Kermit and Comma-Separated-Value Files

C-Kermit 9.0 and Comma-Separated-Value Files

Frank da Cruz
The Kermit Project
Columbia University
Last update: Fri Jun 10 09:17:37 2011

Comma-Separated Value (CSV) format is commonly output by spreadsheets and databases when exporting data into plain-text files for import into other applications. Here are the details:

Here is an example:

aaa, bbb, has spaces,,"ddd,eee,fff", " has spaces ","Muhammad ""The Greatest"" Ali"

The first two are regular fields. The second is a field that has an embedded space but in which any leading or trailing spaces are to be ignored. The fourth is an empty field, but still a field. The fifth is a field that contains embedded commas. The sixth has leading and trailing spaces. The last field has embedded quotation marks.

Prior to C-Kermit 9.0 Alpha.06, C-Kermit did not handle CSV files according to the specification above. Most seriously, there was no provision for a separator to be surrounded by whitespace that was to be considered part of the separator. Also there was no provision for quoting doublequotes inside of a quoted string.

Reading a CSV record

Now the \fsplit() function can handle any CSV-format string if you include the symbolic include set "CSV" as the 4th parameter. To illustrate, this program:

def xx {
   echo [\fcontents(\%1)]
   .\%9 := \fsplit(\fcontents(\%1), &a, \44, CSV)
   for \%i 1 \%9 1 { echo "\flpad(\%i,3). [\&a[\%i]]" }
   echo "-----------"
}
xx {a,b,c}
xx { a , b , c }
xx { aaa,,ccc," with spaces ",zzz }
xx { "1","2","3","","5" }
xx { this is a single field }
xx { this is one field, " and this is another  " }
xx { name,"Mohammad ""The Greatest"" Ali", age, 67 }
xx { """field enclosed in doublequotes""" }
exit

gives the following results:

[a,b,c]
  1. [a]
  2. [b]
  3. [c]
-----------
[ a , b , c ]
  1. [a]
  2. [b]
  3. [c]
-----------
[ aaa,,ccc," with spaces ",zzz ]
  1. [aaa]
  2. []
  3. [ccc]
  4. [ with spaces ]
  5. [zzz]
-----------
[ "1","2","3","","5" ]
  1. [1]
  2. [2]
  3. [3]
  4. []
  5. [5]
  6. []   ← Oops, this was fixed in Alpha.07
-----------
[ this is a single field ]
  1. [this is a single field]
-----------
[ this is one field, " and this is another  " ]
  1. [this is one field]
  2. [ and this is another  ]
  3. []   ← Ditto
-----------
[ name,"Mohammad ""The Greatest"" Ali", age, 67 ]
  1. [name]
  2. [Mohammad "The Greatest" Ali]
  3. [age]
  4. [67]
-----------
[ """field enclosed in doublequotes""" ]
  1. ["field enclosed in doublequotes"]
-----------

The separator \44 (comma) must still be specified as the break set (3rd \fsplit() parameter). When “CSV” is specified as the include set:

The Grouping Mask is automatically set to 1 (which specifies that the ASCII doublequote character (") is used for grouping;
The Separator Flag is automatically set to 1 so that adjacent field separators will not be collapsed;
All bytes (values 0 through 255) other than the break character are added to the include set;
Any leading whitespace is stripped from the first element unless it is enclosed in doublequotes;
Any trailing whitespace is trimmed from the end of the last element unless it is enclosed in doublequotes;
If the separator character has any spaces or tabs preceding it or following it, they are ignored and discarded;
The separator character is treated as an ordinary data character if it appears in a quoted field;
A sequence of two doublequote characters ("") within a quoted field is converted to a single doublequote.

There is also a new TSV symbolic include set, which is like CSV except without the quoting rules or the stripping of whitespace around the separator because, by definition, TSV fields do not contain tabs.

Of course you can specify any separator(s) you want with either the CSV, TSV, or ALL symbolic include sets. For example, if you have a TSV file in which you want the spaces around each Tab to be discarded, you can use:

\fsplit(variable, &a, \9, CSV)

\9 is Tab.

The new symbolic include sets can also be used with \fword(), which is just like \fsplit() except that it retrieves the nth word from the argument string, rather than an array of all the words. In C-Kermit you can get information about these or any other functions with the HELP FUNCTION command, e.g.:

C-Kermit> help func word

Function \fword(s1,n1,s2,s3,n2,n3) - Extracts a word from a string.
    s1 = source string.
    n1 = word number (1-based) counting from left; if negative, from right.
    s2 = optional break set.
    s3 = optional include set (or ALL, CSV, or TSV).
    n2 = optional grouping mask.
    n3 = optional separator flag:
       0 = collapse adjacent separators;
       1 = don't collapse adjacent separators.

  \fword() returns the n1th "word" of the string s1, according to the
  criteria specified by the other parameters.

  The BREAK SET is the set of all characters that separate words. The
  default break set is all characters except ASCII letters and digits.
  ASCII (C0) control characters are treated as break characters by default,
  as are spacing and punctuation characters, brackets, and so on, and
  all 8-bit characters.

  The INCLUDE SET is the set of characters that are to be treated as
  parts of words even though they normally would be separators.  The
  default include set is empty.  Three special symbolic include sets are
  also allowed:

    ALL (meaning include all bytes that are not in the break set)
    CSV (special treatment for Comma-Separated-Value records)
    TSV (special treatment for Tab-Separated-Value records)

  For operating on 8-bit character sets, the include set should be ALL.

  If the GROUPING MASK is given and is nonzero, words can be grouped by
  quotes or brackets selected by the sum of the following:

     1 = doublequotes:    "a b c"
     2 = braces:          {a b c}
     4 = apostrophes:     'a b c'
     8 = parentheses:     (a b c)
    16 = square brackets: [a b c]
    32 = angle brackets:  <a b c>

  Nesting is possible with {}()[]<> but not with quotes or apostrophes.

Returns string:
  Word number n1, if there is one, otherwise an empty string.

Also see:
  HELP FUNCTION SPLIT

C-Kermit>

Using `\fjoin()` to create Comma- or Tab-Separated Value Lists from Arrays

In Alpha.08 of C-Kermit 9.0, \fsplit()'s inverse function, \fjoin() received the capability of converting an array into a comma-separated or a tab-separated value list. Thus, given a CSV, if you split it into an array with \fsplit() and then join the array with \fjoin(), giving each function the new CSV parameter in the appropriate argument position, the result will be will be equivalent to the original, according to the CSV definition. It might not be identical, because if the result had extraneous spaces before or after the separating commas, these are discarded, but that does not affect the elements themselves. The new syntax for \fjoin() is:

\fjoin(&a,CSV): Given the array \&a[] or any other valid array designator, joins its elements into a comma-separated list according to the rules listed above.
\fjoin(&a,TSV): Joins the elements of the given array into a tab-separated list, also described above.

Previous calling conventions for \fjoin() are undisturbed, including the ability to specify a portion of an array, rather than the whole array:

declare \&a[] = 1 2 3 4 5 6 7 8 9
echo \fjoin(&a[3:7],CSV)
3,4,5,6,7

Using \fsplit() and \fjoin() it is now possible to convert a comma-separated value list into a tab-separated value list, and vice versa (which is not a simple matter of changing commas to tabs or vice versa).

Applications for CSV Files

Databases such as MS Access or MySQL can export tables or reports in CSV format, and then Kermit can read the resulting CSV file and do whatever you like with it; typically something that could not be done with the database query language itself (or that you didn't know how to do that way): create reports or datasets based on complex criteria or procedures, edit or modify some fields, etc, and then use \fjoin() to put each record back in CSV form so it can be reimported into a spreadsheet or database.

Here is a simple example in which we purge all records of customers who have two or more unpaid bills. The file is sorted so that each license purchase record is followed by its annual maintenance payment records in chronological order.

#!/usr/local/bin/kermit
.filename = somefile.csv        # Input file in CSV format
fopen /read \%c \m(filename)    # Open it
if fail exit                    # Don't go on if open failed
copy \m(filename) ./new         # Make a copy of the file

.oldserial = 00000000000        # Multiple records for each serial number
.zeros = 0                      # Unpaid bill counter

while true {                    # Loop
    fread /line \%c line        # Get a record
    if fail exit                # End of file
    .n := \fsplit(\m(line),&a,\44,CSV)    # Split the fields into an array
    if not equ "\m(oldserial)" "\&a[6]" { # Have new serial number?
        # Remove all records for previous serial number
        # if two or more bills were not paid...
        if > \m(zeros) 1 {
            grep /nomatch \m(oldserial) /output:./new2 ./new
            rename ./new2 ./new
        }
        .oldserial := \&a[6]    # To detect next time serial number changes
        .zeros = 0              # Reset unpaid bill counter
    }
    if equ "\&a[5]" "$0.00" {   # Element 5 is amount paid
        increment zeros         # If it's zero, count it.
    }
}
fclose \%c

Rewriting the file multiple times is inelegant, but this is a quick and dirty use-once-and-discard script, so elegance doesn't count. The example is interesting in that it purges certain records based on the contents of other records. Maybe there is a way to do this directly with SQL, but why use SQL when you can use Kermit?

Here is the same task but this time no shelling out, and this time we do change and add some fields and then join the result back into a CSV record and write it out to a new file. The object is to create a record for each license that shows not only the date and purchase price of the license but also the date and amount of the last maintenance payment, and to add new fields for sorting by anniversary (month and day):

#!usr/local/bin/kermit +
cd ~/somedirectory                      # CD to appropriate directory
if fail exit 1                          # Make sure we did
.filename := \%1                        # Filename from command line
if not def filename {                   # If none give usage message
    exit 1 "Usage: \%0: infile [ outfile ]"
}
fopen /read \%c \m(filename)            # Open the input CSV file
if fail exit                            # Make sure we did

.output := \%2                          # Output filename from command line
if not def output {                     # Supply one if not given
    .output := New_\m(filename)
}
fopen /write \%o \m(output)             # Open output file
if fail exit                            # Check that we did

.serial = 00000000000                   # Initialize serial number
.licenses = 0                           # and license counter

fread /line \%c line                        # First line is column labels
if fail exit                                # Check
fwrite /line \%o "\m(line),AMM_DD,AYYYY"    # Write new labels line

# Remaining lines are license purchases (K95B) followed by zero or more
# maintenance invoices (K95BM) for each license.

.datepaid = 00/00/0000                  # Initialize last maint payment date
.amtpaid = $0.00                        # Initialize last maint payment amount
set flag off                            # For remembering we're at end of file
while not flag {                        # Loop to read all records
    fread /line \%c line                # Read a record
    if fail set flag on                 # If EOF set flag for later
    .n := \fsplit(\m(line),&a,\44,CSV)  # Break record into array
    if ( flag || equ "\&a[3]" "K95B" ) { # License or EOF
        if fail exit 1 "FAILED: \v(lastcommand)"
        if licenses {                   # If this is not the first license
            .\&x[5] := \m(amtpaid)      # Substitute most recent amount paid
            .\&x[21] := \m(datepaid)    # Substitute most recent date paid
            void \fsplit(\&x[18],&d,/)  # Break up original (anniversary) date
            # and put mm_dd and yyyy in separate fields for sorting...
            fwrite /line \%o "\fjoin(&x,CSV),\flpad(\&d[1],2,0)_\flpad(\&d[2],2,0),\&d[3]"
            if fail exit 1 WRITE        # Check for error
            xecho .                     # Show progress as one dot per record
        }
        if flag break			# We're at EOF so we're finished
        increment licenses              # New license - count it
        array copy &a &x		# Keep this record while reading next
        .serial	:= \&a[6]		# Remember serial number
	.datepaid = 00/00/0000          # Initial maintenance payment date
	.amtpaid = $0.00                # and amount
        continue                        # and go back to read next record
    }
    if not eq "\m(serial)" "\&a[6]" {   # Catch out-of-sequence record
        echo
        echo "SEQUENCE: \m(serial)..\&a[6]: \&a[7] [\&a[1]]"
        continue
    }
    if equ "\&a[5]" "" .\&a[5] = $0.00  # If amount is empty make it $0.00
    if not equ "\&a[5]" "$0.00" {       # If amount is not $0.00
        .datepaid := \&a[21]            # remember date paid
        .amtpaid := \&a[5]              # and amount paid
    }
}
fclose ALL                              # Done - close all files and exit
exit 0 Done.

The result imports back into Excel, where it can be sorted, formatted, or otherwise manipulated as desired.

Using CSV Files: Extending Kermit's Data Structures

Now that we can parse a CSV record, what would we do with a CSV file – that is, a sequence of records? If we needed all the data available at once, we would want to load it into a matrix of (row,column) values. But Kermit doesn't have matrices. Or does it?

Kermit has several built-in data types, but you can invent your own data types as needed using Kermit's macro feature:

define variablename value

For example:

define alphabet abcdefghijklmnopqrstuvwxyz

This defines a macro named alphabet and gives it the value abcdefghijklmnopqrstuvwxyz. A more convenient notation (added in C-Kermit 7.0) for this is:

.alphabet = abcdefghijklmnopqrstuvwxyz

The two are exactly equivalent: they make a literal copy the "right hand side" as the value of the macro. Then you can refer to the macro anywhere in a Kermit command as "\m(macroname)":

echo "Alphabet = \m(alphabet)"

There is a second way to define a macro, which is like the first except that the right-hand side is evaluated first; that is, any variable references or function calls in the right-hand side are replaced by their values before the result is assigned to the macro. The command for this is ASSIGN rather than DEFINE:

define alphabet abcdefghijklmnopqrstuvwxyz
assign backwards \freverse(\m(alphabet))
echo "Alphabet backwards = \m(backwards)"

which prints:

Alphabet backwards = zyxwvutsrqponmlkjihgfedcba

This kind of assignment can also be done like this:

.alphabet = abcdefghijklmnopqrstuvwxyz
.backwards := \freverse(\m(alphabet))

Any command starting with a period is an assignment, and the operator (= or :=) tells what to do with the right-hand side before making the assignment.

In both the DEFINE and ASSIGN commands, the variable name itself is taken literally. It is also possible, however, to have Kermit compute the variable name. This is done (as described in Using C-Kermit, 2nd Ed., p.457), using parallel commands that start with underscore: _DEFINE and _ASSIGN (alias _DEF and _ASG). These are just like DEFINE and ASSIGN except they evaluate the variable name before making the assignment. For example:

define \%a one
_define \%a\%a\%a 111

would create a macro named ONEONEONE with a value of 111, and:

define \%a one
define number 111
_assign \%a\%a\%a \m(number)

would create the same macro with the same value, but:

define \%a one
define number 111
_define \%a\%a\%a \m(number)

would give the macro a value of "\m(number)".

You can use the _ASSIGN command to create any kind of data structure you want; you can find some examples in the Object-Oriented Programming section of the Kermit Script Library. In the following program we use this capability to create a two-dimensional array, or matrix, to hold the all the elements of the CSV file, and then to display the matrix:

fopen /read \%c data.csv                # Open CSV file
if fail exit 1

.\%r = 0                                # Row
.\%m = 0                                # Maximum columns
while true {
    fread /line \%c line                # Read a record
    if fail break                       # End of file
    .\%n := \fsplit(\m(line),&a,\44,CSV) # Split record into items
    incr \%r                            # Count this row
    for \%i 1 \%n 1 {                   # Assign items to this row of matrix
        _asg a[\%r][\%i] \&a[\%i]
    }
    if > \%i \%m { .\%m := \%i }        # Remember width of widest row
}
fclose \%c                              # Close CSV file
decrement \%m                           # (because of how FOR loop works)
echo MATRIX A ROWS: \%r COLUMNS: \%m    # Show the matrix

for \%i 1 \%r 1 {                       # Loop through rows
    for \%j 1 \%m 1 {                   # Loop through columns of each row
        xecho "\flpad(\m(a[\%i][\%j]),6)"
    }
    echo
}
exit 0

The matrix is called a and its elements are a[1][1], a[1][2], a[1][3], ... a[2][1], etc, and you can treat this data structure exactly like a two-dimensional array, in which you can refer to any element by its "X and Y coordinates". For example, if the CSV file contained numeric data you could compute row and column sums using simple FOR loops and Kermit's built-in one-dimensional array data type:

declare \&r[\%r]                        # Make an array for the row sums
declare \&c[\%m]                        # Make an array for the column sums
for \%i 1 \%r 1 {                       # Loop through rows
    for \%j 1 \%m 1 {                   # Loop through columns of each row
        increment \&r[\%i] \m(a[\%i][\%j]) # Accumulate row sum
        increment \&c[\%j] \m(a[\%i][\%j]) # Accumulate column sum
    }
}

Note that the sum arrays don't have to be initialized to zero because Kermit's INCREMENT command treats empty definitions as zero.

C-Kermit 9.0 and Comma-Separated-Value Files

Reading a CSV record

Using \fjoin() to create Comma- or Tab-Separated Value Lists from Arrays

Applications for CSV Files

Using CSV Files: Extending Kermit's Data Structures

Using `\fjoin()` to create Comma- or Tab-Separated Value Lists from Arrays