The Kermit Project |
Columbia
University
612 West 115th Street, New York NY 10025 USA • kermit@columbia.edu
| |||||||||
|
Frank da CruzThis page presents a short script that illustrates new features in C-Kermit 9.0 (added in Alpha.05) that provide script-level access to some of Kermit's character-set features.
fdc@columbia.edu
5 April 2010
Suppose you have a bilingual website in English and Spanish and you want to produce a report from the Web log showing (say) the 20 the most popular Google searches. This is an interesting problem because:
http://www.google.com/search?q=amazon+en+espa%%C3%%B1olThe search string exhibits hex encoding, and the clause “&ie=utf-8” identifies the character set. In another example, the search arrives in ISO-8859-1 Latin Alphabet 1:
&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:es-ES:official&client=firefox-a
http://www.google.com/search?hl=es&client=ms-rim&channel=browser
&ie=ISO-8859-1&q=color+Casta%%F1o+en+ingl%%E9s&btnG=Buscar&lr=
The web log has one line per website access. Each line consists of nine Tab-separated fields (\9 is Tab), of which the 8th is the referring URL and the 5th is the name of the file that was served. Here is a skeleton program that reads the file and selects entries in which HTML files were served in response to a Google search:
(The only reason we're filtering out non-Google searches is to keep the example simple; other search engines have different syntax.) Here we see an important new C-Kermit 9.0 feature:fopen /read filename \%i # Open the input file if fail exit 1 # Exit if file wasn't opened. while true { fread /line \%i line # Read line and trim blanks if fail break # Quit loop on end of file void \fsplit(\m(line),&a,\9,ALL,,1) # Split line into fields if not \findex(.html,\&a[5]) continue # Skip record if file not HTML .google := \findex(.google.,\&a[8]) # Did access come from Google? if not google continue # If not skip this record }
So first we split the record into its fields, one of which is the URL. Then we split the URL into its "clauses". Then we split each clause into its ID and its value. If we have query (q) clause its value is the (encoded) search string. If we have an Input Encoding (ie) clause, its value is the MIME character-set name. Once we have the search string, we do:void \fsplit(\&a[8],&b,&?,ALL,,1) # Split Google URL into clauses undef charset string # Clear the result variables for i 1 \fdim(&b) 1 { # Look for &q= and &ie= void \fsplit(\&b[i],&c,=,ALL) # Split this clause on "=" if equ "\&c[1]" "q" .string := \&c[2] # If 'q' set query string else if equ "\&c[1]" "ie" .charset := \&c[2] # If 'ie' set charset } if not def string continue # If no query string skip this record .string := \fsqueeze(\flower(\fdecodehex(\freplace(\m(string),+,\32))))
The inner function \freplace() simply replaces all plus signs with blanks. The new \fdecodehex() function undoes the hex encoding, so that %%E9, %%F1, etc, are replaced by the bytes they represent. Then the function flower() lowercases the result, and finally the new \fsqueeze() function converts all runs of internal whitespace to single blanks..string := \fsqueeze(\flower(\fdecodehex(\freplace(\m(string),+,\32))))
The \fdecodehex() function assumes that the hex escape sequence is %% but you can specify a different one as a second argument (for example, '=' for email in Quoted-Printable format):
\freplace(this+is+a+string) = this is a string \fdecodehex(one%%2C%%20two%%2C%%20three) = one, two, three \flower(This Is a STRING) = this is a string \fsqueeze(this is a string) = this is a string
\fdecodehex(one=2C=20two=2C=20three,=) = one, two, three
Now comes the interesting part:
The object of this program is to normalize the search strings so those that differ only in character set or alphabetic case or spacing can be counted together. In this example we're going to convert all the search strings to ISO 8859-1. The first clause above says "If the character set was declared in the URL, then convert the string from the declared character set to ISO-8859-1." The conversion is done with the new fcvtcset() function: given a string and two character-set names, it converts the string from the first set to the second (if the two are the same, it does nothing).if def charset { _increment cset<\fupper(\m(charset))> if debug echo "CONVERTING [\m(string)] \m(charset)->ISO-8859-1" .string := \fcvtcset(\m(string),\m(charset),iso-8859-1) if debug show mac string } else if equal "\fstringtype(\m(string))" "UTF8" { .string := \fcvtcset(\m(string),UTF-8,iso-8859-1) if debug show mac string _increment cset<UNDECLARED_UTF-8_DETECTED> } else { _increment cset<UNDECLARED> }
Now, in order for this to be driven by the character-set names used in Google's &ie= clause, Kermit had to become aware of the MIME character-set designators. Kermit has always had its own names for character sets, which preceded MIME's; e.g. LATIN1 for ISO Latin Alphabet 1. MIME has different names, e.g. ISO-8859-1 for Latin Alphabet 1. As of Alpha.05, C-Kermit 9.0 understands the MIME names as well, thus you can feed either Kermit names or MIME names to any of Kermit's commands or functions that take character-set names. A new equivalence table is available HERE.
We're not done yet: the majority of Google URLs are missing the character-set ID, yet many of them still contain UTF8 data, which would need to be converted to have a consistent set of strings. For this, C-Kermit 9.0 Alpha.05 has another new function, \fstringtype(), seen in the second clause above:
This function can tell whether its argument is 7-bit text, 8-bit text, UTF-8, or binary (not text at all). Luckily, the format of UTF-8 is just about unmistakable. When the IF statement is true, we convert the untagged UTF-8 to Latin-1. If the string is not UTF-8, we leave it alone because there is no reliable way to know that the character set is, and in any case, it's mostly likely Latin-1 or CP1252, which are compatible as far as Spanish is concerned.} else if equal "\fstringtype(\m(string))" "UTF8" {
Finally, a couple more transformations:
The first one lowercases any uppercase accented Spanish letters, since \flower() presently handles only A-Z, unaccented. Second, we replace the word "espanol" with "espaņol", because this is a very common case and it can be done safely..string := \fsubstitute(\m(string),ÁÉÍĶÚÜŅ,áéíķúüņ) .string := \freplace(\m(string),espanol,espaņol)
The _increment command is like the increment command (no underscore) except it computes the name of the variable before incrementing it._increment cset<\fupper(\m(charset))> _increment search<\m(string)>
To dump out the results, we convert the associative arrays to regular arrays, sort them in the desired order, and display them as follows:
The output looks like this:echo echo "Records: \flpad(\m(lines),8)" echo "Google: \flpad(\m(google),8)" .n := \faaconvert(search,&a,&b) echo "Unique: \flpad(\m(n),8)" .m := \faaconvert(cset,&c,&d) echo "Charsets: \flpad(\m(m),8)" echo echo Charsets by by frequency... array sort /reverse /numeric &d &c for i 1 m 1 { echo \flpad(\m(i),3). \flpad([\&d[i]],8) \&c[i] } if > n 20 .n = 20 echo echo Top \m(n) searches by frequency... array sort /reverse /numeric &b &a for i 1 n 1 { echo \flpad(\m(i),3). \flpad([\&b[i]],8) \&a[i] }
Without the normalization, there would be many variations of each string and each variation would be counted separately.Records: 28801 Google: 11442 Unique: 5546 Charsets: 6 Charsets by frequency... 1. [7646] UNDECLARED 2. [1962] UTF-8 3. [1512] UNDECLARED_UTF-8_DETECTED 4. [116] ISO-8859-1 5. [20] UTF8 6. [1] GB2312 Top 20 searches by frequency... 1. [649] amazon espaņol 2. [647] amazon 3. [418] amazon.com 4. [185] dia de la madre 2010 5. [117] dia de la madre 2010 colombia 6. [109] www.amazon.com 7. [106] dia de la madre 8. [95] cuando es el dia de la madre 9. [88] amazon.com espaņol 10. [75] amazon en espaņol 11. [69] dominican hair products 12. [65] www.amazon.com en espaņol 13. [64] dia international de la mujer 14. [46] dia del maestro 15. [46] nombres de paises en inglés 16. [44] dominican hair products for hair growth 17. [42] zapatos altos para dama 18. [42] amazon espaņol venezuela 19. [37] amazon.es 20. [37] dia de la madre 2010 chile
Home | Kermit 95 | C-Kermit | Scripts | Current | New | FAQ | Support |