wlex: a lexer generator for large encodings, derived from ocamlex runtime system: Copyright (C) 2000, 2001, 2002 Alain Frisch distributed under the terms of the LGPL : see LICENSE lexer generator patch: distributed under any license that suits your need and complies to the restrictions of the QPL email: Alain.Frisch@ens.fr web: http://www.eleves.ens.fr:8080/home/frisch -------------------------------------------------------------------------- Overview This package consists of a lexer generator and the associated runtime system. NOTE * The lexer generator is derived from ocamllex; due to licence issues * (this part of the OCaml system is distributed under the terms of the * QPL), I have to distribute it as a patch. This imply you will need * the whole OCaml source tree to build wlex. If you experience any * problem to apply the patch, please don't hesitate to contact me. The lexer architecure of wlex adds an extra layer (classification) between the lexbuf and the lexer. This layer extracts "character classes" from the lexbuf and the lexer itself works with this classes, not directly on characters. Usually, the number of classes is small (<< 256) and the classification may consume more than one byte to produce the next class. This allow to parse efficiently wide characters encodings such as UTF-8 (the main motivation for wlex). Classes form a partition of accepted characters. A typical example is to group all the letters to a class "letter", all the digits to a class "digit", ... The regexps in the lexer specification (the .mll file) are built on classes. For instance: let ident = letter (letter | digit)* In some cases, it is necessary to change the design of the lexer. For instance, it is a good idea to have a single for identifiers and keywords, the distinction between them being done in the semantic action of the rule. Another possibility is to declare all the characters from the keywords as single classes. This is a very bad idea. During lexing, the classification is handled by an "engine". Some generic engine are provided (a C implementation for speed; an ML implementation if you want pure bytecode), especially to support UTF-8. Working with such an encoding with ocamllex would introduce a *lot* of "waiting" state and a *lot* of duplicated transitions in the automaton (it is the motivation for wlex to avoid these). Another small modification from ocamllex is the possibility to give parameters to lexer entry points. The mandatory parameters are the lexbuf and the engine. -------------------------------------------------------------------------- Requirements You need recent releases of: - the OCaml compiler 3.00 or 3.01 , !! including the source tree http://caml.inria.fr/ocaml/index.html (older versions may work) - Findlib (package manager for OCaml) http://www.ocaml-programming.de/packages/documentation/findlib/ -------------------------------------------------------------------------- Lexer specification (. mll file) The syntax of .mll file is modified: - before the header, there is a new section which declare classes. It starts with the keyword *classes*, followed by classes declaration. An ident declares a class with this name. A literal character 'x' declares a class with name char_ff where ff is the hexa code of the x. A literal string "xyz" is equivalent to 'x' 'y' 'z'. The class are assigned sequential number, starting with 1. The class 0 is predefined to eof. - the entry point accept extra argument. Ex: rule token arg1 arg2 = .... - in a regexp, "_" means any class; an ident is interpreted as a regexp or as a class name - in a "[ ... ]" regexp, the dash is forbidden; a literal char 'x' or an indent references the corresponding class which must be declared; a string "xyz" is equivalent to 'x' 'y' 'z' -------------------------------------------------------------------------- Output of wlex Output files is very close to the one generated by ocamllex. The "empty token" error message specify which lexer entry is involved. Also, by default, the output file begin with class declarations. For the declarations: --- classes encoding_error xml_char (* used only in negations [^ ...] (i.e : literal) *) base_char ascii_digit (* 0..9 "subset" Digit *) ideographic combining_char xml_digit extender ".-_:'*/()[]@,|=!<>+-$" '"' " \t\n\r" --- wlex produces: --- let eof = 0 let encoding_error = 1 let xml_char = 2 let base_char = 3 let ascii_digit = 4 let ideographic = 5 let combining_char = 6 let xml_digit = 7 let extender = 8 let char_2e = 9 let char_2d = 10 let char_5f = 11 let char_3a = 12 ... let one_char_classes = [ (0x2e, 09); (0x2d, 10); (0x5f, 11); (0x3a, 12); ... ] let nb_classes = 34 --- It is possible to output these declarations to a separate .ml file with the command: wlex <.mll file> -cf <.ml file> -------------------------------------------------------------------------- Running the lexer An engine is a function: Lexing.lex_tables -> int -> Lexing.lexbuf -> int It runs the automaton starting at the initial state (second argument). The engine's job is to extract bytes from the lexbuf (by calling the refill_buff function when needed), to classify them into classes, and to run the automaton (the transition are labelled with class numbers). The entry points in the file generated by wlex accept as their first argument (before the lexbuf) an engine. Generic engines are provided. See wlex_engines_ml.mli and wlex_engines.mli (same interface; the first one is the caml implementation; the second one is an interface with C functions). Usually, when there are less than 256 classes, you will use: val engine_tiny_8bit: string -> lex_tables -> int -> lexbuf -> int for 8 bit encoding like Latin-1 (ISO-8859-1), and: val engine_tiny_utf8: string -> (int -> int) -> lex_tables -> int -> lexbuf -> int for UTF-8. NOTE: * The same table works for UTF-8 and Latin-1 encodings. They work with a table compacted into a string. Each code point is looked up into this table to choose which class it belongs to. The UTF-8 engine also must be given a function converting codepoints outside the table (codepoint >= String.length table) to class numbers. -------------------------------------------------------------------------- Building and installation On a UNIX system: - make wlex : apply the patch to ocamllex and compile wlex !!! You have to edit the Makefile and fill out the line OCAMLLEX_SRC = ... - make runtime : build the runtime modules !!! With OCaml 3.04, you can build a shared version of the library (default; to override, for instance if you have an older OCaml version define SHARED_OR_STATIC=static at the beginning of Makefile) - make install : findlib-install the runtime support package (name: wlexing) and copy the wlex executable to bin directory. !! By default, wlex goes to /usr/local/bin; you have to edit the Makefile to change this. - make tester : compile a very dummy test program For a more realistic example, see the Xpath package: http://www.eleves.ens.fr:8080/home/frisch/soft (I wrote wlex to support UTF-8 in Xpath).