lexertl: The Modular Lexical Analyser Generator

Introduction

lexertl is a modern, modular lexical analyser generator. See the wikipedia definition to learn what a lexical analyser is at http://en.wikipedia.org/wiki/Lexical_analysis.

Traditionally, programs such as lex generate source code as their output and even only support one kind of programming language. With lexertl I am seeking to offer much more flexibility than that by exposing the state machine that is generated from a supplied lex specification. By doing this the user has much more freedom in how they process the data, which means it becomes easy to:

lexertl compiles equally well using GCC 4.0.1 on UNIX as well as Visual C++.

Design Goals

What is a Lexical Analyser?

A lexical analyser is a program that takes text as input and outputs substrings it recognizes. Each substring is referred to as a 'token' and has a numeric identifier assigned to it to uniquely identify its type (string, int, keyword etc.). A rule is defined for each token that must be recognised using Regular Expressions.

Regular Expressions

A good way to learn how regular expressions really work is to use the lexertl::debug::dump() function (see a later section on this page for how). In general, people are so used to Perl style expressions they can be confused when things like look-ahead assertions are not supported when using a lexer generator. I will therefore show a couple of examples of subtler than the average regular expressions.

SQL Server Style Strings

'(''|[^'])*'
Will match SQL Server style strings. The state machine is shown below:

C Style Comment Recognition

\/\*([^*]|\*+[^/*])*\*+\/
Matches a C style comment:

Note that the same state machine can be achieved with the following shorter regex:

\/\*(.|\n)*?\*\/
However if you take that approach be careful to declare the pattern near the end of your lexer rules. The abstemious repeat operator is very powerful and if used early in your rules will take precedence over later greedy repeats.

(Currently) Supported Regular Expression Syntax

Character Representations

Sequence

Meaning

\a

Alert (bell).

\b

Backspace.

\e

ESC character, x1B.

\n

Newline.

\r

Carriage return.

\f

Form feed, x0C.

\t

Horizontal tab, x09.

\v

Vertical tab, x0B.

\octal

Character specified by a three-digit octal code.

\xhex

Character specified by a hexadecimal code.

\cchar

Named control character.

"..."

All characters taken as literals between double quotes, except escape sequences.

Character Classes and class-like Constructs

Sequence

Meaning

[...]

A single character listed or contained within a listed range.

[^...]

A single character not listed and not contained within a listed range.

.

Any character.

\d

Digit character ([0-9]).

\D

Non-digit character ([^0-9]).

\s

Whitespace character ([ \t\n\r\f\v]).

\S

Non-whitespace character ([^ \t\n\r\f\v]).

\w

Word character ([a-zA-Z0-9_]).

\W

Non-word character ([^a-zA-Z0-9_]).

Unicode Character Classes

Sequence

Meaning

\p{C}

Other.

\p{Cc}

Other, Control.

\p{Cf}

Other, Format.

\p{Co}

Other, Private Use.

\p{Cs}

Other, Surrogate.

\p{L}

Letter.

\p{LC}

Letter, Cased.

\p{Ll}

Letter, Lowercase.

\p{Lm}

Letter, Modifier.

\p{Lo}

Letter, Other.

\p{Lt}

Letter, Titlecase.

\p{Lu}

Letter, Uppercase.

\p{M}

Mark.

\p{Mc}

Mark, Space Combining.

\p{Me}

Mark, Enclosing.

\p{Mn}

Mark, Nonspacing.

\p{N}

Number.

\p{Nd}

Number, Decimal Digit.

\p{Nl}

Number, Letter.

\p{No}

Number, Other.

\p{P}

Punctuation.

\p{Pc}

Punctuation, Connector.

\p{Pd}

Punctuation, Dash.

\p{Pe}

Punctuation, Close.

\p{Pf}

Punctuation, Final quote.

\p{Pi}

Punctuation, Initial quote.

\p{Po}

Punctuation, Other.

\p{Ps}

Punctuation, Open.

\p{S}

Symbol.

\p{Sc}

Symbol, Currency.

\p{Sk}

Symbol, Modifier.

\p{Sm}

Symbol, Math.

\p{So}

Symbol, Other.

\p{Z}

Separator.

\p{Zl}

Separator, Line.

\p{Zp}

Separator, Paragraph.

\p{Zs}

Separator, Space.

Alternation and Repetition

Sequence

Meaning

...|...

Try subpatterns in alternation.

*

Match 0 or more times (greedy).

+

Match 1 or more times (greedy).

?

Match 0 or 1 times (greedy).

{n}

Match exactly n times.

{n,}

Match at least n times (greedy).

{n,m}

Match at least n times but no more than m times (greedy).

*?

Match 0 or more times (abstemious).

+?

Match 1 or more times (abstemious).

??

Match 0 or 1 times (abstemious).

{n,}?

Match at least n times (abstemious).

{n,m}?

Match at least n times but no more than m times (abstemious).

{MACRO}

Include the regex MACRO in the current regex.

Anchors

Sequence

Meaning

^

Start of string or after a newline.

$

End of string or before a newline.

Matching

As you will see from the examples below, match_results or push_match_results are the structs passed to lookup. Like flex, end of input returns an id of 0 (you can set this to another value if you like) and an unrecognised token returns a single character. The id returned is npos in this case. This value is available from a static function npos() on match_results.

match_results has the following members:

    id_type id;
    id_type user_id;
    iter_type start;
    iter_type end;
    iter_type eoi;
    bool bol;
    id_type state;

push_match_results derives from match_results and adds the following:
    std::stack stack;
This struct is used when recursive rules have been defined.

Examples

Construct a Lexer and Tokenise Input

#include "lexertl/generator.hpp"
#include <iostream>
#include "lexertl/lookup.hpp"
#include "lexertl/rules.hpp"
#include "lexertl/state_machine.hpp"

int main ()
{
    lexertl::rules rules_;
    lexertl::state_machine state_machine_;

    rules_.add ("[0-9]+", 1);
    rules_.add ("[a-z]+", 2);
    lexertl::generator::build (rules_, state_machine_);

    std::string input_ ("abc012Ad3e4");
    std::string::const_iterator iter_ = input_.begin ();
    std::string::const_iterator end_ = input_.end ();
    lexertl::match_results results_ (iter_, end_);

    do
    {
        lexertl::lookup (state_machine_, results_);
        std::cout << "Id: " << results_.id << ", Token: '" <<
            std::string (results_.start, results_.end) << "'\n";
    } while (results_.id != 0);

    return 0;
}

Restartable Lexing

It is possible to modify the results object to (re)start at a specific position. Because there are multiple state variables inside the results object, care needs to be taken to set it up correctly when you do this.

First of all the most important member to set is end. Although the lookup automatically copies this to start when fetching the next token, I recommend setting start at the same time for completeness (you could set id to npos too if you really wanted to!) If you are using "beginning of line" matching, then you need to set the bol flag accordingly, so that the lexer knows whether the next token starts at the beginning of a line or not. Finally, if you are using 'multi state' lexing, be sure to set the state member correctly.

Here is an example:

#include "lexertl/generator.hpp"
#include <iostream>
#include "lexertl/lookup.hpp"
#include "lexertl/rules.hpp"
#include "lexertl/state_machine.hpp"

int main ()
{
    lexertl::rules rules_;
    lexertl::state_machine state_machine_;
    std::string input_ ("can\ncmd\na cmd\ncmd again\nanother cmd");
    // This time we set the iterator to end(), because we are going to reset
    // it anyway.
    std::string::const_iterator iter_ = input_.end ();
    std::string::const_iterator end_ = input_.end ();
    lexertl::match_results results_ (iter_, end_);

    rules_.add ("can", 1);
    rules_.add ("^cmd$", 2);
    rules_.add ("^cmd", 3);
    rules_.add ("cmd$", 4);
    rules_.add ("[a-z]+", 50);
    rules_.add ("\\s+", 100);
    lexertl::generator::build (rules_, state_machine_);

    // Skip the first 4 characters
    results_.start = results_.end = input_.begin () + 4;
    results_.bol = true;
    // Redundant statement, but just for example:
    results_.state = 0;

    do
    {
        lexertl::lookup (state_machine_, results_);
        std::cout << "Id: " << results_.id << ", Token: '" <<
            std::string (results_.start, results_.end) << "'\n";
    } while (results_.id != 0);

    return 0;
}

Recursive Rules

As I don't have the time to write a complimentary parser generator, I have added recursive rule support directly to lexertl. To make this work we need a couple of tweaks. To start with we need to introduce the concept of pushing and popping states. An exit state with '>' before the name means push, whereas '<' for the exit state means pop.

It turns out we also need a means for continuing even when a particular rule has matched. We achieve this by omitting an id when adding a rule. Finally, note that even when we specify an id (as normal) for the exit conditions, the rule will only finish when all previous pushes (with no id specified) are popped.

See the example below where all of these concepts are put together. The rules work together to match a string containing an even number of 0s and 1s.

#include "lexertl/generator.hpp"
#include <iostream>
#include "lexertl/lookup.hpp"
#include "lexertl/rules.hpp"
#include "lexertl/state_machine.hpp"

int main ()
{
    lexertl::rules rules_;
    lexertl::state_machine sm_;
    std::string input_ ("110100a");
    std::string::const_iterator iter_ = input_.begin ();
    std::string::const_iterator end_ = input_.end ();
    lexertl::push_match_results results_ (iter_, end_);

    rules_.add_state ("ZERO");
    rules_.add_state ("ONE");
    rules_.add ("INITIAL", "0", ">ONE");
    rules_.add ("INITIAL", "1", ">ZERO");
    rules_.add ("ZERO", "0", 1, "<");
    rules_.add ("ZERO", "1", ">ZERO");
    rules_.add ("ONE", "0", ">ONE");
    rules_.add ("ONE", "1", 1, "<");
    lexertl::generator::build (rules_, sm_);

    do
    {
       lexertl::lookup (sm_, results_);
       std::cout << "Id: " << results_.id << ", Token: '" <<
            std::string (results_.start, results_.end) << "'\n";
    } while (results_.id != 0 && results_.id != results_.npos ());

    return 0;
}

Full Blown Grammars

A simple calculator example follows that utilises the possibility of pushing a state other than current:

#include "lexertl/generator.hpp"
#include "lexertl/lookup.hpp"
#include "lexertl/rules.hpp"

enum calc_id {eEOF, eNegate, eValue, eOperator, eOpen, eClose};

void gen_calc(lexertl::state_machine &sm_)
{
    lexertl::rules rules_;

    rules_.add_state("VALUE");
    rules_.add_state("OPERATOR");
    rules_.add_state("SUB_INITIAL");
    rules_.add_state("SUB_VALUE");
    rules_.add_state("SUB_OPERATOR");

    rules_.add("INITIAL", "-", eNegate, "VALUE");
    rules_.add("INITIAL,VALUE", "\\d+", eValue, "OPERATOR");
    rules_.add("OPERATOR", "[-+*/%]", eOperator, "INITIAL");

    rules_.add("INITIAL,VALUE", "\\(", eOpen, ">SUB_INITIAL:OPERATOR");
    rules_.add("SUB_INITIAL,SUB_VALUE", "\\(", eOpen,
        ">SUB_INITIAL:SUB_OPERATOR");

    rules_.add("SUB_INITIAL", "-", eNegate, "SUB_VALUE");
    rules_.add("SUB_INITIAL,SUB_VALUE", "\\d+", eValue, "SUB_OPERATOR");
    rules_.add("SUB_OPERATOR", "[-+*/%]", eOperator, "SUB_INITIAL");
    rules_.add("SUB_OPERATOR", "\\)", eClose, "<");

    rules_.add("*", "\\s+", sm_.skip(), ".");

    lexertl::generator::build(rules_, sm_);
}

void reduce (std::stack &operands_,
    std::stack &ops_)
{
    if (operands_.empty() || ops_.empty()) return;

    char op_ = ops_.top();
    int rhs_ = 0;

    switch (op_)
    {
    case '~':
        ops_.pop();
        operands_.top() *= -1;
        break;
    case '+':
        ops_.pop();
        rhs_ = operands_.top();
        operands_.pop();
        operands_.top () += rhs_;
        break;
    case '-':
        ops_.pop();
        rhs_ = operands_.top();
        operands_.pop();
        operands_.top () -= rhs_;
        break;
    case '*':
        ops_.pop();
        rhs_ = operands_.top();
        operands_.pop();
        operands_.top () *= rhs_;
        break;
    case '/':
        ops_.pop();
        rhs_ = operands_.top();
        operands_.pop();
        operands_.top () /= rhs_;
        break;
    case '%':
        ops_.pop();
        rhs_ = operands_.top();
        operands_.pop();
        operands_.top () %= rhs_;
        break;
    case '(':
        break;
    case ')':
        ops_.pop();

        while (!ops_.empty() && op_ != '(')
        {
            op_ = ops_.top ();
            reduce (operands_, ops_);
        }

        if (op_ == '(') ops_.pop();

        break;
    }
}

void calc(const std::string &input_,
    const lexertl::state_machine &sm_, int &output_)
{
    lexertl::basic_push_match_results results_
        (input_.c_str(), input_.c_str() + input_.size());
    int value_ = 0;
    std::stack operands_;
    std::stack ops_;

    do
    {
        lexertl::lookup (sm_, results_);

        if (results_.id == results_.npos())
        {
            throw std::runtime_error("Syntax error");
        }

        if (results_.id == eOperator)
        {
            switch (*results_.start)
            {
            case '~':
            case '(':
                break;
            case '+':
            case '-':
                reduce(operands_, ops_);
                break;
            default:
                if (!ops_.empty())
                {
                    const char op_ = ops_.top();

                    if (op_ != '~' && op_ != '+' &&
                        op_ != '-' && op_ != '(')
                    {
                        reduce(operands_, ops_);
                    }
                }

                break;
            }

            ops_.push(*results_.start);
        }
        else if (results_.id == eNegate)
        {
            ops_.push('~');
        }
        else if (results_.id == eValue)
        {
            std::string val_(results_.start, results_.end);

            value_ = atoi(val_.c_str());
            operands_.push(value_);
        }
        else if (results_.id == eOpen)
        {
            ops_.push('(');
        }
        else if (results_.id == eClose)
        {
            ops_.push(')');
            reduce(operands_, ops_);
        }
    } while (results_.id != 0);

    if (results_.state != 2)
    {
        throw std::runtime_error("Unterminated expression!");
    }

    while (!ops_.empty())
    {
        reduce(operands_, ops_);
    }

    if (operands_.size() != 1)
    {
        throw std::runtime_error("internal error: operands_.size() != 1");
    }

    output_ = operands_.top();
}

int main ()
{
    lexertl::state_machine sm_;
    std::string input_;
    int output_ = 0;

    gen_calc(sm_);

    do
    {
        try
        {
            std::cout << "Input: " << std::flush;
            std::getline(std::cin, input_);

            if (input_ == "q" || input_ == "Q")
                break;

            std::cout << input_ << " = ";
            calc(input_, sm_, output_);
            std::cout << output_ << std::endl;
        }
        catch (const std::exception &e)
        {
            std::cout << "ERROR: " << e.what() << std::endl;
        }
    } while (input_ != "q" && input_ != "Q");

    return 0;
}

Construct a Lexer and Dump the State Machine

#include "lexertl/debug.hpp"
#include "lexertl/generator.hpp"
#include <iostream>
#include "lexertl/rules.hpp"
#include "lexertl/state_machine.hpp"

int main ()
{
    lexertl::rules rules_;
    lexertl::state_machine state_machine_;

    rules_.add ("[0-9]+", 1);
    rules_.add ("[a-z]+", 2);
    lexertl::generator::build (rules_, state_machine_);
    state_machine_.minimise ();
    lexertl::debug::dump (state_machine_, std::cout);
    return 0;
}

Generating a Char State Machine

If you wish to write your own code generator, you will find it useful to generate a char state machine as it contains states with simple transitions from character ranges to other states. Even in Unicode mode this allows you to observe real character ranges. In contrast state_machine has a two phase lookup and always slices unicode chars into bytes in order to compress the data.

#include "lexertl/debug.hpp"
#include "lexertl/generator.hpp"
#include <iostream>
#include "lexertl/rules.hpp"
#include "lexertl/state_machine.hpp"

int main ()
{
    lexertl::rules rules_;
    lexertl::char_state_machine char_state_machine_;

    rules_.add ("[0-9]+", 1);
    rules_.add ("[a-z]+", 2);
    lexertl::char_generator::build (rules_, char_state_machine_);
    lexertl::debug::dump (char_state_machine_, std::cout);
    return 0;
}

See state_machine.hpp for the char_state_machine structure.

Lexing a file

There is currently one type of file iterator included with lexertl which is the file_shared_iterator:

#include "lexertl/file_shared_iterator.hpp"
#include <fstream>
#include "lexertl/generator.hpp"
#include <iostream>
#include "lexertl/lookup.hpp"
#include "lexertl/rules.hpp"
#include "lexertl/state_machine.hpp"

int main ()
{
    lexertl::rules rules_;
    lexertl::state_machine state_machine_;
    std::ifstream if_ ("C:\\Ben\\Dev\\lexertl\\policy.txt");
    lexertl::file_shared_iterator iter_ (if_);
    lexertl::file_shared_iterator end_;
    lexertl::basic_match_results<lexertl::file_shared_iterator, std::size_t>
        results_(iter_, end_);

    rules_.add ("[0-9]+", 1);
    rules_.add ("[a-z]+", 2);
    lexertl::generator::build (rules_, state_machine_);

    do
    {
        lexertl::lookup (state_machine_, results_);
        std::cout << "Id: " << results_.id << ", Token: '" <<
            std::string (results_.start, results_.end) << "'\n";
    } while (results_.id != 0);

    return 0;
}

Code Generation

Currently there is a single code generator that generates C++ table based lookup:

#include "lexertl/generator.hpp"
#include "lexertl/generate_cpp.hpp"
#include <iostream>
#include "lexertl/rules.hpp"
#include "lexertl/state_machine.hpp"

int main ()
{
    lexertl::rules rules_;
    lexertl::state_machine sm_;

    rules_.add ("[0-9]+", 1);
    rules_.add ("[a-z]+", 2);
    lexertl::generator::build (rules_, sm_);
    sm_.minimise();
    lexertl::table_based_cpp::generate_cpp ("lookup", sm_, false, std::cout);
    return 0;
}

If you want to generate code for another language, then you should easily be able to use this generator as a template.

References

Acknowledgements

Thanks to Dave Handley for all his constructive criticism, performance testing, STL and general C++ tips.
Thanks to Régis Vaquette for all his testing and enthusiastic encouragement!
Thanks to Hartmut Kaiser for all his performance testing, feature requests and advice on compatibility with different compilers and, of course, boost.