lexertl: The Modular Lexical Analyser Generator

Introduction

lexertl is a modern, modular lexical analyser generator. See the wikipedia definition to learn what a lexical analyser is at http://en.wikipedia.org/wiki/Lexical_analysis.

Traditionally, programs such as lex generate source code as their output and even only support one kind of programming language. With lexertl I am seeking to offer much more flexibility than that by exposing the state machine that is generated from a supplied lex specification. By doing this the user has much more freedom in how they process the data, which means it becomes easy to:

lexertl compiles equally well using GCC 4.0.1 on UNIX as well as Visual C++.

Design Goals

Design Update

lexertl has moved to an iterator model in preparation for the boost review. Note that only non-const iterators are currently supported, but const_iterators will be added later.

What is a Lexical Analyser?

A lexical analyser is a program that takes text as input and outputs substrings it recognizes. Each substring is referred to as a 'token' and has a numeric identifier assigned to it to uniquely identify its type (string, int, keyword etc.). A rule is defined for each token that must be recognised using Regular Expressions.

(Currently) Supported Regular Expression Syntax

Character representations

Sequence

Meaning

\a

Alert (bell).

\b

Backspace.

\e

ESC character, x1B.

\n

Newline.

\r

Carriage return.

\f

Form feed, x0C.

\t

Horizontal tab, x09.

\v

Vertical tab, x0B.

\octal

Character specified by a three-digit octal code.

\xhex

Character specified by a hexadecimal code.

\cchar

Named control character.

"..."

All characters taken as literals between double quotes, except escape sequences.

Character classes and class-like constructs

Sequence

Meaning

[...]

A single character listed or contained within a listed range.

[^...]

A single character not listed and not contained within a listed range.

.

Any character.

\d

Digit character ([0-9]).

\D

Non-digit character ([^0-9]).

\s

Whitespace character ([ \t\n\r\f\v]).

\S

Non-whitespace character ([^ \t\n\r\f\v]).

\w

Word character ([a-zA-Z0-9_]).

\W

Non-word character ([^a-zA-Z0-9_]).

Alternation and Repetition

Sequence

Meaning

...|...

Try subpatterns in alternation.

*

Match 0 or more times (greedy).

+

Match 1 or more times (greedy).

?

Match 0 or 1 times (greedy).

{n}

Match exactly n times.

{n,}

Match at least n times (greedy).

{n,m}

Match at least n times but no more than m times (greedy).

*?

Match 0 or more times (abstemious).

+?

Match 1 or more times (abstemious).

??

Match 0 or 1 times (abstemious).

{n,}?

Match at least n times (abstemious).

{n,m}?

Match at least n times but no more than m times (abstemious).

{MACRO}

Include the regex MACRO in the current regex.

Anchors

Sequence

Meaning

^

Start of string or after a newline.

$

End of string or before a newline.

Examples

Construct a Lexer and Tokenise Input

#include "lexertl/generator.hpp"
#include <iostream>
#include "lexertl/rules.hpp"
#include "lexertl/state_machine.hpp"
#include "lexertl/input.hpp"

lexertl::rules rules_;
lexertl::state_machine state_machine_;

rules_.add (_T("[0-9]+"), 1);
rules_.add (_T("[a-z]+"), 2);
lexertl::generator::build (rules_, state_machine_);

std::string input_ ("abc012Ad3e4");
lexertl::input in_ (&state_machine_, input_.c_str (), input_.c_str () + input_.end ());
lexertl::input::iterator iter_ = in_.begin ();
lexertl::input::iterator end_ = in_.end ();

while (iter_ != end_)
{
    size_t id_ = iter_->id;

    std::cout << "Id: " << id_ << ", Token: '" << std::string (iter_->start,
        iter_->end) << "'\n";
    ++iter_;
}

Restartable Lexing

It is possible to modify an iterator to (re)start at a specific position. Because there are multiple state variables inside the iterator, care needs to be taken to set it up correctly when you do this.

First of all the most important iterator member to set is end. Although the iterator automatically copies this to start when fetching the next token, I recommend setting start at the same time for completeness (you could set id to npos too if you really wanted to!) If you are using "beginning of line" matching, then you need to set the bol flag accordingly, so that the lexer knows whether the next token starts at the beginning of a line or not. Finally, if you are using 'multi state' lexing, be sure to set the state member correctly.

Here is an example:

#include "lexertl/generator.hpp"
#include "lexertl/rules.hpp"
#include "lexertl/state_machine.hpp"
#include "lexertl/input.hpp"

lexertl::rules rules_;
lexertl::state_machine state_machine_;
std::string input_ (_T("can\ncmd\na cmd\ncmd again\nanother cmd"));
lexertl::input in_ (&state_machine_, input_.c_str (), input_.c_str () + input_.size ());
// This time we set the iterator to end(), because we are going to reset
// it anyway, therefore it is pointless getting begin() to fetch the
// first token!
lexertl::input::iterator iter_ = in_.end ();
lexertl::input::iterator end_ = in_.end ();

rules_.add (_T("can"), 1);
rules_.add (_T("^cmd$"), 2);
rules_.add (_T("^cmd"), 3);
rules_.add (_T("cmd$"), 4);
rules_.add (_T("[a-z]+"), 50);
rules_.add (_T("\\s+"), 100);
lexertl::generator::build (rules_, state_machine_);

// Skip the first 4 characters
iter_->start = iter_->end = input_.c_str () + 4;
iter_->bol = true;
// Redundant statement, but just for example:
iter_->state = 0;
// Fetch first token (from current position).
++iter_;

while (iter_ != end_)
{
    const std::size_t id_ = iter_->id;
    const std::basic_string token_ (iter_->start, iter_->end);

    ++iter_;
}

Construct a Lexer and Tokenise a File

#include "lexertl/file_input.hpp"
#include "lexertl/generator.hpp"
#include <iostream>
#include "lexertl/rules.hpp"
#include "lexertl/state_machine.hpp"
#include "lexertl/input.hpp"

lexertl::rules rules_;
lexertl::state_machine state_machine_;

rules_.add (_T("[0-9]+"), 1);
rules_.add (_T("[a-z]+"), 2);
lexertl::generator::build (rules_, state_machine_);

std::ifstream is_ ("<pathname>");
lexertl::file_input file_ (&state_machine_, is_);
lexertl::file_input::iterator iter_ = file_.begin ();
lexertl::file_input::iterator end_ = file_.end ();

while (iter_ != end_)
{
    size_t id_ = iter_->id;

    std::cout << "Id: " << id_ << ", Token: '" << std::string (iter_->start,
        iter_->end) << "'\n";
    ++iter_;
}

Construct a Lexer and Dump the State Machine

#include "lexertl/debug.hpp"
#include <iostream>
#include "lexertl/generator.hpp"
#include "lexertl/rules.hpp"
#include "lexertl/state_machine.hpp"

lexertl::rules rules_;
lexertl::state_machine state_machine_;

rules_.add (_T("[0-9]+"), 1);
rules_.add (_T("[a-z]+"), 2);
lexertl::generator::build (rules_, state_machine_);
lexertl::generator::minimise_dfa (state_machine_);
lexertl::debug::dump (state_machine_, std::cout);

Iterating Over the State Machine

If you wish to write your own code generator, you will find it useful to be able to iterate over any state_machine you generate:

#include "lexertl/generator.hpp"
#include "lexertl/rules.hpp"
#include "lexertl/state_machine.hpp"
#include "lexertl/string_token.hpp"

lexertl::rules rules_;
lexertl::state_machine state_machine_;

rules_.add (_T("[0-9]+"), 1);
rules_.add (_T("[a-z]+"), 2);
lexertl::generator::build (rules_, state_machine_);

lexertl::state_machine::iterator iter_ = state_machine_.begin();
lexertl::state_machine::iterator end_ = state_machine_.end();

while (iter_ != end_)
{
    size_t state_ = iter_->state;

    if (iter_->transitions)
    {
        lexertl::string_token token_ = iter_->token;
        size_t goto_state_ = iter->goto_state;
    }

    ++iter_;
}
Here is a full list list of the members of the iterator: You can therefore process the state machine is a more structured way like this:
const size_t dfas_ = state_machine_.size();

for (std::size_t dfa_ = 0; dfa_ < dfas_; ++dfa_)
{
    const std::size_t states_ = iter_->states;

    for (std::size_t state_ = 0; state_ < states_; ++state_)
    {
        const std::size_t transitions_ = iter_->transitions;

        // Move to next state
        ++iter_;

        for (std::size_t t_ = 1; t_ < transitions_; ++t_)
        {
            // If there are any transitions, process them.
            // Note that moving to a new row automatically
            // fetches the first transition if one exists
            // which is why this loop start with an index of 1.
            ++iter_;
        }
    }
}

References

Acknowledgements

Thanks to Dave Handley for all his constructive criticism, performance testing, STL and general C++ tips.
Thanks to Régis Vaquette for all his testing and enthusiastic encouragement!
Thanks to Hartmut Kaiser for all his performance testing, feature requests and advice on compatibility with different compilers and, of course, boost.