lexertl is a modern, modular lexical analyser generator. See the
wikipedia definition to learn what a lexical analyser is at
http://en.wikipedia.org/wiki/Lexical_analysis.
Traditionally, programs such as
lex
generate source code as their output and even only support one kind of
programming language. With lexertl I am seeking to offer much more
flexibility than that by exposing the
state machine
that is generated from a supplied lex specification. By doing this the user
has much more freedom in how they process the data, which means it becomes easy to:
lexertl compiles equally well using GCC 4.0.1 on UNIX as
well as Visual C++.
A lexical analyser is a program that takes text as input and outputs substrings it recognizes. Each substring is referred to as a 'token' and has a numeric identifier assigned to it to uniquely identify its type (string, int, keyword etc.). A rule is defined for each token that must be recognised using Regular Expressions.
Sequence |
Meaning |
|---|---|
\a |
Alert (bell). |
\b |
Backspace. |
\e |
ESC character, x1B. |
\n |
Newline. |
\r |
Carriage return. |
\f |
Form feed, x0C. |
\t |
Horizontal tab, x09. |
\v |
Vertical tab, x0B. |
\octal |
Character specified by a three-digit octal code. |
\xhex |
Character specified by a hexadecimal code. |
\cchar |
Named control character. |
"..." |
All characters taken as literals between double quotes, except escape sequences. |
Sequence |
Meaning |
|---|---|
[...] |
A single character listed or contained within a listed range. |
[^...] |
A single character not listed and not contained within a listed range. |
. |
Any character. |
\d |
Digit character ([0-9]). |
\D |
Non-digit character ([^0-9]). |
\s |
Whitespace character ([ \t\n\r\f\v]). |
\S |
Non-whitespace character ([^ \t\n\r\f\v]). |
\w |
Word character ([a-zA-Z0-9_]). |
\W |
Non-word character ([^a-zA-Z0-9_]). |
Sequence |
Meaning |
|---|---|
...|... |
Try subpatterns in alternation. |
* |
Match 0 or more times (greedy). |
+ |
Match 1 or more times (greedy). |
? |
Match 0 or 1 times (greedy). |
{n} |
Match exactly n times. |
{n,} |
Match at least n times (greedy). |
{n,m} |
Match at least n times but no more than m times (greedy). |
*? |
Match 0 or more times (abstemious). |
+? |
Match 1 or more times (abstemious). |
?? |
Match 0 or 1 times (abstemious). |
{n,}? |
Match at least n times (abstemious). |
{n,m}? |
Match at least n times but no more than m times (abstemious). |
{MACRO} |
Include the regex MACRO in the current regex. |
Sequence |
Meaning |
|---|---|
^ |
Start of string or after a newline. |
$ |
End of string or before a newline. |
#include "lexertl/generator.hpp"
#include <iostream>
#include "lexertl/lookup.hpp"
#include "lexertl/rules.hpp"
#include "lexertl/state_machine.hpp"
lexertl::rules rules_;
lexertl::state_machine state_machine_;
rules_.add (_T("[0-9]+"), 1);
rules_.add (_T("[a-z]+"), 2);
lexertl::generator::build (rules_, state_machine_);
std::string input_ ("abc012Ad3e4");
std::string::const_iterator iter_ = input_.begin ();
std::string::const_iterator end_ = input.end ();
lexertl::match_results results_ (iter_, end_);
do
{
lexertl::lookup(state_machine_, results_);
std::cout << "Id: " << state_machine_.id << ", Token: '" <<
std::string (results_.start, state_machine_->end) << "'\n";
} while (results_.id != 0);
It is possible to modify a the results object to (re)start at a specific position. Because there are multiple state variables inside the results object, care needs to be taken to set it up correctly when you do this.
First of all the most important member to set is end.
Although the lookup automatically copies this to start when fetching the next token,
I recommend setting start at the same time for completeness (you could set
id to npos too if you really wanted to!) If you are using
"beginning of line" matching, then you need to set the bol flag accordingly,
so that the lexer knows whether the next token starts at the beginning of a
line or not. Finally, if you are using 'multi state' lexing, be sure to set the
state member correctly.
Here is an example:
#include "lexertl/generator.hpp"
#include "lexertl/lookup.hpp"
#include "lexertl/rules.hpp"
#include "lexertl/state_machine.hpp"
lexertl::rules rules_;
lexertl::state_machine state_machine_;
std::string input_ (_T("can\ncmd\na cmd\ncmd again\nanother cmd"));
// This time we set the iterator to end(), because we are going to reset
// it anyway.
std::string::const_iterator iter_ = input_.end ();
std::string::const_iterator end_ = input.end ();
lexertl::match_results results_ (iter_, end_);
rules_.add (_T("can"), 1);
rules_.add (_T("^cmd$"), 2);
rules_.add (_T("^cmd"), 3);
rules_.add (_T("cmd$"), 4);
rules_.add (_T("[a-z]+"), 50);
rules_.add (_T("\\s+"), 100);
lexertl::generator::build (rules_, state_machine_);
// Skip the first 4 characters
results_.start = results_.end = input_.begin () + 4;
results_.bol = true;
// Redundant statement, but just for example:
results_.state = 0;
do
{
lexertl::lookup(state_machine_, results_);
std::cout << "Id: " << state_machine_.id << ", Token: '" <<
std::string (results_.start, state_machine_->end) << "'\n";
} while (results_.id != 0);
#include "lexertl/debug.hpp"
#include <iostream>
#include "lexertl/generator.hpp"
#include "lexertl/rules.hpp"
#include "lexertl/state_machine.hpp"
lexertl::rules rules_;
lexertl::state_machine state_machine_;
rules_.add (_T("[0-9]+"), 1);
rules_.add (_T("[a-z]+"), 2);
lexertl::generator::build (rules_, state_machine_);
lexertl::generator::minimise_dfa (state_machine_);
lexertl::debug::dump (state_machine_, std::cout);
If you wish to write your own code generator, you will find it useful to be able
to iterate over any state_machine you generate:
#include "lexertl/generator.hpp"
#include "lexertl/rules.hpp"
#include "lexertl/state_machine.hpp"
#include "lexertl/string_token.hpp"
lexertl::rules rules_;
lexertl::state_machine state_machine_;
rules_.add (_T("[0-9]+"), 1);
rules_.add (_T("[a-z]+"), 2);
lexertl::generator::build (rules_, state_machine_);
lexertl::state_machine::iterator iter_ = state_machine_.begin();
lexertl::state_machine::iterator end_ = state_machine_.end();
while (iter_ != end_)
{
size_t state_ = iter_->state;
if (iter_->transitions)
{
lexertl::string_token token_ = iter_->token;
size_t goto_state_ = iter->goto_state;
}
++iter_;
}
Here is a full list list of the members of the iterator:
const size_t dfas_ = state_machine_.size();
for (std::size_t dfa_ = 0; dfa_ < dfas_; ++dfa_)
{
const std::size_t states_ = iter_->states;
for (std::size_t state_ = 0; state_ < states_; ++state_)
{
const std::size_t transitions_ = iter_->transitions;
// Move to next state
++iter_;
for (std::size_t t_ = 1; t_ < transitions_; ++t_)
{
// If there are any transitions, process them.
// Note that moving to a new row automatically
// fetches the first transition if one exists
// which is why this loop start with an index of 1.
++iter_;
}
}
}
Thanks to Dave Handley for all his constructive criticism, performance testing, STL
and general C++ tips.
Thanks to Régis Vaquette for all his testing and enthusiastic encouragement!
Thanks to Hartmut Kaiser for all his performance testing, feature requests and advice
on compatibility with different compilers and, of course, boost.