lexertl is a modern, modular lexical analyser generator. See the
wikipedia definition to learn what a lexical analyser is at
http://en.wikipedia.org/wiki/Lexical_analysis.
Traditionally, programs such as
lex
generate source code as their output and even only support one kind of
programming language. With lexertl I am seeking to offer much more
flexibility than that by exposing the
state machine
that is generated from a supplied lex specification. By doing this the user
has much more freedom in how they process the data, which means it becomes easy to:
lexertl compiles equally well using GCC 4.0.1 on UNIX as
well as Visual C++.
A lexical analyser is a program that takes text as input and outputs substrings it recognizes. Each substring is referred to as a 'token' and has a numeric identifier assigned to it to uniquely identify its type (string, int, keyword etc.). A rule is defined for each token that must be recognised using Regular Expressions.
A good way to learn how regular expressions really work is to use the
lexertl::debug::dump() function (see a later section on this
page for how). In general, people are so used to Perl style expressions
they can be confused when things like look-ahead assertions are not supported
when using a lexer generator. I will therefore show a couple of examples of subtler
than the average regular expressions.
'(''|[^'])*'
Will match SQL Server style strings. The state machine is shown below:
\/\*([^*]|\*+[^/*])*\*+\/Matches a C style comment:
Note that the same state machine can be achieved with the following shorter regex:
\/\*(.|\n)*?\*\/However if you take that approach be careful to declare the pattern near the end of your lexer rules. The abstemious repeat operator is very powerful and if used early in your rules will take precedence over later greedy repeats.
Sequence |
Meaning |
|---|---|
\a |
Alert (bell). |
\b |
Backspace. |
\e |
ESC character, x1B. |
\n |
Newline. |
\r |
Carriage return. |
\f |
Form feed, x0C. |
\t |
Horizontal tab, x09. |
\v |
Vertical tab, x0B. |
\octal |
Character specified by a three-digit octal code. |
\xhex |
Character specified by a hexadecimal code. |
\cchar |
Named control character. |
"..." |
All characters taken as literals between double quotes, except escape sequences. |
Sequence |
Meaning |
|---|---|
[...] |
A single character listed or contained within a listed range. |
[^...] |
A single character not listed and not contained within a listed range. |
. |
Any character. |
\d |
Digit character ([0-9]). |
\D |
Non-digit character ([^0-9]). |
\s |
Whitespace character ([ \t\n\r\f\v]). |
\S |
Non-whitespace character ([^ \t\n\r\f\v]). |
\w |
Word character ([a-zA-Z0-9_]). |
\W |
Non-word character ([^a-zA-Z0-9_]). |
Sequence |
Meaning |
|---|---|
\p{C} |
Other. |
\p{Cc} |
Other, Control. |
\p{Cf} |
Other, Format. |
\p{Co} |
Other, Private Use. |
\p{Cs} |
Other, Surrogate. |
\p{L} |
Letter. |
\p{LC} |
Letter, Cased. |
\p{Ll} |
Letter, Lowercase. |
\p{Lm} |
Letter, Modifier. |
\p{Lo} |
Letter, Other. |
\p{Lt} |
Letter, Titlecase. |
\p{Lu} |
Letter, Uppercase. |
\p{M} |
Mark. |
\p{Mc} |
Mark, Space Combining. |
\p{Me} |
Mark, Enclosing. |
\p{Mn} |
Mark, Nonspacing. |
\p{N} |
Number. |
\p{Nd} |
Number, Decimal Digit. |
\p{Nl} |
Number, Letter. |
\p{No} |
Number, Other. |
\p{P} |
Punctuation. |
\p{Pc} |
Punctuation, Connector. |
\p{Pd} |
Punctuation, Dash. |
\p{Pe} |
Punctuation, Close. |
\p{Pf} |
Punctuation, Final quote. |
\p{Pi} |
Punctuation, Initial quote. |
\p{Po} |
Punctuation, Other. |
\p{Ps} |
Punctuation, Open. |
\p{S} |
Symbol. |
\p{Sc} |
Symbol, Currency. |
\p{Sk} |
Symbol, Modifier. |
\p{Sm} |
Symbol, Math. |
\p{So} |
Symbol, Other. |
\p{Z} |
Separator. |
\p{Zl} |
Separator, Line. |
\p{Zp} |
Separator, Paragraph. |
\p{Zs} |
Separator, Space. |
Sequence |
Meaning |
|---|---|
...|... |
Try subpatterns in alternation. |
* |
Match 0 or more times (greedy). |
+ |
Match 1 or more times (greedy). |
? |
Match 0 or 1 times (greedy). |
{n} |
Match exactly n times. |
{n,} |
Match at least n times (greedy). |
{n,m} |
Match at least n times but no more than m times (greedy). |
*? |
Match 0 or more times (abstemious). |
+? |
Match 1 or more times (abstemious). |
?? |
Match 0 or 1 times (abstemious). |
{n,}? |
Match at least n times (abstemious). |
{n,m}? |
Match at least n times but no more than m times (abstemious). |
{MACRO} |
Include the regex MACRO in the current regex. |
Sequence |
Meaning |
|---|---|
^ |
Start of string or after a newline. |
$ |
End of string or before a newline. |
As you will see from the examples below, match_results or
push_match_results are the structs passed to lookup. Like flex,
end of input returns an id of 0 (you can set this to another value if you like)
and an unrecognised token returns a single character. The id returned
is npos in this case. This value is available from a static function
npos() on match_results.
match_results has the following members:
id_type id;
id_type user_id;
iter_type start;
iter_type end;
iter_type eoi;
bool bol;
id_type state;
id defines the numeric id for the current match.user_id can optionally be set as a second value
(e.g. an index into a table of semantic actions).start is the start of the token.end is the end of the token.eoi is the end of the input.bol is a flag indicating either start of input or
the last character read being \n.state is the current start state.push_match_results derives from match_results
and adds the following:
std::stack stack;
This struct is used when recursive rules have been defined.
#include "lexertl/generator.hpp"
#include <iostream>
#include "lexertl/lookup.hpp"
#include "lexertl/rules.hpp"
#include "lexertl/state_machine.hpp"
int main ()
{
lexertl::rules rules_;
lexertl::state_machine state_machine_;
rules_.add ("[0-9]+", 1);
rules_.add ("[a-z]+", 2);
lexertl::generator::build (rules_, state_machine_);
std::string input_ ("abc012Ad3e4");
std::string::const_iterator iter_ = input_.begin ();
std::string::const_iterator end_ = input_.end ();
lexertl::match_results results_ (iter_, end_);
do
{
lexertl::lookup (state_machine_, results_);
std::cout << "Id: " << results_.id << ", Token: '" <<
std::string (results_.start, results_.end) << "'\n";
} while (results_.id != 0);
return 0;
}
It is possible to modify the results object to (re)start at a specific position. Because there are multiple state variables inside the results object, care needs to be taken to set it up correctly when you do this.
First of all the most important member to set is end.
Although the lookup automatically copies this to start when fetching the next token,
I recommend setting start at the same time for completeness (you could set
id to npos too if you really wanted to!) If you are using
"beginning of line" matching, then you need to set the bol flag accordingly,
so that the lexer knows whether the next token starts at the beginning of a
line or not. Finally, if you are using 'multi state' lexing, be sure to set the
state member correctly.
Here is an example:
#include "lexertl/generator.hpp"
#include <iostream>
#include "lexertl/lookup.hpp"
#include "lexertl/rules.hpp"
#include "lexertl/state_machine.hpp"
int main ()
{
lexertl::rules rules_;
lexertl::state_machine state_machine_;
std::string input_ ("can\ncmd\na cmd\ncmd again\nanother cmd");
// This time we set the iterator to end(), because we are going to reset
// it anyway.
std::string::const_iterator iter_ = input_.end ();
std::string::const_iterator end_ = input_.end ();
lexertl::match_results results_ (iter_, end_);
rules_.add ("can", 1);
rules_.add ("^cmd$", 2);
rules_.add ("^cmd", 3);
rules_.add ("cmd$", 4);
rules_.add ("[a-z]+", 50);
rules_.add ("\\s+", 100);
lexertl::generator::build (rules_, state_machine_);
// Skip the first 4 characters
results_.start = results_.end = input_.begin () + 4;
results_.bol = true;
// Redundant statement, but just for example:
results_.state = 0;
do
{
lexertl::lookup (state_machine_, results_);
std::cout << "Id: " << results_.id << ", Token: '" <<
std::string (results_.start, results_.end) << "'\n";
} while (results_.id != 0);
return 0;
}
As I don't have the time to write a complimentary parser generator, I have
added recursive rule support directly to lexertl. To make this work
we need a couple of tweaks. To start with we need to introduce the concept of
pushing and popping states. An exit state with '>' before the name
means push, whereas '<' for the exit state means pop.
It turns out we also need a means for continuing even when a particular rule has matched. We achieve this by omitting an id when adding a rule. Finally, note that even when we specify an id (as normal) for the exit conditions, the rule will only finish when all previous pushes (with no id specified) are popped.
See the example below where all of these concepts are put together. The rules work together to match a string containing an even number of 0s and 1s.
#include "lexertl/generator.hpp"
#include <iostream>
#include "lexertl/lookup.hpp"
#include "lexertl/rules.hpp"
#include "lexertl/state_machine.hpp"
int main ()
{
lexertl::rules rules_;
lexertl::state_machine sm_;
std::string input_ ("110100a");
std::string::const_iterator iter_ = input_.begin ();
std::string::const_iterator end_ = input_.end ();
lexertl::push_match_results results_ (iter_, end_);
rules_.add_state ("ZERO");
rules_.add_state ("ONE");
rules_.add ("INITIAL", "0", ">ONE");
rules_.add ("INITIAL", "1", ">ZERO");
rules_.add ("ZERO", "0", 1, "<");
rules_.add ("ZERO", "1", ">ZERO");
rules_.add ("ONE", "0", ">ONE");
rules_.add ("ONE", "1", 1, "<");
lexertl::generator::build (rules_, sm_);
do
{
lexertl::lookup (sm_, results_);
std::cout << "Id: " << results_.id << ", Token: '" <<
std::string (results_.start, results_.end) << "'\n";
} while (results_.id != 0 && results_.id != results_.npos ());
return 0;
}
A simple calculator example follows that utilises the possibility of pushing a state other than current:
#include "lexertl/generator.hpp"
#include "lexertl/lookup.hpp"
#include "lexertl/rules.hpp"
enum calc_id {eEOF, eNegate, eValue, eOperator, eOpen, eClose};
void gen_calc(lexertl::state_machine &sm_)
{
lexertl::rules rules_;
rules_.add_state("VALUE");
rules_.add_state("OPERATOR");
rules_.add_state("SUB_INITIAL");
rules_.add_state("SUB_VALUE");
rules_.add_state("SUB_OPERATOR");
rules_.add("INITIAL", "-", eNegate, "VALUE");
rules_.add("INITIAL,VALUE", "\\d+", eValue, "OPERATOR");
rules_.add("OPERATOR", "[-+*/%]", eOperator, "INITIAL");
rules_.add("INITIAL,VALUE", "\\(", eOpen, ">SUB_INITIAL:OPERATOR");
rules_.add("SUB_INITIAL,SUB_VALUE", "\\(", eOpen,
">SUB_INITIAL:SUB_OPERATOR");
rules_.add("SUB_INITIAL", "-", eNegate, "SUB_VALUE");
rules_.add("SUB_INITIAL,SUB_VALUE", "\\d+", eValue, "SUB_OPERATOR");
rules_.add("SUB_OPERATOR", "[-+*/%]", eOperator, "SUB_INITIAL");
rules_.add("SUB_OPERATOR", "\\)", eClose, "<");
rules_.add("*", "\\s+", sm_.skip(), ".");
lexertl::generator::build(rules_, sm_);
}
void reduce (std::stack &operands_,
std::stack &ops_)
{
if (operands_.empty() || ops_.empty()) return;
char op_ = ops_.top();
int rhs_ = 0;
switch (op_)
{
case '~':
ops_.pop();
operands_.top() *= -1;
break;
case '+':
ops_.pop();
rhs_ = operands_.top();
operands_.pop();
operands_.top () += rhs_;
break;
case '-':
ops_.pop();
rhs_ = operands_.top();
operands_.pop();
operands_.top () -= rhs_;
break;
case '*':
ops_.pop();
rhs_ = operands_.top();
operands_.pop();
operands_.top () *= rhs_;
break;
case '/':
ops_.pop();
rhs_ = operands_.top();
operands_.pop();
operands_.top () /= rhs_;
break;
case '%':
ops_.pop();
rhs_ = operands_.top();
operands_.pop();
operands_.top () %= rhs_;
break;
case '(':
break;
case ')':
ops_.pop();
while (!ops_.empty() && op_ != '(')
{
op_ = ops_.top ();
reduce (operands_, ops_);
}
if (op_ == '(') ops_.pop();
break;
}
}
void calc(const std::string &input_,
const lexertl::state_machine &sm_, int &output_)
{
lexertl::basic_push_match_results results_
(input_.c_str(), input_.c_str() + input_.size());
int value_ = 0;
std::stack operands_;
std::stack ops_;
do
{
lexertl::lookup (sm_, results_);
if (results_.id == results_.npos())
{
throw std::runtime_error("Syntax error");
}
if (results_.id == eOperator)
{
switch (*results_.start)
{
case '~':
case '(':
break;
case '+':
case '-':
reduce(operands_, ops_);
break;
default:
if (!ops_.empty())
{
const char op_ = ops_.top();
if (op_ != '~' && op_ != '+' &&
op_ != '-' && op_ != '(')
{
reduce(operands_, ops_);
}
}
break;
}
ops_.push(*results_.start);
}
else if (results_.id == eNegate)
{
ops_.push('~');
}
else if (results_.id == eValue)
{
std::string val_(results_.start, results_.end);
value_ = atoi(val_.c_str());
operands_.push(value_);
}
else if (results_.id == eOpen)
{
ops_.push('(');
}
else if (results_.id == eClose)
{
ops_.push(')');
reduce(operands_, ops_);
}
} while (results_.id != 0);
if (results_.state != 2)
{
throw std::runtime_error("Unterminated expression!");
}
while (!ops_.empty())
{
reduce(operands_, ops_);
}
if (operands_.size() != 1)
{
throw std::runtime_error("internal error: operands_.size() != 1");
}
output_ = operands_.top();
}
int main ()
{
lexertl::state_machine sm_;
std::string input_;
int output_ = 0;
gen_calc(sm_);
do
{
try
{
std::cout << "Input: " << std::flush;
std::getline(std::cin, input_);
if (input_ == "q" || input_ == "Q")
break;
std::cout << input_ << " = ";
calc(input_, sm_, output_);
std::cout << output_ << std::endl;
}
catch (const std::exception &e)
{
std::cout << "ERROR: " << e.what() << std::endl;
}
} while (input_ != "q" && input_ != "Q");
return 0;
}
#include "lexertl/debug.hpp"
#include "lexertl/generator.hpp"
#include <iostream>
#include "lexertl/rules.hpp"
#include "lexertl/state_machine.hpp"
int main ()
{
lexertl::rules rules_;
lexertl::state_machine state_machine_;
rules_.add ("[0-9]+", 1);
rules_.add ("[a-z]+", 2);
lexertl::generator::build (rules_, state_machine_);
state_machine_.minimise ();
lexertl::debug::dump (state_machine_, std::cout);
return 0;
}
If you wish to write your own code generator, you will find it useful to
generate a char state machine as it contains states with simple transitions
from character ranges to other states. Even in Unicode mode this allows you
to observe real character ranges. In contrast state_machine has
a two phase lookup and always slices unicode chars into bytes in order to
compress the data.
#include "lexertl/debug.hpp"
#include "lexertl/generator.hpp"
#include <iostream>
#include "lexertl/rules.hpp"
#include "lexertl/state_machine.hpp"
int main ()
{
lexertl::rules rules_;
lexertl::char_state_machine char_state_machine_;
rules_.add ("[0-9]+", 1);
rules_.add ("[a-z]+", 2);
lexertl::char_generator::build (rules_, char_state_machine_);
lexertl::debug::dump (char_state_machine_, std::cout);
return 0;
}
See state_machine.hpp for the char_state_machine structure.
There is currently one type of file iterator included with lexertl
which is the file_shared_iterator:
#include "lexertl/file_shared_iterator.hpp"
#include <fstream>
#include "lexertl/generator.hpp"
#include <iostream>
#include "lexertl/lookup.hpp"
#include "lexertl/rules.hpp"
#include "lexertl/state_machine.hpp"
int main ()
{
lexertl::rules rules_;
lexertl::state_machine state_machine_;
std::ifstream if_ ("C:\\Ben\\Dev\\lexertl\\policy.txt");
lexertl::file_shared_iterator iter_ (if_);
lexertl::file_shared_iterator end_;
lexertl::basic_match_results<lexertl::file_shared_iterator, std::size_t>
results_(iter_, end_);
rules_.add ("[0-9]+", 1);
rules_.add ("[a-z]+", 2);
lexertl::generator::build (rules_, state_machine_);
do
{
lexertl::lookup (state_machine_, results_);
std::cout << "Id: " << results_.id << ", Token: '" <<
std::string (results_.start, results_.end) << "'\n";
} while (results_.id != 0);
return 0;
}
Currently there is a single code generator that generates C++ table based lookup:
#include "lexertl/generator.hpp"
#include "lexertl/generate_cpp.hpp"
#include <iostream>
#include "lexertl/rules.hpp"
#include "lexertl/state_machine.hpp"
int main ()
{
lexertl::rules rules_;
lexertl::state_machine sm_;
rules_.add ("[0-9]+", 1);
rules_.add ("[a-z]+", 2);
lexertl::generator::build (rules_, sm_);
sm_.minimise();
lexertl::table_based_cpp::generate_cpp ("lookup", sm_, false, std::cout);
return 0;
}
If you want to generate code for another language, then you should easily be able to use this generator as a template.
Thanks to Dave Handley for all his constructive criticism, performance testing, STL
and general C++ tips.
Thanks to Régis Vaquette for all his testing and enthusiastic encouragement!
Thanks to Hartmut Kaiser for all his performance testing, feature requests and advice
on compatibility with different compilers and, of course, boost.