Finally had a requirement to create a .NET interface to lexertl. So far I have done the proof of concept at work (C++/CLR) and can now call it from C#. Will see about releasing the source on CodeProject (as the wrapper was developed in work time..!) when it is in better shape.
Agreed with Hartmut so rename unique_id to user_id. The auto counter beahaviour is gone and instead this id is optionally settable when adding a rule. This should mean this version is ready for integration with boost:spirit again, bar testing.
Fixed up the recursive rule handling. Improved the lookup so that left most longest is maintained when using recursive rules and add the possibility to push a state other than current.
The code generator still needs attention, but a new release is in order regardless.
I wondered whether it was a retrograde step to make minimise_dfa()
a member function of state_machine, but after having tried it I am
now convinced that the code is different enough that it is worth it.
char_state_machine can now also be minimised.
Made a change I have been meaning to make for ages: the EOL index is now
the last column in each row. This means if $ is not used in the
rules then no space is wasted for it in the state machine.
When using recursive rules, lookup now checks whether the stack is empty before attempting to pop states.
I've realised that state machine minimisation should be a member function
for each state machine type and currently char_state_machine does not
support minimisation. I will look into this along with pushing recursive matching as
far as possible. I have started with a simple calculator example in main.cpp.
I will mull how far it is logical to push the recursive rules (i.e. are there
too many problems introduced by supporting simple recursive grammars.)
Decided to add iterator conversion from char32_t to UTF-8
and UTF-16 (mainly for seeing Unicode tokens on Linux).
Looked in Unicode Explained from O'Reilly and decided to implement Simple Case Mapping (page 245).
Tried compiling with VC++ 7.1 and was pleased to note that it still compiles OK!
I have fixed warnings highlighted by that compiler and added a default constructor
as well as clear() and reset() functions to match_results. These
also work fine with VC++ 7.1.
The char types for rules and state_machine have now been separated.
This means that you can use char based rules with a char32_t
based state_machine (for example).
I'm now ploughing ahead with Unicode features. More Unicode character sets
have been added, rules and state machines can now cope with char32_t and I have
even added UTF-8 and UTF-16 iterators after being inspired by UTF8-CPP!
(http://utfcpp.sourceforge.net/)
I have decided to definitely allow char based rules with wide char state machines
and then hopefully that will be the end of the Unicode work for now.
Unicode character sets added
(
http://www.unicode.org/Public/5.1.0/ucd/UCD.html#General_Category_Values).
Syntax is, for example, \p{Lu}. So far this is still only using
wchar_t, as I'm still waiting for Microsoft to support the new
char types for the 32 bit char stuff.
It's a relief to finally have wide char slicing working correctly!
Added a parameter to the code generator for the lookup function name. Also added a parameter so that the DFA can use pointers directly (according to Anteru's testing this is 10% faster).
I've generated code using each 'feature' and found and fixed lots of errors.
Also, the skip constant was still defined as std::size_t
which would have completely failed for non std::size_t based
state machines!
Next up is traits for lookup (performance) and user data for use by the
boost::spirit lexer.
At the request of Anteru (http://anteru.net/)
I have finally rewritten the C++ (table based)
code generator. I have introduced a 'features' enum and therefore generate only
the necessary code for any given state machine. I am still testing this code so I am not
updating the rss until that is complete. The zip already contains the latest code however.
Inspired by the work of Roberto Ierusalimschy
(http://www.cs.tufts.edu/~nr/cs257/archive/roberto-ierusalimschy/peg.pdf)
It dawned on me that I could easily add recursive rule support to lexertl (after all
a lexer + stack = parser!) This technique would also be very interesting for a DFA based regex
library (it would be cool to be able to match nested brackets when grepping source code).
VC++ 6 support has finally been removed. The code base was actually already
broken on that compiler and there was no obvious way to fix it. I first wrote
a regex library in 2004 when VC++ 6 was still very much in use. With C++11 now
on the horizon, it seems fair enough to move on (and of course
boost dropped support for VC++ 6 ages ago).
I have finally whipped the new version into shape. You can now specify the
id_type for the state_machine if you wish. The default is
std::size_t as before.
The generator is now capable of generating a char_state_machine
directly and in fact will cope with a custom state machine type if you set it up right!
In conjunction with this, char_state_machine doesn't use iterators anymore
as the model didn't really work that well. You can dump state machines as before.
Note that a char_state_machine will store *real character ranges* even in
wchar_t mode, whereas state_machine always slices wide characters,
allowing it to consistently use a 256 entry first phase lookup (Most of you won't care about that,
but for anyone doing code generators this is an important distinction).
The code generator is next in the list to be reworked.
I think I've finally got a proper solution to eol ambiguity resolution
(slapped wrist for not testing my previous solution more thoroughly).
As usual, doing the right thing (TM) was surprisingly easy, although I'm sure
there must be simpler approaches to ^ and $ handling without all this
post processing. I'm now happily in a position to tackle meatier issues, such as a more flexible
generator interface and another ponder on look-ahead. I think I can solve at least
half of the look-ahead problem with some regex syntax tree shenanigans, but I need both
a*/a and a/a* to work correctly to really solve it. I may even be able
to use some more post processing thinking about it. Maybe that's the easiest way.
Spoke to Hartmut and agreed to ditch the boost.lexer.zip and instead only update
that code base in boosts SVN repository. Any boost review is still ages away
and it is a moot point seeing as spirit has been using the library for years
already!
I finally fixed the ambiguity with $ and \n in the
lexertl code base. The boost version will be fixed soon.
I will also start work on some kind of abstract state machine interface with Hartmut so that
generator::build() can produce a char_state_machine directly soon.
As I haven't got around to making changes in preparation for the re2c style
code generator, I have switched lexertl.zip to the latest version which includes
the changes mentioned below. I'm thinking that it is probably better to either build to
a char_state_machine directly in the generator class and that
it probably doesn't really need iterators. There will be an option as to whether to
group transitions by state or by char cluster. The latter is needed for a
re2c style code generator.
As I have recently started a revamp of lexertl I have decided to start a blog to keep everybody up to date. As this version is not feature complete yet, I have added a separate zip file which you can find here.
So far I have implemented the following improvements:
wchar_t based state machines (overridable).lexertl::skip token constant.^) link a singleton (as it can only occur at the beginning of a token).debug::dump() now compresses ranges.This dramatically reduces the list of (easier) features I wanted to add and just leaves the following for the immediate future:
file_iteratorsize_t into a templated type for state machine creation.