The lexertl Blog

2011-11-10

Finally had a requirement to create a .NET interface to lexertl. So far I have done the proof of concept at work (C++/CLR) and can now call it from C#. Will see about releasing the source on CodeProject (as the wrapper was developed in work time..!) when it is in better shape.

2011-10-27

Agreed with Hartmut so rename unique_id to user_id. The auto counter beahaviour is gone and instead this id is optionally settable when adding a rule. This should mean this version is ready for integration with boost:spirit again, bar testing.

2011-10-23

Fixed up the recursive rule handling. Improved the lookup so that left most longest is maintained when using recursive rules and add the possibility to push a state other than current.

The code generator still needs attention, but a new release is in order regardless.

2011-10-15

I wondered whether it was a retrograde step to make minimise_dfa() a member function of state_machine, but after having tried it I am now convinced that the code is different enough that it is worth it. char_state_machine can now also be minimised.

2011-10-14

Made a change I have been meaning to make for ages: the EOL index is now the last column in each row. This means if $ is not used in the rules then no space is wasted for it in the state machine.

When using recursive rules, lookup now checks whether the stack is empty before attempting to pop states.

I've realised that state machine minimisation should be a member function for each state machine type and currently char_state_machine does not support minimisation. I will look into this along with pushing recursive matching as far as possible. I have started with a simple calculator example in main.cpp. I will mull how far it is logical to push the recursive rules (i.e. are there too many problems introduced by supporting simple recursive grammars.)

2011-09-11

Decided to add iterator conversion from char32_t to UTF-8 and UTF-16 (mainly for seeing Unicode tokens on Linux).

2011-08-27

Looked in Unicode Explained from O'Reilly and decided to implement Simple Case Mapping (page 245).

2011-08-21

Tried compiling with VC++ 7.1 and was pleased to note that it still compiles OK! I have fixed warnings highlighted by that compiler and added a default constructor as well as clear() and reset() functions to match_results. These also work fine with VC++ 7.1.

The char types for rules and state_machine have now been separated. This means that you can use char based rules with a char32_t based state_machine (for example).

2011-08-15

I'm now ploughing ahead with Unicode features. More Unicode character sets have been added, rules and state machines can now cope with char32_t and I have even added UTF-8 and UTF-16 iterators after being inspired by UTF8-CPP! (http://utfcpp.sourceforge.net/)

I have decided to definitely allow char based rules with wide char state machines and then hopefully that will be the end of the Unicode work for now.

2011-07-30

Unicode character sets added ( http://www.unicode.org/Public/5.1.0/ucd/UCD.html#General_Category_Values). Syntax is, for example, \p{Lu}. So far this is still only using wchar_t, as I'm still waiting for Microsoft to support the new char types for the 32 bit char stuff.

It's a relief to finally have wide char slicing working correctly!

2011-01-31

Added a parameter to the code generator for the lookup function name. Also added a parameter so that the DFA can use pointers directly (according to Anteru's testing this is 10% faster).

2011-01-24

I've generated code using each 'feature' and found and fixed lots of errors. Also, the skip constant was still defined as std::size_t which would have completely failed for non std::size_t based state machines!

Next up is traits for lookup (performance) and user data for use by the boost::spirit lexer.

2011-01-23

At the request of Anteru (http://anteru.net/) I have finally rewritten the C++ (table based) code generator. I have introduced a 'features' enum and therefore generate only the necessary code for any given state machine. I am still testing this code so I am not updating the rss until that is complete. The zip already contains the latest code however.

2010-12-24

Inspired by the work of Roberto Ierusalimschy (http://www.cs.tufts.edu/~nr/cs257/archive/roberto-ierusalimschy/peg.pdf) It dawned on me that I could easily add recursive rule support to lexertl (after all a lexer + stack = parser!) This technique would also be very interesting for a DFA based regex library (it would be cool to be able to match nested brackets when grepping source code).

2010-11-13

VC++ 6 support has finally been removed. The code base was actually already broken on that compiler and there was no obvious way to fix it. I first wrote a regex library in 2004 when VC++ 6 was still very much in use. With C++11 now on the horizon, it seems fair enough to move on (and of course boost dropped support for VC++ 6 ages ago).

2010-10-19

I have finally whipped the new version into shape. You can now specify the id_type for the state_machine if you wish. The default is std::size_t as before.

The generator is now capable of generating a char_state_machine directly and in fact will cope with a custom state machine type if you set it up right! In conjunction with this, char_state_machine doesn't use iterators anymore as the model didn't really work that well. You can dump state machines as before. Note that a char_state_machine will store *real character ranges* even in wchar_t mode, whereas state_machine always slices wide characters, allowing it to consistently use a 256 entry first phase lookup (Most of you won't care about that, but for anyone doing code generators this is an important distinction).

The code generator is next in the list to be reworked.

2010-08-27

I think I've finally got a proper solution to eol ambiguity resolution (slapped wrist for not testing my previous solution more thoroughly). As usual, doing the right thing (TM) was surprisingly easy, although I'm sure there must be simpler approaches to ^ and $ handling without all this post processing. I'm now happily in a position to tackle meatier issues, such as a more flexible generator interface and another ponder on look-ahead. I think I can solve at least half of the look-ahead problem with some regex syntax tree shenanigans, but I need both a*/a and a/a* to work correctly to really solve it. I may even be able to use some more post processing thinking about it. Maybe that's the easiest way.

2010-07-28

Spoke to Hartmut and agreed to ditch the boost.lexer.zip and instead only update that code base in boosts SVN repository. Any boost review is still ages away and it is a moot point seeing as spirit has been using the library for years already!

I finally fixed the ambiguity with $ and \n in the lexertl code base. The boost version will be fixed soon. I will also start work on some kind of abstract state machine interface with Hartmut so that generator::build() can produce a char_state_machine directly soon.

2010-03-01

As I haven't got around to making changes in preparation for the re2c style code generator, I have switched lexertl.zip to the latest version which includes the changes mentioned below. I'm thinking that it is probably better to either build to a char_state_machine directly in the generator class and that it probably doesn't really need iterators. There will be an option as to whether to group transitions by state or by char cluster. The latter is needed for a re2c style code generator.

2009-09-29

As I have recently started a revamp of lexertl I have decided to start a blog to keep everybody up to date. As this version is not feature complete yet, I have added a separate zip file which you can find here.

So far I have implemented the following improvements:

This dramatically reduces the list of (easier) features I wanted to add and just leaves the following for the immediate future: