gram_grep
is a search tool that goes far beyond the capabilities of grep
.
Searches can span multiple lines and may be chained together in a variety
of ways and can even utilise
bison style grammars.
Maybe you want a search to ignore comments, or search only within strings. Maybe you have code that has SQL within strings and that SQL itself contains strings that you want to search in. The possibilities are endless and there is no limit to the sequence of sub-searches.
For example, here is how you would search for the text memory_file
outside of
C and C++ style comments:
gram_grep -Hn -v --flex-regexp "\/\/.*|\/\*(?s:.)*?\*\/" -F memory_file main.cpp
gram_grep
allows multiple searches to be pipelined, unlike grep
.
Because of this, switches such as --ignore-case
reset to their defaults
as a pattern is added to the pipeline.
As gram_grep
searches can span multiple lines, you can
specify --display-whole-match
to show the entire match.
Should you wish to limit a search to single lines, you can always use
-E .+
before your search.
The vast majority of the switches offerered by grep
are now supported.
The characters &
, <
, >
,
^
and |
have a special meaning to the DOS shell.
When any of these characters are used outside of a string, they must be escaped
by the ^
character.
For example if you wanted to pass the regexp [^0-9]|\\[0-9]
,
you would pass it as [^^0-9]^|\\[0-9]
.
If you wish to pass a double quote as part of a parameter then the entire parameter must be passed inside double quotes. In order for the double quote to be passed as a literal in this situation, it must be doubled up.
For example if you wanted to pass the regexp "/*"(?s:.)*?"*/"
,
you would pass it as """/*""(?s:.)*?""*/"""
.
There is a switch --dump-argv
in order to clarify what
gram_grep
actually receives should you end up completely baffled!
Just use single quotes around your parameters. If you want to pass a single quote as part of your parameter, then terminate the string, escape the single quote, then restart the string.
e.g. for [^']
pass '[^'\'']'
It quickly gets tedious trying to correctly escape characters in a command shell, so we switch to a configuration file to also exclude strings:
gram_grep -Hn --config=sample_configs/nosc.g main.cpp
The config file nosc.g
looks like this:
%%
%%
%%
'([^'\\\r\n]|\\.)*' skip()
\"([^"\\\r\n]|\\.)*\" skip()
R\"\((?s:.)*?\)\" skip()
"//".*|"/*"(?s:.)*?"*/" skip()
memory_file 1
%%
Note how characters are also skipped just in case there is a character
containing a double quote! Also note how we have moved our search for
memory_file
directly into the config file as this part of
the config lists regexes that are passed to a lexer generator.
This means that we specify the things we want to match (use 1
for the id in this case) or explicitly skip (use skip()
in
this case) all within the same section. This mode alone has already given
us far more searching power than with traditional techniques.
If we wanted to only search in strings or comments, we would use
1
instead of skip()
for those regexes and omit the
memory_file
line altogether. We would then pass
memory_file
with -F
as a command line parameter
for example.
gram_grep
now conforms to the grep
way of doing recursive searches.
This means that if you specify --recursive
, or --dereference-recursive
then instead of specifying pathname patterns, you specify directory patterns instead.
See --include
etc. if you want to filter on particular file types in these modes.
Note that it is possible to issue a command to check out files from source control:
gram_grep -Hn -r --include="*.csproj" -F v4.5.1 --replace=v4.5.2 --perform-output --checkout="tf.exe checkout $1" .
The above example would replace v4.5.1
with v4.5.2
in *.csproj
, checking out the files from TFS as they match.
Note that there are also switches --startup
and
--shutdown
where you can run other commands at startup and
exit respectively if required (e.g.,
"tf.exe workspace /new /collection:http://... refactor /noprompt"
and
"tf.exe workspace /delete /collection:http://... refactor /noprompt"
).
The config file has the following format:
<grammar/lexer directives>
%%
<grammar>
%%
<regexp macros>
%%
<regexes>
%%
As implied above, the grammar/lexer directives
,
grammar
and regexp macros
are all optional.
Here is an example of a simple grammar that recognises C++ strings
split over multiple lines (strings.g
):
/*
NOTE: in order to successfully find strings it is necessary to filter out comments and chars.
As a subtlety, comments could contain apostrophes (or even unbalanced double quotes in
an extreme case)!
*/
%token RawString String
%%
list: String { match = substr($1, 1, 1); };
list: RawString { match = substr($1, 3, 2); };
list: list String { match += substr($2, 1, 1); };
list: list RawString { match += substr($2, 3, 2); };
%%
%%
\"([^"\\\r\n]|\\.)*\" String
R\"\((?s:.)*?\)\" RawString
'([^'\\\r\n]|\\.)*' skip()
[ \t\r\n]+|"//".*|"/*"(?s:.)*?"*/" skip()
%%
Although the grammar is just about as simple as it gets, note the scripting
added. Each string fragment is joined into a match
, that can
then be searched on by a following search. This means we can search within
C++ strings without worrying about how they are split over lines.
Note how we have switched from using 1
as the matching regexp id
to names which we have specified using %token
and used in the
grammar.
Example usage:
gram_grep -Hn --config=sample_configs/strings.g -F grammar main.cpp
The full list of scripting commands are listed below. You can see their use
in the more sophisticated examples that follow later. $n
,
$from
and $to
refer to the item in the production
you are interested in (numbering starts at 1
).
$n
) specifiers.format('text', ...);
(use {}
for format specifiers)replace_all('text', 'regexp', 'text')
system('text');
erase($n);
erase($from, $to);
erase($from.second, $to.first);
insert($n, 'text');
insert($n.second, 'text');
match = $n;
match = substr($n, <omit from left>, <omit from right>);
match += $n;
match += substr($n, <omit from left>, <omit from right>);
print('text');
replace($n, 'text');
replace($from, $to, 'text');
replace($from.second, $to.first, 'text');
replace_all($n, 'regexp', 'text');
This is a standalone syntax that does not currently support any function nesting or $n
within the regexes. It has following format:
regex_search($n, 'regex')
{ || regex_search($n, 'regex')
}
By default, the entire grammar will match. However, there are times you are
only interested if specific parts of your grammar matches. If you want to
only match on particular grammar rules, use {}
just before the
terminating semi-colon for that rule. This technique is shown in a later
example.
Most of the time, the only grammar/lexer directive you will care about
will be %token
. However, the following are supported:
Pattern selection and interpretation:
-E, --extended-regexp
PATTERN is an extended regular expression (ERE)-F, --fixed-strings
PATTERN is a set of newline-separated fixed strings-G, --basic-regexp
PATTERN is a basic regular expression (BRE)-P, --perl-regexp
PATTERN is a Perl regular expression-e, --regexp=PATTERN
use PATTERN for matching-f, --file=FILE
take PATTERNS from FILE-i, --ignore-case
ignore case distinctions --no-ignore-case
do not ignore case distinctions (default)-w, --word-regexp
force PATTERN to match only whole words-x, --line-regexp
force PATTERN to match only whole linesMiscellaneous
-s, --no-messages
suppress error messages-v, --invert-match
select non-matching text-V, --version
print version information and exit --help
display this help and exitOutput control:
-m, --max-count=NUM
stop after NUM matches-b, --byte-offset
print the byte offset with output lines-n, --line-number
print line number with output lines --line-buffered
flush output on every line-H, --with-filename
print the filename for each match-h, --no-filename
suppress the prefixing filename on output --label=LABEL
print LABEL as filename for standard input-o, --only-matching
show only the part of a line matching PATTERN-q, --quiet, --silent
suppress all normal output --binary-files=TYPE
assume that binary files are TYPE;
TYPE is `binary', `text', or `without-match'-a, --text
equivalent to --binary-files=text-I
equivalent to --binary-files=without-match-d, --directories=ACTION
how to handle directories;
ACTION is 'read', 'recurse', or 'skip'-r, --recursive
like --directories=recurse-R, --dereference-recursive
likewise, but follow all symlinks --include=GLOB
search only files that match GLOB (a file pattern) --exclude=GLOB
skip files that match GLOB --exclude-from=FILE
skip files that match any file pattern from FILE --exclude-dir=GLOB
skip directories that match GLOB-L, --files-without-match
print only names of FILEs containing no match-l, --files-with-matches
print only names of FILEs containing matches-c, --count
print only a count of matches per FILE-T, --initial-tab
make tabs line up (if needed)-Z, --null
print 0 byte after FILE nameContext control:
-B, --before-context=NUM
print NUM lines of trailing context-A, --after-context=NUM
print NUM lines of leading context-C, --context=NUM
print NUM lines of output context-NUM
same as --context=NUM --group-separator=SEP
print SEP on line between matches with context --no-group-separator=SEP
do not print separator for matches with context --color=[WHEN]
--colour=[WHEN]
use markers to highlight the matching strings;
WHEN is 'always', 'never', or 'auto'gram_grep specific switches:
--checkout=CMD
checkout command (include $1 for pathname) --config=CONFIG_FILE
search using config file --display-whole-match
display a multiline match --dump
dump DFA regexp --dump-argv
dump command line arguments --dump-dot
dump DFA regexp in DOT format --exec=CMD
Executes the supplied command --extend-search
extend the end of the next match to be the end of the current match --flex-regexp
PATTERN is a flex style regexp --force-write
if a file is read only, force it to be writable --if=CONDITION
make search conditional --invert-match-all
only match if the search does not match at all-N, --line-number-parens
print line number in parenthesis with output lines --perform-output
output changes to matching file-p, --print=TEXT
print TEXT instead of line of match --replace=TEXT
replace match with TEXT --return-previous-match
return the previous match instead of the current one --shutdown=CMD
command to run when exiting --startup=CMD
command to run at startup --summary
show match count footer --utf8
in the absence of a BOM assume UTF-8-W, --word-list=PATHNAME
search for a word from the supplied word list --writable
only process files that are writableIf an input file has a BOM (byte order marker), then that will be recognised. In the case of UTF-16, the contents will be automatically converted to UTF-8 in memory to allow uniform processing.
Unicode support can be enabled with the --utf8
switch.
Two things happen with this switch enabled:
--config
, --flex-regexp
). Note that the std::regex
support (-E
, -G
, -P
) does not currently support Unicode.insert.g
:
%token INSERT INTO Name String VALUES
%%
start: insert;
insert: INSERT into name VALUES;
into: INTO | %empty;
name: Name | Name '.' Name | Name '.' Name '.' Name;
%%
%%
(?i:INSERT) INSERT
(?i:INTO) INTO
(?i:VALUES) VALUES
\. '.'
(?i:[a-z_][a-z0-9@$#_]*|\[[a-z_][a-z0-9@$#_]*[ ]*\]) Name
'([^']|'')*' String
\s+|--.*|"/*"(?s:.)*?"*/" skip()
%%
The command line looks like this:
gram_grep -Hn -r --include="*.sql" --config=sample_configs/insert.g .
First the string extraction (strings.g
):
%token RawString String
%%
list: String { match = substr($1, 1, 1); };
list: RawString { match = substr($1, 3, 2); };
list: list String { match += substr($2, 1, 1); };
list: list RawString { match += substr($2, 3, 2); };
%%
%%
\"([^"\\\r\n]|\\.)*\" String
R\"\((?s:.)*?\)\" RawString
'([^'\\\r\n]|\\.)*' skip()
[ \t\r\n]+|"//".*|"/*"(?s:.)*?"*/" skip()
%%
Or if we wanted to scan C#:
%token String VString
%%
list: String { match = substr($1, 1, 1); };
list: VString { match = substr($1, 2, 1); };
list: list '+' String { match += substr($3, 1, 1); };
list: list '+' VString { match += substr($3, 2, 1); };
%%
ws [ \t\r\n]+
%%
\+ '+'
\"([^"\\\r\n]|\\.)*\" String
@\"([^"]|\"\")*\" VString
'([^'\\\r\n]|\\.)*' skip()
{ws}|"//".*|"/*"(?s:.)*?"*/" skip()
%%
Now the grammar to search inside the strings (merge.g
):
%token AS Integer INTO MERGE Name PERCENT TOP USING
%%
merge: MERGE opt_top opt_into name opt_alias USING;
opt_top: %empty | TOP '(' Integer ')' opt_percent;
opt_percent: %empty | PERCENT;
opt_into: %empty | INTO;
name: Name | Name '.' Name | Name '.' Name '.' Name;
opt_alias: %empty | opt_as Name;
opt_as: %empty | AS;
%%
%%
(?i:AS) AS
(?i:INTO) INTO
(?i:MERGE) MERGE
(?i:PERCENT) PERCENT
(?i:TOP) TOP
(?i:USING) USING
\. '.'
\( '('
\) ')'
\d+ Integer
(?i:[a-z_][a-z0-9@$#_]*|\[[a-z_][a-z0-9@$#_]*[ ]*\]) Name
\s+ skip()
%%
The command line looks like this:
gram_grep -Hn -r --include="*.cpp" --config=sample_configs/strings.g --config=sample_configs/merge.g .
Note the use of {}
here to specify that we only care when
the rule item: Name;
matches.
%token Bool Char Name NULLPTR Number String Type %% start: decl; decl: Type list ';'; list: item | list ',' item; item: Name {}; item: Name '=' value; value: Bool | Char | Number | NULLPTR | String; %% NAME [_A-Za-z][_0-9A-Za-z]* %% = '=' , ',' ; ';' true|TRUE|false|FALSE Bool nullptr NULLPTR BOOL|BSTR|BYTE|COLORREF|D?WORD|DWORD_PTR Type DROPEFFECT|HACCEL|HANDLE|HBITMAP|HBRUSH Type HCRYPTHASH|HCRYPTKEY|HCRYPTPROV|HCURSOR|HDBC Type HICON|HINSTANCE|HMENU|HMODULE|HSTMT|HTREEITEM Type HWND|LPARAM|LPCTSTR|LPDEVMODE|POSITION|SDWORD Type SQLHANDLE|SQLINTEGER|SQLSMALLINT|UINT|U?INT_PTR Type UWORD|WPARAM Type bool|(unsigned\s+)?char|double|float Type (unsigned\s+)?int((32|64)_t)?|long|size_t Type {NAME}(\s*::\s*{NAME})*(\s*[*])+ Type {NAME} Name -?\d+(\.\d+)? Number '([^'\\\r\n]|\\.)*' Char \"([^"\\\r\n]|\\.)*\" String [ \t\r\n]+|"//".*|"/*"(?s:.)*?"*/" skip() %%
The command line looks like this:
gram_grep -Hn -r --include="*.h" --config=sample_configs/uninit.g .
Note the use of a variety of scripting commands:
%token Integer Name RawString String
%%
start: '(' format list ')' '.' 'str' '(' ')'
/* Erase the first "(" and the trailing ".str()" */
{ erase($1);
erase($5, $8); };
start: 'str' '(' format list ')'
/* Erase "str(" */
{ erase($1, $2); };
format: 'boost' '::' 'format' '(' string ')'
/* Replace "boost" with "std" */
/* Replace the format specifiers within the strings */
{ replace($1, 'std');
replace_all($5, '%(\d+[Xdsx])', '{:$1}');
replace_all($5, '%((?:\d+)?\.\d+f)', '{:$1}');
replace_all($5, '%x', '{:x}');
replace_all($5, '%[ds]', '{}');
replace_all($5, '%%', '%');
erase($6); };
string: String;
string: RawString;
string: string String;
string: string RawString;
list: %empty;
list: list '%' param
/* Replace "%" with ", " */
{ replace($2, ', '); };
param: Integer;
param: name
/* Replace any trailing ".c_str()" calls with "" */
{ replace_all($1, '\.c_str\(\)$', ''); };
name: Name opt_func
| name deref Name opt_func;
opt_func: %empty | '(' opt_param ')';
deref: '.' | '->' | '::';
opt_param: %empty | Integer | name;
%%
%%
\( '('
\) ')'
\. '.'
% '%'
:: '::'
-> '->'
boost 'boost'
format 'format'
str 'str'
-?\d+ Integer
\"([^"\\\r\n]|\\.)*\" String
R\"\((?s:.)*?\)\" RawString
'([^'\\\r\n]|\\.)*' skip()
[_a-zA-Z][_0-9a-zA-Z]* Name
\s+|"//".*|"/*"(?s:.)*?"*/" skip()
%%
The command line looks like this:
gram_grep -Hn --perform-output -r --include="*.cpp" --config=sample_configs/boost_format.g .
This example finds an if
statement, its opening parenthesis
and its closing parenthesis and copes with any parenthesis nested in
between. We introduce the nonsense token anything
so that
we stop matching directly after the closing parenthesis and we rely on
lexer states to cope with the nesting.
Note the use of the %consume
directive to avoid a warning
that token anything
is not used by the grammar.
%token if anything
%consume anything
%x PREBODY BODY PARENS
%%
start: if '(' ')';
%%
any (?s:.)
char '([^'\\\r\n]|\\.)+'
name [A-Z_a-z][0-9A-Z_a-z]*
string \"([^"\\\r\n]|\\.)*\"|R\"\((?s:.)*?\)\"
ws [ \t\r\n]+|"//".*|"/*"(?s:.)*?"*/"
%%
<INITIAL>if<PREBODY> if
<PREBODY>[(]<BODY> '('
<PREBODY>.{+}[\r\n]<.> skip()
<BODY,PARENS>[(]<>PARENS> skip()
<PARENS>[)]<<> skip()
<BODY>[)]<INITIAL> ')'
<BODY,PARENS>{string}<.> skip()
<BODY,PARENS>{char}<.> skip()
<BODY,PARENS>{ws}<.> skip()
<BODY,PARENS>{name}<.> skip()
<BODY,PARENS>{any}<.> skip()
{string} anything
{char} anything
{ws} anything
{name} anything
{any} anything
%%
gram_grep -Hn -r --include="*.cpp;*.h" --config=sample_configs/block.g --extend-search --config=sample_configs/var.g --return-previous-match --invert-match-all -F "$1" .
block.g
:
// Locate a top level braced block (i.e. function bodies) // Note that we filter out class, struct and namespace // in order to match any embeded blocks inside those constructs. %token Name anything %x BODY BRACES %% start: '{' '}'; %% any (?s:.) char '([^'\\]|\\.)+' name [A-Z_a-z][0-9A-Z_a-z]* string \"([^"\\]|\\.)*\"|R\"\((?s:.)*?\)\" ws [ \t\r\n]+|\/\/.*|"/*"(?s:.)*?"*/" %% (class|struct|namespace|union)\s+{name}?[^;{]*\{ skip() extern\s*["]C["]\s*\{ skip() <INITIAL>\{<BODY> '{' <BODY,BRACES>\{<>BRACES> skip() <BRACES>\}<<> skip() <BODY>\}<INITIAL> '}' <BRACES,BODY>{string}<.> skip() <BRACES,BODY>{char}<.> skip() <BRACES,BODY>{ws}<.> skip() <BRACES,BODY>{name}<.> skip() <BRACES,BODY>{any}<.> skip() {string} anything {char} anything {name} anything {ws} anything {any} anything %%
var.g
:
%captures %token Name Keyword String Whitespace %% start: Name opt_template Whitespace (Name) opt_ws ';'; opt_template: %empty | '<' name '>'; name: Name | name '::' Name; opt_ws: %empty | Whitespace; %% name [A-Z_a-z]\w* %% ; ';' < '<' > '>' :: '::' #{name} Keyword break Keyword CExtDllState Keyword CShellManager Keyword CWaitCursor Keyword continue Keyword delete Keyword enum Keyword false Keyword goto Keyword namespace Keyword new Keyword return Keyword throw Keyword VTS_[0-9A-Z_]* Keyword {name} Name \"([^"\\\r\n]|\\.)*\" String R\"\((?s:.)*?\)\" String \s+ Whitespace \/\/.* skip() "/*"(?s:.)*?"*/" skip() %%
All of these example configs are available in the zip with a
.g
extension.
There is now a Makefile which will allow you to build on Linux
and also a CMakeLists.txt
file if you prefer to go that route.