Notes for the 'grep' tool family
================================

The members of the 'grep' family are:

- grep, which uses the most traditional regular expressions, once
  derived from the NFA code in ed (g/re/p);

- egrep, the traditional Unix DFA matching utility;

- fgrep, which searches for a set of fixed strings using the algorithm
  by Aho and Corasick;

- new grep, which understands POSIX.2 regular expressions and uses the
  modified 'Unix regular expression library'. This provides both a
  DFA and a NFA and uses either of one dependent on the pattern.

The first three utilities still use the original Unix algorithms, but
were extended to handle multibyte characters and expressions of arbitrary
length and complexity. This includes an implementation of 'lazy transition
evaluation' for egrep (which was traditionally found in System V, but not
in Unix 32V and BSD Unix).

All search methods are embedded into a newly developed input framework,
which limits the line length by the amount of available memory only.

The combination of these changes and improvements in compiler technology
make these utilities up to twice as fast as the implementations found in
traditional Unix systems.

Performance comparison
======================

All tests were performed on a Intel Celeron 800 system running Linux
2.4.20, glibc 2.3.2, gcc 3.2.2, with a singlebyte locale, using a
completely cached input file of 128 MB size generated by

	od /dev/urandom | cut -d' ' -f2- | nl -w7 -nrz

As the lines of this file contain only digits and spaces, performance
of some of the following tests is different from that encountered on
searches in human or computer language texts. The results thus do not
serve as a benchmark, but highlight characteristical strengths and
weaknesses of the algorithms and their implementations. Measurements
with GNU grep have additionally been included for reference.

Search for a single character that does not occur in input:
--------------------------------------------------------------------
<grep> -c x file
--------------------------------------------------------------------
	grep		egrep		new grep	GNU grep 2.5
--------------------------------------------------------------------
real	1.6		1.5		1.7		0.5
user	1.3		1.1		1.4		0.2
sys	0.3		0.2		0.3		0.2

Search for a single character that does often occur in input:
--------------------------------------------------------------------
<grep> -c 0 file
--------------------------------------------------------------------
	grep		egrep		new grep	GNU grep 2.5
--------------------------------------------------------------------
real	1.0		1.0		1.2		1.3
user	0.7		0.7		0.9		1.0
sys	0.3		0.2		0.2		0.2

Search for three characters that do not occur in input:
--------------------------------------------------------------------
<grep> -c abc file
--------------------------------------------------------------------
	grep		egrep		new grep	GNU grep 2.5
--------------------------------------------------------------------
real	1.6		1.5		1.7		0.6
user	1.3		1.3		1.4		0.4
sys	0.3		0.1		0.2		0.2

Search for a three character sequence that does often occur in input:
--------------------------------------------------------------------
<grep> -c 123 file
--------------------------------------------------------------------
	grep		egrep		new grep	GNU grep 2.5
--------------------------------------------------------------------
real	2.9		1.5		1.7		1.0
user	2.7		1.3		1.4		0.7
sys	0.2		0.2		0.2		0.3

Search for a ten-character sequence that does not occur in input:
--------------------------------------------------------------------
<grep> -c abcdefghij file
--------------------------------------------------------------------
	grep		egrep		new grep	GNU grep 2.5
--------------------------------------------------------------------
real	1.6		1.5		1.7		0.4
user	1.4		1.2		1.3		0.1
sys	0.2		0.3		0.3		0.2

GNU grep will usually be faster than any of the grep utilities included
here when the expression consists mostly of a fixed character substring,
thanks to its optimization techniques. egrep and new grep show the
characteristic performance of their DFA technique, which effectively
sets an upper time limit on most searches; they can be slightly faster
if the match occurs at early points of many lines. The speed of the NFA
used in traditional grep depends on the characters encountered in input.

The NFA in traditional grep will also run slower with metacharacters
in the expression, as the following tests show:

Search for an expression that starts with a dot:
--------------------------------------------------------------------
<grep> -c .foo file
--------------------------------------------------------------------
	grep		egrep		new grep	GNU grep 2.5
--------------------------------------------------------------------
real	12.3		1.5		1.7		0.7
user	12.0		1.2		1.3		0.4
sys	 0.3		0.2		0.3		0.2

Search for bracket expressions:
--------------------------------------------------------------------
<grep> -c '[01][23][45][67]' file
--------------------------------------------------------------------
	grep		egrep		new grep	GNU grep 2.5
--------------------------------------------------------------------
real	7.9		1.5		1.7		2.3
user	7.5		1.2		1.3		2.0
sys	0.3		0.2		0.3		0.2

NFA stress test:
--------------------------------------------------------------------
<grep> -c '0.*1.*2.*3.*4.*5.*6.*7' file
--------------------------------------------------------------------
	grep		egrep		new grep	GNU grep 2.5
--------------------------------------------------------------------
real	8:56.1		1.6		1.8		3.2
user	8:54.1		1.3		1.5		2.9
sys	   0.3		0.3		0.3		0.2

But DFA performance can also degrade with some patterns, although
this happens less often in practice. The following sequence of tests
uses patterns of increasing complexity that eventually overflow the
cache used in any of the DFAs:

DFA stress test sequence:
--------------------------------------------------------------------
<grep> -c '.*0....' file
--------------------------------------------------------------------
	grep		egrep		new grep	GNU grep 2.5
--------------------------------------------------------------------
real	3.0		1.1		1.3		1.9
user	2.7		0.8		1.0		1.6
sys	0.3		0.2		0.3		0.2
--------------------------------------------------------------------
<grep> -c '.*0.....' file
--------------------------------------------------------------------
	grep		egrep		new grep	GNU grep 2.5
--------------------------------------------------------------------
real	3.1		1.2		2.7		2.0
user	2.8		0.8		2.4		1.7
sys	0.2		0.3		0.3		0.2
--------------------------------------------------------------------

Cache overflow in new grep.

<grep> -c '.*0......' file
--------------------------------------------------------------------
	grep		egrep		new grep	GNU grep 2.5
--------------------------------------------------------------------
real	3.6		2.1		4.1		2.0
user	3.3		1.8		3.8		1.8
sys	0.3		0.2		0.2		0.2
--------------------------------------------------------------------

Cache overflow in egrep.

<grep> -c '.*0.......' file
--------------------------------------------------------------------
	grep		egrep		new grep	GNU grep 2.5
--------------------------------------------------------------------
real	3.6		3.5		5.6		2.1
user	3.4		3.1		5.3		1.8
sys	0.2		0.3		0.2		0.2
--------------------------------------------------------------------
<grep> -c '.*0........' file
--------------------------------------------------------------------
	grep		egrep		new grep	GNU grep 2.5
--------------------------------------------------------------------
real	3.8		5.0		7.6		2.2
user	3.4		4.6		7.2		1.9
sys	0.3		0.3		0.2		0.2
--------------------------------------------------------------------
<grep> -c '.*0.........' file
--------------------------------------------------------------------
	grep		egrep		new grep	GNU grep 2.5
--------------------------------------------------------------------
real	3.9		6.8		10.1		2.2
user	3.6		6.4		 9.7		1.9
sys	0.2		0.3		 0.2		0.3
--------------------------------------------------------------------
<grep> -c '.*0..........' file
--------------------------------------------------------------------
	grep		egrep		new grep	GNU grep 2.5
--------------------------------------------------------------------
real	4.1		8.7		12.7		6.5
user	3.8		8.4		12.4		6.1
sys	0.2		0.3		 0.3		0.4
--------------------------------------------------------------------

Cache overflow in GNU grep.

<grep> -c '.*0...........' file
--------------------------------------------------------------------
	grep		egrep		new grep	GNU grep 2.5
--------------------------------------------------------------------
real	4.2		12.1		17.0		27.9
user	3.9		11.8		16.6		27.3
sys	0.2		 0.3		 0.2		 0.4
--------------------------------------------------------------------
<grep> -c '.*0............' file
--------------------------------------------------------------------
	grep		egrep		new grep	GNU grep 2.5
--------------------------------------------------------------------
real	4.3		15.1		20.8		2:12.8
user	4.0		14.8		20.5		2:12.1
sys	0.2		 0.2		 0.2		   0.5
--------------------------------------------------------------------
<grep> -c '.*0.............' file
--------------------------------------------------------------------
	grep		egrep		new grep	GNU grep 2.5
--------------------------------------------------------------------
real	5.0		18.7		24.4		11:47.9
user	4.6		18.3		24.0		11:46.6
sys	0.3		 0.2		 0.2		    0.5

GNU grep refills more positions of the cache than actually needed.
This eventually leads to huge performance decreases.

An expression feature that DFAs cannot provide are backreferences
(strictly speaking, a NFA also cannot provide it, but NFA-like search
methods can easily be extended thus). While egrep, which uses a DFA
only, cannot search for this construct, the new grep will use its
NFA here.

--------------------------------------------------------------------
<grep> -c '\(0\)1\1' m
--------------------------------------------------------------------
	grep		egrep		new grep	GNU grep 2.5
--------------------------------------------------------------------
real	14.1				12.4		1:29.6
user	13.8		n/a		12.0		1:28.6
sys	 0.3				 0.3		   0.5

	Gunnar Ritter					5/26/03
