When grep
or sed
are used with the option --extended-regexp
and the pattern {1,9999}
is a part of the regexp that is used, the performance of these commands become low. To be more clear, below are applied few tests.[1] [2]
- The relative performance of
grep -E
,egrep
andsed -E
is almost equal, so only the test that were made withgrep -E
are provided.
Test 1
$ time grep -E '[0-9]{1,99}' < /dev/null
real 0m0.002s
Test 2
$ time grep -E '[0-9]{1,9999}' < /dev/null
> real 0m0.494s
Test 3
$ time grep -E '[0123456789]{1,9999}' < /dev/null > real 21m43.947s
Test 4
$ time grep -E '[0123456789]+' < /dev/null
$ time grep -E '[0123456789]*' < /dev/null
$ time grep -E '[0123456789]{1,}' < /dev/null
$ time grep -P '[0123456789]{1,9999}' < /dev/null
real 0m0.002s
What is the reason of this significant difference of the performance?
Note that it's not the matching that takes time, but the building of the RE. You'll find that it uses quite a lot of RAM as well:
The number of allocs seems roughly proportional to the number of iterations, but the memory allocated seems to grow exponentially.
That's down to how GNU regexps are implemented. If you compile GNU
grep
withCPPFLAGS=-DDEBUG ./configure && make
, and run those commands, you'll see the exponential effect in action. Going deeper than that would mean going through a lot of theory on DFA and dive into the gnulib regexp implementation.Here, you can use PCREs instead that doesn't seem to have the same problem:
grep -Po '[0-9]{1,65535}'
(the maximum, though you can always do things like[0-9](?:[0-9]{0,10000}){100}
for 1 to 1,000,001 repetitions) doesn't take more time nor memory thangrep -Po '[0-9]{1,2}'
.