👨‍💻 Wesley Moore

Platform Support for GNU Extensions to Basic Regular Expressions

Recently I was reviewing some shell script a colleague had written:

if grep -e '@[^@]\+@' "$DIR/install.sh" ; then

I thought the \ before the + was a mistake, and also pointed out that if + was to be used we’d probably need to pass -E for extended regular expression (ERE) support. The colleague replied that \+ in a basic regular expression (BRE) was the same as + in ERE (one or more repetitions).

This was news to me! I wanted to know more, so I turned to the FreeBSD re_format(7) man page. Historically this is where most of my knowledge about the distinction between BREs and EREs came from. There was no mention of it there. I spun up a FreeBSD virtual machine and performed a quick test, which confirmed that \+ did in fact work.

Meanwhile my colleague replied that it turned out \+ was not part of the POSIX specification for BREs and was a GNU extension. Referring to the most recent POSIX spec it says:

…it is implementation-defined whether \?, \+, and \| each match the literal character ?, +, or |, respectively, or behave as described for the ERE special characters ?, +, and |, respectively

with the additional note:

A future version of this standard may require \?, \+, and \| to behave as described for the ERE special characters ?, +, and |, respectively.

So treating \+ as + is not currently standardised.

I was curious about why it worked in FreeBSD grep, but was not mentioned in re_format(7), so poked around the FreeBSD source code. This led me to regcomp.c:

#ifdef LIBREGEX
	} else if (p->gnuext && EATTWO('\\', '?')) {
		INSERT(OQUEST_, pos);
		ASTERN(O_QUEST, pos);
	} else if (p->gnuext && EATTWO('\\', '+')) {
		INSERT(OPLUS_, pos);
		ASTERN(O_PLUS, pos);
#endif

The functionality was introduced in August 2020. The gnuext flag is set unless the REG_POSIX flag is set on the regex, which it is not when grep is in basic mode.

Next I turned to the source of extension: glibc. Regex syntax is quite customisable in glibc. The definition of basic regexes, RE_SYNTAX_POSIX_BASIC, is:

# define RE_SYNTAX_POSIX_BASIC						\
  (_RE_SYNTAX_POSIX_COMMON | RE_BK_PLUS_QM | RE_CONTEXT_INVALID_DUP)

and RE_BK_PLUS_QM is:

/* If this bit is not set, then + and ? are operators, and \+ and \? are
     literals.
   If set, then \+ and \? are operators and + and ? are literals.  */
# define RE_BK_PLUS_QM (RE_BACKSLASH_ESCAPE_IN_LISTS << 1)

Digging into the origin of this GNU extension, I found that it’s been present in glibc since at least 1995. I wondered how widespread support for the extension was. The following were my findings:

Conclusion

It sure was fun poking through the code of a bunch of open-source operating systems. It was interesting to see all the implementations, and how widely they varied in readability. TRE in macOS was by far the most difficult to follow. musl was very clear as usual. FreeBSD was more complicated, but still relatively straightforward.

Ultimately the conclusion is that this is a non-standardised extension that it is relatively widely supported, but not everywhere. So it is best to explicitly use extended regular expressions via -E or similar when their functionality is desired.

Stay in touch!

Follow me on the ⁂ Fediverse, subscribe to the feed, or send me an email.

I also publish new posts as emails in my newsletter.

Subscribe to Newsletter