Recently I was reviewing some shell script a colleague had written:
if grep -e '@[^@]\+@' "$DIR/install.sh" ; then
I thought the \ before the + was a mistake, and also pointed out that if
+ was to be used we’d probably need to pass -E for extended regular
expression (ERE) support. The colleague replied that \+ in a basic regular
expression (BRE) was the same as + in ERE (one or more repetitions).
This was news to me! I wanted to know more, so I turned to the FreeBSD
re_format(7) man page. Historically this is where most of my
knowledge about the distinction between BREs and EREs came from. There was no
mention of it there. I spun up a FreeBSD virtual machine and performed a quick
test, which confirmed that \+ did in fact work.
Meanwhile my colleague replied that it turned out \+ was not part of
the POSIX specification for BREs and was a GNU extension. Referring to
the most recent POSIX spec it says:
…it is implementation-defined whether
\?,\+, and\|each match the literal character?,+, or|, respectively, or behave as described for the ERE special characters?,+, and|, respectively
with the additional note:
A future version of this standard may require
\?,\+, and\|to behave as described for the ERE special characters?,+, and|, respectively.
So treating \+ as + is not currently standardised.
I was curious about why it worked in FreeBSD grep, but was not mentioned in
re_format(7), so poked around the FreeBSD source code. This led me to
regcomp.c:
#ifdef LIBREGEX
} else if (p->gnuext && EATTWO('\\', '?')) {
INSERT(OQUEST_, pos);
ASTERN(O_QUEST, pos);
} else if (p->gnuext && EATTWO('\\', '+')) {
INSERT(OPLUS_, pos);
ASTERN(O_PLUS, pos);
#endif
The functionality was introduced in August 2020. The gnuext flag
is set unless the REG_POSIX flag is set on the regex, which it is not
when grep is in basic mode.
Next I turned to the source of extension: glibc. Regex syntax is quite customisable in glibc.
The definition of basic regexes, RE_SYNTAX_POSIX_BASIC, is:
# define RE_SYNTAX_POSIX_BASIC \
(_RE_SYNTAX_POSIX_COMMON | RE_BK_PLUS_QM | RE_CONTEXT_INVALID_DUP)
and RE_BK_PLUS_QM is:
/* If this bit is not set, then + and ? are operators, and \+ and \? are
literals.
If set, then \+ and \? are operators and + and ? are literals. */
# define RE_BK_PLUS_QM (RE_BACKSLASH_ESCAPE_IN_LISTS << 1)
Digging into the origin of this GNU extension, I found that it’s been present in glibc since at least 1995. I wondered how widespread support for the extension was. The following were my findings:
- ✅ Chimera Linux (and other musl based distributions)
- ✅ macOS
- Seems to use a version of TRE from circa 2009.
- Appears to have gained support
for
\+in the Oct 2021 code dump. - The corresponding code does not appear to be present in upstream TRE.
- ✅ NetBSD
- Supports it via sync with FreeBSD in Feb 2021.
- ❌ OpenBSD
- ❌ Illumos
- ✅ Redox OS
- Uses the posix-regex crate, which does appear to implement the extension.
- Since 2018.
- ✅ Haiku
- Supported.
- Since 2014 via import of gnuregex.
- ❌ SerenityOS
Conclusion
It sure was fun poking through the code of a bunch of open-source operating systems. It was interesting to see all the implementations, and how widely they varied in readability. TRE in macOS was by far the most difficult to follow. musl was very clear as usual. FreeBSD was more complicated, but still relatively straightforward.
Ultimately the conclusion is that this is a non-standardised
extension that it is relatively widely supported, but not everywhere. So it is
best to explicitly use extended regular expressions via -E or similar when
their functionality is desired.