Bash: grep with LookBehind and LookAhead to isolate desired text

grep has support for Perl compatible regular expressions (PCRE) by using the -P flag, and this provides a number of useful features.  In this article, I’ll show how LookBehind and LookAhead regular expression support can provide enhanced parsing abilities to your shell scripts.

For example, consider an xml file “test.xml” with the contents:

<root>
  <path>/my/data</path>
  <paths>/global/data</paths>
</root>

Using the -E extended regular expression flag you could attempt to isolate the <path> value.

# returns the entire line
$ grep -Eo "<path>(.*)</path>" test.xml
<path>/my/data</path>

# a bit smarter, by taking characters until '<' reached
# but still leaves us with prefixed tag
$ grep -Eo "<path>[^ <]*" test.xml
<path>/my/data

From this imperfect parse, you might use some combination of sed/awk/cut to get to your ultimate atomic value.  But even easier would be using a LookBehind that does not capture the starting tag at all.

# use non-captured LookBehind to isolate value
$ grep -Po "(?<=path\>)[^<]+" test.xml
/my/data

The “?<=” signals a non-captured LookBehind group.  This works perfectly for static length strings, but if your prefix string can have variable length, then continue reading about “\K” below.

Limitation of LookBehind

A drawback of LookBehind is that it requires a static length string, for example if we wanted to accommodate “<path>” as well as “<paths>”, it would be tempting to believe we could add “s?” to the end of the tag name.

# fails because LookBehind requires static string length
$ grep -Po "(?<=paths?\>)[^<]+" test.xml
grep: lookbehind assertion is not fixed length

In these situations, you can use “\K” inside the regex, which resets the capture text.

# use \K to reset capture group for variable length LookBehind
# notice we have now brought back both values from xml
$ grep -Po "<paths?\>\K[^<]+" test.xml
/my/data
/global/data

# another example of using \K
# pulling out just domain name from either http|https site
$ echo "https://www.google.com/?q=foo" | grep -Po 'https?://\K([^ /?\"])*'
www.google.com

LookAhead

LookAhead also allows you to remove part of the matching text from a capture group by specifying a “?=” in front of the capture group.  Consider an example where we do need the final child path to end with “/data”, but are only interested in pulling back the base directory.

# the value must end with "/data" to match,
# but we only want the base directory location output
$ grep -Po "<paths?\>\K[^<]+(?=/data)" test.xml
/my
/global

LookAhead can accommodate variable length non-capture groups (unlike LookBehind).

 

REFERENCES

rexegg.com, LookAhead and LookBehind

regular-expression.mobi, lookbehind and \K

octoparse.com, regex to match html tags

github.com/fabianlee, bash script examples of LookBehind

NOTES

match html tags using capture group.

$ grep -P '<(\S*?)[^>]*>(.*)?</\1>' test.xml
<path>/my/data</path>

grep -P does not support output of any capture group but first, but pcregrep can do it

$ sudo apt-get install pcregrep -y
$ pcregrep -o2 '<(\S*?)[^>]*>(.*)?</\1>' test.xml
/my/data

You can use brute force LookBehind if you need variable length LookBehind, but does not seem worth it when \K does a better job

grep -Po "(?:(?<=http://)|(?<=https://))([^ /?\"])*" greptest.txt