oSoSo —> Regular expressions

Extended and perl regular expressions

Introduction

Depending on how much you know about regular expressions, it might be a good idea to read some tutorials first. There are many of them, so use a search engine. The syntax of basic and extended regular expressions is described in IEEE Std 1003.1, 2003 Edition, section regular expressions.

Much has been written about the matching process. This page, however, focuses on searching and replacing.

Basic regular expressions

The first thing we try is a simple search and replace operation with sed:

$ command | sed 's/ / /g' > file.html

This translates blanks into the  -HTML-Entity. Pretty simple, uh?

Extended regular expressions

If you want to use extended regular expressions, use sed's -r switch. The differences between BREs and EREs are described in the above mentioned standard document, and I won't repeat them here.

Perl regular expressions

Now let's start with the really interesting stuff: Perl regular expressions. They are a superset of extended regular expressions and are described in the manual pages perlre(1) and the tutorial perlretut(1). Unfortunately, sed doesn't support them. However, super-sed does.

An example: We want to remove any text that is enclosed by <DEL> and </DEL>. To achieve this, we need to turn off greediness. In perl regular expressions this is done by appending a question mark to the * (or +) operator.

$ ssed -R 's/<DEL>.*?<\/DEL>//g'

Here we are using super-sed with the -R switch, activating PRE syntax. The command transforms <DEL>delete me</DEL>foo<DEL>delete me, too</DEL>bar into foobar. If we had omitted the question mark, the *-operator would have been greedy and the regular expression would have matched <DEL>delete me</DEL>foo<DEL>delete me, too</DEL> (from the first to the last DEL tag), so that the result would have been bar instead of foobar.

Text blocks

Occasionally, you will want to apply a regular expression to more than a single line. In these situations, you need perl. Perl allows to modify a variable called "input record separator", see perlvar(1), section $INPUT_RECORD_SEPARATOR. By default, this separator is a simple newline, so perl processes one line after each other. I've written an example script, which I called extreg (but of course you can name it as you like):

#!/bin/sh
if [ $# -gt 1 ]; then
   perl -we '$/="'"$2"'";while(<>){'"$1"';print $_}'
else
   perl -we 'undef $/;$_=<>;'"$1"';print $_;'
fi

The syntax is easy: extreg <regexp> [<separator>]

regexp are one ore more extended regular expressions, separated by semicolon.

separator is an optional input record separator.

If you want to be portable you should invoke perl directly. extreg is primarily meant to be used in your private scripts or in interactive shell mode.

By default extreg applies the regular expression(s) to the whole input. Though you may have to use the "s" option to make dots also match newlines.

extreg 's/<#DEL>.*?<#\/DEL>/\[...\]/gs'

transforms:

<#DEL>This paragraph isn't visible.
It's removed with one single regular expression.<#/DEL>

This is <#DEL>quite<#/DEL> convenient.

into:

[...]

This is [...] convenient.

Some examples on how to use the separator:

extreg 'expression' '\n'

works line-by-line and

extreg 'expression' -

always reads up to the next hyphen.

extreg 'expression' ''

reads paragraphs (see perlvar(1), section $INPUT_RECORD_SEPARATOR, and perlfaq5(1), section How can I read in a file by paragraphs?, for more detailed information), whereas

extreg 'expression'

processes the whole text at a single blow.

The separator is especially important when masses of data that don't fit into memory shall be processed, or when data is received slowly and extreg shall not wait until all data has arrived.

Take care of setting an appropriate separator when using extreg instead of sed, or you might be surprised that it doesn't work anymore.

Some other gimmicks

With extreg you can use perl code. E.g. the following expressions replaces the wort "SECONDS" by the number of elapsed seconds since 1970:

extreg 's/SECONDS/time/ge'

(Note the e-option.) By using perl code, you can even nest regular expressions:

extreg 's{<#NBSP>(.*?)<#/NBSP>}{$_=$1;s/ /&nbsp;/g;$_}gse'

(Please note that the patterns are enclosed by curly brackets instead of being separated by slashes.) Use this feature with caution, because such expressions are likely to get illegible and undebuggable.

Felix Wiemann <Felix.Wiemann@ososo.de>