Forum : The BROKeN BuBBLe

Regular Expressions

From Angus McLeod@VERT/ANJO to All on Monday, June 19, 2006 14:44:00

A while back we discussed Regular Expressions, and saw how they could be
used in PHP, JavaScript and Perl to SEARCH through text, and optionally,
to SUBSTITUTE text in the same way we might use a good text editor to FIND
and REPLACE. We learned that a Regular Expression is a sequence of
characters that must match exactly, possibly including META-CHARACTERS
that have special meaning and controls how the RE will match.

Quick Recap
===========
Let's remember what meta-characters we know about so far:

/ Opens (and closes) the RE.
^ Match the beginning of the line.
$ Match the end of the line.
. Match any character (except newline).
[] Introduces a character class.

Quantifiers:

* Matches ZERO or more instances of preceeding character.
+ Matches ONE or more instances of preceeding character.
{} Specifies exact number or range of instances to match.

Within Character Classes:

- Specify range of characters (unless first in character class).
^ Negate character class

And last but not least:

\ Escapes the significance of other characters including itself.

Two additional points we want to remember:

Quantifiers themselves match nothing. So * does not match any
characters, and DOSish people may have a hard time remembering this at
first. The quantifiers only specify hoe many of the PRECEDING
character to match.

RE's are greedy. The RE /Syn.*net/ might match as follows:

'Mostly Synchronet software is used on the DOVE-net support network.'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The .* greedily gobbled up as much of the text as possible, while still
making a match.

Greed
=====
Usually, greedy quantifiers work just fine. But occasionally, we want a
match to be made in such a way that the quantifier is NOT greedy. We can apply a meta-character to the quantifier to alter it's greediness! We use
the ? meta-character in this case, and place it immediately after the quantifier we do not want to be greedy. By utilizing this technique, the
RE /Syn.*?net/ is made UN-greedy. It would now match

'Mostly Synchronet software is used on the DOVE-net support network.'
^^^^^^^^^^
which is most likely what we intended in the first place!

New Meta-Characters
===================
Time to learn a few more!

The | meta-character allows us to specify alternation. Say we were
searching for one of the Stooges. We could use /larry|curly|moe/i which
will match any one of the three alternates. The first alternative begins
at the last pattern delimiter (or the start of the RE) and the last alternative ends at the next pattern delimiter. Note that | has no
special meaning in a character class, so [larry|curly|moe] is just a
character class and equivilent to [larycumo|]. Note the 'i' flag at the
end of the RE which we recall makes the match case-INsensitive.

You can use () meta-characters to introduce grouping. By placing multiple characters and meta-characters into a group, you can apply quantifiers to
them as a whole. Suppose yopu were searching through an HTML document for places where three tags occurred together. (I know is
old-school. Shaddup!) We could use a RE something like / /i
or we could do /( ){3}/i. The latter is not only shorter, but it is
easily extended. Suppose we want to match three OR MORE similar tags? /( ){3,}/i does the trick for us, and solves a problem that we wouldn't easily be able to solve without grouping! The () group the four characters
<, b, r, > together so that the quantifier {3,} can be applied to the
entire group.

The () grouping meta-characters also work well as pattern delimiters for alternations within a larger pattern. If we were looking for one of the Stooges as a part of a larger RE, we would need some way to indicate the
start of the first alternate and the end of the last. We could use
/Commedian (Larry|Curly|Moe) is a Stooge/

The () meta-characters also allow us to create capture-buffers and do some neat back-references. More on this later.

Useful Escapes
==============
Ok, we can do some clever stuff with character classes, like /[0-9]/ to
match any digit and /[a-z]/ to match any letter and so forth. Some of
this is so common, RE's have special 'shorthand' or escapes. The ma in
ones are:

\w Match a "word" character (alphanumeric plus "_")
\W Match a non-"word" character
\s Match a whitespace character
\S Match a non-whitespace character
\d Match a digit character
\D Match a non-digit character

And some useful zero-width assertions, such as:

\b Match a word boundary
\B Match a non-(word boundary)

These last two don't match a character -- they match the spot BETWEEN two characters, providing the characters on each side of that spot conform.
A \b matches the spot between a \w and a \W (in any order)

Capture Buffers
===============
We have some text which includes dates in that irrational MM/DD/YYYY
format preferred by some nations. We want to convert it to a sensible YYYY-MM-DD format. So we will use a RE of /\d\d[-\/]\d\d[-\/]\d\d\d\d/
which should match against text in the correct format. Let's note a few things.

First, we match a date separator of - or / since either might appear in
our text. We use a character class [-\/] for the separator. The - does
not indicate a range of characters, because it comes at the beginning of
the class. We have to escape the / because that is the same character we
use to delimit our RE. So our class becomes [-\/]. But note that some languages (Perl) allow the use of alternate characters for the RE
delimiter, so we could have written !\d\d[-/]\d\d[-/]\d\d\d\d! with ! as
our RE delimiter, making it unnecessary to escape the / in the character class.

Second, having MATCHED our RE, how do we substitute the new date format
for the old? We need to pick up the numbers and swap them around! So we introduce () grouping meta-characters for their capture-buffer properties. s!(\d\d)[-/](\d\d)[-/](\d\d\d\d)!$3-$1-$2! does the trick.

We've added () capture-buffers for each of the three sets of digits in our MATCH, and used $1, $2 and $3 to refer to the contents of those buffers in
our REPLACE. The numbers, of course, refer to the buffers in the order in which they are found. If we applied this substitution to text containing
the date 06/19/2006, it should be replaced with the date 2006-06-19.

Note there is a potential problem here. The date separators are matched
with the identical character classes [-/]. The classes are identical, but
it doesn't mean they will match identical things. The text might contain 06-19/2006 which is probably NOT a date, since the two separator
characters differ. It would still match our RE though.

Backreferences
==============
We used $1, $2, etcetera to refer to the contents of the capture-buffers
AFTER the match. When we refer to the contents of a capture-buffer IN the match, it's called a backreference, and they look slightly different. We
use \1, \2, etcetera instead of $1, $2. With this in mind, we re-write
our substitution. It looks like this:

s!(\d\d)([-/])(\d\d)\2(\d\d\d\d)!$4-$1-$3!
^^^^^^ ^^
We have added a () around the first separator and used \2 to match the
second separator with a backreference to the first. Now the first
separator is matched against the character class as before, but
whatever matches (a - or a /) is put into the (second) capture buffer.
The second separator must now match the same character that matched the
first separator. In this way, either - or / can be used as a
separator, but both separators must be the same!

In this example we have seen () used for capturing buffers for use
after the match, and for backreferences during the match itself.

Usefulness
==========
So all this RE talk is just fine, but how exactly can it be put to good
use? Because if the flexibility with which they can match text and the
ease with which matched text can be rearranged, RE's are great for
processing text in bulk to extract nuggets of useful information. Text in bulk like say LOG files....

Who has been logging on to your board? And from where?

The RE /^\+\+\s+$(\d{4})$\s+(.*?)\s+Logon/ matched against the
logfiles in /sbbs/data/logs should select logons, and place the user
number and user id in the two capture buffers. Here's how the RE
breaks down:

/ # start the RE
^\+\+ # ++ at the start of the line
\s+ # one or more whitespace
$ # a literal left bracket, excaped
(\d{4}) # exactly four digits, placed in capture buffer #1
$ # a literal right bracket, escaped
\s+ # one or more whitespace
(.*?) # any number of any characters ungreedily
# matched, and placed in capture buffer #2
\s+ # one or more whitespace
Logon # The word "Logon"
/x # end the RE with the 'x' flag

Note that we have escaped those characters that appear literally in the
input text, but will be taken for meta-characters in our RE (like the ++
at the beginning of the line and the () around the user number).

Also note that we end the RE with the 'x' flag. So WTF is that? The 'x'
flag in Perl allows us to break up our RE by adding additional whitespace
and including comments. In fact, you could cut'n'paste that RE exactly as above, including whitespaces and comments, directly into a Perl program,
where it would work without modification and would be much easier to read. (Of course, only a VERY wimpy Perl programmer would dream of doing any
such thing...)

So here's a Linux command you should be able to run and see who has
been using your system and the frequency of use:

perl -ne "/^\+\+\s+$\d{4}$\s+(.*)\s+Logon/ && print \$1.qq(\n)" \
*.log|sort|uniq -c

Here's another:

perl -ne "/^\@\+.*\[(.*)\]/ and print \$1.qq(\n)" *.log|sort|uniq -c

that will show the IP addresses from which connections have been made.
Windows people can do EXACTLY the same, providing they install Perl
and the Unix Utilities (to get the 'uniq' utility).

http://www.activestate.com/Products/ActivePerl/
http://unxutils.sourceforge.net/

These same RE's could be used in PHP or JavaScript programs to produce
the same sort of results.

Conclusion
==========
Regular Expressions are a powerful force in the world of data analysis. We
can use them to manipulate our logfiles (including our web logs) and
extract valuable data. This data could later be presented in tabular form
as a BBS bulletin, or plotted graphically for display via the webserver.

Used to process message text they can help identify things of interest,
add Ctrl-A colour highlights to key words or perhaps to implement
profanity filters (ugh!).

External 'Door' programs have an unfortunate habit of generating ugly bulletins in weirdo formats. Regular Expressions used as a part of a
simple pre-processor program can strip these bulletins down to basics, rearrange the pertinent data in any way you like, and re-colourize to
suit the theme of your BBS. Texts enhanced with colour codes from an old
BBS can be filtered through a script that matches the old colour codes and replaces them with the Synchronet equivilent.

Regular Expressions can make a data analysis script or program much easier
to code. The average BBS has masses of data that is subject to analysis,
so Regular Expressions have a valuable role to play in the management of
your BBS.

---
Playing: "Windmills" by "Toad The Wet Sprocket"
from the "Dulcinea" album
� Synchronet � Programatically generated on The ANJO BBS

System Info

Sysop:	MCMLXXIX
Location:	Prospect, CT
Users:	325
Nodes:	10 (0 / 10)
Uptime:	247:43:38
Calls:	508
Messages:	219914

Regular Expressions

System Info