• Regular Expressions

    From Angus McLeod@VERT/ANJO to All on Monday, June 19, 2006 14:44:00
    A while back we discussed Regular Expressions, and saw how they could be
    used in PHP, JavaScript and Perl to SEARCH through text, and optionally,
    to SUBSTITUTE text in the same way we might use a good text editor to FIND
    and REPLACE. We learned that a Regular Expression is a sequence of
    characters that must match exactly, possibly including META-CHARACTERS
    that have special meaning and controls how the RE will match.

    Quick Recap
    ===========
    Let's remember what meta-characters we know about so far:

    / Opens (and closes) the RE.
    ^ Match the beginning of the line.
    $ Match the end of the line.
    . Match any character (except newline).
    [] Introduces a character class.

    Quantifiers:

    * Matches ZERO or more instances of preceeding character.
    + Matches ONE or more instances of preceeding character.
    {} Specifies exact number or range of instances to match.

    Within Character Classes:

    - Specify range of characters (unless first in character class).
    ^ Negate character class

    And last but not least:

    \ Escapes the significance of other characters including itself.

    Two additional points we want to remember:

    Quantifiers themselves match nothing. So * does not match any
    characters, and DOSish people may have a hard time remembering this at
    first. The quantifiers only specify hoe many of the PRECEDING
    character to match.

    RE's are greedy. The RE /Syn.*net/ might match as follows:

    'Mostly Synchronet software is used on the DOVE-net support network.'
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

    The .* greedily gobbled up as much of the text as possible, while still
    making a match.


    Greed
    =====
    Usually, greedy quantifiers work just fine. But occasionally, we want a
    match to be made in such a way that the quantifier is NOT greedy. We can apply a meta-character to the quantifier to alter it's greediness! We use
    the ? meta-character in this case, and place it immediately after the quantifier we do not want to be greedy. By utilizing this technique, the
    RE /Syn.*?net/ is made UN-greedy. It would now match

    'Mostly Synchronet software is used on the DOVE-net support network.'
    ^^^^^^^^^^
    which is most likely what we intended in the first place!


    New Meta-Characters
    ===================
    Time to learn a few more!

    The | meta-character allows us to specify alternation. Say we were
    searching for one of the Stooges. We could use /larry|curly|moe/i which
    will match any one of the three alternates. The first alternative begins
    at the last pattern delimiter (or the start of the RE) and the last alternative ends at the next pattern delimiter. Note that | has no
    special meaning in a character class, so [larry|curly|moe] is just a
    character class and equivilent to [larycumo|]. Note the 'i' flag at the
    end of the RE which we recall makes the match case-INsensitive.

    You can use () meta-characters to introduce grouping. By placing multiple characters and meta-characters into a group, you can apply quantifiers to
    them as a whole. Suppose yopu were searching through an HTML document for places where three <br> tags occurred together. (I know <br> is
    old-school. Shaddup!) We could use a RE something like /<br><br><br>/i
    or we could do /(<br>){3}/i. The latter is not only shorter, but it is
    easily extended. Suppose we want to match three OR MORE similar tags? /(<br>){3,}/i does the trick for us, and solves a problem that we wouldn't easily be able to solve without grouping! The () group the four characters
    <, b, r, > together so that the quantifier {3,} can be applied to the
    entire group.

    The () grouping meta-characters also work well as pattern delimiters for alternations within a larger pattern. If we were looking for one of the Stooges as a part of a larger RE, we would need some way to indicate the
    start of the first alternate and the end of the last. We could use
    /Commedian (Larry|Curly|Moe) is a Stooge/

    The () meta-characters also allow us to create capture-buffers and do some neat back-references. More on this later.


    Useful Escapes
    ==============
    Ok, we can do some clever stuff with character classes, like /[0-9]/ to
    match any digit and /[a-z]/ to match any letter and so forth. Some of
    this is so common, RE's have special 'shorthand' or escapes. The ma in
    ones are:

    \w Match a "word" character (alphanumeric plus "_")
    \W Match a non-"word" character
    \s Match a whitespace character
    \S Match a non-whitespace character
    \d Match a digit character
    \D Match a non-digit character

    And some useful zero-width assertions, such as:

    \b Match a word boundary
    \B Match a non-(word boundary)

    These last two don't match a character -- they match the spot BETWEEN two characters, providing the characters on each side of that spot conform.
    A \b matches the spot between a \w and a \W (in any order)


    Capture Buffers
    ===============
    We have some text which includes dates in that irrational MM/DD/YYYY
    format preferred by some nations. We want to convert it to a sensible YYYY-MM-DD format. So we will use a RE of /\d\d[-\/]\d\d[-\/]\d\d\d\d/
    which should match against text in the correct format. Let's note a few things.

    First, we match a date separator of - or / since either might appear in
    our text. We use a character class [-\/] for the separator. The - does
    not indicate a range of characters, because it comes at the beginning of
    the class. We have to escape the / because that is the same character we
    use to delimit our RE. So our class becomes [-\/]. But note that some languages (Perl) allow the use of alternate characters for the RE
    delimiter, so we could have written !\d\d[-/]\d\d[-/]\d\d\d\d! with ! as
    our RE delimiter, making it unnecessary to escape the / in the character class.

    Second, having MATCHED our RE, how do we substitute the new date format
    for the old? We need to pick up the numbers and swap them around! So we introduce () grouping meta-characters for their capture-buffer properties. s!(\d\d)[-/](\d\d)[-/](\d\d\d\d)!$3-$1-$2! does the trick.

    We've added () capture-buffers for each of the three sets of digits in our MATCH, and used $1, $2 and $3 to refer to the contents of those buffers in
    our REPLACE. The numbers, of course, refer to the buffers in the order in which they are found. If we applied this substitution to text containing
    the date 06/19/2006, it should be replaced with the date 2006-06-19.

    Note there is a potential problem here. The date separators are matched
    with the identical character classes [-/]. The classes are identical, but
    it doesn't mean they will match identical things. The text might contain 06-19/2006 which is probably NOT a date, since the two separator
    characters differ. It would still match our RE though.


    Backreferences
    ==============
    We used $1, $2, etcetera to refer to the contents of the capture-buffers
    AFTER the match. When we refer to the contents of a capture-buffer IN the match, it's called a backreference, and they look slightly different. We
    use \1, \2, etcetera instead of $1, $2. With this in mind, we re-write
    our substitution. It looks like this:

    s!(\d\d)([-/])(\d\d)\2(\d\d\d\d)!$4-$1-$3!
    ^^^^^^ ^^
    We have added a () around the first separator and used \2 to match the
    second separator with a backreference to the first. Now the first
    separator is matched against the character class as before, but
    whatever matches (a - or a /) is put into the (second) capture buffer.
    The second separator must now match the same character that matched the
    first separator. In this way, either - or / can be used as a
    separator, but both separators must be the same!

    In this example we have seen () used for capturing buffers for use
    after the match, and for backreferences during the match itself.


    Usefulness
    ==========
    So all this RE talk is just fine, but how exactly can it be put to good
    use? Because if the flexibility with which they can match text and the
    ease with which matched text can be rearranged, RE's are great for
    processing text in bulk to extract nuggets of useful information. Text in bulk like say LOG files....

    Who has been logging on to your board? And from where?

    The RE /^\+\+\s+\((\d{4})\)\s+(.*?)\s+Logon/ matched against the
    logfiles in /sbbs/data/logs should select logons, and place the user
    number and user id in the two capture buffers. Here's how the RE
    breaks down:

    / # start the RE
    ^\+\+ # ++ at the start of the line
    \s+ # one or more whitespace
    \( # a literal left bracket, excaped
    (\d{4}) # exactly four digits, placed in capture buffer #1
    \) # a literal right bracket, escaped
    \s+ # one or more whitespace
    (.*?) # any number of any characters ungreedily
    # matched, and placed in capture buffer #2
    \s+ # one or more whitespace
    Logon # The word "Logon"
    /x # end the RE with the 'x' flag

    Note that we have escaped those characters that appear literally in the
    input text, but will be taken for meta-characters in our RE (like the ++
    at the beginning of the line and the () around the user number).

    Also note that we end the RE with the 'x' flag. So WTF is that? The 'x'
    flag in Perl allows us to break up our RE by adding additional whitespace
    and including comments. In fact, you could cut'n'paste that RE exactly as above, including whitespaces and comments, directly into a Perl program,
    where it would work without modification and would be much easier to read. (Of course, only a VERY wimpy Perl programmer would dream of doing any
    such thing...)

    So here's a Linux command you should be able to run and see who has
    been using your system and the frequency of use:

    perl -ne "/^\+\+\s+\(\d{4}\)\s+(.*)\s+Logon/ && print \$1.qq(\n)" \
    *.log|sort|uniq -c

    Here's another:

    perl -ne "/^\@\+.*\[(.*)\]/ and print \$1.qq(\n)" *.log|sort|uniq -c

    that will show the IP addresses from which connections have been made.
    Windows people can do EXACTLY the same, providing they install Perl
    and the Unix Utilities (to get the 'uniq' utility).

    http://www.activestate.com/Products/ActivePerl/
    http://unxutils.sourceforge.net/

    These same RE's could be used in PHP or JavaScript programs to produce
    the same sort of results.


    Conclusion
    ==========
    Regular Expressions are a powerful force in the world of data analysis. We
    can use them to manipulate our logfiles (including our web logs) and
    extract valuable data. This data could later be presented in tabular form
    as a BBS bulletin, or plotted graphically for display via the webserver.

    Used to process message text they can help identify things of interest,
    add Ctrl-A colour highlights to key words or perhaps to implement
    profanity filters (ugh!).

    External 'Door' programs have an unfortunate habit of generating ugly bulletins in weirdo formats. Regular Expressions used as a part of a
    simple pre-processor program can strip these bulletins down to basics, rearrange the pertinent data in any way you like, and re-colourize to
    suit the theme of your BBS. Texts enhanced with colour codes from an old
    BBS can be filtered through a script that matches the old colour codes and replaces them with the Synchronet equivilent.

    Regular Expressions can make a data analysis script or program much easier
    to code. The average BBS has masses of data that is subject to analysis,
    so Regular Expressions have a valuable role to play in the management of
    your BBS.


    ---
    Playing: "Windmills" by "Toad The Wet Sprocket"
    from the "Dulcinea" album
    þ Synchronet þ Programatically generated on The ANJO BBS