_learning perl_, ch7, 2.I.2002, nh

(also see man perlre, from which this is liberally cribbed)

a REGULAR EXPRESSION is a pattern to be matched against a string.

  * regexps enclosed in /s/
    if (/abc/){     # means, if($_ matches /abc/)
      print $_;
    }

  match operators:
  ---------------
  a | a plain letter matches itself
  . | any single character except \n
  * | zero or more of preceding character = {0,} (see below)
  + | one or more of preceding character  = {1,}
  ? | zero or one of preceding character  = {0,1}
  ^ | NEGATES what follows it

  a pattern-matching CHARACTER CLASS is enclosed in [s]
                     ^^^^^^^^^^^^^^^
                     one and only one of the characters must be
                     present for the pattern to match!
                       [abcde] matches /rice/ ('e'), /dog/ ('d')
                     - escape \[ and \] in ch. classes
                     - ranges of characters separated by dashes [a-z]
                     - escape \- if you want an actual '-'
                     - ^ negates next string: [^aeiou] = any lc cons.
                     - escape \^s -- [^\^] = anything but a '^'

    Perl predefines some common character classes:

construct      | equivalent class | negated construct | eq. neg. class
---------------|------------------|-------------------|---------------
\d (a digit)   | [0-9]            | \D (not a digit)  | [^0-9]
\w (a word ch) | [a-zA-Z0-9_]     | \W (not a word ch)| [^a-zA-Z0-9_]
\s (a space ch)| [ \r\t\n\f]      | \S (! a space ch) | [^ \r\t\n\f]
---------------|------------------|-------------------|---------------

 o  \w matches ostensibly any "word character," but really any ch
    allowed in a perl variable name
 o  spaces are ( ), carriage returns (\r), tabs (\t), line feeds
    (\n), and form feeds (\f).
 o  character class shortcuts can be also used as part of other
    character classes

--> patterns are GREEDY!  (leftmost is greediest)
      |
      `-> to get around this, use the GENERAL MULTIPLIER {}:
            x{5,10} -- matches 5 to 10 x's
            x{5,}   -- matches 5 or more x's
            x{5}    -- matches exactly 5 x's
            x{0,5}  -- matches 5 or fewer x's   # 0 obligatory

      `->  also, can put '?' after .* to make it NON-GREEDY (lazy):
           /a.*c.*d/ would match everything up to the 2nd 'c' in 
           "a xxx c xxxxxxxxxx c xxx d",
              BUT
           /a.*?c.*d/ would only match up to the first 'c' above.
          
      `->  ? modifier works after any multiplier (?,+,*, and {m,n})
           (doesn't change meaning, just greediness)
          
      `->  is sometimes better to use ? rather than let an
           expression backtrack all over, adding to runtime

NOTE:   Backslashed metacharacters in Perl are alphanumeric, such as
        \b, \w, \n.  Unlike some other regular expression languages,
        there are no backslashed symbols that aren't alphanumeric.
        So anything that looks like \\, \(, \), \<, \>, \{, or \} is
        always interpreted as a literal character, not a
        metacharacter. This was once used in a common idiom to
        disable or quote the special meanings of regular expression
        metacharacters in a string that you want to use for a
        pattern. Simply quote all non- alphanumeric characters:

            $pattern =~ s/(\W)/\\$1/g;

        Now it is much more common to see either the quotemeta()
        function or the \Q escape sequence used to disable all
        metacharacters' special meanings like this:

            /$unquoted\Q$quoted\E$unquoted/

           
grouping patterns:
-----------------
  def: a SEQUENCE matches something static: /abc/
  def: a MULTIPLIER allows multiple instances of one character in a
       match

(parentheses as memory):
-----------------------
  memorize parts of string with ()
  THEN, recall them with \[0-9] (backslash followed by an integer,
                                 referring to the same-numbered pair
  |                              of perens, counting from one)
  `-> matches the same sequence
        /fred(.)barney\1/ 
      would match fred, followed by one character, followed by
      barney, followed by the same previous character:
        "fredxbarneyx", but NOT "fredxbarneyy"
      (whereas, 
        /fred.barney./ would match both "fredxbarneyx" AND
      "fredxbarneyy")

      corresponding variables are also set from the references:
      ========================================================
           if (/Time: (..):(..):(..)/) {
               $hours = $1;
               $minutes = $2;
               $seconds = $3;
           }
      |
      `-> or, you can just assign your own names:
          ($hours,$minutes,$seconds) = (/Time: (..):(..):(..)/);

    * When the bracketing construct ( ... ) is used, \<digit>
      matches the digit'th substring.  Outside of the pattern,
      always use "$" instead of "\" in front of the digit.


alternation:
-----------
  /a|b|c/ matches any one of a, b, or c (= [abc])
  /nori|orange/ matches "nori" or "orange" 


anchoring patterns
~~~~~~~~~~~~~~~~~~
       \b -- 'word boundary'
       \B -- not a word boundary
       ^  -- beginning of string, IFF at beginning of pattern
       $  -- end of string, IFF at end of pattern
       \A  Match at only beginning of string
       \Z  Match at only end of string (or before newline at the end)

    these pattern matchers also work:
              \l          lowercase next char (think vi)
              \u          uppercase next char (think vi)
              \L          lowercase till \E (think vi)
              \U          uppercase till \E (think vi)
              \E          end case modification (think vi)
              \Q          quote (disable) pattern metacharacters till \E 

    In addition, Perl defines the following:

              \w  Match a "word" character (alphanumeric plus "_")
              \W  Match a non-word character
              \s  Match a whitespace character
              \S  Match a non-whitespace character
              \d  Match a digit character
              \D  Match a non-digit character

       (?=pattern)
                 A zero-width positive lookahead assertion.  For
                 example, /\w+(?=\t)/ matches a word followed by
                 a tab, without including the tab in $&.

       (?!pattern)
                 A zero-width negative lookahead assertion.  For
                 example /foo(?!bar)/ matches any occurrence of
                 "foo" that isn't followed by "bar".  Note
                 however that lookahead and lookbehind are NOT
                 the same thing.  You cannot use this for
                 lookbehind.

PRECEDENCE should be paid attention to, so 
  a|b*  
means "one 'a' and any number of 'b's", whereas
  (a|b)* 
means "any number of 'a's or 'b's".
          |
          `-> NOTE that this set of (s) triggers the memory, so if
              you reference sth later, count that in the count!
          |
          `-> if you don't want this, use (?:a|b), just (?:, to disable it

    _       _        _       _        _       _        _       _
_.+' `+._.+' `+.__.+' `+._.+' `+.__.+' `+._.+' `+.__.+' `+._.+' `+._.

matching operator:
~~~~~~~~~~~~~~~~~
  ------
  | =~ | -- selecting a different target
  ------ 
    `-> takes a regexp op on the right side, and changes the *target*
        of the op to sth else (besides $_)
            $a = "hello world";
            $a =~ /^he/     # true
            $a =~ /(.)\1/   # true (matches double 'l')
        
        o  target of this can also be <STDIN>
             if (<STDIN> =~ /^[yY]/){     # if the response is "yes"

  -> ignore case with the /i operator
        /sompattern/i

  -> either you can backspace a lot of slashes, \/\/\/\/\/, or you can
    use your delimiter of choice:
        /^\/usr\/bin/
        m#/usr/bin#
        m@/usr/bin@  ... or whatever, just any nonalphanumeric,
                        nonwhitespace character will do.
        |
        `-> this is cool because it allows you to specify any
            different delimiter.  usually / will do, but if you use
            text with a lot of /s, you can use something else, like
            the #. technically, you *could* use # regularly, but if
            you were operating on a lot of text that used frequent
            #s, you could then switch to /.

  -> variable interpolation
    ======================
        $what = bird;
        $sentence = "a bird can fly";
        if ($sentence =~ /\b$what\b/){
          print "$sentence has the word $what in it!\n";
        }
      
      * if $what = "[box]", can disable pattern-matching
        characteristic of [s] by saying /\Q$what\E/ (wherein \Q, it
        will be remembered, disables them, until \E).

    a few predefined read-only variables:
    -------------------------------------------
    | $& | what was matched                   |
    |----|------------------------------------|
    | $` | what comes before what was matched |
    |----|------------------------------------|
    | $' | what comes after what was matched  |
    -------------------------------------------
                $_ = "this is a sample string";
                /sa.*le/;  # matches 'sample'
                            # $& = "sample"
                            # $` = "this is a "
                            # $' = "string"


substitute operator: 
~~~~~~~~~~~~~~~~~~~  
    s/old/new/;
    
    * the /g operator operates on all possible matches within the
      string, not just the first match

    * the /i operator ignores case, just like the match-operator
 
    * the replacement string is variable-interpolated:
      $_ = "hello world";
      $new = "goodbye";
      s/world/$new/;    #$_ gets "hello goodbye"

    * $_ = "this is a test";
      s/(\w+)/<$1>/g;   # $_ now = "<this> <is> <a> <test>"
      
    * can make delimiter anythign you want, like above.  just use
      any nonalphanumeric, nonwhitespace ch 3 times:
          s#old#new#
    
    * can also use =~ operator, as with match operator:
          $x =~ s/old/new/;


SPLIT & JOIN functions:
~~~~~~~~~~~~~~~~~~~~~~
   "split" breaks a string up with a regexp;
   "join" glues it together again.

  split:
  =====
     takes a regexp and a string
     ---------------------------
        $line = "this:is:a::test";
        @fields = split(/:/,$line);
          # fields now is ("this","is","a","","test)
    
     or, if you don't want the extra "" (from ::) in there,
        @fields = split(/:+/,$line);
          # fields now is ("this","is","a","test)
     
     commonly takes $_
        $_ = "some string";
        $words = split(/ /)   # same as $words = split(/ /,$_);

  join:
  ====
      takes a string and a list of strings
      ------------------------------------
          $bigstring = join($glue,@fields);
              # puts whatever's in $glue between each field
              # $glue is *not* a regexp!! just some glue

          $passwdline = join(":",@fields);

      can cheat to get glue ahead of or after every element:
          
          $result = join("+", "", @fields)
              # "" is treated as an empty element, to be glued
              # together with the first data element of @fields
          or
          $result = join("+", @fields, ""); 
              # same effect