Perl Tutorial - Practical Extraction and Reporting Language (Perl)
Please leave a remark at the bottom of each page with your useful suggestion.
Table of Contents
- Perl Introduction
- Perl Program Startup
- Perl Regular Expressions
- Perl Array Program
- Perl Basic Program
- Perl Subroutine / Function Program
- Perl XML Program
- Perl String Program
- Perl Statement Program
- Perl Network Program
- Perl Hash Program
- Perl File Handling Program
- Perl Data Type Program
- Perl Database Program
- Perl Class Program
- Perl CGI Program
- Perl GUI Program
- Perl Report Program
Perl Regular Expressions
The basic method for applying a regular expression is to use the pattern binding operators =~ and !~. The first operator is a test and assignment operator.
There are three regular expression operators within Perl.
- Match Regular Expression - m//
- Substitute Regular Expression - s///
- Transliterate Regular Expression - tr///
Pattern-matching options.
Option Description
g Match all possible patterns
i Ignore case
m Treat string as multiple lines
o Only evaluate once
s Treat string as single line
x Ignore white space in pattern
Pattern Modifiers
Code Description
g Globalmatch all occurrences of the regular expression
i Ignore casematch any case
m Multiple linesprocess the input as multiple lines
o Only oncecompile the regular expression the first time
s Single lineignore new lines
x Extra spacesallow comments and spaces in regular expression syntax
Modifier | Operator | Description |
---|---|---|
i | $_=~s/PATTERN/REPLACEMENT/i; $_=~m/PATTERN/i; |
Makes the match case insensitive. |
m | $_=~s/PATTERN/REPLACEMENT/m; $_=~m/PATTERN/m; |
Specifies that if the string has newline or carriage return characters, the ^ and $ operators will now match against a newline boundary, instead of a string boundary. |
o | $_=~s/PATTERN/REPLACEMENT/o; $_=~m/PATTERN/o; |
Evaluates the expression only once. |
s | $_=~s/PATTERN/REPLACEMENT/s; $_=~m/PATTERN/s; |
Allows use of '.' to match a newline character. |
x | $_=~s/PATTERN/REPLACEMENT/x; $_=~m/PATTERN/x; |
Allows you to use white space in the expression for clarity. |
g | $_=~s/PATTERN/REPLACEMENT/g; $_=~m/PATTERN/g; |
Globally finds all matches. |
cg | $_=~m/PATTERN/cg; | Allows the search to continue even after a global match fails. |
e | $_=~s/PATTERN/REPLACEMENT/e; | Evaluates the replacement as if it were a Perl statement, and uses its return value as the replacement text. |
c | $_=~tr/SEARCHLIST/REPLACEMENTLIST/cds; | Complements SEARCHLIST. |
d | $_=~tr/SEARCHLIST/REPLACEMENTLIST/cds; | Deletes found but unreplaced characters. |
s | $_=~tr/SEARCHLIST/REPLACEMENTLIST/cds; | Squashes duplicate replaced characters. |
Simple Matching
m/abc/; # find 'abc'
m#abc#; # ...
ma\abca; # ...
/abc/; # ...
/abc def/; # find 'abc def'
/^abc/; # abc at beginning
/abc$/; # abc at the end
/^$/; # empty line
Substitution
s/a/b/; # first a->b
s/a/b/g; # all a->b
s/Hi!/Ho!/g; # 'Hi' -> 'Ho'
s/[[:ctrl:]]//g; # remove control chars
Translation
tr/a/b/; # all a->b
y/a/b/; # all a->b
tr/abc/x/; # a->x,b->x,c->x
tr/xxx/abc/; # only x->a
tr/[a-z]/[A-Z]/; # upper case
tr/A-Za-z/N-ZA-Mn-za-m/; # ROT13
Quantities
/^\s?\S/; # 0..1 spaces
/^\s*\S/; # 0..n spaces
/^\s+\S/; # 1..n spaces
/a{3}/; # 3 times 'a'
/ab{3}/; # 3 times 'b'
/(ab){3}/; # 3 times 'ab'
/a{3,4}/; # 3..4 times 'a'
/a{3,}/; # 3..n times 'a'
/a.+b/; # maximal match
/a.+?b/; # non-greedy match
Grouping and Alternatives
/(abc)def/; # $1='abc'
/(a)b(cd)/; # $1='a',$2='cd'
/(a)(?:b)(c)/; # $1='a',$2='c'
/(start|begin)/; # either 'start'
# or 'begin'
Special Characters
\d Digit
\D Non-Digit
\w Word Character
\W Non-Word Character
\s Whitespace
\S Non-Whitespace
Character Classes
:alpha: alphabetic
:alnum: alpha numeric
:upper: upper case
:lower: lower case
:digit: \d
:xdigit: hex number
:print: printable
:space: \s
:blank: space, enter
:punct: punctuation
:graph: alnum and punct
:word: \w
:ascii: ASCII chars
:control: control chars
Greedy searches
greedy means that each pattern will try to match as much as it can.
The pattern /a.*a/ matches as many characters as possible between the first a and the last a.
If your text string is ababacdea, /a.*a/ will match the whole string.
You can control the greediness using a question mark.
The question mark matches a minimum number of times.
The following table shows how to minimize the greediness.
Syntax Means
*? Match zero or more times, minimal number of times
+? Match one or more times, minimal number of times
?? Match zero or one time, minimal number of times
{num}? Match exactly num times, minimal number of times
{num,}? Match at least num times, minimal number of times
{num,max}? Match at least num but not more than max times, minimal number of times
Anchoring Metacharacters
Metacharacter What It Matches
^ Matches to beginning of line or beginning of string
$ Matches to end of line or end of a string
\A Matches the beginning of the string only
\Z Matches the end of the string or line
\z Matches the end of string only
\G Matches where previous m//g left off
\b Matches a word boundary (when not inside [ ])
\B Matches a nonword boundary
Character Class: Anchored Characters
Metacharacter What It Matches
\b Matches a word boundary (when not inside [ ])
\B Matches a nonword boundary
^ Matches to beginning of line
$ Matches to end of line
\A Matches the beginning of the string only
\Z Matches the end of the string or line
\z Matches the end of string only
\G Matches where previous m//g left off
Character Class: Miscellaneous Characters
Metacharacter What It Matches
\12 Matches that octal value, up to \377
\x811 Matches that hex value
\cX Matches that control character;
e.g., \cC is <Ctrl>-C and \cV is <Ctrl>-V
\e Matches the ASCII ESC character, not backslash
\E Marks the end of changing
case with \U, \L, or \Q
\l Lowercase the next character only
\L Lowercase characters until the end of the string or until \E
\N Matches that named character; e.g., \N{greek:Beta}
\p{PROPERTY} Matches any character with the named property; e.g., \p{IsAlpha}/
\P{PROPERTY} Matches any character without the named property
\Q Quote metacharacters until \E
\u Titlecase next character only
\U Uppercase until \E
\x{NUMBER} Matches Unicode NUMBER given in hexadecimal
\X Matches Unicode
"combining character sequence" string
\[ Matches that metacharacter
\\ Matches a backslash
Character Class: Remembered Characters
Metacharacter What It Matches
(string) Used for
backreferencing (see Examples 9.38 and 9.39)
\1 or $1 Matches first set of parentheses[a]
\2 or $2 Matches second set of parentheses
\3 or $3 Matches third set of parentheses
Character Class: Repeated Characters
Metacharacter What It Matches
x? Matches 0 or 1 x
x* Matches 0 or more occurrences of x
x+ Matches 1 or more occurrences of x
(xyz)+ Matches 1 or more patterns of xyz
x{m,n} Matches at least m occurrences of x and no more than n occurrences of x
Character Class: Whitespace Characters
Metacharacter What It Matches
\s Matches a whitespace character, such as spaces, tabs, and newlines
\S Matches nonwhitespace character
\n Matches a newline
\r Matches a return
\t Matches a tab
\f Matches a form feed
\b Matches a backspace
\0 Matches a null character
Checking for multiple occurrences
Pattern Interpretation
/a{1,4}/ Matches one, two, three, or four as.
/a{2}/ Matches two as.
/a{0,2}/ Matches one or two as.