SEARCHING USING REGULAR EXPRESSIONS
Regular expressions are special strings (sets of characters)
that are used to match text. If you needed to extract phone
numbers from a text file, you might use a regular expression.
If you needed to locate all files with digits in the middle of
their names, you might use a regular expression. If you wanted
to find all repeated words in a block of text, you might use a
regular expression. And if you wanted to replace all URLs in a
page with actual HTML links to those same URLs, yes, you might
use a regular expression.
Regular expressions are created using the regular expression
language, a specialized language designed to do everything that
was just discussed and much more. Like any language, regular
expressions have a special syntax and instructions that you
must learn.
Understanding Regular Expressions
The filtering examples in the previous two tutorials enabled
you to locate data using matches, comparisons, and wildcard
operators. For basic filtering (and even some not-so-basic
filtering) this might be enough. But as the complexity of
filtering conditions grows, so does the complexity of the WHERE
clauses themselves.
And this is where regular expressions become useful. Regular
expressions are special strings (sets of characters) that are
used to match text. If you needed to extract phone numbers from
a text file, you might use a regular expression. If you needed
to locate all files with digits in the middle of their names,
you might use a regular expression. If you wanted to find all
repeated words in a block of text, you might use a regular
expression. And if you wanted to replace all URLs in a page
with actual HTML links to those same URLs, yes, you might use a
regular expression (or two, for this last example).
Regular expressions are supported in all sorts of
programming languages, text editors, operating systems, and
more. And savvy programmers and network managers have long
regarded regular expressions as a vital component of their
technical toolboxes.
Regular expressions are created using the regular expression
language, a specialized language designed to do everything that
was just discussed and much more. Like any language, regular
expressions have a special syntax and instructions that you
must learn.
Note
To Learn More Full coverage of regular expressions is beyond
the scope of this tutorial. While the basics are covered here,
for a more thorough introduction to regular expressions you
might want to obtain a copy of my Sams Teach Yourself Regular
Expressions in 10 Minutes (ISBN 0672325667).
Using MySQL Regular Expressions
So what does this have to do with MySQL? As already
explained, all regular expressions do is match text, comparing
a pattern (the regular expression) with a string of text. MySQL
provides rudimentary support for regular expressions with WHERE
clauses, allowing you to specify regular expressions that are
used to filter data retrieved using SELECT.
Note
Just a Subset of the Regular Expression Language If you
are already familiar with regular expressions, take note. MySQL
only supports a small subset of what is supported in most
regular expression implementations, and this tutorial covers
most of what is supported.
This will all become much clearer with some
examples.
Basic Character Matching
We'll start with a very simple example. The following
statement retrieves all rows where column prod_name contains
the text 1000:
• Input
SELECT prod_name
FROM products
WHERE prod_name REGEXP '1000'
ORDER BY prod_name;

• Output
+--------------+
| prod_name |
+--------------+
| JetPack 1000 |
+--------------+
• Analysis
This statement looks much like the ones that used LIKE (in
Tutorial 13, "Using Wildcard Filtering"), except that the
keyword LIKE has been replaced with REGEXP. This tells MySQL
that what follows is to be treated as a regular expression (one
that just matches the literal text 1000).
So, why bother using a regular expression? Well, in the
example just used, regular expressions really add no value (and
probably hurt performance), but consider this next example:
• Input
SELECT prod_name
FROM products
WHERE prod_name REGEXP '.000'
ORDER BY prod_name;

• Output
+--------------+
| prod_name |
+--------------+
| JetPack 1000 |
| JetPack 2000 |
+--------------+
• Analysis
Here the regular expression .000 was used. . is a special
character in the regular expression language. It means match
any single character, and so both 1000 and 2000 matched and
were returned.
Of course, this particular example could also have been
accomplished using LIKE and wildcards (as seen in Tutorial
13).
Note
LIKE Versus REGEXP There is one very important
difference between LIKE and REGEXP. Look at these two
statements:
SELECT prod_name
FROM products
WHERE prod_name LIKE '1000'
ORDER BY prod_name;

and
SELECT prod_name
FROM products
WHERE prod_name REGEXP '1000'
ORDER BY prod_name;

If you were to try them both you'd discover that the first
returns no data and the second returns one row. Why is
this?
As seen in Tutorial 13, LIKE matches an entire column. If
the text to be matched existed in the middle of a column value,
LIKE would not find it and the row would not be returned
(unless wildcard characters were used). REGEXP, on the other
hand, looks for matches within column values, and so if the
text to be matched existed in the middle of a column value,
REGEXP would find it and the row would be returned. This is a
very important distinction.
So can REGEXP be used to match entire column values (so that
it functions like LIKE)? Actually, yes, using the ^ and $
anchors, as will be explained later in this tutorial.
Tip
Matches Are Not Case-Sensitive
Regular expression matching in MySQL (as of version 3.23.4) are
not case-sensitive (either case will be matched). To force
case-sensitivity, you can use the BINARY keyword, as
in
WHERE prod_name REGEXP BINARY
'JetPack .000'
Performing OR Matches
To search for one of two strings (either one or the other),
use | as seen here:
• Input
SELECT prod_name
FROM products
WHERE prod_name REGEXP '1000|2000'
ORDER BY prod_name;

• Output
+--------------+
| prod_name |
+--------------+
| JetPack 1000 |
| JetPack 2000 |
+--------------+
• Analysis
Here the regular expression 1000|2000 was used. | is the
regular expression OR operator. It means match one or the
other, and so both 1000 and 2000 matched and were returned.
Using | is functionally similar to using OR statements in
SELECT statements, with multiple OR conditions being
consolidated into a single regular expression.
Tip
More Than Two OR Conditions
More than two OR conditions may be specified. For example,
'1000|2000|3000' would match 1000 or 2000 or
3000.
Matching One of Several Characters
. matches any single character. But what if you wanted to
match only specific characters? You can do this by specifying a
set of characters enclosed within [ and ], as seen here:
• Input
SELECT prod_name
FROM products
WHERE prod_name REGEXP '[123] Ton'
ORDER BY prod_name;

• Output
+-------------+
| prod_name |
+-------------+
| 1 ton anvil |
| 2 ton anvil |
+-------------+
• Analysis
Here the regular expression [123] Ton was used. [123]
defines a set of characters, and here it means match 1 or 2 or
3, so both 1 ton and 2 ton matched and were returned (there was
no 3 ton).
As you have just seen, [] is another form of OR statement.
In fact, the regular expression [123] Ton is shorthand for
[1|2|3] Ton, which also would have worked. But the []
characters are needed to define what the OR statement is
looking for. To better understand this, look at the next
example:
• Input
SELECT prod_name
FROM products
WHERE prod_name REGEXP '1|2|3 Ton'
ORDER BY prod_name;

• Output
+---------------+
| prod_name |
+---------------+
| 1 ton anvil |
| 2 ton anvil |
| JetPack 1000 |
| JetPack 2000 |
| TNT (1 stick) |
+---------------+
• Analysis
Well, that did not work. The two required rows were
retrieved, but so were three others. This happened because
MySQL assumed that you meant '1' or '2' or '3 ton'. The |
character applies to the entire string unless it is enclosed
with a set.
Sets of characters can also be negated. That is, they'll
match anything but the specified characters. To negate a
character set, place a ^ at the start of the set. So, whereas
[123] matches characters 1, 2, or 3, [^123] matches anything
but those characters.
Matching Ranges
Sets can be used to define one or more characters to be
matched. For example, the following will match digits 0 tHRough
9:
[0123456789]
To simplify this type of set, - can be used to define a range.
The following is functionally identical to the list of digits
just seen:
[0-9]
Ranges are not limited to complete sets[1-3] and [6-9] are
valid ranges, too. In addition, ranges need not be numeric, and
so [a-z] will match any alphabetical character.
Here is an example:
• Input
SELECT prod_name
FROM products
WHERE prod_name REGEXP '[1-5] Ton'
ORDER BY prod_name;

• Output
+--------------+
| prod_name |
+--------------+
| .5 ton anvil |
| 1 ton anvil |
| 2 ton anvil |
+--------------+
• Analysis
Here the regular expression [1-5] Ton was used. [1-5]
defines a range, and so this expression means match 1 through
5, and so three matches were returned. .5 ton was returned
because 5 ton matched (without the . character).
Matching Special Characters
The regular expression language is made up of special
characters that have specific meanings. You've already seen .,
[], |, and -, and there are others, too. Which begs the
question, if you needed to match those characters, how would
you do so? For example, if you wanted to find values that
contain the . character, how would you search for it? Look at
this example:
• Input
SELECT vend_name
FROM vendors
WHERE vend_name REGEXP '.'
ORDER BY vend_name;

• Output
+----------------+
| vend_name |
+----------------+
| ACME |
| Anvils R Us |
| Furball Inc. |
| Jet Set |
| Jouets Et Ours |
| LT Supplies |
+----------------+
• Analysis
That did not work. . matches any character, and so every row
was retrieved.
To match special characters they must be preceded by \\. So,
\\- means find and \\. means find .:
• Input
SELECT vend_name
FROM vendors
WHERE vend_name REGEXP '\\.'
ORDER BY vend_name;

• Output
+--------------+
| vend_name |
+--------------+
| Furball Inc. |
+--------------+
• Analysis
That worked. \\. matches ., and so only a single row was
retrieved. This process is known as escaping, and all
characters that have special significance within regular
expressions must be escaped this way. This includes ., |, [],
and all of the other special characters used thus far.
\\ is also used to refer to meta-characters (characters that
have specific meanings), as listed in Table 9.1. bellow
Table 9.1. Whitespace
Metacharacters
|
Metacharacter
|
Description
|
|
\\f
|
Form feed
|
|
\\n
|
Line feed
|
|
\\r
|
Carriage
return
|
|
\\t
|
Tab
|
|
\\v
|
Vertical
table
|
Tip
To Match \ To match the
backslash character itself (\), you would need to
use \\\.
Note
\ or \\? Most regular expression implementation use a
single backslash to escape special characters to be able to use
them as literals. MySQL, however, requires two backslashes
(MySQL itself interprets one and the regular expression library
interprets the other).
Matching Character Classes
There are matches that you'll find yourself using
frequently, digits, or all alphabetical characters, or all
alphanumerical characters, and so on. To make working with
these easier, you may use predefined character sets known as
character classes. Table 9.2 bellow lists the character classes
and what they mean.
Table 9.2. Character
Classes
|
Class
|
Description
|
|
[:alnum:]
|
Any letter or digit,
(same as [a-zA-Z0-9])
|
|
[:alpha:]
|
Any letter (same as
[a-zA-Z])
|
|
[:blank:]
|
Space or tab (same as
[\\t ])
|
|
[:cntrl:]
|
ASCII control characters
(ASCII 0 tHRough 31 and
127)
|
|
[:digit:]
|
Any digit (same as
[0-9])
|
|
[:graph:]
|
Same as
[:print:] but excludes
space
|
|
[:lower:]
|
Any lowercase letter
(same as [a-z])
|
|
[:print:]
|
Any printable
character
|
|
[:punct:]
|
Any character that is
neither in [:alnum:] nor
[:cntrl:]
|
|
[:space:]
|
Any whitespace character
including the space (same as
[\\f\\n\\r\\t\\v ])
|
|
[:upper:]
|
Any uppercase letter
(same as [A-Z])
|
|
[:xdigit:]
|
Any hexadecimal digit
(same as [a-fA-F0-9])
|
Matching Multiple Instances
All of the regular expressions used thus far attempt to
match a single occurrence. If there is a match, the row is
retrieved and if not, nothing is retrieved. But sometimes
you'll require greater control over the number of matches. For
example, you might want to locate all numbers regardless of how
many digits the number contains, or you might want to locate a
word but also be able to accommodate a trailing s if one
exists, and so on.
This can be accomplished using the regular expressions
repetition meta-characters, listed in Table 9.3 bellow.
Table 9.3. Repetition
Metacharacters
|
Metacharacter
|
Description
|
|
*
|
0 or more
matches
|
|
+
|
1 or more matches
(equivalent to {1,})
|
|
?
|
0 or 1 match (equivalent
to {0,1})
|
|
{n}
|
Specific number of
matches
|
|
{n,}
|
No less than a specified
number of matches
|
|
{n,m}
|
Range of matches
(m not to exceed 255)
|
Following are some examples.
• Input
SELECT prod_name
FROM products
WHERE prod_name REGEXP '\\([0-9] sticks?\\)'
ORDER BY prod_name;

• Output
+----------------+
| prod_name |
+----------------+
| TNT (1 stick) |
| TNT (5 sticks) |
+----------------+
• Analysis
Regular expression \\([0-9]
sticks?\\) requires some explanation. \\( matches (, [0-9]
matches any digit (1 and 5 in this example), sticks? matches
stick and sticks (the ? after the s makes that s optional
because ? matches 0 or 1 occurrence of whatever it follows),
and \\) matches the closing ). Without ? it would have been
very difficult to match both stick and sticks.
Here's another example. This time we'll try to match four
consecutive digits:
• Input
SELECT prod_name
FROM products
WHERE prod_name REGEXP '[[:digit:]]{4}'77
ORDER BY prod_name;

• Output
+--------------+
| prod_name |
+--------------+
| JetPack 1000 |
| JetPack 2000 |
+--------------+
• Analysis
As explained previously, [:digit:] matches any digit, and so
[[:digit:]] is a set of digits. {4} requires exactly four
occurrences of whatever it follows (any digit), and so
[[:digit:]]{4} matches any four consecutive digits.
It is worth noting that when using regular expressions there
is almost always more than one way to write a specific
expression. The previous example could have also been written
as follows:
• Input
SELECT prod_name
FROM products
WHERE prod_name REGEXP '[0-9][0-9][0-9][0-9]'
ORDER BY prod_name;

Anchors
All of the examples thus far have matched text anywhere
within a string. To match text at specific locations, you need
to use anchors as listed in Table 9.4. bellow
|
Metacharacter
|
Description
|
|
^
|
Start of text
|
|
$
|
End of text
|
|
[[:<:]]
|
Start of word
|
|
[[:>:]]
|
End of word
|
For example, what if you wanted to find all products that
started with a number (including numbers starting with a
decimal point)? A simple search for [0-9\\.] (or
[[:digit:]\\.]) would not work because it would find matches
anywhere within the text. The solution is to use the ^ anchor,
as seen here:
• Input
SELECT prod_name
FROM products
WHERE prod_name REGEXP '^[0-9\\.]'
ORDER BY prod_name;

• Output
+--------------+
| prod_name |
+--------------+
| .5 ton anvil |
| 1 ton anvil |
| 2 ton anvil |
+--------------+
• Analysis
^ matches the start of a string. As such, ^[0-9\\.] matches
. or any digit only if they are the first characters within a
string. Without the ^, four other rows would have been
retrieved, too (those that have digits in the middle).
Note
The Dual Purpose ^ ^ has two uses. Within a set (defined
using [ and ]) it is used to negate that set. Otherwise it is
used to refer to the start of a string.
Making REGEXP Behave Like LIKE Earlier in this tutorial
I mentioned that LIKE and REGEXP behaved differently in that
LIKE matched an entire string and REGEXP matched substrings,
too. Using anchors, REGEXP can be made to behave just like LIKE
by simply starting each expression with ^ and ending it with
$.
Tip
Simple Regular Expression
Testing You can use SELECT to test regular expressions without
using database tables. REGEXP checks always return 0 (not a
match) or 1 (match). You can use REGEXP with literal strings to
test expressions and to experiment with them. The syntax would
look like this:
SELECT 'hello' REGEXP
'[0-9]';
This example would
obviously return 0 (as there are no digits in the text
hello).
Summary
In this tutorial, you learned the basics of regular
expressions, and how to use them in MySQL SELECT statements via
the REGEXP keyword.
[Next]
|