Consider a practical task – we have a phone number like
"+7(903)-123-45-67", and we need to turn it into pure numbers:
To do so, we can find and remove anything that’s not a number. Character classes can help with that.
A character class is a special notation that matches any symbol from a certain set.
For the start, let’s explore the “digit” class. It’s written as
\d and corresponds to “any single digit”.
For instance, the let’s find the first digit in the phone number:
Without the flag
g, the regular expression only looks for the first match, that is the first digit
Let’s add the
g flag to find all digits:
That was a character class for digits. There are other character classes as well.
Most used are:
\d(“d” is from “digit”)
- A digit: a character from
\s(“s” is from “space”)
- A space symbol: includes spaces, tabs
\nand few other rare characters, such as
\w(“w” is from “word”)
- A “wordly” character: either a letter of Latin alphabet or a digit or an underscore
_. Non-Latin letters (like cyrillic or hindi) do not belong to
\d\s\w means a “digit” followed by a “space character” followed by a “wordly character”, such as
A regexp may contain both regular symbols and character classes.
CSS\d matches a string
CSS with a digit after it:
Also we can use many character classes:
The match (each regexp character class has the corresponding result character):
For every character class there exists an “inverse class”, denoted with the same letter, but uppercased.
The “inverse” means that it matches all other characters, for instance:
- Non-digit: any character except
\d, for instance a letter.
- Non-space: any character except
\s, for instance a letter.
- Non-wordly character: anything but
\w, e.g a non-latin letter or a space.
In the beginning of the chapter we saw how to make a number-only phone number from a string like
+7(903)-123-45-67: find all digits and join them.
An alternative, shorter way is to find non-digits
\D and remove them from the string:
. is a special character class that matches “any character except a newline”.
Or in the middle of a regexp:
Please note that a dot means “any character”, but not the “absense of a character”. There must be a character to match it:
By default, a dot doesn’t match the newline character
For instance, the regexp
A, and then
B with any character between them, except a newline
There are many situations when we’d like a dot to mean literally “any character”, newline included.
That’s what flag
s does. If a regexp has it, then a dot
. matches literally any character:
Check https://caniuse.com/#search=dotall for the most recent state of support. At the time of writing it doesn’t include Firefox, IE, Edge.
Luckily, there’s an alternative, that works everywhere. We can use a regexp like
[\s\S] to match “any character”.
[\s\S] literally says: “a space character OR not a space character”. In other words, “anything”. We could use another pair of complementary classes, such as
[\d\D], that doesn’t matter. Or even the
[^] – as it means match any character except nothing.
Also we can use this trick if we want both kind of “dots” in the same pattern: the actual dot
. behaving the regular way (“not including a newline”), and also a way to match “any character” with
[\s\S] or alike.
Usually we pay little attention to spaces. For us strings
1 - 5 are nearly identical.
But if a regexp doesn’t take spaces into account, it may fail to work.
Let’s try to find digits separated by a hyphen:
Let’s fix it adding spaces into the regexp
\d - \d:
A space is a character. Equal in importance with any other character.
We can’t add or remove spaces from a regular expression and expect to work the same.
In other words, in a regular expression all characters matter, spaces too.
There exist following character classes:
\s– space symbols, tabs, newlines.
\S– all but
\w– Latin letters, digits, underscore
\W– all but
.– any character if with the regexp
's'flag, otherwise any except a newline
…But that’s not all!
We can search by these properties as well. That requires flag
u, covered in the next article.