6日 十二月 2019
本资料仅提供以下语言版本:English, Русский。请 帮助我们 将其翻译为 简体中文 版本。

Capturing groups

A part of a pattern can be enclosed in parentheses (...). This is called a “capturing group”.

That has two effects:

  1. It allows to get a part of the match as a separate item in the result array.
  2. If we put a quantifier after the parentheses, it applies to the parentheses as a whole.

Examples

Let’s see how parentheses work in examples.

Example: gogogo

Without parentheses, the pattern go+ means g character, followed by o repeated one or more times. For instance, goooo or gooooooooo.

Parentheses group characters together, so (go)+ means go, gogo, gogogo and so on.

alert( 'Gogogo now!'.match(/(go)+/i) ); // "Gogogo"

Example: domain

Let’s make something more complex – a regular expression to search for a website domain.

For example:

mail.com
users.mail.com
smith.users.mail.com

As we can see, a domain consists of repeated words, a dot after each one except the last one.

In regular expressions that’s (\w+\.)+\w+:

let regexp = /(\w+\.)+\w+/g;

alert( "site.com my.site.com".match(regexp) ); // site.com,my.site.com

The search works, but the pattern can’t match a domain with a hyphen, e.g. my-site.com, because the hyphen does not belong to class \w.

We can fix it by replacing \w with [\w-] in every word except the last one: ([\w-]+\.)+\w+.

Example: email

The previous example can be extended. We can create a regular expression for emails based on it.

The email format is: name@domain. Any word can be the name, hyphens and dots are allowed. In regular expressions that’s [-.\w]+.

The pattern:

let regexp = /[-.\w]+@([\w-]+\.)+[\w-]+/g;

alert("my@mail.com @ his@site.com.uk".match(regexp)); // my@mail.com, his@site.com.uk

That regexp is not perfect, but mostly works and helps to fix accidental mistypes. The only truly reliable check for an email can only be done by sending a letter.

Parentheses contents in the match

Parentheses are numbered from left to right. The search engine memorizes the content matched by each of them and allows to get it in the result.

The method str.match(regexp), if regexp has no flag g, looks for the first match and returns it as an array:

  1. At index 0: the full match.
  2. At index 1: the contents of the first parentheses.
  3. At index 2: the contents of the second parentheses.
  4. …and so on…

For instance, we’d like to find HTML tags <.*?>, and process them. It would be convenient to have tag content (what’s inside the angles), in a separate variable.

Let’s wrap the inner content into parentheses, like this: <(.*?)>.

Now we’ll get both the tag as a whole <h1> and its contents h1 in the resulting array:

let str = '<h1>Hello, world!</h1>';

let tag = str.match(/<(.*?)>/);

alert( tag[0] ); // <h1>
alert( tag[1] ); // h1

Nested groups

Parentheses can be nested. In this case the numbering also goes from left to right.

For instance, when searching a tag in <span class="my"> we may be interested in:

  1. The tag content as a whole: span class="my".
  2. The tag name: span.
  3. The tag attributes: class="my".

Let’s add parentheses for them: <(([a-z]+)\s*([^>]*))>.

Here’s how they are numbered (left to right, by the opening paren):

In action:

let str = '<span class="my">';

let regexp = /<(([a-z]+)\s*([^>]*))>/;

let result = str.match(regexp);
alert(result[0]); // <span class="my">
alert(result[1]); // span class="my"
alert(result[2]); // span
alert(result[3]); // class="my"

The zero index of result always holds the full match.

Then groups, numbered from left to right by an opening paren. The first group is returned as result[1]. Here it encloses the whole tag content.

Then in result[2] goes the group from the second opening paren ([a-z]+) – tag name, then in result[3] the tag: ([^>]*).

The contents of every group in the string:

Optional groups

Even if a group is optional and doesn’t exist in the match (e.g. has the quantifier (...)?), the corresponding result array item is present and equals undefined.

For instance, let’s consider the regexp a(z)?(c)?. It looks for "a" optionally followed by "z" optionally followed by "c".

If we run it on the string with a single letter a, then the result is:

let match = 'a'.match(/a(z)?(c)?/);

alert( match.length ); // 3
alert( match[0] ); // a (whole match)
alert( match[1] ); // undefined
alert( match[2] ); // undefined

The array has the length of 3, but all groups are empty.

And here’s a more complex match for the string ac:

let match = 'ac'.match(/a(z)?(c)?/)

alert( match.length ); // 3
alert( match[0] ); // ac (whole match)
alert( match[1] ); // undefined, because there's nothing for (z)?
alert( match[2] ); // c

The array length is permanent: 3. But there’s nothing for the group (z)?, so the result is ["ac", undefined, "c"].

Searching for all matches with groups: matchAll

matchAll is a new method, polyfill may be needed

The method matchAll is not supported in old browsers.

A polyfill may be required, such as https://github.com/ljharb/String.prototype.matchAll.

When we search for all matches (flag g), the match method does not return contents for groups.

For example, let’s find all tags in a string:

let str = '<h1> <h2>';

let tags = str.match(/<(.*?)>/g);

alert( tags ); // <h1>,<h2>

The result is an array of matches, but without details about each of them. But in practice we usually need contents of capturing groups in the result.

To get them, we should search using the method str.matchAll(regexp).

It was added to JavaScript language long after match, as its “new and improved version”.

Just like match, it looks for matches, but there are 3 differences:

  1. It returns not an array, but an iterable object.
  2. When the flag g is present, it returns every match as an array with groups.
  3. If there are no matches, it returns not null, but an empty iterable object.

For instance:

let results = '<h1> <h2>'.matchAll(/<(.*?)>/gi);

// results - is not an array, but an iterable object
alert(results); // [object RegExp String Iterator]

alert(results[0]); // undefined (*)

results = Array.from(results); // let's turn it into array

alert(results[0]); // <h1>,h1 (1st tag)
alert(results[1]); // <h2>,h2 (2nd tag)

As we can see, the first difference is very important, as demonstrated in the line (*). We can’t get the match as results[0], because that object isn’t pseudoarray. We can turn it into a real Array using Array.from. There are more details about pseudoarrays and iterables in the article Iterables(可迭代对象).

There’s no need in Array.from if we’re looping over results:

let results = '<h1> <h2>'.matchAll(/<(.*?)>/gi);

for(let result of results) {
  alert(result);
  // первый вывод: <h1>,h1
  // второй: <h2>,h2
}

…Or using destructuring:

let [tag1, tag2] = '<h1> <h2>'.matchAll(/<(.*?)>/gi);

Every match, returned by matchAll, has the same format as returned by match without flag g: it’s an array with additional properties index (match index in the string) and input (source string):

let results = '<h1> <h2>'.matchAll(/<(.*?)>/gi);

let [tag1, tag2] = results;

alert( tag1[0] ); // <h1>
alert( tag1[1] ); // h1
alert( tag1.index ); // 0
alert( tag1.input ); // <h1> <h2>
Why is a result of matchAll an iterable object, not an array?

Why is the method designed like that? The reason is simple – for the optimization.

The call to matchAll does not perform the search. Instead, it returns an iterable object, without the results initially. The search is performed each time we iterate over it, e.g. in the loop.

So, there will be found as many results as needed, not more.

E.g. there are potentially 100 matches in the text, but in a for..of loop we found 5 of them, then decided it’s enough and make a break. Then the engine won’t spend time finding other 95 mathces.

Named groups

Remembering groups by their numbers is hard. For simple patterns it’s doable, but for more complex ones counting parentheses is inconvenient. We have a much better option: give names to parentheses.

That’s done by putting ?<name> immediately after the opening paren.

For example, let’s look for a date in the format “year-month-day”:

let dateRegexp = /(?<year>[0-9]{4})-(?<month>[0-9]{2})-(?<day>[0-9]{2})/;
let str = "2019-04-30";

let groups = str.match(dateRegexp).groups;

alert(groups.year); // 2019
alert(groups.month); // 04
alert(groups.day); // 30

As you can see, the groups reside in the .groups property of the match.

To look for all dates, we can add flag g.

We’ll also need matchAll to obtain full matches, together with groups:

let dateRegexp = /(?<year>[0-9]{4})-(?<month>[0-9]{2})-(?<day>[0-9]{2})/g;

let str = "2019-10-30 2020-01-01";

let results = str.matchAll(dateRegexp);

for(let result of results) {
  let {year, month, day} = result.groups;

  alert(`${day}.${month}.${year}`);
  // first alert: 30.10.2019
  // second: 01.01.2020
}

Capturing groups in replacement

Method str.replace(regexp, replacement) that replaces all matches with regexp in str allows to use parentheses contents in the replacement string. That’s done using $n, where n is the group number.

For example,

let str = "John Bull";
let regexp = /(\w+) (\w+)/;

alert( str.replace(regexp, '$2, $1') ); // Bull, John

For named parentheses the reference will be $<name>.

For example, let’s reformat dates from “year-month-day” to “day.month.year”:

let regexp = /(?<year>[0-9]{4})-(?<month>[0-9]{2})-(?<day>[0-9]{2})/g;

let str = "2019-10-30, 2020-01-01";

alert( str.replace(regexp, '$<day>.$<month>.$<year>') );
// 30.10.2019, 01.01.2020

Non-capturing groups with ?:

Sometimes we need parentheses to correctly apply a quantifier, but we don’t want their contents in results.

A group may be excluded by adding ?: in the beginning.

For instance, if we want to find (go)+, but don’t want the parentheses contents (go) as a separate array item, we can write: (?:go)+.

In the example below we only get the name John as a separate member of the match:

let str = "Gogogo John!";

// ?: exludes 'go' from capturing
let regexp = /(?:go)+ (\w+)/i;

let result = str.match(regexp);

alert( result[0] ); // Gogogo John (full match)
alert( result[1] ); // John
alert( result.length ); // 2 (no more items in the array)

Summary

Parentheses group together a part of the regular expression, so that the quantifier applies to it as a whole.

Parentheses groups are numbered left-to-right, and can optionally be named with (?<name>...).

The content, matched by a group, can be obtained in the results:

  • The method str.match returns capturing groups only without flag g.
  • The method str.matchAll always returns capturing groups.

If the parentheses have no name, then their contents is available in the match array by its number. Named parentheses are also available in the property groups.

We can also use parentheses contents in the replacement string in str.replace: by the number $n or the name $<name>.

A group may be excluded from numbering by adding ?: in its start. That’s used when we need to apply a quantifier to the whole group, but don’t want it as a separate item in the results array. We also can’t reference such parentheses in the replacement string.

任务

作为互联网接口的 MAC 地址 包括了 6 个以冒号 : 分隔的两位十六进制数。

举个例子:'01:32:54:67:89:AB'

请写一个能检查所有 MAC 地址的正则表达式。

用法:

let reg = /your regexp/;

alert( reg.test('01:32:54:67:89:AB') ); // true

alert( reg.test('0132546789AB') ); // false(缺少冒号)

alert( reg.test('01:32:54:67:89') ); // false(只有 5 个数字,必须是 6 个数字)

alert( reg.test('01:32:54:67:89:ZZ') ) // false(ZZ 不是合法的十六进制)

两位十六进制数的模式是 [0-9a-f]{2}(假设 i flag 已被启用)。

我们需要一个 NN 这种形式的数字,后面还需要五个 :NN 形式的数字。

最终的正则表达式是:[0-9a-f]{2}(:[0-9a-f]{2}){5}

现在让我们看看此模式如何匹配整个文本:从 ^ 处开始,到 $ 这里结束。通过将匹配模式包裹在 ^...$ 来完成的。

最终结果:

let reg = /^[0-9a-fA-F]{2}(:[0-9a-fA-F]{2}){5}$/i;

alert( reg.test('01:32:54:67:89:AB') ); // true

alert( reg.test('0132546789AB') ); // false(缺少冒号)

alert( reg.test('01:32:54:67:89') ); // false(只有 5 个数字,必须是 6 个数字)

alert( reg.test('01:32:54:67:89:ZZ') ) // false(ZZ 不是合法的十六进制)

编写一个正则来匹配 #abc#abcdef 格式的颜色。即:# 后接三位或六位 16 进制数。

使用案例:

let reg = /your regexp/g;

let str = "color: #3f3; background-color: #AA00ef; and: #abcd";

alert( str.match(reg) ); // #3f3 #AA0ef

注:必须为三位或六位,#abcd 这种不应该被匹配。

查找三位颜色 #abc 的正则表达式为:/#[a-f0-9]{3}/i

我们可以添加额外三位 16 进制数,不多也不少。这三位可能有,也可能没有。

最简单的方式 —— 直接附加上去:/#[a-f0-9]{3}([a-f0-9]{3})?/i

但是,还有一种更讨巧的方法:/#([a-f0-9]{3}){1,2}/i

这里我们把正则 [a-f0-9]{3} 放置在括号内,并且应用量词 {1,2}

实际操作:

let reg = /#([a-f0-9]{3}){1,2}/gi;

let str = "color: #3f3; background-color: #AA00ef; and: #abcd";

alert( str.match(reg) ); // #3f3 #AA0ef #abc

不过这里有个小问题:这个模式会在 #abcd 中找到 #abc。为了避免这种情况,我们可以在最后加上 \b

let reg = /#([a-f0-9]{3}){1,2}\b/gi;

let str = "color: #3f3; background-color: #AA00ef; and: #abcd";

alert( str.match(reg) ); // #3f3 #AA0ef

编写一条正则表达式来查找所有的数字,包括整数、浮点数和负数。

例如:

let reg = /your regexp/g;

let str = "-1.5 0 2 -123.4.";

alert( str.match(re) ); // -1.5, 0, 2, -123.4

回顾上个问题,\d+(\.\d+)? 可以匹配一个具有可选择小数部分的正数。

那么我们只需要在最前面加上一个可选的负号 - 即可:

let reg = /-?\d+(\.\d+)?/g;

let str = "-1.5 0 2 -123.4.";

alert( str.match(reg) );   // -1.5, 0, 2, -123.4

一条算数表达式包括两个数字及其中间的一个运算符。例如:

  • 1 + 2
  • 1.2 * 3.4
  • -3 / -6
  • -2 - 2

运算符可能为:"+""-""*""/"

开头、结尾和中间可能存在额外的空格。

编写一个函数 parse(expr)。它接收一个表达式作为参数,并且返回一个包含以下三个值的数组:

  1. 第一个数。
  2. 运算符。
  3. 第二个数。

例如:

let [a, op, b] = parse("1.2 * 3.4");

alert(a); // 1.2
alert(op); // *
alert(b); // 3.4

回顾之前的问题,我们用 -?\d+(\.\d+)? 来匹配数字。

[-+*/] 匹配运算符。我们把 - 放在最前面,因为如果放在中间的话,则表示字符范围,这并不是我们想要的。

注意,在 JavaScript 中,/.../ 中的 / 需要被转义。

我们需要匹配一个数字、一个运算符,还有另一个数字。除此以外,还有它们之间可能存在的空格。

完整的正则表达式为:-?\d+(\.\d+)?\s*[-+*/]\s*-?\d+(\.\d+)?

为了将得到的结果转化为数组,我们须将所需的数据:数字及运算符,包裹在括号中,对应的表达式为:(-?\d+(\.\d+)?)\s*([-+*/])\s*(-?\d+(\.\d+)?)

实际操作:

let reg = /(-?\d+(\.\d+)?)\s*([-+*\/])\s*(-?\d+(\.\d+)?)/;

alert( "1.2 + 12".match(reg) );

结果包括:

  • result[0] == "1.2 + 12"(完整匹配)
  • result[1] == "1"(第一个捕获组)
  • result[2] == ".2"(第二个捕获组 —— 小数部分)
  • result[3] == "+"(…)
  • result[4] == "12"(…)
  • result[5] == undefined(最后一个小数部分不存在,因此为 undefined)

我们只需要数字和运算符,不需要小数部分。

因此,我们可以加上 ?: 来去除多余的捕获组,例如:(?:\.\d+)?

最终答案:

function parse(expr) {
  let reg = /(-?\d+(?:\.\d+)?)\s*([-+*\/])\s*(-?\d+(?:\.\d+)?)/;

  let result = expr.match(reg);

  if (!result) return;
  result.shift();

  return result;
}

alert( parse("-1.23 * 3.45") );  // -1.23, *, 3.45
教程路线图

评论

在评论之前先阅读本内容…
  • 如果你发现教程有错误,或者有其他需要修改和提升的地方 — 请 提交一个 GitHub issue 或 pull request,而不是在这评论。
  • 如果你对教程的内容有不理解的地方 — 请详细说明。
  • 使用 <code> 标签插入只有几个词的代码,插入多行代码可以使用 <pre> 标签,对于超过 10 行的代码,建议你使用沙箱(plnkrJSBincodepen…)