[dry] five minutes, regular expression is no longer your trouble

Preface: as a program ape, but a lot of people started, and I was in iOS “started”, so always want to write something about iOS, a little meaning and value of things, but now is busy working during the day, and Qt and C++ project. So to explore the relatively less time iOS. Today to share a little bit about the regular expression of the dry goods, as apes we all know that regular expressions in the matching query content and matching search is still more common. But I do not believe that many people are not gifted with an extraordinary retentive memory, so only to find, when needed, and then go slowly, and test. After reading this article, if feel good, please check it every, there is a need for the buddy can see, you can also join the collection, so that I can help you to ensure that the improved efficiency in need, this is my original intention to share this article.

Hee hee, in fact, five minutes would like to read in detail all the difficulties, not to mention the know. What I mean is, five minutes to browse an article, like the collection of spare, the general feeling, can see how many times to look at their own slightly

Mark:
Tips: reprint articles address because the use of markdown layout, and the expression contains some special characters, so there may be some places that error didn’t notice, thank you put forward, I will promptly correct reading, so as not to affect everyone, sorry.


1, meta character

Table 1 common meta characters

Code / expression Explain
. Matches any character except newline
/w A letter or number or underscore or character
/s Matches any whitespace
/d Matching number
/b Match the beginning or end of a word
^ Match string start
$ End of match string

2, character escape

If you are looking for, or *, there is a problem: you can’t specify them, because they will be interpreted to mean something else. You have to use / cancel the special meaning of these characters. Therefore, you should use and / * /. Of course, to find itself, you have to use.

3, repeat

Table 2 commonly used qualifiers

Code / expression Explain
* Repeat zero or more times
+ Repeat one or more times
? Repeat zero or one time
{n} Repeat n times
{n} Repeat n times or more
{n, m} Repeat n to m times

4, character class

  • “[your set]: [aeiou], a, e, I, O, u, and any one of the [.?!], (or punctuation. Or?!).
  • [0-9]: is exactly the same as /d, representing a number.
  • [a-zA-Z]: represents a letter, [a-z0-9A-Z] equivalent to /w (of course, consider English).

Patients with 4-1:/ (? 0/d{2}[) – /d{8}?

Analysis: this expression can be matched to several formats of telephone numbers, such as (010) 88886666, or 022-22334455, or the like. We make some analysis on it: the first is an escape character (/, it can appear 0 times or 1 times, then? Is a 0, followed by 2 digital /d{2}, then) one or – or in space, it appears 1 times or not, is the last of 8? A digital /d{8}.

5, branching conditions

If you take a look at 4-1, you can see that the expression can match 010) or “” in an incorrect format (022-87654321). To solve this problem, we need to use branching conditions.

The regular expression in the * * branch condition refers to several rules, if you meet any of them a rule should be used as matching, the specific method is to use the | (vertical) to separate different rules

Example 5-1:0/d{2}-/d{8}|0/d{3}-/d{7}

Analysis: this expression can match two phone numbers to hyphen separated: one is the three area code, 8 local number (e.g. 010-12345678), is a 4 bit code, 7 bit local number (e.g. 0376-2233445), 0/d{2}-/d{8} said the “0 plus two” figures “-” plus 8 digital 0/d{3}-/d{7}, said “0 plus three” figures “-” add 7 numbers, can be interpreted as “or |”. Is to find the match with the former or the latter match the content.

Note: when using branching conditions, pay attention to the order of each condition. Because the matching branch conditions, from the left to the right to test each condition, if you meet a branch, then it will not go to other conditions. For example: /d{5}-/d{4}|/d{5} and /d{5}|/d{5}-/d{4} are different.

6, grouping

We have already mentioned how to repeat a single character (directly after the character with the upper limit), but what if you want to repeat multiple characters? You can refer to the stator expression (also called grouping) in parentheses, and then you can specify the number of repetitions of this expression.

Case 6-1: (/d{1,3}.) {3}/d{1,3}

Analysis: This is a simple IP address matching expression. To understand the expression, please analyze it in the following order: /d{1,3}, 1 to 3 digits, (/d{1,3}.) {3}, three digits plus a English period (this is the whole group) repeated 3 times, eventually adding digital /d{1,3} one to three bits.

Unfortunately, it will also match the IP address of this impossible 256.300.888.999. If we can use arithmetic comparison, perhaps can easily solve this problem, but the regular expression does not provide any function of mathematics, so only use long packets, choose character classes to describe a correct IP (address: (2[0-4]/d|25[0-5]|[01]? /d/d?).) {3} (2[0-4]/d|25[0-5]|[01]? /d/d?).

7, antisense

Sometimes you need to find characters that do not belong to a class that can be simply defined. For example, in addition to looking for figures, any other character will be the case, then need to use the anti sense.

Table 3 commonly used anti sense code

Code / expression Explain
/W Matches any character that is not alphabetic, numeric, underlined, or Chinese
/S Matches any character that is not a blank character
/D Matches any non numeric character
/B Matching is not the beginning or end of a word
[^x] Matches any character other than x
[^aeiou] Matches any character other than aeiou

8, reverse reference

When a sub expression is specified using a brace, the text that matches the expression (that is, the contents of the packet capture) can be further processed in an expression or other program. By default, each packet will automatically have a group number, the rule is: from left to right, to the left parenthesis grouping as a symbol, number of the first occurrence of a packet is 1, second is 2, and so on. The
backward reference is used to duplicate the text of a previous grouping match. For example, /1 represents a grouping of 1 matching text.

Case 8-1:/b (/w+) /b/s+/1/b

Analysis: can match duplicate words, like go go, or kitty kitty. This expression is the first of a word, is the word at the beginning and end of more than one letter or number of /b between /b (/w+), the word will be captured by the number of 1 groups, then 1 or several whitespace is 1 /s+, finally packet capture content (also is the front of the word, /1).

Table 4 common grouping grammar

classification Code / syntax Explain
capture (EXP) Match the exp, and capture the text to the automatically named group
(/< name/> exp) Match the exp, and capture the text into a group named name, can also be written ((‘name’exp))
(: exp) Matching exp, do not capture the matching text, nor to the packet distribution group
Zero width assertion (? =exp) Match exp front position
(< =exp) Match the position behind exp
(?? exp) The match is not followed by the location of the exp
(? &lt!! exp) Match the position of the front is not exp
Notes (? #comment) This type of grouping does not have any effect on the processing of regular expressions, which is used to provide annotations for people to read

9, greed and laziness

When a regular expression contains an acceptable qualifier, the usual behavior is to match as many characters as possible, so that the entire expression can be matched. Take this expression, for example, a.*b, which will match the longest string that starts with a and ends with B. If you use it to search for aabab, it will match the entire string aabab. Greedy matching.
sometimes, we need lazy matching, that is, to match as little as possible characters. The previous qualifier can be converted into a lazy matching pattern, with a question mark behind it?. This *? Means match any number of repeat, but can make the matching success under the premise of using the least repeat. Now look at the lazy version of the example:
a.*? B matches the shortest start with a to the end of the string B. If applied to aabab, it will match AAB (first to third characters) and ab (from fourth to fifth characters).

Note: the first match has the highest priority

Table 5 lazy qualifiers

Code / expression Explain
*? Repeat any time, but as little as possible
+? Repeat 1 or more times, but as little as possible
?? Repeat 0 or 1 times, but as little as possible
{n, m}? Repeat n to m times, but as little as possible
{n}? Repeat n times or more, but as little as possible

Note: This article is cited in the article “regular expressions 30 minute tutorial”, more details please see the original.


Schedule:

Code / expression Explain
/ Mark the next character as a special character, or a literal character, or a backward reference, or an octal escape character. For example,’n’matches the character “n”. ‘/n’matches a newline character. The sequence of’//’matching “/” and “/ (” match “(“.
^ Match the start position of the input string. If the Multiline property of the RegExp object is set,’/r’also matches the location after’/n’ or.
$ Match the end of the input string. If you set the Multiline property of the RegExp object, $also matches the location of’/n’or’/r’.
* Matches the previous sub zero or multiple. For example, Zo can match “Z” and “zoo””. Equivalent to {0,}.
+ Matches the previous expression one or more times. For example,’zo+’can match “Zo” and “zoo”, but cannot match “z”. Equivalent to {1,}.
? Matches the previous sub zero or one time. For example, “do (ES)” can match “do” or “does” in “do””. Equivalent to {0,1}.
{n} N is a non negative integer. Matching determined n times. For example,’o{2}’can’t match the’o’ in “Bob”, but it can match the two o in “food”.
{n} N is a non negative integer. Match at least n times. For example,’o{2,}’cannot match “Bob” in’o’, but can match all o in foooood. ‘o{1,}’is equivalent to’o+’. ‘o{0,}’is equivalent to’o*’.
{n, m} M and N are nonnegative integers, of which n &lt = = M. Minimum match n times and up to m times. For example, “o{1,3}” will match the first three o in “fooooood”. ‘o{0,1}’is equivalent to’O?. Note that there is no space between commas and two numbers.
? When the character is followed by any other restriction (*, +, -, {n}, {n,}, {n, m}), the matching pattern is non greedy. The non greedy mode matches the search string as little as possible, while the default greedy mode matches as many matches as possible. For example, for string “oooo”,’o+ ‘will match a single “O”, while’o+’ will match all’o’.
. Matches any single character except “/n”. To match any character, including’/n’, use a pattern like'[./n]’.
[xyz] Character set. Matches any character contained. For example,'[abc]’can match’a’ in “plain”.
[^xyz] Negative character set. Matches any character not included. For example,'[^abc]’can match plain,’p’,’l’,’i’,’n’.
[a-z] Character range. Matches any character in the specified range. For example,'[a-z]’can match any lowercase letter character in the range of’z’ to’a’.
[^a-z] Negative character range. Matches any character not in the specified range. For example,'[^a-z]’can match any character that is not in the range of’a’ to’z’.
/b Matches a word boundary, that is, the position between the word and the space. For example,’er/b’can match the “never” in the’er’, but can not match the “verb” in the’er’.
/B Matching non word boundary. ‘er/B’can match’er’ in “verb”, but cannot match’er’in “never”.
/cx Match control characters specified by X. For example, /cM matches a Control-M or carriage return. The value of X must be one of A-Z or A-Z. Otherwise, C is treated as a literal’c’character.
/d Matches a numeric character. Equivalent to [0-9].
/D Matches a non numeric character. Equivalent to [^0-9].
/f Matching a formfeed. Equivalent to /x0c and /cL.
/n Match a newline character. Equivalent to /x0a and /cJ.
/r Matches a carriage return character. Equivalent to /x0d and /cM.
/s Matches any whitespace characters, including spaces, tabs, formfeed etc.. Equivalent to [/f/n/r/t/v].
/S Matches any non whitespace character. Equivalent to [^ /f/n/r/t/v].
/t Matches a tab. Equivalent to /x09 and /cI.
/v Matches a vertical tab. Equivalent to /x0b and /cK.
/w Matches any word character including underscore. Equivalent to'[A-Za-z0-9_]’.
/W Matches any non word character. Equivalent to'[^A-Za-z0-9_]’.
/xn Match n, where n is sixteen binary escape value. The sixteen hexadecimal escape value must be two digits long. For example,’/x41’matches “A”. ‘/x041’is equivalent to’/x04’ & “1”. Regular expressions can be used in ASCII coding.
/num Match num, where num is a positive integer. Reference to the obtained match. For example, ‘(-) /1’ matches two consecutive identical characters.
/n Identifies an octal escape value or a backward reference. If /n has at least n of the obtained sub expressions, then n is a backward reference. Otherwise, if n is an octal digit (0-7), then n is an octal escape value.
/nm Identifies an octal escape value or a backward reference. If /nm has at least nm to obtain a sub expression before, nm is a backward reference. If /nm has at least n access before, then n is a backward reference to the literal M. If the previous conditions are not satisfied, if n and m are octal digits (0-7), then /nm will match octal escape value nm.
/nml If the n is an octal number (0-3), and m and L are octal digits (0-7), then the octal escape value nml.
/un Match n, where n is a Unicode character represented by four decimal digits. For example, /u00A9 matches the copyright symbol (?).

(pattern) match pattern and get this match. The resulting match can be obtained from the resulting Matches collection, using the SubMatches collection in VBScript, and $0 in JScript… $9 attribute. To match parentheses characters, use ‘(or’ ‘).
(?: pattern) matches the pattern but does not get the matching result, which means that it is a non – acquired match and does not store for later use. The various parts in the use of “or” character “|” to a combination mode is very useful.
, for example, “industr (:: y|ies)” is a more simple expression than “industry|industries”.
(? =pattern) forward looking at any matching pattern string at the beginning of the matching search string. This is a non acquisition match, that is, the match does not need to be acquired for later use. For example,’Windows (=95|98|NT|2000) ‘can match “Windows” in “Windows”, but can not match “Windows 3.1” in “Windows” in “””. Pre check does not consume characters, that is, after a match occurs, immediately after the last match to start the next match of the search, rather than starting from the character contains the pre check.
(?? pattern) negative pre check, in the absence of any string matching pattern match at the beginning of the search string. This is a non acquisition match, that is, the match does not need to be acquired for later use. For example,’Windows (? 95|98|NT|2000) ‘can match “Windows” in “Windows 3.1”, but cannot match “Windows 2000” in “Windows””. Pre check does not consume characters, that is, after a match occurs, immediately after the last match to start the next match of the search, rather than starting from the character contains the pre check.
x|y match X or y. For example,’z|food’can match “Z” or “food””. ‘(z|f) ood’ matches “zood” or “food””.


Username: ^[a-z0-9_-]{3,16}$password:
^[a-z0-9_-]{6,18}$
sixteen hex values: ^#? ([a-f0-9]{6}|[a-f0-9]{3}) E-mail: $
^ ([a-z0-9_.-]+) @ ([/da-z.-]+). ([a-z.]{2,6}) $
URL:^ (HTTPS: / / / /?)? ([/da-z.-]+). ([a-z.]{2,6}) ([///w.-]*) *// $
^ IP address:? (?: (? 25[0-5]|2[0-4][0-9]|[01]? [0-9][0-9]?).) {3} (? 25[0-5]|2[0-4][0-9]|[01]? [0-9][0-9]?
) $HTML tag: ^< ([a-z]+) ([^=XXFN}+) (*?: > &lt (. *); ///1> |/s+//> Unicode) $
encoding in the range: ^[u4e00-u9fa5], {0, Chinese characters}$
regular expression matching Chinese characters: [/u4e00-/u9fa5]
, Chinese commentary: really is a headache, the expression will be easier,
Double byte characters (including Chinese characters,:{FNXX=x00-xff]
) commentary: can be used to calculate the length of the string (a double byte character length of 2 characters, ASCII 1)
regular expression matching blank line: /n/s*/r
commentary: can use regular expressions to delete blank lines matching
HTML markers: &lt (/S*;? [^=XXFN}*> < 2)? //1> |< />
. *? Commentary: online version of the spread is too bad, this is only the matching part, the nested tag complex regular expressions,
is still incapable of action and blank characters: ^/s*|/s*$
commentary: can be used to remove the trailing blank characters (including spaces, tabs, etc.), formfeed very useful expressions of
. The regular expression with Email address: /w+ ([-+.]/w+) *@/w+ ([-.]/w+) *./w+ ([-.]/w+) *
commentary: regular expression form validation is useful
URL. URL: [a-zA-z]+: //{FNXX=s]*
commentary: online version of the spread function is very limited, the above basic can meet the demand of
, the account is legitimate (letter that allows the 5-16 byte, allow alphanumeric underline) comment:
:^[a-zA-Z][a-zA-Z0-9_]{4,15}$forms authentication
very practical, domestic telephone number: /d{3}-/d{8}|/d{4}-/d{7}
commentary: matching forms such as 0511-4405222 or 021-87888822
, Tencent QQ number: [1-9][0-9]{4,
} comment: Tencent QQ began
matching, China post from 10000 Encoding: [1-9]/d{5} (?! /d)
: Chinese, encoding for postal commentary 6 digit ID: /d{15}|/d{18}
,
, Chinese commentary: ID is 15 or 18 bits
, IP address: ((2[0-4]/d|25[0-5]|[01]? /d/d?).) {3} (2[0-4]/d|25[0-5]|[01]? /d/d?
) commentary: extraction IP address:
,
useful specific digital ^[1-9]/d*$/ / ^-[1-9]/d*$/ /
, positive integers, negative integer
^? [1-9]/d*$/ / ^[1-9]/d*|0$/ /
matching integer, non negative integers (positive integer
/ ^-[1-9]/d*|0$+ 0), positive integer (non negative integer + 0)
, ^[1-9]/d*./d*|0./d*[1-9]/d*$/ positive floating-point
^ ([1- 9]/d*./d*|0./d*[1-9]/d*), $/ / negative float
^? ([1-9]/d*./d*|0./d*[1-9]/d*|0?.0+|0) / $, floating-point number
^[1-9]/d*./d*|0./d*[1-9]/d*|0?.0+|0$/ /, non negative (positive float float + 0 ^ (
) – ([1-9]/d*./ d*|0./d*[1-9]/d*)) |0?.0+|0$/ /, non positive (negative float float + 0)
commentary: useful when working with large amounts of data, the specific application of modified
, pay attention to a particular string:
^[A-Za-z]+$/ /, consisting of 26 English letter string
^[A-Z]+$/ /, consisting of 26 English letter string
^[a-z]+$/ /, consisting of 26 English letters lowercase string
^[A-Za-z0-9]+$/ /, by the numbers and 26 English letter string
^/w+$/ / matching string consisting of 26 letters or numbers, English underlined