正则表达式是一种可供 Linux 工具过滤文本的自定义模板。 Linux 工具(比如 sed 或 gawk) 会在读取数据时使用正则表达式对数据进行模式匹配。如果数据匹配模式,它就会被接受并进行处理。如果数据不匹配模式,它就会被弃用。
正则表达式模式使用元字符来描述数据流中的一个或多个字符。
尽管在 Linux 世界中有很多不同的正则表达式引擎,但最流行的是以下两种。
大多数 Linux 工具至少符合 POSIX BRE 引擎规范,能够识别该规范定义的所有模式符号。有些工具(比如 sed)仅符合 BRE 引擎规范的一个子集。这是出于速度方面的考虑导致的,因为 sed 希望尽可能快地处理数据流中的文本。
POSIX ERE 引擎多见于依赖正则表达式过滤文本的编程语言中。它为常见模式(比如数字、单词以及字母数字字符)提供了高级模式符号和特殊符号。 gawk 使用 ERE 引擎来处理正则表达式。
最基本的 BRE 模式是匹配数据流中的文本字符。
正则表达式并不关心模式在数据流中出现的位置,也不在意模式出现了多少次。只要能匹配文本字符串中任意位置的模式,正则表达式就会将该字符串传回 Linux 工具。
正则表达式区分大小写。
在正则表达式中,无须写出整个单词。只要定义的文本出现在数据流中,正则表达式就能够匹配。
也无须局限于在正则表达式中只使用单个文本单词,空格和数字也是可以的。
$ echo "This is line number 1" | sed -n '/ber 1/p'
This is line number 1
$
$ echo "This is line number1" | sed -n '/ber 1/p'
$
$ cat data1
This is a normal line of text.
This is a line with too many spaces.
$ sed -n '/ /p' data1
This is a line with too many spaces.
$
.*[]^${}\+?|()
锚定行首
$ echo "The book store" | sed -n '/^book/p'
$
$ echo "Books are great" | sed -n '/^Book/p'
Books are great
$
锚定行尾
$ echo "This is a good book" | sed -n '/book$/p'
This is a good book
$ echo "This book is good" | sed -n '/book$/p'
$
$ echo "There are a lot of good books" | sed -n '/book$/p'
$
组合锚点
$ cat data4
this is a test of using both anchors
I said this is a test
this is a test
I'm sure this is a test.
$ sed -n '/^this is a test$/p' data4
this is a test
$
$ cat data5
This is one test line.
This is another test line.
$ sed '/^$/d' data5
This is one test line.
This is another test line.
$
$ sed -n '/[ch]at/p' data6
The cat is sleeping.
That is a very nice hat.
$
$ echo "Yes" | sed -n '/[Yy]es/p'
$ echo "Yes" | sed -n '/[Yy][Ee][Ss]/p'
$ sed -n '/[0123]/p' data7
$ cat data8
60633
46201
223001
4353
22203
$ sed -n '
>/[0123456789][0123456789][0123456789][0123456789][0123456789]/p
>' data8
60633
46201
223001
22203
$
$ sed -n '
> /^[0123456789][0123456789][0123456789][0123456789][0123456789]$/p
> ' data8
60633
46201
22203
$
$ sed -n '/[^ch]at/p' data6
This test is at line four.
$
$ sed -n '/^[0-9][0-9][0-9][0-9][0-9]$/p' data8
60633
46201
45902
$
$ sed -n '/[a-ch-m]at/p' data6
The cat is sleeping.
That is a very nice hat.
$
$ echo "I'm getting too fat." | sed -n '/[a-ch-m]at/p'
$
除了定义自己的字符组, BRE 还提供了一些特殊的字符组,以用来匹配特定类型的字符。
字符组 | 描述 |
---|---|
[[:alpha:]] | 匹配任意字母字符,无论是大写还是小写 |
[[:alnum:]] | 匹配任意字母数字字符, 0~9 、A~Z 或 a~z |
[[:blank:]] | 匹配空格或制表符 |
[[:digit:]] | 匹配 0~9 中的数字 |
[[:lower:]] | 匹配小写字母字符 a~z |
[[:print:]] | 匹配任意可打印字符 |
[[:punct:]] | 匹配标点符号 |
[[:space:]] | 匹配任意空白字符:空格、制表符、换行符、分页符(formfeed)、垂直制表符和回车符 |
[[:upper:]] | 匹配任意大写字母字符 A~Z |
特殊字符组在正则表达式中的用法和普通字符组一样:
$ echo "abc" | sed -n '/[[:digit:]]/p'
$
$ echo "abc" | sed -n '/[[:alpha:]]/p'
abc
$ echo "abc123" | sed -n '/[[:digit:]]/p'
abc123
$ echo "This is, a test" | sed -n '/[[:punct:]]/p'
This is, a test
$ echo "This is a test" | sed -n '/[[:punct:]]/p'
$
$ echo "I'm getting a color TV" | sed -n '/colou*r/p'
I'm getting a color TV
$ echo "I'm getting a colour TV" | sed -n '/colou*r/p'
I'm getting a colour TV
$
$ echo "this is a regular pattern expression" | sed -n '
> /regular.*expression/p'
this is a regular pattern expression
$
$ echo "bt" | sed -n '/b[ae]*t/p'
bt
$ echo "bat" | sed -n '/b[ae]*t/p'
bat
$ echo "bet" | sed -n '/b[ae]*t/p'
bet
$ echo "btt" | sed -n '/b[ae]*t/p'
btt
$ echo "baat" | sed -n '/b[ae]*t/p'
baat
$ echo "baaeeet" | sed -n '/b[ae]*t/p'
baaeeet
$ echo "baeeaeeat" | sed -n '/b[ae]*t/p'
baeeaeeat
$ echo "baakeeet" | sed -n '/b[ae]*t/p'
$
$ echo "bt" | gawk '/b[ae]?t/{print $0}'
bt
$ echo "beeet" | gawk '/be+t/{print $0}'
beeet
$ echo "bt" | gawk '/b[ae]+t/{print $0}'
$
$ echo "bt" | gawk --re-interval '/be{1}t/{print $0}'
$
$ echo "beet" | gawk --re-interval '/be{1,2}t/{print $0}'
beet
$ echo "bat" | gawk --re-interval '/b[ae]{1,2}t/{print $0}'
bat
expr1 |expr2|...
$ echo "The cat is asleep" | gawk '/cat|dog/{print $0}'
The cat is asleep
$ echo "The dog is asleep" | gawk '/cat|dog/{print $0}'
The dog is asleep
$ echo "The sheep is asleep" | gawk '/cat|dog/{print $0}'
$
$ echo "He has a hat." | gawk '/[ch]at|dog/{print $0}'
He has a hat.
$
$ echo "Sat" | gawk '/Sat(urday)?/{print $0}'
Sat
$ echo "Saturday" | gawk '/Sat(urday)?/{print $0}'
Saturday
$
$ echo "cat" | gawk '/(c|b)a(b|t)/{print $0}'
cat
$ cat countfiles
#!/bin/bash
# count number of files in your PATH
mypath=$(echo $PATH | sed 's/:/ /g')
count=0
for directory in $mypath
do
check=$(ls $directory)
for item in $check
do
count=$[ $count + 1 ]
done
echo "$directory - $count"
count=0
done
$ ./countfiles /usr/local/sbin - 0
/usr/local/bin - 2
/usr/sbin - 213
/usr/bin - 1427
/sbin - 186
/bin - 152
/usr/games - 5
/usr/local/games – 0
$
(123)456-7890
(123) 456-7890
123-456-7890
123.456.7890
^\(?[2-9][0-9]{2}\)?(| |-|\.)[0-9]{3}( |-|\.)[0-9]{4}$
拆解看:^\(? [2-9] [0-9]{2} \)? (| |-|\.) [0-9]{3} ( |-|\.) [0-9]{4}$
$ cat isphone
#!/bin/bash
# script to filter out bad phone numbers
gawk --re-interval '/^\(?[2-9][0-9]{2}\)?(| |-|\.)
[0-9]{3}( |-|\.)[0-9]{4}/{print $0}'
$
$ echo "317-555-1234" | ./isphone
317-555-1234
$ cat phonelist
000-000-0000
123-456-7890
...
$ cat phonelist | ./isphone
username@hostname
^([a-zA-Z0-9_\-\.\+]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$
拆解看:^([a-zA-Z0-9_\-\.\+]+) @ ([a-zA-Z0-9_\-\.]+) \. ([a-zA-Z]{2,5})$