Regular Expression

Regular Expression is widely used in many programming languages. It is also called regex or regexp. Basically, it defines the pattern of strings with meta characters to effectively identify specific data in a data set.

The grep command uses regex to specify a search string pattern. In this section, we'll explain selected regex meta characters using the grep command. We use the same data set used in the previous section to show some examples.

^ : begin with

^ defines data that begins with a specified string after ^. For example, to search the lines that begin with "I like a" in the data set prepared in the previous section, run the following command.

Command Line - INPUT
grep -r ^"I like a" test

The command returns a line like the one below. The result doesn't include the lines with "I like a" in the middle of the sentence.

Command Line - RESPONSE
test/sample_1.txt:I like apples and he likes oranges

$ : end with

$ defines data that ends with a specified string before $. For example, to search the lines that end with "s bananas" in the data set prepared in the previous section, run the following command.

Command Line - INPUT
grep -r "s bananas"$ test

The command returns a result like the one below. The result doesn't include the lines with "s bananas" in the middle of the sentence.

Command Line - RESPONSE
test/sample_1.txt:I like grapes and he likes bananas
test/sample_1.txt:He likes grapes and she likes bananas

[ ] : matches anything contained

[ ] defines one character that matches anything contained within the bracket. For example,
[A-Z] : any uppercase alphabet character
[a-z] : any lowercase alphabet character
[0-9] : any single digit number

For the data set prepared in the previous section, run the command below to see how [ ] works.

Command Line - INPUT
grep -r [AE]S test

The command returns a result like the one below. The result includes the lines with AS or ES.

Command Line - RESPONSE
test/test_sub/sample_2.txt:I like the GRAPES
test/test_sub/sample_2.txt:I like the BANANAS
test/test_sub/sample_2.txt:He likes the GRAPES
test/test_sub/sample_2.txt:He likes the BANANAS
test/test_sub/sample_2.txt:She likes the GRAPES
test/test_sub/sample_2.txt:She likes the BANANAS

[^ ] : matches anything NOT contained

[^] defines one character that matches anything NOT contained within the bracket.

For example, to search the lines with “es” but excluding “kes”, “ges” and “pes”, run the following command.

Command Line - INPUT
grep -r [^kgp]es test

The result will be like the one below.

Command Line - RESPONSE
test/sample_1.txt:I like apples and he likes oranges
test/sample_1.txt:I like bananas and she likes apples
test/sample_1.txt:He likes apples and she likes oranges
test/sample_1.txt:He likes bananas and I like apples
test/sample_1.txt:She likes apples and I like oranges
test/sample_1.txt:She likes bananas and he likes apples

. (dot) : matches any one character

. (dot) defines any one character. Run the following command to search the lines with a string of five characters that begin with “l” and end with “a”.

Command Line - INPUT
grep -r l...a test

The result will be like the one below.

Command Line - RESPONSE
test/sample_1.txt:I like apples and he likes oranges
test/sample_1.txt:He likes apples and she likes oranges
test/sample_1.txt:She likes apples and I like oranges

* (asterisk) : matches any character(s) or no occurrence

* (asterisk) defines character(s) specified before * or no occurrence. For example, AP*D means APD, APPD, APPPD, or AD.

To see how * works in the data set prepared, run the following command.

Command Line - INPUT
grep -r AP*L*E test

The command searches for a string that matches all the criteria below

  • begins with A
  • ends with E
  • P, L, or no character between A and E

The command result will be like the one below.

Command Line - RESPONSE
test/test_sub/sample_2.txt:I like the APPLE
test/test_sub/sample_2.txt:I like the GRAPES
test/test_sub/sample_2.txt:He likes the APPLE
test/test_sub/sample_2.txt:He likes the GRAPES
test/test_sub/sample_2.txt:She likes the APPLE
test/test_sub/sample_2.txt:She likes the GRAPES

IdeaNote: Asterisk

The meaning of * (asterisk) is different in a regular expression and a wildcard.

In a wildcard, * means any character(s). The * is not related in any way to the character before it.

In a regular expression, * has a relation with the character before *. For example, A* means A, AA, AAA, ..., or no occurrence.