<strong>What are regular expression?</strong>

- A formal notation for describing text patterns and operations on those patters


- Implemented as a part of many languages


- Extremely powerful and express a wide class of sting manipulationa

<strong>What are they used for?</strong>

- Search and replace
- Reforming one file format or layout to another 
- Confirming that strings have a required format
- Parsing strings into tokens. 
- Example: username, hostname from email address
- And many more

<strong>Examples of patterns (highlighted is the matched string using pattern):</strong> 

- Match the string "Hi, <strong>Hello</strong> World"

`hello`


- Match the string "<strong>Hello</strong>" on a line by itself 

`^Hello$`


- Match a number in a string "This is number <strong>2</strong>" 

`[0-9]`


<strong>How can we apply regular expressions?</strong>

- `find`
- `grep`
- `sed`
- `awk`


### Globally Finding Filenames with Patterns (`find`)

`DATA` directory has files with extensions `.DAT` and `.dat`.

How can we show only `.dat` files?


In [24]:
# Lets check what is in DATA
!ls -F DATA/


Icon?                 embryodis_c12_n1.dat  embryodis_c17_n1.dat
[34mSIZE-1440[m[m/            embryodis_c13_n1.dat  embryodis_c18_n1.dat
avg_embryo_dist.DAT   embryodis_c14_n1.dat  embryodis_c19_n1.dat
embryodis_c10_n1.DAT  embryodis_c15_n1.dat
embryodis_c11_n1.dat  embryodis_c16_n1.dat


In [25]:
# At the command line -- use the wildcard (*)
# List all files beginning with zero or more 
# of any character, followed by the .dat string.

! ls DATA/*.dat

DATA/embryodis_c11_n1.dat DATA/embryodis_c14_n1.dat DATA/embryodis_c17_n1.dat
DATA/embryodis_c12_n1.dat DATA/embryodis_c15_n1.dat DATA/embryodis_c18_n1.dat
DATA/embryodis_c13_n1.dat DATA/embryodis_c16_n1.dat DATA/embryodis_c19_n1.dat


In [26]:
# For .DAT files
! ls DATA/*.DAT

DATA/avg_embryo_dist.DAT  DATA/embryodis_c10_n1.DAT


<strong>NOTE</strong>:
`ls` does not show what is in the subdirectories

### Globally Finding Filenames with Patterns (`find`)
`find` command can recursively search a lot of directories at once. One option is to use it with regular expressions. The syntax is:
```
find [path] -regex "<expression>"
```
With this syntax, the top level of the search will be the indicated . `find` will begin at that location and recursively parse filenames using regular expressions (`-regex`). The expression for which it will seek matches is provided between the double quotes.

In [33]:
# find .dat files
! find DATA -regex ".*\.dat"
# Note that it shoes all the files in DATA and 
# its subdirectory SIZE-1440 

DATA/embryodis_c11_n1.dat
DATA/embryodis_c12_n1.dat
DATA/embryodis_c13_n1.dat
DATA/embryodis_c14_n1.dat
DATA/embryodis_c15_n1.dat
DATA/embryodis_c16_n1.dat
DATA/embryodis_c17_n1.dat
DATA/embryodis_c18_n1.dat
DATA/embryodis_c19_n1.dat
DATA/SIZE-1440/embryodis_n1.dat
DATA/SIZE-1440/embryodis_n10.dat
DATA/SIZE-1440/embryodis_n11.dat
DATA/SIZE-1440/embryodis_n12.dat
DATA/SIZE-1440/embryodis_n13.dat
DATA/SIZE-1440/embryodis_n14.dat
DATA/SIZE-1440/embryodis_n15.dat
DATA/SIZE-1440/embryodis_n16.dat
DATA/SIZE-1440/embryodis_n17.dat
DATA/SIZE-1440/embryodis_n18.dat
DATA/SIZE-1440/embryodis_n19.dat


<strong>NOTE</strong>: While the wildcard is available on the command line, it doesn’t mean the same thing on the command line that it does in proper regular expression syntax. On the command line, .* means “one dot (.), then zero or more of any character.” In a regex, it means “zero or more of any character (.).”

The dot character (`.`) is a metacharacter in proper regular expressions. For this reason, the backslash is used before the real dot in “.dat” to indicate it should be taken literally. 

- (`.`) Match any character
- (`*`) Match zero or more preceding

In [34]:
# find .DAT files
! find DATA -regex ".*\.DAT"
# Note that it shoes all the files in DATA and 
# its subdirectory SIZE-1440 

DATA/avg_embryo_dist.DAT
DATA/embryodis_c10_n1.DAT
DATA/SIZE-1440/embryodis_n10 (1).DAT


In [1]:
# find specific .dat files (Not all)
# Say you want to show files start with embryodis_n 

! find DATA -regex ".*embryodis_n.*\.dat"

DATA/SIZE-1440/embryodis_n1.dat
DATA/SIZE-1440/embryodis_n10.dat
DATA/SIZE-1440/embryodis_n11.dat
DATA/SIZE-1440/embryodis_n12.dat
DATA/SIZE-1440/embryodis_n13.dat
DATA/SIZE-1440/embryodis_n14.dat
DATA/SIZE-1440/embryodis_n15.dat
DATA/SIZE-1440/embryodis_n16.dat
DATA/SIZE-1440/embryodis_n17.dat
DATA/SIZE-1440/embryodis_n18.dat
DATA/SIZE-1440/embryodis_n19.dat


In [3]:
# find all files start with embryodis_n and ends with either .dat or .DAT

! find DATA -regex ".*embryodis_n.*\.[Dd][Aa][Tt]"

# [Dd] means either D or d

DATA/SIZE-1440/embryodis_n1.dat
DATA/SIZE-1440/embryodis_n10.DAT
DATA/SIZE-1440/embryodis_n11.dat
DATA/SIZE-1440/embryodis_n12.dat
DATA/SIZE-1440/embryodis_n13.dat
DATA/SIZE-1440/embryodis_n14.dat
DATA/SIZE-1440/embryodis_n15.dat
DATA/SIZE-1440/embryodis_n16.dat
DATA/SIZE-1440/embryodis_n17.dat
DATA/SIZE-1440/embryodis_n18.dat
DATA/SIZE-1440/embryodis_n19.dat


In [8]:
# find all files start with embryodis_n and ends with either .dat or .DAT
# and the number in the filename in the range 13-17
# Note that the second digit (1) in 13 and 17 is common 

! find DATA -regex ".*embryodis_n1[3-7]\.[Dd][Aa][Tt]"

# [3-7] means the rangr from 3 to 7

DATA/SIZE-1440/embryodis_n13.dat
DATA/SIZE-1440/embryodis_n14.dat
DATA/SIZE-1440/embryodis_n15.dat
DATA/SIZE-1440/embryodis_n16.dat
DATA/SIZE-1440/embryodis_n17.dat


In [95]:
# find all files start with embryodis_n and ends with either .dat or .DAT
# and the number in the filename is either 13 or 17
# Note that the second digit (1) in 13 and 17 is common 

!find . -regex ".*.embryodis_n1[3,7]\.[a-zA-Z].*"

./DATA/SIZE-1440/embryodis_n13.dat
./DATA/SIZE-1440/embryodis_n17.dat


### `grep`, `sed`, and `awk`

`grep`, `sed`, and `awk` are a family of tools that use regular expressions and are available on the command line. They each have different capabilities:

The `grep` command has the basic syntax `grep` . `grep` grabs matched patterns and prints them.
The `sed` command has the basic syntax sed `"s///"` . Sed combines grep with a substitution command.
The `awk` command has the basic syntax `awk pattern [action]`. `awk` handles columns.

#### Finding Patterns in Files (`grep`)

`grep` searches, globally, for regular expressions inside files, based on their content. For example, assume you want to see which of the ipynb files talks about sed, then you type:

## Example

In [24]:
! cat DATA/regexp.dat

Hi

Hello World

May 16, 2017

20:36:17


In [25]:
! grep "Hi" DATA/regexp.dat

Hi


In [26]:
! grep "Hello" DATA/regexp.dat

Hello World


In [27]:
# Use regexp using pattern
# grep line that contains a number
! grep ".*[0-9].*" DATA/regexp.dat

May 16, 2017
20:36:17


In [28]:
# Use regexp using pattern
# grep line that starts a number 
! grep "^[0-9].*" DATA/regexp.dat

20:36:17


## Anothe Example

In [38]:
! cat DATA/phone.txt

390/234/3128 Tel
343-344-2425
342.234.4543
phone: 442.234.4543


In [39]:
# grep the lines that contains the character (.) 
! grep "\." DATA/phone.txt

342.234.4543
phone: 442.234.4543


In [40]:
# grep the lines that contains letters
! grep "[a-z]" DATA/phone.txt

390/234/3128 Tel
phone: 442.234.4543


In [42]:
# grep the lines that starts with letters
! grep "^[a-z]" DATA/phone.txt

phone: 442.234.4543


In [43]:
# grep the lines that ends with letters
! grep "[a-z]$" DATA/phone.txt

390/234/3128 Tel


### Finding and Replacing a Complex Pattern

Since, sometimes, you’ll need to reuse part of the pattern you matched, `sed` has syntax to hold the match in memory. It uses parentheses. Specifically, the following syntax matches x and remembers the match:

```
\(x\)
```


## Example :

In [49]:
# Replace Hi with HI in DATA/regexp.dat file
# syntax is "s/pattern/replace/g"

! sed "s/Hi/HI/g" DATA/regexp.dat

# Note that sed does not change the original file 
# unless you asked with the flag -i

HI

Hello World

May 16, 2017

20:36:17


In [54]:
# Replace last line by Date: ... in DATA/regexp.dat file
# syntax is "s/pattern/replace/g"

! sed "s/\([0-9][0-9]:[0-9][0-9]:[0-9][0-9]\)/Date: \1/g" DATA/regexp.dat


Hi

Hello World

May 16, 2017

Date: 20:36:17


In [55]:
# OR

! sed "s/\([0-9]\{2\}:[0-9]\{2\}:[0-9]\{2\}\)/Date: \1/g" DATA/regexp.dat


Hi

Hello World

May 16, 2017

Date: 20:36:17


In [89]:
# Replace May 16, 2017 by
# Day: 16 Month: May Year: 2017

! sed "s/\([a-zA-Z]\{3\}\)\ \([0-9]\{2\}\).*\([0-9]\{4\}\)/Day: \2 Moth: \1 Year: \3/g" DATA/regexp.dat


Hi

Hello World

Day: 16 Moth: May Year: 2017

20:36:17


## Another Example (DATA/dates.dat): 

In [90]:
# cat the file DATA/dates.dat
! cat DATA/dates.dat
# Note that the date format is different from line to line

2014-05-01
2014-09-10
2015-10-30
2014.06.24
2014/09/23
2010/12/29
2009/10/05


In [92]:
# we want to have year-month-day format
# First we have to match all the dates with

# syntax is "s/pattern/replace/g"

! sed "s/\(20[01][0-9]\).*\([0-9][0-9]\).*\([0-9][0-9]\)/\1-\2-\3/g" DATA/dates.dat

2014-05-01
2014-09-10
2015-10-30
2014-06-24
2014-09-23
2010-12-29
2009-10-05


### sed Extras
With `sed`, we can use the `d` character to delete all blank lines in the file:

## Example ( DATA/regexp.dat file ):

In [96]:
! cat DATA/regexp.dat

Hi

Hello World

May 16, 2017

20:36:17


In [99]:
# delete empty lines
! sed '/^$/d' DATA/regexp.dat

Hi
Hello World
May 16, 2017
20:36:17


In [102]:
# delete empty lines and replace Hello with HELLO 
# use -e flag
! sed -e '/^$/d' -e 's/Hello/HELLO/' DATA/regexp.dat

Hi
HELLO World
May 16, 2017
20:36:17


## Another Example ( DATA/phone.txt ):

In [103]:
# cat DATA/phone.txt
! cat DATA/phone.txt

390/234/3128 Tel
343-344-2425
342.234.4543
phone: 442.234.4543


In [106]:
# reformat to +1(???)???-????
! sed 's/.*\([0-9]\{3\}\).*\([0-9]\{3\}\).*\([0-9]\{4\}\).*/\+1(\1\)\2-\3/' DATA/phone.txt

+1(390)234-3128
+1(343)344-2425
+1(342)234-4543
+1(442)234-4543


### Manipulating Columns of Data (awk)

A lot of data in physics begins in a simple format: columns of numbers in plain-text documents. Fortunately for us, a command-line tool called `awk` was invented long ago to quickly and efficiently sort, modify, and evaluate such files. This tool, a sibling to sed and grep, uses regular expressions to get the job done.

As an introductory example, we can investigate the files in the filesystem. On a Linux platform, a list of colors available to the system is found in the /usr/share/X11 directory. On a Unix (Mac OS X) platform, it is made available in /usr/X11/share/X11.

In [111]:
# The file is very long, so we will show only the first 10 lines
!cat /usr/X11/share/X11/rgb.txt | head

255 250 250		snow
248 248 255		ghost white
248 248 255		GhostWhite
245 245 245		white smoke
245 245 245		WhiteSmoke
220 220 220		gainsboro
255 250 240		floral white
255 250 240		FloralWhite
253 245 230		old lace
253 245 230		OldLace


In [113]:
# grep OldLace color
# syntax: awk '/pattern/' file
!awk '/OldLace/' /usr/X11/share/X11/rgb.txt

253 245 230		OldLace


In [115]:
# grep the number set 144
!awk '/144/' /usr/X11/share/X11/rgb.txt
# Note that 144 here could be in the beginning, middle, or end.

112 128 144		slate gray
112 128 144		SlateGray
112 128 144		slate grey
112 128 144		SlateGrey
 30 144 255		dodger blue
 30 144 255		DodgerBlue
208  32 144		violet red
208  32 144		VioletRed
 30 144 255		DodgerBlue1
144 238 144		PaleGreen2
205  96 144		HotPink3
205  41 144		maroon3
144 238 144		light green
144 238 144		LightGreen


In [116]:
# How about if we want the colors that start with 144
!awk '/^144/' /usr/X11/share/X11/rgb.txt

144 238 144		PaleGreen2
144 238 144		light green
144 238 144		LightGreen


In [120]:
# the middle set is 144
!awk '/^.*\ 144\\t.*/' /usr/X11/share/X11/rgb.txt

 30 144 255		dodger blue
 30 144 255		DodgerBlue
 30 144 255		DodgerBlue1


In [126]:
!awk '/^.+ +.+144/' /usr/X11/share/X11/rgb.txt

112 128 144		slate gray
112 128 144		SlateGray
112 128 144		slate grey
112 128 144		SlateGrey
208  32 144		violet red
208  32 144		VioletRed
144 238 144		PaleGreen2
205  96 144		HotPink3
205  41 144		maroon3
144 238 144		light green
144 238 144		LightGreen


### Actions 
In addition to just replicating some of `grep`’s capabilities, `awk` can add an action. However, it can only do actions on a column-wise basis. Note that dollar signs ($) indicate columns:

In [135]:
# print the three number sets for all colors 
#awk '{print $1$2$3}' /usr/X11/share/X11/rgb.txt

In [136]:
# For colors with 112, print out the three numbers sets 
# without space
! awk '/^112/{print $1$2$3}' /usr/X11/share/X11/rgb.txt

112128144
112128144
112128144
112128144
112112112
112112112


In [137]:
# For lines start with 112, print out the three numbers sets 
# with space (use ,)
! awk '/^112/{print $1, $2, $3}' /usr/X11/share/X11/rgb.txt


112 128 144
112 128 144
112 128 144
112 128 144
112 112 112
112 112 112


In [138]:
# Or we can modify just one line:
# "\t" is for tab
!awk NR==11'{print $1 $2 $3,"\t",$4}' /usr/X11/share/X11/rgb.txt

250240230 	 linen


In [139]:
# And finally, we can do math with awk:
!awk NR==11'{print $1,"+",$2,"+",$3,"=",$1+$2+$3}' /usr/X11/share/X11/rgb.txt

250 + 240 + 230 = 720


## Python Regular Expressions
Everything we’ve seen how to do so far in this chapter is also possible in Python. Alternatives to all of these tools exist in the Python regular expression module `re`, which comes as part of the Python standard library. The `re` module allows Python-flavored regular expression pattern matching.

`grep`’s capabilities can be replaced with:

- `re.match(pattern, string)` to match a regular expression pattern to the beginning of a string
- `re.search(pattern, string)` to search a string for the presence of a pattern
- `re.findall( pattern, string*)` to find all occurrences of a pattern in a string

Similarly, the capabilities of sed can be replaced with:

- `re.sub(pattern, replacement, string)` to substitute all occurrences of a pattern found in a string
- `re.subn(pattern, replacement, string)` to substitute all occurrences of a pattern found in a string and return the number of substitutions made

The `re` model provides a few more powerful utilities as well.

- `re.split(pattern, string)` splits a string by the occurrences of a pattern.
- `re.finditer(pattern, string)` returns an iterator yielding a match object for each match.
- `re.compile(pattern)` precompiles a regex for faster matches.

In all of these functions, if a match to a regular expression is not found, then `None` is returned. If a match is found, then a special `MatchObject` is returned.

`MatchObjects` have methods and attributes that allow you to determine the position in the string of the match, the original regular expression pattern, and the values captured by any parentheses with the `MatchObject.groups()` method.

Let’s try to match a date regular expression to some actual dates:

In [140]:
# First, import the regular expression module.
import re

In [141]:
# First, import the regular expression module.
import re

In [142]:
# The string matches the pattern, so a match is returned.
re.match("20[01][0-9].*[0-9][0-9].*[0-9][0-9]", '2015-12-16')

<_sre.SRE_Match object; span=(0, 10), match='2015-12-16'>

In [143]:
# Assign the match to a variable name for later use
m = re.match("20[01][0-9].*[0-9][0-9].*[0-9][0-9]", '2015-12-16')
print(m)

<_sre.SRE_Match object; span=(0, 10), match='2015-12-16'>


In [144]:
# Find the index in the string of the start of the match.
m.pos

0

In [146]:
# Try to match the date pattern against something that is not a date.
m = re.match("20[01][0-9].*[0-9][0-9].*[0-9][0-9]", 'not-a-date')

In [147]:
# Note how None is returned when the match fails.
m is None

True

### The compile() method

To speed up matching multiple strings against a common pattern, it is always a good
idea to compile() the pattern. Compiling takes much longer than matching. However,
once you have a compiled pattern, all of the same functions are available as methods
of the pattern. Since the pattern is already known, you don’t need to pass it in when
you call match() or search() or the other methods. Let’s compile a version of the date
regular expression that has capturing parentheses around the actual date values:

In [149]:
# Compile the regular expression and store it as the re_date variable.
re_date = re.compile("(20[01][0-9]).*([0-9][0-9]).*([0-9][0-9])")

In [150]:
# Use this variable to match against a string.
re_date.match('2014-28-01')

<_sre.SRE_Match object; span=(0, 10), match='2014-28-01'>

In [151]:
# Assign the match to a variable m for later use.
m = re_date.match('2014-28-01')

In [152]:
# Since the regular expression uses capturing parentheses, you can obtain the values
# within them using the groups() method. A tuple that has the same length as
# the number of capturing parentheses is returned.
m.groups()

('2014', '28', '01')