How to use Regular expressions in Python
This tutorial discusses how to use regular expressions in Python with simple programming examples.
To find patterns and to extract various bits of lines, we can use string methods like split() and find() and using lists and string slicing to extract a substring of the lines.
#find function on strings str1 = 'I am an Engg student' print (str1.find('an'))
Output
5
The patter ‘an’ is present in string hence the index of ‘an’ 5 is returned by the find() function in this case. In the next example, we are trying to find ‘AN’, which is not present in the string, hence the output in -1.
#find function on strings str1 = 'I am an Engg student' print (str1.find('AN'))
Output
-1
The same can be done very efficiently and easily using a Python library called regular expressions that handles many tasks.
Regular expressions are almost used in every other programming language for searching and parsing strings.
Video Tutorial
Special characters and character sequences:
ˆ – Matches the beginning of the line.
$ – Matches the end of the line.
. – Matches any character (a wildcard).
\s – Matches a whitespace character.
\S – Matches a non-whitespace character (opposite of \s).
‘*’ – Applies to the immediately preceding character & indicates to match zero or more of the preceding character(s).
‘+’ – Applies to the immediately preceding character & indicates to match one or more of the preceding character.
[aeiou] – Matches a single character as long as that character is in the specified set.
[a-z0-9] – You can specify ranges of characters using the minus sign.
[ˆA-Za-z] – When the first character in the set notation is a caret, it inverts the logic.
( ) – When parentheses are added to a regular expression, they are ignored for the purpose of matching but allow you to extract a particular subset of the matched string rather than the whole string when using findall().
The following programs demonstrate how to use regular expressions in Python
Search for pattern using searech() function
# Find pattern Engg import re line = 'I am an Engg student' if re.search('Engg', line): print(line) else: print ("Pattern not found") print ('FINISH')
Output
I am an Engg student FINISH
# Find pattern Engg at the end of the string import re line = 'I am an Engg student' if re.search('Engg$', line): print(line) else: print ("Pattern not found") print ('FINISH')
Output
Pattern not found FINISH
# Find pattern Engg at the begining of the string import re line = 'I am an Engg student' if re.search('^Engg', line): print(line) else: print ("Pattern not found") print ('FINISH')
Output
Pattern not found FINISH
Contents of Sample input file text1.txt
Name: Rahul
From: Karanataka
From: xyz@abc.com
To: pqr@abc.com
X-DSPAM-Confidence: 0.8475
X-DSPAM-Probability: 0.0000
Mail From: XYZ123@ABC.COM
To: PQR123@ABC.COM
X-DSPAM-Confidence: 0.9125
X-DSPAM-Probability: 0.0080
# Search for lines that contain 'From' import re hand = open('test1.txt') for line in hand: line = line.rstrip() if re.search('From:', line): print(line)
Output
From: Karanataka From: <xyz@abc.com> Mail From: <XYZ123@ABC.COM>
# Search for lines that starts with 'From' import re hand = open('test1.txt') for line in hand: line = line.rstrip() if re.search('^From:', line): print(line)
Output
From: Karanataka From: <xyz@abc.com>
#Character matching in regular expressions Search for lines that start with 'F', #followed by 2 characters, followed by 'm:' import re hand = open('test1.txt') for line in hand: line = line.rstrip() if re.search('^F..m:', line): print(line)
Output
From: Karanataka From: <xyz@abc.com>
# Search for lines that start with From and have an at sign import re hand = open('test1.txt') for line in hand: line = line.rstrip() if re.search('^From.+@.+', line): print(line)
Output
From: <xyz@abc.com>
Extracting data using regular expressions
Extracting data using regular expressions – extract email address
Here the regular expression \S+@\S+ matches with all non-whitespace characters followed by @ symbol and non-space characters.
#Example-1 import re s = 'A message fromto cse@xyz.com' lst = re.findall('\S+@\S+', s) print(lst)
Output
['<cse@abc.com>', 'cse@xyz.com']
In the output: first email address has extract < and > symbols. To remove that we have to use regular expressions which match only characters and numbers.
In the next example regular expression [a-zA-Z0-9]+@[a-zA-Z0-9]+.[a-zA-Z0-9]+ matches with all character ([a-zA-Z0-9]+) or digits followed by @ symbol followed by all characters and digits ([a-zA-Z0-9]+), followed by . and all characters and digits ([a-zA-Z0-9]+).
#Example-2 import re s = 'A message fromto cse@xyz.com' lst = re.findall('[a-zA-Z0-9]+@[a-zA-Z0-9]+\.[a-zA-Z0-9]+', s) print(lst)
Output
['csev@abc.com', 'cse@xyz.com']
Now the output is correct. The same can be obtained using shortcut special characters. \w matches with characters and digits. That is \w is special character for [a-zA-Z0-9]. Example 3 shows that.
#Example -3 import re s = 'A message fromto cse@xyz.com' lst = re.findall('\w+@\w+\.\w+', s) print(lst)
Output
['csev@abc.com', 'cse@xyz.com']
Python program to read the content of a file and extract an email address.
# Search for lines that have an at sign between characters # The characters must be a letter or number import re hand = open('test1.txt') for line in hand: line = line.rstrip() x = re.findall('\w+@\w+.\w+', line) if len(x) > 0: print(x)
Output
['xyz@abc.com'] ['pqr@abc.com'] ['XYZ123@ABC.COM'] ['PQR123@ABC.COM']
Python Programming examples to demonstrate Regular expressions in Python
Combining searching and extracting Search for lines that start with ‘X’ followed by any non-whitespace characters and ‘:’. Followed by a space and any number. The number can include a decimal.
import re hand = open('test1.txt') for line in hand: line = line.rstrip() x = re.findall('^X\S*: [0-9]+\.[0-9]+', line) if len(x) > 0: print(x)
Output
['X-DSPAM-Confidence: 0.8475'] ['X-DSPAM-Probability: 0.0000'] ['X-DSPAM-Confidence: 0.9125'] ['X-DSPAM-Probability: 0.0080']
Search for lines that start with ‘X’ followed by any non-whitespace characters and ‘:’. Followed by a space and any number. The number can include a decimal. Then print the number if it is greater than zero
import re hand = open('test1.txt') for line in hand: line = line.rstrip() x = re.findall('^X\S*: ([0-9]+\.[0-9]+)', line) if len(x) > 0: print (x)
Output
['0.8475'] ['0.0000'] ['0.9125'] ['0.0080']
Python program to extract the USN from a text file
The contents of sample input file usn.txt
Mahesh Huddar 2GI04CS045 GIT BGM
Rahul 2HN15CS001 HIT NDS
import re hand = open('usn.txt') for line in hand: line = line.rstrip() x = re.findall('[1-4][a-zA-Z]{2}[0-9]{2}[a-zA-Z]{2}[0-9]{2}[1-9]', line) if len(x) > 0: print (x)
Output
['2GI04CS045'] ['2HN15CS001']
Escape character – how to remove the special meaning of characters?
#Escape character - how to remove the special meaning of characters import re x = 'We just received 10*2=20 for cookies.' y = re.findall('[0-9]+\*[0-9]=[0-9]+',x) print (y)
Output
['10*2=20']
Summary:
This tutorial discusses, How to use Regular expressions in Python with simple programming examples. If you like the tutorial share it with your friends. Like the Facebook page for regular updates and YouTube channel for video tutorials.