How to use Regular expressions in Python

 

How to use Regular expressions in Python

This tutorial discusses how to use regular expressions in Python with simple programming examples.

To find patterns and to extract various bits of lines, we can use string methods like split() and find() and using lists and string slicing to extract a substring of the lines.

#find function on strings
str1 = 'I am an Engg student'
print (str1.find('an'))

Output

5

The patter ‘an’ is present in string hence the index of ‘an’ 5 is returned by the find() function in this case. In the next example, we are trying to find ‘AN’, which is not present in the string, hence the output in -1.

#find function on strings
str1 = 'I am an Engg student'
print (str1.find('AN'))

Output

-1

The same can be done very efficiently and easily using a Python library called regular expressions that handles many tasks.

Regular expressions are almost used in every other programming language for searching and parsing strings.

Video Tutorial

Special characters and character sequences:

ˆ – Matches the beginning of the line.

$ – Matches the end of the line.

. – Matches any character (a wildcard).

\s – Matches a whitespace character.

\S – Matches a non-whitespace character (opposite of \s).

‘*’ – Applies to the immediately preceding character & indicates to match zero or more of the preceding character(s).

‘+’ – Applies to the immediately preceding character & indicates to match one or more of the preceding character.

[aeiou] – Matches a single character as long as that character is in the specified set.

[a-z0-9] – You can specify ranges of characters using the minus sign.

[ˆA-Za-z] – When the first character in the set notation is a caret, it inverts the logic.

( ) – When parentheses are added to a regular expression, they are ignored for the purpose of matching but allow you to extract a particular subset of the matched string rather than the whole string when using findall().

The following programs demonstrate how to use regular expressions in Python

Search for pattern using searech() function

# Find pattern Engg

import re
line = 'I am an Engg student'
if re.search('Engg', line):
    print(line)
else:
    print ("Pattern not found")
    
print ('FINISH')

Output

I am an Engg student
FINISH
# Find pattern Engg at the end of the string

import re
line = 'I am an Engg student'

if re.search('Engg$', line):
    print(line)
else:
    print ("Pattern not found")

print ('FINISH')

Output

Pattern not found
FINISH
# Find pattern Engg at the begining of the string

import re
line = 'I am an Engg student'

if re.search('^Engg', line):
    print(line)
else:
    print ("Pattern not found")

print ('FINISH')

Output

Pattern not found
FINISH

Contents of Sample input file text1.txt

Name: Rahul

From: Karanataka

From: xyz@abc.com

To: pqr@abc.com

X-DSPAM-Confidence: 0.8475

X-DSPAM-Probability: 0.0000

Mail From: XYZ123@ABC.COM

To: PQR123@ABC.COM

X-DSPAM-Confidence: 0.9125

X-DSPAM-Probability: 0.0080

# Search for lines that contain 'From'
import re
hand = open('test1.txt')
for line in hand:
    line = line.rstrip()
    if re.search('From:', line):
        print(line)

Output

From: Karanataka
From: <xyz@abc.com>
Mail From: <XYZ123@ABC.COM>
# Search for lines that starts with 'From'
import re
hand = open('test1.txt')
for line in hand:
    line = line.rstrip()
    if re.search('^From:', line):
        print(line)

Output

From: Karanataka
From: <xyz@abc.com>
#Character matching in regular expressions Search for lines that start with 'F', 
#followed by 2 characters, followed by 'm:'
import re
hand = open('test1.txt')
for line in hand:
    line = line.rstrip()
    if re.search('^F..m:', line):
        print(line)

Output

From: Karanataka
From: <xyz@abc.com>
# Search for lines that start with From and have an at sign
import re
hand = open('test1.txt')
for line in hand:
    line = line.rstrip()
    if re.search('^From.+@.+', line):
        print(line)

Output

From: <xyz@abc.com>

Extracting data using regular expressions

Extracting data using regular expressions – extract email address

Here the regular expression \S+@\S+ matches with all non-whitespace characters followed by @ symbol and non-space characters.

#Example-1

import re
s = 'A message from  to cse@xyz.com'
lst = re.findall('\S+@\S+', s)
print(lst)

Output

['<cse@abc.com>', 'cse@xyz.com']

In the output: first email address has extract < and > symbols. To remove that we have to use regular expressions which match only characters and numbers.

In the next example regular expression [a-zA-Z0-9]+@[a-zA-Z0-9]+.[a-zA-Z0-9]+ matches with all character ([a-zA-Z0-9]+) or digits followed by @ symbol followed by all characters and digits ([a-zA-Z0-9]+), followed by . and all characters and digits ([a-zA-Z0-9]+).

#Example-2

import re
s = 'A message from  to cse@xyz.com'
lst = re.findall('[a-zA-Z0-9]+@[a-zA-Z0-9]+\.[a-zA-Z0-9]+', s)
print(lst)

Output

['csev@abc.com', 'cse@xyz.com']

Now the output is correct. The same can be obtained using shortcut special characters. \w matches with characters and digits. That is \w is special character for [a-zA-Z0-9]. Example 3 shows that.

#Example -3
import re
s = 'A message from  to cse@xyz.com'
lst = re.findall('\w+@\w+\.\w+', s)
print(lst)

Output

['csev@abc.com', 'cse@xyz.com']

Python program to read the content of a file and extract an email address.

# Search for lines that have an at sign between characters
# The characters must be a letter or number
import re
hand = open('test1.txt')
for line in hand:
    line = line.rstrip()
    x = re.findall('\w+@\w+.\w+', line)
    if len(x) > 0:
        print(x)

Output

['xyz@abc.com']
['pqr@abc.com']
['XYZ123@ABC.COM']
['PQR123@ABC.COM']

Python Programming examples to demonstrate Regular expressions in Python

Combining searching and extracting Search for lines that start with ‘X’ followed by any non-whitespace characters and ‘:’. Followed by a space and any number. The number can include a decimal.

import re
hand = open('test1.txt')
for line in hand:
    line = line.rstrip()
    x = re.findall('^X\S*: [0-9]+\.[0-9]+', line)
    if len(x) > 0:
        print(x)

Output

['X-DSPAM-Confidence: 0.8475']
['X-DSPAM-Probability: 0.0000']
['X-DSPAM-Confidence: 0.9125']
['X-DSPAM-Probability: 0.0080']

Search for lines that start with ‘X’ followed by any non-whitespace characters and ‘:’. Followed by a space and any number. The number can include a decimal. Then print the number if it is greater than zero

import re
hand = open('test1.txt')
for line in hand:
    line = line.rstrip()
    x = re.findall('^X\S*: ([0-9]+\.[0-9]+)', line)
    if len(x) > 0:
        print (x)

Output

['0.8475']
['0.0000']
['0.9125']
['0.0080']

Python program to extract the USN from a text file

The contents of sample input file usn.txt

Mahesh Huddar 2GI04CS045 GIT BGM

Rahul 2HN15CS001 HIT NDS 

import re
hand = open('usn.txt')
for line in hand:
    line = line.rstrip()
    x = re.findall('[1-4][a-zA-Z]{2}[0-9]{2}[a-zA-Z]{2}[0-9]{2}[1-9]', line)
    if len(x) > 0:
        print (x)

Output

['2GI04CS045']
['2HN15CS001']

Escape character – how to remove the special meaning of characters?

#Escape character - how to remove the special meaning of characters

import re
x = 'We just received 10*2=20 for cookies.'
y = re.findall('[0-9]+\*[0-9]=[0-9]+',x)
print (y)

Output

['10*2=20']

Summary:

This tutorial discusses, How to use Regular expressions in Python with simple programming examples. If you like the tutorial share it with your friends. Like the Facebook page for regular updates and YouTube channel for video tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *

Welcome to VTUPulse.com


Computer Graphics and Image Processing Mini Projects -> Click Here

Download Final Year Project -> Click Here

This will close in 12 seconds