How to Retrieving web pages with urllib
In this, tutorial we will discuss How to Retrieving web pages with urllib. We can manually send and receive data over HTTP using the socket library. But in this case, we need to construct the request command manually and send it and parse the received data to remove header information.
There is a simpler way to perform this task in Python by using the urllib library.
The urllib treats a web page as a file. You need to simply pass the web page address you would like to retrieve and urllib does the remaining task that is it handles all of the tasks of the HTTP protocol and also header information.
Video Tutorial
The following fragment of code shows how to extract only the content of the web page over the urllib library.
import urllib.request fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt') for line in fhand: print(line.decode().strip())
Output
But soft what light through yonder window breaks It is the east and Juliet is the sun Arise fair sun and kill the envious moon Who is already sick and pale with grief
Reading binary files using urllib
Sometimes you want to retrieve a web page containing a non-text (or binary) file such as an image or video file. The data in these files is generally not useful to print out, but you can easily make a copy of a URL to a local file on your hard disk using urllib.
import urllib.request img = urllib.request.urlopen('http://hsit.ac.in/images/hit1.JPG').read() fhand = open('12345.jpg', 'wb') fhand.write(img) fhand.close()
Output
The image will be written into 12345.jpg
Parsing HTML using regular expressions
Here is a simple web page for a demonstration.
Web page address: http://www.example.com/page1.htm <h1>The First Page</h1> <p> If you like, you can switch to the <a href=”http://www.example.com/page2.htm”> Second Page</a>. <a href=”http://www.example.com/page3.htm”> Third Page</a>. </p> <h1>End of First page </h1> |
import urllib.request import re url = input('Enter - ') html = urllib.request.urlopen(url).read() #print (html) links = re.findall(b'href="(http://.*?)"', html) #print (links) for link in links: print(link.decode())
Output
Enter - http://www.example.com/page1.htm http://www.example.com/page2.htm http://www.example.com/page3.htm
Sample program to extract text in h1 tag
import urllib.request import re url = input('Enter - ') html = urllib.request.urlopen(url).read() links = re.findall(b'', html) for link in links: print(link.decode())
Output
Enter - http://www.example.com/page1.htm The First Page End of First Page
Parsing HTML using BeautifulSoup
There are a number of Python libraries that exist which can be used to parse HTML and extract the required data from the web pages. Every library has its strengths and weaknesses and one can pick based on the requirements of the application.
here we will use the BeautifulSoup library to parse HTML web pages and extract links using the BeautifulSoup library.
BeautifulSoup tolerates highly flawed HTML web pages and still lets you easily extract the required data from the web page.
import urllib.request from bs4 import BeautifulSoup url = input('Enter - ') html = urllib.request.urlopen(url).read() soup = BeautifulSoup(html, 'html.parser') tags = soup('a') for tag in tags: print(tag.get('href', None))
The program prompts for a web address, then opens the web page, reads the data and passes the data to the BeautifulSoup parser, and then retrieves all of the anchor tags and prints out the href attribute for each tag.
Output
Enter - http://www.example.com/page1.htm http://www.example.com/page2.htm http://www.example.com/page3.htm
Program to extract the content of h1 tag – using BeautifulSoup
import urllib.request, urllib.parse, urllib.error from bs4 import BeautifulSoup url = input('Enter - ') html = urllib.request.urlopen(url).read() soup = BeautifulSoup(html, 'html.parser') print (soup.find_all('h1'))
The program prompts for a web address, then opens the web page, reads the data and passes the data to the BeautifulSoup parser, and then retrieves contents of all of the h1 tags and prints out the content of h1 tag.
Output
Enter - http://www.example.com/page1.htm The First Page End of First Page
Program to display tags, tag contents, and tag attributes – using BeautifulSoup
import urllib.request, urllib.parse, urllib.error from bs4 import BeautifulSoup url = input('Enter - ') html = urllib.request.urlopen(url).read() soup = BeautifulSoup(html, 'html.parser') tags = soup('a') for tag in tags: # Look at the parts of a tag print('TAG:', tag) print('URL:', tag.get('href', None)) print('Contents:', tag.contents[0]) print('Attrs:', tag.attrs)
The program prompts for a web address, then opens the web page, reads the data and passes the data to the BeautifulSoup parser, and then retrieves all of the anchor tags and prints out the href attribute for each tag.
Output
Enter - http://www.example.com/page1.htm TAG: Second Page URL: http://www.example.com/page2.htm Content: ['\nSecond Page'] Attrs: [('href', 'http://www.example.com/page2.htm')] TAG: Third Page URL: http://www.example.com/page3.htm Content: ['\nThird Page'] Attrs: [('href', 'http://www.example.com/page3.htm')]
Summary:
This tutorial discusses How to Retrieving web pages with urllib. If you like the tutorial share it with your friends. Like the Facebook page for regular updates and YouTube channel for video tutorials.