Home Python How to collect data from the Web using BeautifulSoup

How to collect data from the Web using BeautifulSoup

Web scraping is one of the most trending topics these days. Due to the advent of Big Data for Machine Learning and Artificial Intelligence, Web scraping has become an essential tool to collect as much data as possible from the modern web. Python is a great programming language that can be used for collecting data from the web. We have many libraries like selenium, BeautifulSoup, Scrapy that can be used for mining the web and collect data.

This tutorial will use the BeautifulSoup library to parse the HTML data returned by the website. We will also use the requests library to send an HTTP request to the server. To follow this tutorial, we need to have python3 installed in your system, and if you don’t have python3 installed in your system, then you can follow our guide on installing python on Linux.

Installing Requirements

In python, we benefit that most of our works are pre-done in the form of python packages, which are ready to be installed and used. We have two great libraries in python, namely requests and BeautifulSoup, which are used for sending HTTP requests and parsing the returned HTML data. Both the libraries are not present in the python standard library, and we need to install them manually. We can install both libraries by running the following command in the terminal.

pip install requests beautifulsoup4

We will have the latest versions of requests and BeautifulSoup libraries installed in our system on running the above command.

Importance of Web Scraping

In recent days web scraping has gained vast popularity. Many people use them to collect data from the web. The web is the largest source of data on the web. It contains many data that can be used to build and train machine learning models with better efficiency. Despite having such a large amount of data, we can’t use all the data because the web is poorly structured, making it very difficult to collect data from some sources. In python, we have many libraries that make our task easier to collect data from the unstructured web.

Legalities

At first, web scraping may look like illegal activity, but I found that it is perfectly legal for educational purposes; there should be no problem with web scraping until you follow their terms and conditions. Though many website owners hate web scraping as some programmer runs scripts for a long time, which provides an extra load to the server, we included this tutorial for educational purpose only.

In this tutorial, we will scrape a single webpage of Wikipedia having the link https://en.wikipedia.org/wiki/Python_(programming_language) and collect data from it. It is recommended not to run the python script for a long time as it may give an extra load to the web site’s server. If you built a program that gives loads to the Wikipedia server, you could give a tax-deductible donation to Wikipedia, which can be used for their servers.

Web Scraping using BeautifulSoup

Now let us begin our task of collecting data from the web. In this tutorial, we will perform the operation on only a single page and see how can we collect data from it.

Sending the HTTP request

To connect to a website, we first need to send an HTTP request to the server. Then the server will reply to the request by sending a response back indicating whether it wants to connect or not. We can send an HTTP request to a server by using the requests library of python. The following code shows a practical illustration.

# importing the required modules
import requests
# The url we want to scrap
url = "https://en.wikipedia.org/wiki/Python_(programming_language)"
# sending an HTTP request to the URL
r = requests.get(url)
# displaying the content of the URL 
print(r.text)

In the above code, we first imported the requests module and then used its get() method to send an HTTP request to the specified URL. The get() method creates a response object that contains many variables and methods containing information about the response sent back by the server.

After sending the HTTP request, we display the content of the Webpage by using the text variable of the response object. On running the above code, we will get the web page’s content printed in the terminal. See the below image for output.

sending an HTTP request to the server
sending an HTTP request to the server

Creating a BeautifulSoup object

After sending the HTTP request and getting the page’s HTML content, we need to parse the raw HTML data to extract meaningful information from the page. We can parse the HTML content by using the BeautifulSoup library, one of the best libraries for parsing HTML and XML data. To create a BeautifulSoup object, run the following code.

# importing the required modules
from bs4 import BeautifulSoup
import requests
# The URL to scrape
url = "https://en.wikipedia.org/wiki/Python_(programming_language)"
# sending a HTTP request to the server
r = requests.get(url)
# Creating a BeautifulSoup object
bs = BeautifulSoup(r.text, "html.parser")

We first imported the BeautifulSoup class from the bs4 module and imported the requests module in the above code. Then we send an HTTP request to a URL by using the requests module of python. Next, we created a BeautifulSoup object using the BeautifulSoup class and gave it the parameters r.text, which is the HTML content of the webpage, and the string “html.parser” which is the name of the parser that should be used for parsing the HTML. There are also many other parsers available like lxml, but the “html.parser” is fast, and we don’t need any manual installation to use it.

Selecting an element

After getting the HTML content, we need to get some particular HTML content elements. Use the beautiful soup to get an HTML element with their content by specifying their tag, class, id, or attributes. Let us see how we can get a specific element.

Using Tag Name

We can select an HTML element with a specific tag by using BeautifulSoup. To get a specific HTML element using the tag name, use the find() method of the BeautifulSoup object and provide the tag name as the parameter. See the below code, for example.

# importing the required modules
from bs4 import BeautifulSoup
import requests
# The URL to scrape
url = "https://en.wikipedia.org/wiki/Python_(programming_language)"
# sending a HTTP request to the server
r = requests.get(url)
# Creating a BeautifulSoup object
bs = BeautifulSoup(r.text, "html.parser")
# print the title element of the Webpage
print(bs.find('title'))
# print the h1 heading of the webpage
print(bs.find('h1'))

In the above code, we use the find() method of the BeautifulSoup object to get the title and the h1 tag of the HTML page.

Output:

getting HTML element by tag
getting HTML element by tag
Using class name

We have seen how we can get an HTML element by using the tag name, but we can also get an element using their class name. To get an element using the class name, we need to use the class_ parameter of the find() method. The below code shows a practical example.

# importing the required modules
from bs4 import BeautifulSoup
import requests
# The URL to scrape
url = "https://en.wikipedia.org/wiki/Python_(programming_language)"
# sending a HTTP request to the server
r = requests.get(url)
# Creating a BeautifulSoup object
bs = BeautifulSoup(r.text, "html.parser")
# print the first a tag having the class external
print(bs.find("a", class_="external"))

In the above code, we use the find() method of the BeautifulSoup object with the parameter class_ containing the value “external.” The find() method will search the HTML content and display the first <a> tag with the class name external.

Output:

getting HTML element by class
getting HTML element by class
Using the ID

We can also get an HTML element by searching it using id. To do this, provide the id of the element in the id parameter of the find() method. The below code shows how we can achieve this.

# importing the required modules
from bs4 import BeautifulSoup
import requests
# The URL to scrape
url = "https://en.wikipedia.org/wiki/Python_(programming_language)"
# sending a HTTP request to the server
r = requests.get(url)
# Creating a BeautifulSoup object
bs = BeautifulSoup(r.text, "html.parser")
# print the h1 tag having the class firstHeading
print(bs.find("h1", id="firstHeading"))

In the above code, we have used the find() method by providing h1 and id=”firstHeading,” so if we run the program, we will get the h1 element with the id firstHeading printed in the console. Take a look at the following output image.

getting HTML element by id
Selecting an element By using their attributes

In some cases, we can’t select an element by giving the id or class. For that, use the attributes of the HTML tag to select those elements. To do that, give the key: value pair of the HTML attribute as a dictionary in the find() method. See the below code, for example.

# importing the required modules
from bs4 import BeautifulSoup
import requests
# The URL to scrape
url = "https://en.wikipedia.org/wiki/Python_(programming_language)"
# sending a HTTP request to the server
r = requests.get(url)
# Creating a BeautifulSoup object
bs = BeautifulSoup(r.text, "html.parser")
# print the first a tag having the href https://www.python.org
print(bs.find("a", {"href":"https://www.python.org/"}))

In the above code, we use the find() method with two arguments: the string “a,” and the other is a python dictionary containing an attribute with the value. We use the href attribute with value https://www.python.org/, so the code will display the first <a> tag having the attribute href=”https://www.python.org.”

Output:

getting HTML element by attributes values
getting HTML element by attributes values

Selecting multiple elements Using findAll() method

The find() method only returns the first HTML element having the condition we provide in the argument. But selecting an element does not always be the best thing as we may need more elements having the same condition. So there is another method, namely findAll() in the BeautifulSoup object, that can be used to select multiple items. We can use any of the element selectors in the findAll() method like class, id, attributes, tag name, etc., which we used in the find() method. The following code shows a practical example of how to use it.

# importing the required modules
from bs4 import BeautifulSoup
import requests
# The URL to scrape
url = "https://en.wikipedia.org/wiki/Python_(programming_language)"
# sending a HTTP request to the server
r = requests.get(url)
# Creating a BeautifulSoup object
bs = BeautifulSoup(r.text, "html.parser")
# getting all the <a> tag with the class external
external_links = bs.findAll("a", class_="external")
# Display all the links containing the class external
for link in external_links:
    print(link)
print("\n\n")
# Displaying the total links we get
print(f"[+] There are {len(external_links)} links containing the class external")

In the above code, we use the findAll() method of the BeautifulSoup object to get all the links that have the class external.

Output:

getting all the a tag using the findAll() method
getting all the a tag using the findAll() method

Getting the data of an HTML element

We have seen how we can get HTML elements using the BeautifulSoup but getting only the elements is not the case. We also need to collect the data present in the element, which will help us later. To collect the text present in the HTML element, we need to use the text string of the element object. See the below code for an example.

# importing the required modules
from bs4 import BeautifulSoup
import requests
# The URL to scrape
url = "https://en.wikipedia.org/wiki/Python_(programming_language)"
# sending a HTTP request to the server
r = requests.get(url)
# Creating a BeautifulSoup object
bs = BeautifulSoup(r.text, "html.parser")
# getting the title element of the Webpage
title = bs.find("title")
# printing the title of the page
print(f"The title of the Webpage is [[ {title.text} ]]")

On running the above code, we will get the text present inside the web page’s title element. Look at the below output image.

getting the inner text of an HTML element
getting the inner text of an HTML element

We can also get the value of the HTML attributes of an HTML element. To do this, we need to use the attrs dictionary of the element object. Let us see a practical example of how to collect the attribute’s value. In the below code, we will get the class attribute of the h1 element present in the Webpage.

# importing the required modules
from bs4 import BeautifulSoup
import requests
# The URL to scrape
url = "https://en.wikipedia.org/wiki/Python_(programming_language)"
# sending a HTTP request to the server
r = requests.get(url)
# Creating a BeautifulSoup object
bs = BeautifulSoup(r.text, "html.parser")
# getting the h1 element of the Webpage
h1 = bs.find("h1")
# displaying a list of all clasing of the h1 element
print(h1.attrs['class'])

Output:

getting the attributes of an HTML element
getting the attributes of an HTML element

You may be thinking about where those data can be used and what benefits we may have by doing web scraping. The technique of web scraping will help gather data for training machine learning models. For example, we may use the technique to collect data of used cars like the model, original price, number of seats, etc., from some websites and build a machine learning model to predict old cars’ prices using those data. So, many ideas can be achieved with the help of web scraping and machine learning. If you want to learn more about machine learning, you can refer to our section on machine learning.

An example Program

We have learned many things about web scraping using BeautifulSoup. Now let us build a program to see how it works. In the program, we need to enter the URL of a webpage, and it will display all the URLs present in the <a> tag of the WebPage.

# importing the required modules
from bs4 import BeautifulSoup
import requests
# The URL to scrape
url = "https://en.wikipedia.org/wiki/Python_(programming_language)"
# sending a HTTP request to the server
r = requests.get(url)
# Creating a BeautifulSoup object
bs = BeautifulSoup(r.text, "html.parser")
# getting all the <a> tag with the class external
links = bs.findAll("a")
# print all the URL present in the webpage 
for link in links:
    try:
        print(link.attrs['href'])
    except:
        pass

Output:

displaying all the URLs of a Webpage
displaying all the URLs of a Webpage

Writing such programs in web scraping is very useful. For example, assume that we need to check if all the URLs of a client’s website are OK and send back the response of 200. We can check this by writing a python script that gets all the links of a page, checks them by sending them requests, and then follows all the internal website links present in the web page and checks their link. So web scraping in python is a beneficial requirement for every programmer.

Conclusion

In this tutorial, we have seen how we can perform web scraping on the web. We have also built a program that can display all the links present on the webpage. You may also refer to our guide on parsing JSON data using python.

You may also like

Leave a Comment

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Accept Read More