Beautifulsoup is a python library that helps developers in parsing HTML and XML files quite easily. Its API can help in searching, navigating, and also modifying the parsed tree of documents. Beautifulsoup is a commonly used library to parse data from scraped website pages. It can be quite useful in scraping websites that are not providing REST APIs for information needed by users. Beautifulsoup library itself can not scrape web pages, it can only parse scrapped pages. For scrapping page, we need to use libraries like urllib, requests, etc. Beautifulsoup behind the scene uses other python libraries (html.parser, lxml, html5lib) for parsing DOM structure of web page. The API of beautifulsoup is very intuitive and easy to use. The current version of beautifulsoup is beautifulsoup4 which is recommended version and works with Python3.
As a part of this tutorial, we'll cover in detail the API of beautifulsoup library. We'll be covering the majority of functionalities provided by it. The tutorial is designed with a simple HTML document to make things easier to understand and grasp. This tutorial is specifically designed to retrieve tags and strings from the given HTML document. It does not concentrate on methods that are used to modify HTML documents. We have a different tutorial where we cover how to modify HTML documents using beautifulsoup. Please feel free to explore it from the below link.
Below we have highlighted important sections of the tutorial to give an overview of the material covered.
In the below cell, we have imported beautifulsoup library and printed the version of it that we'll be using in this tutorial.
import bs4
print("BeautifulSoup Version : {}".format(bs4.__version__))
In order to use beautifulsoup to parse HTML documents, we need to create a BeautifulSoup object which has information about parsed DOM tree of the HTML document. We can then call various methods and attributes available through this BeautifulSoup object to search and retrieve information from an HTML document.
Below we have created a sample HTML document that we'll be using through our tutorial. We'll be parsing this document and explaining various methods/attributes of BeautifulSoup object on this document.
sample_html= '''<html>
<head>
<title>CoderzColumn : Developed for Developers by Developers for the betterment of Development.</title>
<script src="static/script1.js"></script>
<script src="static/script2.js"></script>
<link rel="stylesheet" href="static/stylesheet.css" type="text/css" />
</head>
<body>
<p id='start'>Welcome to CoderzColumn</p>
<p id='main_para'>We regularly publish tutorials on various topics
(Python, Machine learning, Data Visualization, Digital Marketing, etc.) regularly explaining
how to use various Python libraries.</p>
<p id='sub_para'>Below are list of Important Sections of Our Website : </p>
<ul>
<li><a href='https://coderzcolumn.com/blogs'>Blogs</a></li>
<li><a href='https://coderzcolumn.com/tutorials'>Tutorials</a></li>
<li><a href='https://coderzcolumn.com/about'>About</a></li>
<li><a href='https://coderzcolumn.com/contact-us'>Contact US</a></li>
</ul>
<p id='end'>Please feel free to send us mail @ coderzcolumn07@gmail.com if you need any
information about any article or want us to publish article on particular topic.</p>
</body>
</html>'''
We can create BeautifulSoup object by calling the constructor of it. The first argument that we give to the constructor is the whole HTML/XML document as a string or file-like object pointing to HTML/XML file. The second argument is a string specifying which underlying parser library to use to parse documents. The possible values for second argument are 'lxml', 'lxml-xml', 'html.parse' and 'html5lib'. When we create a BeautifulSoup object using constructor, it parses the whole input HTML/XML document and creates a DOM-like (tree-like) structure whose nodes are one of the two types of objects mentioned below.
In order for parsing to complete successfully, the document should be valid, i.e, all tags should have ends tags, etc. We'll explain how we can retrieve information from Tag and Navigable objects in our upcoming sections.
Below we have created BeautifulSoup object for our HTML document. We have then printed it as well.
from bs4 import BeautifulSoup
soup = BeautifulSoup(sample_html, 'html.parser')
print(soup)
In this section, we'll explain how we can access individual HTML tags by treating tag names as property of BeautifulSoup object. We can retrieve Tag object representing a particular tag by treating the tag name as the property of Tag object.
If there is more than one HTML tag by the same name then this property call will retrieve the first one. There are sections on tutorial later which explain how to retrieve all tags by a particular name.
Below we have called html property on our soup object to retrieve the whole html document. The majority of calls to methods and property of BeautifulSoup object returns an object of type Tag or NavigableString string. Whenever we print Tag or NavigableString object, it prints the content of that HTML tag.
whole_page_src = soup.html
print("Object Type : {}".format(type(whole_page_src)))
print(whole_page_src)
Below we have retrieved script tag from the document. As there is more than one script tag, it returns the first one. Then in the next few cells, we have explained a few more examples explaining how to retrieve various other tags from soup objects by treating their name as simply property.
soup.script
soup.link
soup.title
soup.body
soup.p
soup.li
In this section, we'll explain how we can retrieve the attributes and values of those attributes of HTML tag.
Each Tag object has a property named attrs which returns a dictionary that has all attributes of that HTML tag. We can also modify an attribute's value and add a new attribute to this dictionary,
soup.link.attrs
soup.script.attrs
soup.a.attrs
Another way to retrieve the value of the particular attribute is by treating Tag object as a dictionary. Below we have explained with simple examples how to retrieve attribute values this way.
soup.link["type"]
soup.link["href"]
soup.a["href"]
The get() method can also be used to retrieve the value of an attribute. We just need to call get() method on Tag object and give it an attribute name, it'll return the value of the given attribute of that HTML tag.
soup.link.get("href")
soup.link.get("type")
We can access all values of a particular attribute if an attribute has more than one value using get_attribute_list() method of Tag object.
soup.link.get_attribute_list("type")
soup.link.get_attribute_list("href")
We can check whether some attribute is present in HTML tag using has_attr() method. It takes attribute name as input and returns True if an attribute is present in Tag object else, False.
soup.a.has_attr("href")
soup.a
soup.a.has_attr("id")
soup.a.has_attr("target")
soup.p.has_attr("id")
soup.p
There are various ways to access the text of the HTML tag in BeautifulSoup. We'll explain them one by one next.
We can access the total text of Tag or soup object by just calling text property on them. It'll recursively retrieve the text of all tags inside a particular HTML tag to form the total text of the HTML tag.
Below we are retrieving the total text of the HTML doc by calling text property on the soup object. It retrieves all text present in the body of the HTML doc.
soup.text
print(soup.text)
print(soup.title.text)
print(soup.body.text.strip())
print(soup.p.text)
The second way of retrieving the text of a particular Tag is by calling get_text() method on it.
print(soup.body.get_text().strip())
print(soup.ul.get_text().strip())
The getText() method works just like get_text() method.
print(soup.ul.getText().strip())
print(soup.li.get_text().strip())
The Tag object has two more properties that return the text of it.
soup.ul.strings
list(soup.ul.strings)
list(soup.body.strings)
list(soup.ul.stripped_strings)
list(soup.body.stripped_strings)
We can retrieve the name of any HTML tag by calling name property on Tag object itself.
soup.ul.name
soup.title.name
soup.title.parent.name
soup.body.parent.name
In this section, we'll explain how we can retrieve various HTML tags which are present inside of given HTML tags. These are generally referred to as children of HTML tag. There are various ways to retrieve children of the given HTML tag.
The contents property of Tag object returns list of Tag and NavigableString objects which are children of the HTML tag.
Below we have retrieved children of a few HTML tags by calling contents property.
for content in soup.ul.contents:
print(type(content))
soup.ul.contents
for content in soup.ul.contents:
if isinstance(content, bs4.element.Tag):
print([type(c) for c in content.contents],content.contents)
for content in soup.head.contents:
print(type(content))
soup.head.contents
for content in soup.p.contents:
print(type(content))
soup.p.contents
Another way to retrieve all children of the HTML tag is by calling children property of Tag object. Below we have explained the usage with a few simple examples.
soup.ul.children
for child in soup.ul.children:
print(type(child))
list(soup.ul.children)
for child in soup.head.children:
print(type(child))
list(soup.head.children)
The descendants is another property provided by Tag object that can be used to retrieve children of any HTML tag.
soup.ul.descendants
for descendant in soup.ul.descendants: ## Breadth first
print(type(descendant))
list(soup.ul.descendants)
for descendant in soup.head.descendants:
print(type(descendant))
list(soup.head.descendants)
In this section, we'll explain how we can retrieve parent details of a given HTML tag. We can retrieve details about the immediate parent of the given HTML tag or all parents of the HTML tag till the root of the document which is the parent of all tags. The Tag object provides two properties to retrieve parents’ details.
The parent property returns just immediate direct parent of given HTML tag. The HTML tag which contains given HTML tag.
soup.li
parent = soup.li.parent
print(type(parent))
parent
soup.a
soup.a.parent
soup.title
soup.title.parent
The parents property returns all parents of given HTML tag. This contains immediate direct parent as well as all parents of the parent of HTML tag.
soup.a
soup.a.parents
total_parents = list(soup.a.parents)
print("Total Number of Parents : {}".format(len(total_parents)))
print("=========First Immediate Parent =====================")
print(total_parents[0])
print("\n=========Parent of Immediate Parent (Grand Parent) =================")
print(total_parents[1])
soup.ul
total_parents = list(soup.ul.parents)
print("Total Number of Parents : {}".format(len(total_parents)))
print("=========First Immediate Parent =====================")
print(total_parents[0])
print("\n=========Parent of Immediate Parent (Grand Parent) =================")
print(total_parents[1])
In this section, we'll explain how we can retrieve siblings of a given HTML tag. The siblings are tags that are at the same level as the given HTML tag and have the same immediate parent as the given HTML tag. The Tag object provides various properties to retrieve siblings of a given HTML tag.
The next_sibling property when called on Tag object returns immediate single sibling of given HTML tag.
soup.p
out = soup.p.next_sibling
print(type(out))
out
soup.p.next_sibling.next_sibling
soup.li
soup.li.next_sibling
soup.li.next_sibling.next_sibling
The previous_sibling property of Tag object returns immediate previous sibling of given HTML tag.
out = soup.ul.previous_sibling
print(type(out))
out
soup.ul.previous_sibling.previous_sibling
soup.li
soup.script
soup.script.previous_sibling.previous_sibling
The next_siblings property returns all siblings which comes after given HTML tag. It'll return all siblings which will be parsed after the given HTML tag.
soup.li
soup.li.next_siblings
for sibling in soup.li.next_siblings:
print(type(sibling))
list(soup.li.next_siblings)
soup.p
for sibling in soup.p.next_siblings:
print(type(sibling))
list(soup.p.next_siblings)
The previous_siblings property returns all siblings which were parsed before the given HTML tag by the parser.
soup.ul.previous_siblings
for sibling in soup.ul.previous_siblings:
print(type(sibling))
list(soup.ul.previous_siblings)
We can retrieve all elements which get parsed after the given HTML tag and all elements which already got parsed before the given HTML tag.
The Tag object has various properties which can be used to perform forward parse and backward parse.
out = soup.p.next
print(type(out))
out
soup.p.next.next
soup.p.next.next.next
soup.ul.next
soup.ul.next.next
The next_element property of given Tag object works exactly like element property and returns next element which was parsed after given HTML tag.
out = soup.p.next_element
print(type(out))
out
soup.p.next_element.next_element
soup.p.next_element.next_element.next_element
soup.ul.next_element
soup.ul.next_element.next_element
The next_elements property returns list of all Tag objects which were parsed after given HTML tag.
soup.ul.next_elements
for elem in soup.ul.next_elements:
print(type(elem))
list(soup.ul.next_elements)
for elem in soup.a.next_elements:
print(type(elem))
list(soup.a.next_elements)
The previous property returns Tag object which was parsed before given HTML tag by the parser.
out = soup.ul.previous
print(type(out))
out
soup.ul.previous.previous
soup.ul.previous.previous.previous
The previous_element property works exactly like previous property.
out = soup.ul.previous_element
print(type(out))
out
soup.ul.previous_element.previous_element
soup.ul.previous_element.previous_element.previous_element
The previous_elements property returns all Tag objects which were parsed before the given HTML tag on which it was called.
soup.ul.previous_elements
for elem in soup.ul.previous_elements:
print(type(elem))
list(soup.ul.previous_elements)[:7]
In this section, we'll explain various find_*() methods available from Tag object that can be used to search for particular HTML tag/tags. Below are list of find_*() methods that we'll discuss next.
The name parameter is the name of the HTML tag that we are searching for. The attrs parameter accepts dictionary where we specified if we are looking for particular HTML tags that have the specified value set for attributes. We provide a dictionary where the key is the attribute name and the value is the value of that attribute that we are searching for. The text parameter accepts string specifying that we are looking for Tag object which has given string present in its contents. We can also give id parameter to all methods where we specify the id of the HTML tag if we want to retrieve the tag by id.
The find() let us find first occurrence of given HTML tag. We can provide tag names, attributes, and text details to it if we want to retrieve a particular HTML tag. Below we have explained the usage of the method with various examples.
out = soup.find("a")
print(type(out))
out
soup.ul.find("a")
soup.find(id="start")
soup.find("a", text="Tutorials")
soup.find("script", attrs={"src": "static/script1.js"})
soup.find("script", attrs={"src": "static/script2.js"})
soup.find("a", attrs={"href": "https://coderzcolumn.com/about"})
The find_all() method returns all tags of given name. We can specify attributes and text if we want to retrieve tags that satisfy particular attribute values and text. Below we have explained with a few examples how we can use find_all() method.
out = soup.find_all("a")
print(type(out))
for i in out:
print(type(i))
out
soup.find_all("li")
soup.find_all(id=["start", "end"])
soup.ul.find_all("li", limit=2)
soup.find_all("li", text=["Blogs","Tutorials", "About"])
soup.find_all("li", text=["Blogs","Tutorials", "about"])
soup.find_all("a", attrs={"href":"https://coderzcolumn.com/tutorials"})
The find_next() method works like next property of Tag object.
out = soup.ul.find_next()
print(type(out))
out
soup.ul.find_next("a")
soup.ul.find_next("a", text="Blogs")
soup.ul.find_next("a", attrs={"href": "https://coderzcolumn.com/blogs"}, text="Blogs")
The find_previous() method works like previous property of Tag object that we had explained in earlier section.
out = soup.ul.find_previous()
print(type(out))
out
soup.ul.find_previous("p")
We can retrieve all tags that were parsed after the given tag using find_all_next() method. We can filter tags if we want to retrieve only tags by given name or attributes.
out = soup.ul.find_all_next()
print(type(out))
out
soup.ul.find_all_next("a")
soup.ul.find_all_next(id="end")
soup.ul.find_all_next("a", text=["Blogs", "About"])
soup.ul.find_all_next("a", attrs={"href": "https://coderzcolumn.com/blogs"}, text="Blogs")
The find_all_previous() method returns all tags that were parsed before given HTML tag. We can filter tags by providing tag names or attribute details.
out = soup.ul.find_all_previous()
print(type(out))
out[:5]
soup.ul.find_all_previous("p")
soup.ul.find_all_previous(id="start")
soup.ul.find_all_previous("script")
soup.ul.find_all_previous("p", text="Welcome to CoderzColumn")
soup.ul.find_all_previous("p", text=["Welcome to CoderzColumn",
"Below are list of Important Sections of Our Website : "])
soup.ul.find_all_previous("p", attrs={"id":"start"})
We can retrieve the parent of the given HTML tag using find_parent() method which works like parent property we explained earlier.
soup.li.find_parent()
soup.a.find_parent(name="li")
The find_parents() method returns list of all parents of given HTML tag. We can filter parents by specifying the name and attribute details in the method.
soup.a.find_parents(name="li")
soup.a.find_parents(name=["li","ul"])
all_parents = soup.a.find_parents()
print("Total Parents : {}".format(len(all_parents)))
all_parents[:3]
We can used find_next_sibling() method to retrieve sibling that was parsed after given HTML tag. It works like next_sibling property of Tag object.
soup.li.find_next_sibling()
soup.p.find_next_sibling()
The find_previous_sibling() method returns sibling that was parsed before given HTML tag. It works like previous_sibling property of Tag object.
soup.find(id="end").find_previous_sibling()
soup.ul.find_previous_sibling()
The find_next_siblings() method retrieves all siblings of given HTML tag that were parsed after it. We can provide tag name and attribute details if we want to filter siblings based on those details.
soup.p.find_next_siblings()
soup.p.find_next_siblings(name="p")
soup.p.find_next_siblings(id="end")
soup.p.find_next_siblings("ul")
The find_previous_siblings() method retrieves all siblings of a given HTML tag that were parsed before it. We can provide tag name and attribute details if we want to filter siblings based on those details.
soup.find(id="end")
soup.find(id="end").find_previous_siblings(name="p")
soup.find(id="end").find_previous_siblings(id="start")
soup.find(id="end").find_previous_siblings("ul")
This ends our small tutorial explaining how we can parse HTML doc and retrieve information about various HTML tags using beautifulsoup library. Please feel free to let us know your views in the comments section.
If you want to