BeautifulSoup is the most preferred library by the majority of developers when they need to parse and retrieve information from HTML/XML documents. Its simple API helps developers complete tasks faster. Apart from parsing HTML/XML documents and searching for information, the API of BeautifulSoup also provides many other methods which can be used to modify the HTML document itself. We can modify the text of tags, add new tags, change existing tag names, add attributes to tags, remove tags, etc. These tasks will result in modification of contents and structure of HTML/XML document. We can easily handle these kinds of tasks using the simple API of BeautifulSoup which will handle any possible problem which can arise from modification. We just need to work with the API of BeautifulSoup to modify the document. We have already covered a tutorial on how to use BeautifulSoup to parse HTML documents where we have covered the majority of the API of it. Please feel free to check that tutorial as well.
As a part of this tutorial, we'll be primarily concentrating on how to use the API of BeautifulSoup to modify the parsed HTML document. Below we have listed important sections of the tutorial to give an overview of the material covered.
import bs4
print("BeautifulSoup Version : {}".format(bs4.__version__))
In this section, we have created a simple HTML document with a few tags. This will make things easier to understand when we add new tags, remove tags, modify attributes, etc.
We'll be creating a BeautifulSoup object by giving this HTML document as string to BeautifulSoup() constructor. The second argument to the constructor is a string specifying a backend that it'll use to parse the HTML document. This BeautifulSoup object has parsed HTML and various methods that we'll use to modify HTML documents.
If you want to know in detail about BeautifulSoup object then please feel free to check our below tutorial.
sample_html= '''<html>
<head>
<title>CoderzColumn : Developed for Developers by Developers for the betterment of Development.</title>
<script src="static/script1.js"></script>
<script src="static/script2.js"></script>
<link rel="stylesheet" href="static/stylesheet.css" type="text/css" />
</head>
<body>
<p id='start'>Welcome to CoderzColumn</p>
<p id='main_para'>We regularly publish tutorials on various topics
(Python, Machine learning, Data Visualization, Digital Marketing, etc.) regularly explaining
how to use various Python libraries.</p>
<p id='sub_para'>Below are list of Important Sections of Our Website : </p>
<ul>
<li><a href='https://coderzcolumn.com/blogs'>Blogs</a></li>
<li><a href='https://coderzcolumn.com/tutorials'>Tutorials</a></li>
<li><a href='https://coderzcolumn.com/about'>About</a></li>
<li><a href='https://coderzcolumn.com/contact-us'>Contact US</a></li>
</ul>
<p id='end'>Please feel free to send us mail @ coderzcolumn07@gmail.com if you need any
information about any article or want us to publish article on particular topic.</p>
</body>
</html>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(sample_html, 'html.parser')
print(soup)
In this section, we'll explain how we can change the name of the HTML tag. All Tag object in beautifulsoup has a property named name which holds the name of the HTML tag. We can assign a new value to this name property and it'll change the HTML tag name.
Below we have first created a copy of our original BeautifulSoup object. We have then explained modification on this new object. We'll be following this for every section where we'll create a copy of the original BeautifulSoup object and explain modifications on the copied object.
We have modified the name of a few HTML tags in copied BeautifulSoup object.
Please make a NOTE that we'll be using various methods available through BeautifulSoup to find tags in it. These methods are explained in detail in our first tutorial on BeautifulSoup hence we have not included their description here.
import copy
soup_new = copy.deepcopy(soup)
main_para = soup_new.find(id="main_para")
print(type(main_para))
main_para
main_para.name = "div"
main_para
soup_new.find(id="main_para")
soup_new.ul
soup_new.a.name = "link"
soup_new.ul
In this section, we'll explain how we can modify the text store between the start and end of a particular HTML tag. The text within the tag is stored as NavigableString object. There are different ways to modify it.
We'll explain all three ways of modifying text below with simple examples.
We can replace the existing text of HTML Tag by setting a new string value to '.string' property of Tag object. It'll replace any existing string with this new string value.
import copy
soup_new = copy.deepcopy(soup)
first_link = soup_new.a
first_link
first_link.string = "Blogs (143)"
first_link.string
soup_new.a.string
p_start = soup_new.find(id="start")
p_start
p_start.string = "Welcome to CoderzColumn, Have a Great Learning Experience."
p_start
soup_new.find(id="start")
The append() method is available through Tag object which accepts a string and appends that string to the existing string of HTML Tag. It works like append() method of python list.
import copy
soup_new = copy.deepcopy(soup)
p_start = soup_new.p
p_start
p_start.append(", Have a Great Learning Experience.")
p_start, soup_new.p
The extend() method accepts a list of strings and appends all strings to the end of an existing string of HTML tags. It works exactly like extend() method of the python list.
import copy
soup_new = copy.deepcopy(soup)
p_start = soup_new.p
p_start
p_start.extend([", ", "Have a Great", " Learning Experience", "."])
p_start, soup_new.p
We can retrieve the value of any attribute of an HTML tag by treating Tag object like a python dictionary. We can use the same approach to add a new attribute to the HTML tag as well.
import copy
soup_new = copy.deepcopy(soup)
p_start = soup_new.p
p_start
p_start["name"] = "Welcome Paragraph"
p_start, soup_new.p
link = soup_new.a
link
link["target"] = "_blank"
link, soup_new.a
We can easily modify the value of any existing attribute of an HTML tag by treating Tag object as a dictionary-like object. We can set a new value by assigning a new value to an attribute by giving the attribute name as the key to Tag object.
import copy
soup_new = copy.deepcopy(soup)
link = soup_new.a
link
link["href"] = "https://coderzcolumn.com/blogs_latest"
link, soup_new.a
p_end = soup_new.find(id="end")
p_end
p_end["name"] = "End Paragraph"
p_end
soup_new.find(id="end")
In this section, we'll explain how we can create a new HTML tag and add it to BeautifulSoup object. The standard way to create a new tag is by using new_tag() method of BeautifulSoup object. We need to provide an HTML tag name as a string to this method in order to create a new tag. It'll create a new Tag object and return it. We can then use various methods of BeautifulSoup object to add this Tag object in the HTML document. We can also provide attributes of tag followed by tag name to new_tag() method.
Below are list of methods available from BeautifulSoup and Tag objects that let us add new Tag to HTML document.
In this section, we have explained how we can use append() method to add new Tag to HTML document. We have created a few 'li' HTML tags and added them to our existing unordered list tag.
import copy
soup_new = copy.deepcopy(soup)
unordered_list = soup_new.ul
unordered_list
new_option = soup_new.new_tag("li")
new_option
unordered_list.append(new_option)
unordered_list
soup_new.ul
new_option = soup_new.new_tag("li")
new_option
#new_link = soup_new.new_tag("a", attrs={'href':"https://coderzcolumn.com/privacy_policy"})
new_link = soup_new.new_tag("a", href="https://coderzcolumn.com/privacy_policy")
new_link.string = "Privacy Policy"
new_link
new_option.append(new_link)
new_option
soup_new.ul.append("\n")
soup_new.ul.append(new_option)
soup_new.ul.append("\n")
soup_new.ul
In this section, we have explained how we can insert a new tag using insert() method. We need to provide an index of tag as the first argument to insert() method followed by Tag object to insert an object at a particular location in an HTML document.
import copy
soup_new = copy.deepcopy(soup)
link = soup_new.a
link
link.insert(5, " (143)")
link
unordered_list = soup_new.ul
unordered_list
new_option = soup_new.new_tag("li")
new_link = soup_new.new_tag("a", href="https://coderzcolumn.com/privacy_policy")
new_link.string = "Privacy Policy"
new_option.append(new_link)
new_option
unordered_list.insert(0, "\n")
unordered_list.insert(0, new_option)
unordered_list.insert(0, "\n")
unordered_list
The insert_before() and insert_after() methods works like insert() method. They let us insert HTML tag before and after specified HTML tag. Below we have explained with simple examples how we can use them to add tags to HTML document.
import copy
soup_new = copy.deepcopy(soup)
p_inter1 = soup_new.new_tag("p", id="intermediate_para1")
p_inter1.string = "We have more than 250 Tutorials on Python."
p_inter1
soup_new.ul.insert_before(p_inter1)
soup_new.find(id="intermediate_para1")
p_inter2 = soup_new.new_tag("p", id="intermediate_para2")
p_inter2.string = "We have more than 50 Tutorials on Digital marketing."
p_inter2
soup_new.ul.insert_after(p_inter2)
soup_new.find(id="intermediate_para2")
soup_new.find_all("p")
bold = soup_new.new_tag("b")
bold.string = " (143)"
bold
soup_new.a
soup_new.a.string.insert_after(bold)
soup_new.a
italic = soup_new.new_tag("i")
italic.string = "All "
italic
soup_new.a.contents
soup_new.a.contents[0].insert_before(italic)
soup_new.a
The Tag and BeautifulSoup objects provide a method named clear() which can be used to create text content as well as all subtags of the given tag. The method will delete all sub-tags and text of the HTML tag on which it is called.
Below we have explained with a few simple examples what are the uses of clear() method.
import copy
soup_new = copy.deepcopy(soup)
soup_new.ul
soup_new.ul.clear()
soup_new.ul
p_end = soup_new.find(id="end")
p_end
p_end.clear()
p_end
soup_new.find(id="end")
In this section, we have explained how we can remove a particular HTML tag from BeautifulSoup object. The Tag object provides a method named extract() which when called returns that Tag object removing it from main BeautifulSoup object containing whole HTML document. We can call extract() method on any Tag object and it'll be removed from the soup object and returned.
Below we have explained a few examples demonstrating how extract() method works. We need to call extract() method on Tag object that we want to remove from the soup object.
import copy
soup_new = copy.deepcopy(soup)
soup_new.ul
soup_new.ul.a.extract()
soup_new.ul
soup_new.p
soup_new.p.string.extract()
soup_new.p
soup_new.li.parent
soup_new.li.parent.extract()
soup_new.body
In this section, we have explained how we can replace one HTML tag with another in an HTML document. The Tag object has a method named replace_with() which can replace whatever is given to it with the Tag object in the main BeautifulSoup object. We can provide a string to replace_with() and it'll replace the original HTML tag with that string. We can provide another Tag object to replace_with() and it'll replace original HTML tag with this new tag represented through Tag object. We need to call replace_with() on Tag object that we want to replace in BeautifulSoup object.
Below we have explained with a few examples how we can use replace_with() to replace a particular HTML tag from a document.
import copy
soup_new = copy.deepcopy(soup)
soup_new.ul
first_link = soup_new.find("a")
first_link
first_link.replace_with("Blogs")
soup_new.ul
soup_new.ul.li.string, type(soup_new.ul.li.string)
new_link_tag = soup_new.new_tag("link", href="https://coderzcolumn.com/blogs")
new_link_tag
soup_new.ul.li.string.replace_with(new_link_tag)
soup_new.ul
In this section, we have explained how we can wrap one HTML tag inside of another new HTML tag. The Tag object provides a method named wrap() which accepts another Tag object and wraps main Tag object inside of this provided Tag object. We can need to call wrap() method on Tag object which we want to wrap into another Tag object that we provided to wrap() method.
Below we have explained with examples how we can wrap one HTML tag inside of another using wrap() method.
import copy
soup_new = copy.deepcopy(soup)
soup_new.ul
bold = soup_new.new_tag("b")
bold
soup_new.ul.li.a.wrap(bold)
soup_new.ul
italic = soup_new.new_tag("i")
italic
soup_new.ul.li.a.string.wrap(italic)
soup_new.ul
In this section, we have explained how we can replace the HTML tag with its content in an HTML document. The Tag object provides us with method named unwrap() that let us replace the Tag with it's contents inside of BeautifulSoup object. We can call unwrap() method on any Tag object and it'll replace that Tag object with it's content inside of BeautifulSoup object. This method is kind of the opposite of wrap() method we explained in the previous section.
Below we have explained with a few simple examples how we can use unwrap() method.
import copy
soup_new = copy.deepcopy(soup)
soup_new.ul
soup_new.ul.li.a.unwrap()
soup_new.ul
soup_new.ul.li.unwrap()
soup_new.ul
This ends our small tutorial explaining how we can modify the contents of an HTML document parsed as BeautifulSoup object. Please feel free to let us know your views in the comments section.
If you are more comfortable learning through video tutorials then we would recommend that you subscribe to our YouTube channel.
When going through coding examples, it's quite common to have doubts and errors.
If you have doubts about some code examples or are stuck somewhere when trying our code, send us an email at coderzcolumn07@gmail.com. We'll help you or point you in the direction where you can find a solution to your problem.
You can even send us a mail if you are trying something new and need guidance regarding coding. We'll try to respond as soon as possible.
If you want to