Beautifulsoup is a python library that helps in parsing HTML and XML files quite easily. It can help in searching, navigating and also modifying parse tree of documents.
It can be quite useful in scraping websites to get data when the website's are not providing REST APIs for information needed by users.
We'll be using urllib library for hitting URLs and getting their data and then beautifulsoup to parse that HTML data as per our need.
from bs4 import BeautifulSoup
import urllib
#res = urllib.request.urlopen('https://www.quora.com')
res = urllib.request.urlopen('https://www.python.org')
soup = BeautifulSoup(res.read(), 'html.parser')
print(soup.prettify()[:200]) ## It does formatting of html page parsed in soup object.
Let’s analyze a few attributes and methods of soup object that we created above. Developers can directly call tags of HTML as an attribute of parsed soup objects and attributes of HTML tags as the soup object's attribute's dictionary. See the examples below.
BeautifulSoup object represents the whole document. It has the same methods as that of any tag. It has special value for attribute .name
which is set to '[document]'
String data of each tag is stored as the NavigableString
class. One can convert it to Unicode python string by calling method unicode()
on it.
print("Title tag : "+str(soup.title))
print("Tag Name : "+soup.title.name)
print("Script tag : "+str(soup.script)) ## It finds first script tag and returns. There can be more than 1
print("Meta tag : "+str(soup.meta)) ## It finds first meta tag and returns. There can be more than 1
print("Name of Parent of frst meta tag : "+soup.meta.parent.name)
print("All attributes of first meta tag as dictionary : "+str(soup.meta.attrs))
print("Value of src attribute of script tag : "+soup.script['src'])
print("String with strong tag : "+soup.div.strong.string)
print("Soup Object type : "+str(type(soup)))
print("Title tag type : "+str(type(soup.title)))
print("Type of strings of tag : "+str(type(soup.div.strong.string)))
print('BeautifulSoup name : '+soup.name)
print('First div tag id before modifcation : '+ soup.div['id'])
soup.div.name = 'div_main'
print('First div tag changed after modification : '+ soup.div['id'])
print('div main id before modifitcation : '+ soup.div_main['id'])
soup.div_main['id'] = 'testId'
print('div main id before modifitcation : '+ soup.div_main['id'])
print('Deleting class attribute of div tag inside of div_main tag')
del soup.div_main.div['class']
print('Attributes of div tag inside of div_main tag after deletion of class attribute : '+ str(soup.div_main.div.attrs))
print('Value of strong attribute before replacement : '+soup.div.strong.string)
soup.div.strong.string.replace_with('Modified Notice:')
print('Value of strong attribute after replacement : '+soup.div.strong.string)
find()
method returns the first occurrence of a tag which matches a search queryfind_all()
method returns a list of all tags which matches a search queryfind_parent()
method returns parent of tagfind_next()
method returns next tag after tag and one which matches criteria provided.get()
is used to get attributes of tag only not tag itself.print('First div tag : '+str(soup.find('div'))+'\n')
print('All strong tag : '+str(soup.find_all('strong'))+'\n')
print('Parent of first div tag : '+str(soup.div.find_parent().name)+'\n')
print('Find next tag after div tag : '+str(soup.div.find_next())+'\n')
print('Find next "strong" tag after div tag : '+str(soup.div.find_next('strong'))+'\n')
print('Geting arrtibutes which has more than one values : '+ str(soup.div.find_next('div')['class'])+'\n')
print('Geting arrtibutes which has more than one values : '+ str(soup.div.find_next('nav')['class'])+'\n')
print('Geting arrtibutes list of tag : '+ str(soup.nav.get_attribute_list('class'))+'\n')
print('Multivalued attribute of XML soup : '+str(BeautifulSoup('<test class="dark light green">Test Data</test>','xml').test['class']))
print('Getting attributes using get() method : '+ soup.div_main.get('id'))