Share @ LinkedIn Facebook  beautifulsoup, scraping, htmlparsing, xmlparsing

Overview

Beautifulsoup is a python library that helps in parsing HTML and XML files quite easily. It can help in searching, navigating and also modifying parse tree of documents.

It can be quite useful in scraping websites to get data when the website's are not providing REST APIs for information needed by users.

Installation

  • pip install beautifulsoup4
  • easy_install beautifulsoup4
  • apt-get install python-bs4 (for Python 2)
  • apt-get install python3-bs4 (for Python 3)

We'll be using urllib library for hitting URLs and getting their data and then beautifulsoup to parse that HTML data as per our need.

Guide on solving errors after installation:

Other python libraries as parser:

In [1]:
from bs4 import BeautifulSoup
import urllib
In [2]:
#res = urllib.request.urlopen('https://www.quora.com')
res = urllib.request.urlopen('https://www.python.org')
soup = BeautifulSoup(res.read(), 'html.parser')
print(soup.prettify()[:200]) ## It does formatting of html page parsed in soup object.
<!DOCTYPE doctype html>
<!--[if lt IE 7]>   <html class="no-js ie6 lt-ie7 lt-ie8 lt-ie9">   <![endif]-->
<!--[if IE 7]>      <html class="no-js ie7 lt-ie8 lt-ie9">          <![endif]-->
<!--[if IE 8]>

Analyze attributes & methods of parsed soup object

Let’s analyze a few attributes and methods of soup object that we created above. Developers can directly call tags of HTML as an attribute of parsed soup objects and attributes of HTML tags as the soup object's attribute's dictionary. See the examples below.

BeautifulSoup object represents the whole document. It has the same methods as that of any tag. It has special value for attribute .name which is set to '[document]'

String data of each tag is stored as the NavigableString class. One can convert it to Unicode python string by calling method unicode() on it.

In [3]:
print("Title tag : "+str(soup.title))
print("Tag Name : "+soup.title.name)
print("Script tag : "+str(soup.script)) ## It finds first script tag and returns. There can be more than 1
print("Meta tag : "+str(soup.meta)) ## It finds first meta tag and returns. There can be more than 1
print("Name of Parent of frst meta tag : "+soup.meta.parent.name)
print("All attributes of first meta tag as dictionary : "+str(soup.meta.attrs))
print("Value of src attribute of script tag : "+soup.script['src'])
print("String with strong tag : "+soup.div.strong.string)
print("Soup Object type : "+str(type(soup)))
print("Title tag type : "+str(type(soup.title)))
print("Type of strings of tag : "+str(type(soup.div.strong.string)))
print('BeautifulSoup name : '+soup.name)
Title tag : <title>Welcome to Python.org</title>
Tag Name : title
Script tag : <script src="/static/js/libs/modernizr.js"></script>
Meta tag : <meta charset="utf-8"/>
Name of Parent of frst meta tag : head
All attributes of first meta tag as dictionary : {'charset': 'utf-8'}
Value of src attribute of script tag : /static/js/libs/modernizr.js
String with strong tag : Notice:
Soup Object type : <class 'bs4.BeautifulSoup'>
Title tag type : <class 'bs4.element.Tag'>
Type of strings of tag : <class 'bs4.element.NavigableString'>
BeautifulSoup name : [document]

Now let’s modify a few tags and their attributes in soup object

  • Please make a note that the developer can only delete attributes, not tags of parsed HTML doc.
  • Developer can replace a total string of tag but can not modify the string.
In [4]:
print('First div tag id before modifcation : '+ soup.div['id'])
soup.div.name = 'div_main'
print('First div tag changed after modification : '+ soup.div['id'])
print('div main id before modifitcation : '+ soup.div_main['id'])
soup.div_main['id'] = 'testId'
print('div main id before modifitcation : '+ soup.div_main['id'])
print('Deleting class attribute of div tag inside of div_main tag')
del soup.div_main.div['class']
print('Attributes of div tag inside of div_main tag after deletion of class attribute : '+ str(soup.div_main.div.attrs))
print('Value of strong attribute before replacement : '+soup.div.strong.string)
soup.div.strong.string.replace_with('Modified Notice:')
print('Value of strong attribute after replacement : '+soup.div.strong.string)
First div tag id before modifcation : touchnav-wrapper
First div tag changed after modification : nojs
div main id before modifitcation : touchnav-wrapper
div main id before modifitcation : testId
Deleting class attribute of div tag inside of div_main tag
Attributes of div tag inside of div_main tag after deletion of class attribute : {'id': 'nojs'}
Value of strong attribute before replacement : Notice:
Value of strong attribute after replacement : Modified Notice:
  • find() method returns the first occurrence of a tag which matches a search query
  • find_all() method returns a list of all tags which matches a search query
  • find_parent() method returns parent of tag
  • find_next() method returns next tag after tag and one which matches criteria provided.
  • multi-valued attributes get returned as list incase of HTML document only. In the case of an XML document, a multi-valued attribute gets returned as a string.
  • get() is used to get attributes of tag only not tag itself.
In [5]:
print('First div tag : '+str(soup.find('div'))+'\n')
print('All strong tag : '+str(soup.find_all('strong'))+'\n')
print('Parent of first div tag : '+str(soup.div.find_parent().name)+'\n')
print('Find next tag after div tag : '+str(soup.div.find_next())+'\n')
print('Find next "strong" tag after div tag : '+str(soup.div.find_next('strong'))+'\n')
print('Geting arrtibutes which has more than one values : '+ str(soup.div.find_next('div')['class'])+'\n')
print('Geting arrtibutes which has more than one values : '+ str(soup.div.find_next('nav')['class'])+'\n')
print('Geting arrtibutes list of tag : '+ str(soup.nav.get_attribute_list('class'))+'\n')
print('Multivalued attribute of XML soup : '+str(BeautifulSoup('<test class="dark light green">Test Data</test>','xml').test['class']))
print('Getting attributes using get() method : '+ soup.div_main.get('id'))
First div tag : <div id="nojs">
<p><strong>Modified Notice:</strong> While Javascript is not essential for this website, your interaction with the content will be limited. Please turn Javascript on for the full experience. </p>
</div>

All strong tag : [<strong>Modified Notice:</strong>, <strong><small>A</small> A</strong>, <strong>relaunched community-run job board</strong>]

Parent of first div tag : div_main

Find next tag after div tag : <p><strong>Modified Notice:</strong> While Javascript is not essential for this website, your interaction with the content will be limited. Please turn Javascript on for the full experience. </p>

Find next "strong" tag after div tag : <strong>Modified Notice:</strong>

Geting arrtibutes which has more than one values : ['top-bar', 'do-not-print']

Geting arrtibutes which has more than one values : ['meta-navigation', 'container']

Geting arrtibutes list of tag : ['meta-navigation', 'container']

Multivalued attribute of XML soup : dark light green
Getting attributes using get() method : testId

Sunny Solanki  Sunny Solanki