Introduction to Beautiful Soup

Beautiful Soup is a Python library that is used for web scraping purposes to pull the data out of HTML and XML files. It creates a parse tree from page source code that can be used to extract data in a hierarchical and more readable manner.

Installing and Setting Up Beautiful Soup

Before we can start web scraping, we need to install Beautiful Soup. You can do this by using pip, a package installer for Python. Simply type the following command in your terminal:

pip install beautifulsoup4

You will also need to install the requests module, which allows you to send HTTP requests using Python. You can install it using the following command:

pip install requests

Parsing HTML with Beautiful Soup

Once you have Beautiful Soup installed, you can start using it to parse HTML. Here is a basic example:

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

In this example, we first import the BeautifulSoup class from the bs4 module. Then we create an instance of this class and pass in the HTML document that we want to parse. The second argument 'html.parser' is the parser library that Beautiful Soup uses to parse the HTML.

Searching and Navigating the Parse Tree

Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree.

print(soup.title)
# <title>The Dormouse's story</title>

print(soup.title.name)
# title

print(soup.title.string)
# The Dormouse's story

print(soup.title.parent.name)
# head

print(soup.p)
# <p class="title"><b>The Dormouse's story</b></p>

print(soup.p['class'])
# ['title']

print(soup.a)
# None

print(soup.find_all('a'))
# []

print(soup.find(id="link3"))
# None

Extracting Information from a Website

Once you have navigated to the part of the parse tree that you are interested in, you can extract the data. Here is an example:

for link in soup.find_all('a'):
    print(link.get('href'))

This will print out all the URLs found within <a> tags in the HTML document.

In conclusion, Beautiful Soup is a powerful library that makes it easy to scrape information from web pages. It sits on an HTML or XML parser, providing Python-friendly ways of accessing data in these files.

Python

Web Scraping with Python

Introduction to Beautiful Soup

Installing and Setting Up Beautiful Soup

Parsing HTML with Beautiful Soup

Searching and Navigating the Parse Tree

Extracting Information from a Website