General-purpose programming language.
Beautiful Soup is a Python library that is used for web scraping purposes to pull the data out of HTML and XML files. It creates a parse tree from page source code that can be used to extract data in a hierarchical and more readable manner.
Before we can start web scraping, we need to install Beautiful Soup. You can do this by using pip, a package installer for Python. Simply type the following command in your terminal:
pip install beautifulsoup4
You will also need to install the requests module, which allows you to send HTTP requests using Python. You can install it using the following command:
pip install requests
Once you have Beautiful Soup installed, you can start using it to parse HTML. Here is a basic example:
from bs4 import BeautifulSoup html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> """ soup = BeautifulSoup(html_doc, 'html.parser')
In this example, we first import the BeautifulSoup class from the bs4 module. Then we create an instance of this class and pass in the HTML document that we want to parse. The second argument 'html.parser' is the parser library that Beautiful Soup uses to parse the HTML.
Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree.
print(soup.title) # <title>The Dormouse's story</title> print(soup.title.name) # title print(soup.title.string) # The Dormouse's story print(soup.title.parent.name) # head print(soup.p) # <p class="title"><b>The Dormouse's story</b></p> print(soup.p['class']) # ['title'] print(soup.a) # None print(soup.find_all('a')) # [] print(soup.find(id="link3")) # None
Once you have navigated to the part of the parse tree that you are interested in, you can extract the data. Here is an example:
for link in soup.find_all('a'): print(link.get('href'))
This will print out all the URLs found within <a>
tags in the HTML document.
In conclusion, Beautiful Soup is a powerful library that makes it easy to scrape information from web pages. It sits on an HTML or XML parser, providing Python-friendly ways of accessing data in these files.