101.school
CoursesAbout
Search...⌘K
Generate a course with AI...

    Python

    Receive aemail containing the next unit.
    • Refreshing Python Basics
      • 1.1Python Data Structures
      • 1.2Syntax and Semantics
      • 1.3Conditionals and Loops
    • Introduction to Object-Oriented Programming
      • 2.1Understanding Class and Objects
      • 2.2Design Patterns
      • 2.3Inheritance, Encapsulation, and Polymorphism
    • Python Libraries
      • 3.1Numpy and Matplotlib
      • 3.2Pandas and Seaborn
      • 3.3SciPy
    • Handling Files and Exception
      • 4.1Reading, writing and manipulating files
      • 4.2Introduction to Exceptions
      • 4.3Handling and raising Exceptions
    • Regular Expressions
      • 5.1Introduction to Regular Expressions
      • 5.2Python’s re module
      • 5.3Pattern Matching, Substitution, and Parsing
    • Databases and SQL
      • 6.1Introduction to Databases
      • 6.2Python and SQLite
      • 6.3Presentation of Data
    • Web Scraping with Python
      • 7.1Basics of HTML
      • 7.2Introduction to Beautiful Soup
      • 7.3Web Scraping Case Study
    • Python for Data Analysis
      • 8.1Data cleaning, Transformation, and Analysis using Pandas
      • 8.2Data visualization using Matplotlib and Seaborn
      • 8.3Real-world Data Analysis scenarios
    • Python for Machine Learning
      • 9.1Introduction to Machine Learning with Python
      • 9.2Scikit-learn basics
      • 9.3Supervised and Unsupervised Learning
    • Python for Deep Learning
      • 10.1Introduction to Neural Networks and TensorFlow
      • 10.2Deep Learning with Python
      • 10.3Real-world Deep Learning Applications
    • Advanced Python Concepts
      • 11.1Generators and Iterators
      • 11.2Decorators and Closures
      • 11.3Multithreading and Multiprocessing
    • Advanced Python Concepts
      • 12.1Generators and Iterators
      • 12.2Decorators and Closures
      • 12.3Multithreading and Multiprocessing
    • Python Project
      • 13.1Project Kick-off
      • 13.2Mentor Session
      • 13.3Project Presentation

    Web Scraping with Python

    Introduction to Beautiful Soup

    general-purpose programming language

    General-purpose programming language.

    Beautiful Soup is a Python library that is used for web scraping purposes to pull the data out of HTML and XML files. It creates a parse tree from page source code that can be used to extract data in a hierarchical and more readable manner.

    Installing and Setting Up Beautiful Soup

    Before we can start web scraping, we need to install Beautiful Soup. You can do this by using pip, a package installer for Python. Simply type the following command in your terminal:

    pip install beautifulsoup4

    You will also need to install the requests module, which allows you to send HTTP requests using Python. You can install it using the following command:

    pip install requests

    Parsing HTML with Beautiful Soup

    Once you have Beautiful Soup installed, you can start using it to parse HTML. Here is a basic example:

    from bs4 import BeautifulSoup html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> """ soup = BeautifulSoup(html_doc, 'html.parser')

    In this example, we first import the BeautifulSoup class from the bs4 module. Then we create an instance of this class and pass in the HTML document that we want to parse. The second argument 'html.parser' is the parser library that Beautiful Soup uses to parse the HTML.

    Searching and Navigating the Parse Tree

    Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree.

    print(soup.title) # <title>The Dormouse's story</title> print(soup.title.name) # title print(soup.title.string) # The Dormouse's story print(soup.title.parent.name) # head print(soup.p) # <p class="title"><b>The Dormouse's story</b></p> print(soup.p['class']) # ['title'] print(soup.a) # None print(soup.find_all('a')) # [] print(soup.find(id="link3")) # None

    Extracting Information from a Website

    Once you have navigated to the part of the parse tree that you are interested in, you can extract the data. Here is an example:

    for link in soup.find_all('a'): print(link.get('href'))

    This will print out all the URLs found within <a> tags in the HTML document.

    In conclusion, Beautiful Soup is a powerful library that makes it easy to scrape information from web pages. It sits on an HTML or XML parser, providing Python-friendly ways of accessing data in these files.

    Test me
    Practical exercise
    Further reading

    Good morning my good sir, any questions for me?

    Sign in to chat
    Next up: Web Scraping Case Study