Python Web Scraping

Installing Required Libraries:

                                  
                                    pip install beautifulsoup4
pip install requests
                                  
                                

Basic Web Scraping Example:

Fetching HTML with requests:

                                  
                                    import requests

url = "https://example.com"
response = requests.get(url)

html_content = response.text
print(html_content)
                                  
                                

Parsing HTML with BeautifulSoup:

                                  
                                    from bs4 import BeautifulSoup

# Parse HTML content
soup = BeautifulSoup(html_content, 'html.parser')

# Extracting data
title = soup.title.text
print("Title:", title)

# Find all links
links = soup.find_all('a')
for link in links:
    print("Link:", link.get('href'))
                                  
                                

Advanced Web Scraping:

Example: Scraping Quotes from a Website:

                                  
                                    import requests
from bs4 import BeautifulSoup

url = "http://quotes.toscrape.com"
response = requests.get(url)
html_content = response.text

soup = BeautifulSoup(html_content, 'html.parser')

# Extract quotes
quotes = soup.find_all('span', class_='text')

for quote in quotes:
    print("Quote:", quote.text)
    print("------")
                                  
                                

Handling Pagination:

Example: Scraping Quotes from Multiple Pages:

                                  
                                    import requests
from bs4 import BeautifulSoup

base_url = "http://quotes.toscrape.com/page/{}"
page_number = 1

while True:
    url = base_url.format(page_number)
    response = requests.get(url)

    if response.status_code != 200:
        break

    html_content = response.text
    soup = BeautifulSoup(html_content, 'html.parser')

    quotes = soup.find_all('span', class_='text')

    for quote in quotes:
        print("Quote:", quote.text)
        print("------")

    page_number += 1
                                  
                                

Handling Dynamic Content:

Henceforth, in case of webpages with content that is loaded through JavaScript, the use of such tools as Selenium might be required for automated browser actions.

Respect Website Policies:

Whenever you scraping a website it is always important to note that you can use them with respect to, term of services and robots.txt of the web site. For some websites, scraping may be forbidden or there may be detailed conditions to which one needs to submit to perform scraping.

Web scraping is a tool with so much power that it should never become a liability or be used unethically. However, always keep an eye on the website’s terms and conditions and always read them before scraping data from the site.