A Beginner’s Guide to Web Scraping with Python
🌐 A Beginner’s Guide to Web Scraping with Python
Web scraping is the process of extracting data from websites using code. In Python, it's commonly done with libraries like BeautifulSoup, Requests, and Selenium.
If you're new to web scraping, this guide will help you start safely and effectively.
🧰 Tools You’ll Need
🔹 1. Install Required Libraries
bash
Copy
Edit
pip install requests
pip install beautifulsoup4
Optionally for dynamic websites:
bash
Copy
Edit
pip install selenium
📄 2. Understanding HTML Basics
Before scraping, know how websites are structured:
<div>, <span>, <p>: containers
<a>: hyperlinks
<table>: tabular data
<ul>, <li>: lists
id, class: used to identify elements
✅ 3. Basic Scraping Example (Using requests + BeautifulSoup)
Let’s scrape the titles from quotes.toscrape.com, a site made for practice.
🧪 Code:
python
Copy
Edit
import requests
from bs4 import BeautifulSoup
# Step 1: Send a GET request to the website
url = "http://quotes.toscrape.com"
response = requests.get(url)
# Step 2: Parse the HTML content
soup = BeautifulSoup(response.text, "html.parser")
# Step 3: Extract the data
quotes = soup.find_all("span", class_="text")
authors = soup.find_all("small", class_="author")
# Step 4: Print the scraped data
for quote, author in zip(quotes, authors):
print(f"{quote.text} — {author.text}")
✅ Output:
csharp
Copy
Edit
“The world as we have created it...” — Albert Einstein
“It is our choices, Harry...” — J.K. Rowling
...
📑 4. Selecting Elements
Use BeautifulSoup methods:
Method Description
find() Finds the first matching tag
find_all() Finds all matching tags
select('css-selector') Uses CSS selectors (e.g., .class)
.text Extracts text inside the tag
.get('href') Extracts attributes like links
🧠 5. Scraping Tips
Always inspect the HTML of a page using DevTools (F12).
Identify unique class, id, or tag to target elements accurately.
Loop through paginated data using next page URLs.
⚠️ 6. Legal and Ethical Considerations
Always check the website’s robots.txt file (https://site.com/robots.txt).
Scrape public data only.
Do not overload the server with too many requests.
Use polite scraping:
python
Copy
Edit
import time
time.sleep(1) # Wait between requests
🚀 7. Scraping Dynamic Content (with Selenium)
Some sites load data using JavaScript. Use Selenium to interact with them:
python
Copy
Edit
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.get("https://example.com")
quote = driver.find_element(By.CLASS_NAME, "quote").text
print(quote)
driver.quit()
📦 8. Saving Scraped Data
You can save your data as a CSV:
python
Copy
Edit
import csv
with open("quotes.csv", "w", newline="") as file:
writer = csv.writer(file)
writer.writerow(["Quote", "Author"])
for quote, author in zip(quotes, authors):
writer.writerow([quote.text, author.text])
🏁 Summary
Task Tool/Command
Make web request requests.get(url)
Parse HTML BeautifulSoup(html, "html.parser")
Find elements find_all(), select()
Extract data .text, .get('href')
Handle dynamic content Use Selenium
Save data Use csv, json, or pandas
Learn Data Science Course in Hyderabad
Read More
How to Handle Large Datasets with Pandas
Writing Efficient Code for Data Science Projects
Visit Our Quality Thought Training Institute in Hyderabad
Comments
Post a Comment