Web Scraping with BeautifulSoup and Scrapy
πΈ️ What Is Web Scraping?
Web scraping is the process of extracting data from websites using automated tools or code.
π§° Tools You’ll Learn:
Tool Description Use Case
BeautifulSoup Lightweight library for parsing HTML/XML Great for small, simple scraping tasks
Scrapy Powerful web scraping framework Ideal for large-scale, complex scraping projects
π¦ Installation
Install both using pip:
pip install beautifulsoup4 requests lxml scrapy
π PART 1: Web Scraping with BeautifulSoup
π§ͺ Example: Scrape Titles from a Blog Page
import requests
from bs4 import BeautifulSoup
# Target URL
url = 'https://example-blog.com'
# Send request
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml') # or 'html.parser'
# Extract article titles
titles = soup.find_all('h2', class_='post-title')
# Print results
for title in titles:
print(title.text.strip())
✅ Key BeautifulSoup Functions
Function What It Does
soup.find() Finds first matching tag
soup.find_all() Finds all matching tags
tag.text Extracts inner text
tag['href'] Extracts attribute value
π PART 2: Web Scraping with Scrapy
Scrapy is more structured and powerful — perfect for crawling multiple pages or websites.
π ️ Step 1: Create a Scrapy Project
scrapy startproject my_scraper
cd my_scraper
π ️ Step 2: Generate a Spider
scrapy genspider blog_spider example-blog.com
π§ͺ Step 3: Sample Spider Code
# my_scraper/spiders/blog_spider.py
import scrapy
class BlogSpider(scrapy.Spider):
name = 'blog_spider'
start_urls = ['https://example-blog.com']
def parse(self, response):
for post in response.css('h2.post-title'):
yield {
'title': post.css('a::text').get(),
'link': post.css('a::attr(href)').get()
}
π ️ Step 4: Run the Spider
scrapy crawl blog_spider
Or to save the results:
scrapy crawl blog_spider -o posts.json
π When to Use What?
Task Use BeautifulSoup Use Scrapy
Small scripts ✅ ❌
Quick one-time jobs ✅ ❌
Need to follow many links ❌ ✅
High performance & large data ❌ ✅
Exporting to CSV/JSON ✅ (manually) ✅ (built-in)
Handling AJAX or JavaScript ❌ ❌ (use Selenium or Playwright)
π Notes on Ethics & Legality
Always check the site's robots.txt (e.g., example.com/robots.txt)
Avoid overloading servers — use delays or polite crawling
Some sites may ban or block bots
π§ Summary
Tool Pros Cons
BeautifulSoup Easy to use, simple syntax Not great for large projects
Scrapy Fast, scalable, powerful Steeper learning curve
Learn Data Science Course in Hyderabad
Read More
Creating Interactive Dashboards with Plotly
A Comparison of Python vs. R for Data Science
The Best Python Libraries for Machine Learning
Building Your First Data Science Project in Jupyter Notebook
Visit Our Quality Thought Training Institute in Hyderabad
Comments
Post a Comment