Monday, September 8, 2025

thumbnail

Web Scraping with BeautifulSoup and Scrapy

 ๐Ÿ•ธ️ What Is Web Scraping?


Web scraping is the process of extracting data from websites using automated tools or code.


๐Ÿงฐ Tools You’ll Learn:

Tool Description Use Case

BeautifulSoup Lightweight library for parsing HTML/XML Great for small, simple scraping tasks

Scrapy Powerful web scraping framework Ideal for large-scale, complex scraping projects

๐Ÿ“ฆ Installation


Install both using pip:


pip install beautifulsoup4 requests lxml scrapy


๐Ÿ“ PART 1: Web Scraping with BeautifulSoup

๐Ÿงช Example: Scrape Titles from a Blog Page

import requests

from bs4 import BeautifulSoup


# Target URL

url = 'https://example-blog.com'


# Send request

response = requests.get(url)

soup = BeautifulSoup(response.content, 'lxml')  # or 'html.parser'


# Extract article titles

titles = soup.find_all('h2', class_='post-title')


# Print results

for title in titles:

    print(title.text.strip())


✅ Key BeautifulSoup Functions

Function What It Does

soup.find() Finds first matching tag

soup.find_all() Finds all matching tags

tag.text Extracts inner text

tag['href'] Extracts attribute value

๐Ÿš€ PART 2: Web Scraping with Scrapy


Scrapy is more structured and powerful — perfect for crawling multiple pages or websites.


๐Ÿ› ️ Step 1: Create a Scrapy Project

scrapy startproject my_scraper

cd my_scraper


๐Ÿ› ️ Step 2: Generate a Spider

scrapy genspider blog_spider example-blog.com


๐Ÿงช Step 3: Sample Spider Code

# my_scraper/spiders/blog_spider.py


import scrapy


class BlogSpider(scrapy.Spider):

    name = 'blog_spider'

    start_urls = ['https://example-blog.com']


    def parse(self, response):

        for post in response.css('h2.post-title'):

            yield {

                'title': post.css('a::text').get(),

                'link': post.css('a::attr(href)').get()

            }


๐Ÿ› ️ Step 4: Run the Spider

scrapy crawl blog_spider



Or to save the results:


scrapy crawl blog_spider -o posts.json


๐Ÿ” When to Use What?

Task Use BeautifulSoup Use Scrapy

Small scripts

Quick one-time jobs

Need to follow many links

High performance & large data

Exporting to CSV/JSON ✅ (manually) ✅ (built-in)

Handling AJAX or JavaScript ❌ (use Selenium or Playwright)

๐Ÿ” Notes on Ethics & Legality


Always check the site's robots.txt (e.g., example.com/robots.txt)


Avoid overloading servers — use delays or polite crawling


Some sites may ban or block bots


๐Ÿง  Summary

Tool Pros Cons

BeautifulSoup Easy to use, simple syntax Not great for large projects

Scrapy Fast, scalable, powerful Steeper learning curve

Learn Data Science Course in Hyderabad

Read More

Creating Interactive Dashboards with Plotly

A Comparison of Python vs. R for Data Science

The Best Python Libraries for Machine Learning

Building Your First Data Science Project in Jupyter Notebook

Visit Our Quality Thought Training Institute in Hyderabad

Get Directions

Subscribe by Email

Follow Updates Articles from This Blog via Email

No Comments

About

Search This Blog

Powered by Blogger.

Blog Archive