Saturday, November 8, 2025

thumbnail

A Guide to SQL for Data Science

 ๐Ÿ“˜ What Is SQL?


SQL (Structured Query Language) is the standard language for working with data stored in relational databases.


For data scientists, SQL is essential because:


Most real-world data lives in databases.


It allows you to query, clean, aggregate, and join data efficiently.


It’s often the first step in data analysis before using Python, R, or visualization tools.


๐Ÿง  SQL Essentials for Data Science

1. Understanding Databases and Tables


Database: A collection of organized data.


Table: Similar to an Excel sheet — rows (records) and columns (fields).


Row: One record (e.g., one customer).


Column: One attribute (e.g., customer name).


Example:


id name age city

1 Alice 29 London

2 Bob 35 New York

⚙️ Basic SQL Commands

1. SELECT — Retrieve Data


The most important SQL command.


SELECT name, age

FROM customers;



Retrieves columns name and age from the customers table.


2. WHERE — Filter Data


Use conditions to filter rows.


SELECT *

FROM customers

WHERE age > 30;



Returns customers older than 30.


Common operators:

=, !=, >, <, >=, <=, BETWEEN, LIKE, IN


Example:


SELECT *

FROM customers

WHERE city IN ('London', 'New York');


3. ORDER BY — Sort Results

SELECT *

FROM customers

ORDER BY age DESC;



Sorts customers by age (descending).


4. LIMIT — Restrict the Number of Rows

SELECT *

FROM customers

LIMIT 5;



Shows only 5 rows.


5. DISTINCT — Remove Duplicates

SELECT DISTINCT city

FROM customers;



Returns unique city names.


๐Ÿ“Š Aggregation and Grouping


Data scientists often need summaries — totals, averages, counts, etc.


Aggregate Functions

Function Description

COUNT() Number of rows

SUM() Total sum

AVG() Average

MIN() Minimum value

MAX() Maximum value


Example:


SELECT city, COUNT(*) AS num_customers

FROM customers

GROUP BY city;



Counts customers per city.


HAVING — Filter Groups


Unlike WHERE, HAVING filters after grouping.


SELECT city, AVG(age) AS avg_age

FROM customers

GROUP BY city

HAVING AVG(age) > 30;


๐Ÿ”— Working with Multiple Tables (JOINS)


Data in databases is often spread across multiple related tables.


Customers Orders

id name

1 Alice

2 Bob

1. INNER JOIN


Returns only matching rows.


SELECT c.name, o.total

FROM customers c

JOIN orders o

ON c.id = o.customer_id;


2. LEFT JOIN


Returns all rows from the left table, even if there’s no match.


SELECT c.name, o.total

FROM customers c

LEFT JOIN orders o

ON c.id = o.customer_id;


3. RIGHT JOIN


Returns all rows from the right table, even if there’s no match.


SELECT c.name, o.total

FROM customers c

RIGHT JOIN orders o

ON c.id = o.customer_id;


4. FULL OUTER JOIN


Returns all records when there is a match in either table.


SELECT c.name, o.total

FROM customers c

FULL JOIN orders o

ON c.id = o.customer_id;


๐Ÿงฉ Data Cleaning with SQL


Data cleaning is a critical part of data science.


1. Handle Missing Data

SELECT *

FROM customers

WHERE age IS NULL;



or replace missing data:


UPDATE customers

SET age = 0

WHERE age IS NULL;


2. Remove Duplicates

SELECT DISTINCT *

FROM customers;


3. Data Type Conversion

SELECT CAST(age AS FLOAT)

FROM customers;


4. String Operations

SELECT UPPER(name), LOWER(city)

FROM customers;


SELECT CONCAT(first_name, ' ', last_name) AS full_name

FROM users;


๐Ÿงฎ Analytical SQL — For Data Scientists


SQL can perform data analysis directly inside the database.


1. Window Functions


Used for running totals, ranks, and moving averages.


SELECT

  customer_id,

  order_date,

  SUM(total) OVER (PARTITION BY customer_id ORDER BY order_date) AS running_total

FROM orders;


2. Ranking

SELECT

  name,

  total,

  RANK() OVER (ORDER BY total DESC) AS rank

FROM sales;


3. Case Statements


Conditional logic in SQL.


SELECT

  name,

  CASE

    WHEN age < 18 THEN 'Minor'

    WHEN age BETWEEN 18 AND 65 THEN 'Adult'

    ELSE 'Senior'

  END AS category

FROM customers;


๐Ÿ“ˆ Integrating SQL with Data Science Tools


Python: Use pandas + SQLAlchemy or sqlite3


import pandas as pd

import sqlite3


conn = sqlite3.connect('data.db')

df = pd.read_sql_query("SELECT * FROM customers WHERE age > 30", conn)



R: Use dbplyr or sqldf packages.


Visualization Tools: Tableau, Power BI, and Looker use SQL queries under the hood.


๐Ÿงพ Best Practices for Data Scientists


✅ Write clear, readable queries — use indentation.

✅ Always use aliases (AS) for clarity.

✅ Test queries on small samples using LIMIT.

✅ Avoid SELECT * in production — select only needed columns.

✅ Index frequently filtered columns to speed up queries.


๐Ÿง  Summary

Concept Key Command Use

Retrieve data SELECT Query tables

Filter WHERE, HAVING Conditions

Sort ORDER BY Ranking and sorting

Aggregate COUNT(), AVG() Summary statistics

Combine tables JOIN Merge datasets

Clean data IS NULL, DISTINCT, CAST() Prepare for analysis

Advanced analysis WINDOW, CASE Analytical logic

Learn Data Science Course in Hyderabad

Read More

Focus on specific tools and platforms used in the industry.

Tools & Technologies in Data Science

Transitioning from a Non-Technical Background to Data Science

Demystifying the Data Science Job Market

Visit Our Quality Thought Training Institute in Hyderabad

Get Directions 

Subscribe by Email

Follow Updates Articles from This Blog via Email

No Comments

About

Search This Blog

Powered by Blogger.

Blog Archive