๐ What Is SQL?
SQL (Structured Query Language) is the standard language for working with data stored in relational databases.
For data scientists, SQL is essential because:
Most real-world data lives in databases.
It allows you to query, clean, aggregate, and join data efficiently.
It’s often the first step in data analysis before using Python, R, or visualization tools.
๐ง SQL Essentials for Data Science
1. Understanding Databases and Tables
Database: A collection of organized data.
Table: Similar to an Excel sheet — rows (records) and columns (fields).
Row: One record (e.g., one customer).
Column: One attribute (e.g., customer name).
Example:
id name age city
1 Alice 29 London
2 Bob 35 New York
⚙️ Basic SQL Commands
1. SELECT — Retrieve Data
The most important SQL command.
SELECT name, age
FROM customers;
Retrieves columns name and age from the customers table.
2. WHERE — Filter Data
Use conditions to filter rows.
SELECT *
FROM customers
WHERE age > 30;
Returns customers older than 30.
Common operators:
=, !=, >, <, >=, <=, BETWEEN, LIKE, IN
Example:
SELECT *
FROM customers
WHERE city IN ('London', 'New York');
3. ORDER BY — Sort Results
SELECT *
FROM customers
ORDER BY age DESC;
Sorts customers by age (descending).
4. LIMIT — Restrict the Number of Rows
SELECT *
FROM customers
LIMIT 5;
Shows only 5 rows.
5. DISTINCT — Remove Duplicates
SELECT DISTINCT city
FROM customers;
Returns unique city names.
๐ Aggregation and Grouping
Data scientists often need summaries — totals, averages, counts, etc.
Aggregate Functions
Function Description
COUNT() Number of rows
SUM() Total sum
AVG() Average
MIN() Minimum value
MAX() Maximum value
Example:
SELECT city, COUNT(*) AS num_customers
FROM customers
GROUP BY city;
Counts customers per city.
HAVING — Filter Groups
Unlike WHERE, HAVING filters after grouping.
SELECT city, AVG(age) AS avg_age
FROM customers
GROUP BY city
HAVING AVG(age) > 30;
๐ Working with Multiple Tables (JOINS)
Data in databases is often spread across multiple related tables.
Customers Orders
id name
1 Alice
2 Bob
1. INNER JOIN
Returns only matching rows.
SELECT c.name, o.total
FROM customers c
JOIN orders o
ON c.id = o.customer_id;
2. LEFT JOIN
Returns all rows from the left table, even if there’s no match.
SELECT c.name, o.total
FROM customers c
LEFT JOIN orders o
ON c.id = o.customer_id;
3. RIGHT JOIN
Returns all rows from the right table, even if there’s no match.
SELECT c.name, o.total
FROM customers c
RIGHT JOIN orders o
ON c.id = o.customer_id;
4. FULL OUTER JOIN
Returns all records when there is a match in either table.
SELECT c.name, o.total
FROM customers c
FULL JOIN orders o
ON c.id = o.customer_id;
๐งฉ Data Cleaning with SQL
Data cleaning is a critical part of data science.
1. Handle Missing Data
SELECT *
FROM customers
WHERE age IS NULL;
or replace missing data:
UPDATE customers
SET age = 0
WHERE age IS NULL;
2. Remove Duplicates
SELECT DISTINCT *
FROM customers;
3. Data Type Conversion
SELECT CAST(age AS FLOAT)
FROM customers;
4. String Operations
SELECT UPPER(name), LOWER(city)
FROM customers;
SELECT CONCAT(first_name, ' ', last_name) AS full_name
FROM users;
๐งฎ Analytical SQL — For Data Scientists
SQL can perform data analysis directly inside the database.
1. Window Functions
Used for running totals, ranks, and moving averages.
SELECT
customer_id,
order_date,
SUM(total) OVER (PARTITION BY customer_id ORDER BY order_date) AS running_total
FROM orders;
2. Ranking
SELECT
name,
total,
RANK() OVER (ORDER BY total DESC) AS rank
FROM sales;
3. Case Statements
Conditional logic in SQL.
SELECT
name,
CASE
WHEN age < 18 THEN 'Minor'
WHEN age BETWEEN 18 AND 65 THEN 'Adult'
ELSE 'Senior'
END AS category
FROM customers;
๐ Integrating SQL with Data Science Tools
Python: Use pandas + SQLAlchemy or sqlite3
import pandas as pd
import sqlite3
conn = sqlite3.connect('data.db')
df = pd.read_sql_query("SELECT * FROM customers WHERE age > 30", conn)
R: Use dbplyr or sqldf packages.
Visualization Tools: Tableau, Power BI, and Looker use SQL queries under the hood.
๐งพ Best Practices for Data Scientists
✅ Write clear, readable queries — use indentation.
✅ Always use aliases (AS) for clarity.
✅ Test queries on small samples using LIMIT.
✅ Avoid SELECT * in production — select only needed columns.
✅ Index frequently filtered columns to speed up queries.
๐ง Summary
Concept Key Command Use
Retrieve data SELECT Query tables
Filter WHERE, HAVING Conditions
Sort ORDER BY Ranking and sorting
Aggregate COUNT(), AVG() Summary statistics
Combine tables JOIN Merge datasets
Clean data IS NULL, DISTINCT, CAST() Prepare for analysis
Advanced analysis WINDOW, CASE Analytical logic
Learn Data Science Course in Hyderabad
Read More
Focus on specific tools and platforms used in the industry.
Tools & Technologies in Data Science
Transitioning from a Non-Technical Background to Data Science
Demystifying the Data Science Job Market
Visit Our Quality Thought Training Institute in Hyderabad
Subscribe by Email
Follow Updates Articles from This Blog via Email
No Comments