Exploratory Data Analysis (EDA) in 5 Minutes
๐ Exploratory Data Analysis (EDA) in 5 Minutes
EDA is the process of understanding your data before applying any machine learning models. It helps you find patterns, spot anomalies, check assumptions, and decide how to clean and prepare your data.
Let’s break it down into 5 simple steps you can follow quickly.
✅ Step 1: Understand the Structure of Your Data
Load the data using pandas:
import pandas as pd
df = pd.read_csv('your_data.csv')
Check basic info:
df.shape # Rows and columns
df.info() # Data types and non-null values
df.head() # Preview first 5 rows
df.describe() # Summary stats (mean, std, min, max)
๐ Goal: Get a general feel for what you're working with.
✅ Step 2: Check for Missing or Duplicate Data
Find missing values:
df.isnull().sum()
Check for duplicates:
df.duplicated().sum()
๐ Goal: Identify data quality issues early.
✅ Step 3: Understand Each Column (Univariate Analysis)
Categorical variables:
df['gender'].value_counts().plot(kind='bar')
Numerical variables:
df['age'].hist(bins=20)
๐ Goal: Know the distribution of each variable.
✅ Step 4: Find Relationships (Bivariate Analysis)
Numerical vs Numerical:
df.plot.scatter(x='age', y='income')
Categorical vs Target:
import seaborn as sns
sns.boxplot(x='gender', y='income', data=df)
Correlation heatmap:
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
๐ Goal: Spot patterns and possible predictors.
✅ Step 5: Look for Outliers and Data Imbalance
Boxplots for outliers:
sns.boxplot(df['income'])
Target class imbalance:
df['churn'].value_counts(normalize=True).plot(kind='bar')
๐ Goal: Decide if you need to fix outliers or balance your dataset.
๐งญ Quick Summary
Task Code/Tool Example
View structure df.info(), df.describe()
Missing values df.isnull().sum()
Duplicates df.duplicated().sum()
Variable distribution df['col'].hist(), value_counts()
Relationships sns.boxplot(), scatter(), heatmap()
Outliers & imbalance sns.boxplot(), value_counts()
๐ง Final Thought:
"You can't fix what you don't understand. EDA is about understanding."
Even a quick EDA helps avoid costly mistakes later when modeling.
Learn Data Science Course in Hyderabad
Read More
The Art of Asking the Right Questions in Data Science
Why Data Cleaning is the Most Important Step
Data Science Tools You Must Know
Essential Math and Statistics for Data Science
Visit Our Quality Thought Training Institute in Hyderabad
Comments
Post a Comment