Exploratory Data Analysis (EDA) in 5 Minutes

 ๐Ÿ” Exploratory Data Analysis (EDA) in 5 Minutes


EDA is the process of understanding your data before applying any machine learning models. It helps you find patterns, spot anomalies, check assumptions, and decide how to clean and prepare your data.


Let’s break it down into 5 simple steps you can follow quickly.


✅ Step 1: Understand the Structure of Your Data


Load the data using pandas:


import pandas as pd

df = pd.read_csv('your_data.csv')



Check basic info:


df.shape           # Rows and columns

df.info()          # Data types and non-null values

df.head()          # Preview first 5 rows

df.describe()      # Summary stats (mean, std, min, max)



๐Ÿ“Œ Goal: Get a general feel for what you're working with.


✅ Step 2: Check for Missing or Duplicate Data


Find missing values:


df.isnull().sum()



Check for duplicates:


df.duplicated().sum()



๐Ÿ“Œ Goal: Identify data quality issues early.


✅ Step 3: Understand Each Column (Univariate Analysis)


Categorical variables:


df['gender'].value_counts().plot(kind='bar')



Numerical variables:


df['age'].hist(bins=20)



๐Ÿ“Œ Goal: Know the distribution of each variable.


✅ Step 4: Find Relationships (Bivariate Analysis)


Numerical vs Numerical:


df.plot.scatter(x='age', y='income')



Categorical vs Target:


import seaborn as sns

sns.boxplot(x='gender', y='income', data=df)



Correlation heatmap:


sns.heatmap(df.corr(), annot=True, cmap='coolwarm')



๐Ÿ“Œ Goal: Spot patterns and possible predictors.


✅ Step 5: Look for Outliers and Data Imbalance


Boxplots for outliers:


sns.boxplot(df['income'])



Target class imbalance:


df['churn'].value_counts(normalize=True).plot(kind='bar')



๐Ÿ“Œ Goal: Decide if you need to fix outliers or balance your dataset.


๐Ÿงญ Quick Summary

Task Code/Tool Example

View structure df.info(), df.describe()

Missing values df.isnull().sum()

Duplicates df.duplicated().sum()

Variable distribution df['col'].hist(), value_counts()

Relationships sns.boxplot(), scatter(), heatmap()

Outliers & imbalance sns.boxplot(), value_counts()

๐Ÿง  Final Thought:


"You can't fix what you don't understand. EDA is about understanding."


Even a quick EDA helps avoid costly mistakes later when modeling.

Learn Data Science Course in Hyderabad

Read More

The Art of Asking the Right Questions in Data Science

Why Data Cleaning is the Most Important Step

Data Science Tools You Must Know

Essential Math and Statistics for Data Science

Visit Our Quality Thought Training Institute in Hyderabad

Get Directions

Comments

Popular posts from this blog

Understanding Snowflake Editions: Standard, Enterprise, Business Critical

Installing Tosca: Step-by-Step Guide for Beginners

Entry-Level Cybersecurity Jobs You Can Apply For Today