Outlier Detection Methods in Data Science
๐ง What is an Outlier?
An outlier is a data point that is significantly different from other observations. It can occur due to:
Errors (e.g. data entry mistakes)
Natural variation (e.g. a very high salary)
Fraud or anomalies
Detecting outliers is important because they can:
Skew statistical analysis
Affect model performance
Reveal important insights (like fraud or rare events)
๐ ️ Common Outlier Detection Methods
1. Statistical Methods
๐ a) Z-Score
Measures how many standard deviations a value is from the mean.
python
Copy
Edit
from scipy import stats
import numpy as np
z_scores = np.abs(stats.zscore(df['column_name']))
outliers = df[z_scores > 3] # Typically, z > 3 is an outlier
๐ b) IQR (Interquartile Range)
Uses the middle 50% of the data.
python
Copy
Edit
Q1 = df['column_name'].quantile(0.25)
Q3 = df['column_name'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['column_name'] < Q1 - 1.5 * IQR) |
(df['column_name'] > Q3 + 1.5 * IQR)]
✅ Best for: Univariate, small to medium datasets
❌ Not ideal for multivariate or skewed data
2. Visualization Techniques
Boxplots: Show outliers as points outside whiskers
Scatter plots: Help spot outliers in 2D data
Histogram/Distplot: Reveal unusual spikes or gaps
python
Copy
Edit
import seaborn as sns
sns.boxplot(x=df['column_name'])
3. Distance-Based Methods
๐ a) Euclidean Distance / k-Nearest Neighbors (k-NN)
Outliers are far from their neighbors.
python
Copy
Edit
from sklearn.neighbors import LocalOutlierFactor
lof = LocalOutlierFactor(n_neighbors=20)
y_pred = lof.fit_predict(X) # -1 = outlier, 1 = inlier
✅ Best for: Multivariate data
❌ Sensitive to scaling
4. Density-Based Methods
๐ a) DBSCAN (Density-Based Spatial Clustering)
Groups dense areas; points outside these are outliers.
python
Copy
Edit
from sklearn.cluster import DBSCAN
model = DBSCAN(eps=0.5, min_samples=5)
model.fit(X)
outliers = X[model.labels_ == -1] # -1 means outlier
✅ Great for: Spatial and clustering-based data
❌ Requires tuning parameters (eps, min_samples)
5. Machine Learning-Based Methods
๐ a) Isolation Forest
Builds trees to isolate observations. Outliers are isolated faster.
python
Copy
Edit
from sklearn.ensemble import IsolationForest
iso = IsolationForest(contamination=0.05)
y_pred = iso.fit_predict(X) # -1 = outlier
✅ Scalable, good for high-dimensional data
๐ b) One-Class SVM
Fits a boundary around the data.
python
Copy
Edit
from sklearn.svm import OneClassSVM
ocsvm = OneClassSVM(nu=0.05, kernel="rbf")
y_pred = ocsvm.fit_predict(X)
✅ Effective for complex shapes
❌ Can be slow on large datasets
๐ Summary Table
Method Type Best For Pros Cons
Z-Score / IQR Statistical Univariate data Simple, fast Not good for multivariate
Boxplot / Histogram Visual Small datasets Easy to interpret Subjective
k-NN / LOF Distance Multivariate More accurate Sensitive to scale
DBSCAN Density Clustering + noise Detects clusters Parameter tuning needed
Isolation Forest ML-based High-dimensional data Fast, scalable Less interpretable
One-Class SVM ML-based Complex data structures Handles non-linear Slow on large datasets
✅ Best Practices
Always scale your data (e.g., StandardScaler) for distance-based methods.
Combine visual + statistical + model-based techniques for better detection.
Handle outliers depending on context:
Remove if errors
Cap/transform if extreme but valid
Keep if they’re meaningful (e.g., fraud detection)
Learn Data Science Course in Hyderabad
Read More
How to Handle Categorical Data in Machine Learning Models
Feature Selection Techniques: Filter, Wrapper, and Embedded Methods
How to Use Principal Component Analysis (PCA) for Dimensionality Reduction
One-Hot Encoding vs. Label Encoding: When to Use Them
Visit Our Quality Thought Training Institute in Hyderabad
Comments
Post a Comment