Building Your First Data Science Project in Jupyter Notebook
Building Your First Data Science Project in Jupyter Notebook
Jupyter Notebook is one of the most popular tools for data science because it allows you to write and run code interactively alongside notes, visualizations, and explanations. Let’s walk through creating a basic data science project.
Step 1: Set Up Your Environment
Install Anaconda (recommended)
Anaconda is a Python distribution that comes pre-packaged with Jupyter Notebook and many data science libraries.
Download & install: https://www.anaconda.com/products/distribution
Alternatively, you can install Jupyter via pip:
pip install notebook
Step 2: Launch Jupyter Notebook
Open your terminal or Anaconda Navigator.
Run the command:
jupyter notebook
This will open Jupyter in your browser.
Step 3: Create a New Notebook
In the Jupyter interface, click New > Python 3 to create a new notebook.
Rename your notebook (e.g., First_Data_Science_Project).
Step 4: Import Libraries
At the top of your notebook, import the necessary Python libraries for data science:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
pandas: For data manipulation
numpy: For numerical operations
matplotlib/seaborn: For data visualization
Step 5: Load Your Dataset
You can use publicly available datasets. For example, the famous Titanic dataset:
url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'
data = pd.read_csv(url)
Check the first few rows:
data.head()
Step 6: Explore the Data (Exploratory Data Analysis - EDA)
Understand your data’s structure and contents.
# Summary statistics
data.describe()
# Data info
data.info()
# Check for missing values
data.isnull().sum()
Visualize distributions:
sns.countplot(x='Survived', data=data)
plt.title('Survival Counts')
plt.show()
sns.histplot(data['Age'].dropna(), bins=30)
plt.title('Age Distribution')
plt.show()
Step 7: Clean the Data
Handle missing values or incorrect data:
# Fill missing Age values with median
data['Age'].fillna(data['Age'].median(), inplace=True)
# Drop rows where 'Embarked' is missing
data.dropna(subset=['Embarked'], inplace=True)
Step 8: Feature Engineering
Create or modify features to improve your model.
# Convert categorical columns to numeric
data['Sex'] = data['Sex'].map({'male': 0, 'female': 1})
# Create a new feature: FamilySize
data['FamilySize'] = data['SibSp'] + data['Parch'] + 1
Step 9: Build a Simple Machine Learning Model
Let’s predict survival using logistic regression.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Features and target
X = data[['Pclass', 'Sex', 'Age', 'FamilySize']]
y = data['Survived']
# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize and train model
model = LogisticRegression()
model.fit(X_train, y_train)
# Predict
y_pred = model.predict(X_test)
# Accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))
Step 10: Summarize Your Findings
Write markdown cells to explain:
What the data represents
Key insights from EDA
How you handled missing data
The model used and its performance
Bonus Tips
Save your notebook regularly.
Use Markdown cells to keep your project organized and readable.
Experiment with more complex models like Random Forest, SVM, or XGBoost as you learn.
Try visualizing feature importance or confusion matrices to evaluate models.
Learn Data Science Course in Hyderabad
Read More
An Introduction to R's ggplot2 for Beautiful Visualizations
Visualizing Data with Matplotlib and Seaborn
Data Manipulation with dplyr in R
10 Pandas Functions Every Data Scientist Should Know
Visit Our Quality Thought Training Institute in Hyderabad
Comments
Post a Comment