Building Your First Data Science Project in Jupyter Notebook

September 08, 2025

Building Your First Data Science Project in Jupyter Notebook

Jupyter Notebook is one of the most popular tools for data science because it allows you to write and run code interactively alongside notes, visualizations, and explanations. Let’s walk through creating a basic data science project.

Step 1: Set Up Your Environment

Install Anaconda (recommended)

Anaconda is a Python distribution that comes pre-packaged with Jupyter Notebook and many data science libraries.

Download & install: https://www.anaconda.com/products/distribution

Alternatively, you can install Jupyter via pip:

pip install notebook

Step 2: Launch Jupyter Notebook

Open your terminal or Anaconda Navigator.

Run the command:

jupyter notebook

This will open Jupyter in your browser.

Step 3: Create a New Notebook

In the Jupyter interface, click New > Python 3 to create a new notebook.

Rename your notebook (e.g., First_Data_Science_Project).

Step 4: Import Libraries

At the top of your notebook, import the necessary Python libraries for data science:

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

pandas: For data manipulation

numpy: For numerical operations

matplotlib/seaborn: For data visualization

Step 5: Load Your Dataset

You can use publicly available datasets. For example, the famous Titanic dataset:

url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'

data = pd.read_csv(url)

Check the first few rows:

data.head()

Step 6: Explore the Data (Exploratory Data Analysis - EDA)

Understand your data’s structure and contents.

# Summary statistics

data.describe()

# Data info

data.info()

# Check for missing values

data.isnull().sum()

Visualize distributions:

sns.countplot(x='Survived', data=data)

plt.title('Survival Counts')

plt.show()

sns.histplot(data['Age'].dropna(), bins=30)

plt.title('Age Distribution')

plt.show()

Step 7: Clean the Data

Handle missing values or incorrect data:

# Fill missing Age values with median

data['Age'].fillna(data['Age'].median(), inplace=True)

# Drop rows where 'Embarked' is missing

data.dropna(subset=['Embarked'], inplace=True)

Step 8: Feature Engineering

Create or modify features to improve your model.

# Convert categorical columns to numeric

data['Sex'] = data['Sex'].map({'male': 0, 'female': 1})

# Create a new feature: FamilySize

data['FamilySize'] = data['SibSp'] + data['Parch'] + 1

Step 9: Build a Simple Machine Learning Model

Let’s predict survival using logistic regression.

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score

# Features and target

X = data[['Pclass', 'Sex', 'Age', 'FamilySize']]

y = data['Survived']

# Split dataset

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train model

model = LogisticRegression()

model.fit(X_train, y_train)

# Predict

y_pred = model.predict(X_test)

# Accuracy

print("Accuracy:", accuracy_score(y_test, y_pred))

Step 10: Summarize Your Findings

Write markdown cells to explain:

What the data represents

Key insights from EDA

How you handled missing data

The model used and its performance

Bonus Tips

Save your notebook regularly.

Use Markdown cells to keep your project organized and readable.

Experiment with more complex models like Random Forest, SVM, or XGBoost as you learn.

Try visualizing feature importance or confusion matrices to evaluate models.

Learn Data Science Course in Hyderabad

Visualizing Data with Matplotlib and Seaborn

Data Manipulation with dplyr in R

10 Pandas Functions Every Data Scientist Should Know

Visit Our Quality Thought Training Institute in Hyderabad

Get Directions

Search This Blog

Best Quality Thought Software Institute Training in Hyderabad