How to Handle Large Datasets with Pandas

๐Ÿผ How to Handle Large Datasets with Pandas

Pandas is a powerful library for data analysis, but working with large datasets (gigabytes or more) can be slow or memory-intensive.

Here are proven techniques to handle large datasets efficiently using pandas.


⚙️ 1. Use Chunking with read_csv()

Instead of loading the whole file at once, read it in smaller parts.


python

Copy

Edit

chunksize = 100000  # number of rows per chunk


for chunk in pd.read_csv('large_file.csv', chunksize=chunksize):

    process(chunk)  # custom function to handle each chunk

๐Ÿ”„ Useful for processing logs, summaries, or filtering without exhausting memory.


๐Ÿงช 2. Use dtype and usecols to Reduce Memory Usage

Specify only the columns and types you need.


python

Copy

Edit

dtype = {

    'id': 'int32',

    'value': 'float32'

}


cols = ['id', 'value', 'category']


df = pd.read_csv('large_file.csv', usecols=cols, dtype=dtype)

✅ This can reduce memory usage by 50% or more.


๐Ÿ”ƒ 3. Convert Data Types After Loading

Sometimes columns default to larger types like float64 or object.


python

Copy

Edit

df['id'] = df['id'].astype('int32')

df['category'] = df['category'].astype('category')  # for repeated text

๐Ÿ“‰ Categorical types are ideal for columns with many repeated string values.


๐Ÿš€ 4. Filter or Sample Early

Only keep what you need from the start.


python

Copy

Edit

df = pd.read_csv('large_file.csv', nrows=100000)  # only read first 100k rows


# or filter on the fly during chunking

filtered = []


for chunk in pd.read_csv('large_file.csv', chunksize=50000):

    small = chunk[chunk['status'] == 'active']

    filtered.append(small)


df = pd.concat(filtered)

๐ŸงŠ 5. Use Compression

Read and write compressed files to save disk space and I/O time.


python

Copy

Edit

df.to_csv('output.csv.gz', compression='gzip')

df = pd.read_csv('output.csv.gz', compression='gzip')

Supported formats: .gz, .bz2, .zip, .xz


๐Ÿงต 6. Use Dask for Out-of-Core Processing

For very large datasets that don’t fit into memory, use Dask as a pandas-compatible alternative.


python

Copy

Edit

import dask.dataframe as dd


ddf = dd.read_csv('large_file.csv')

result = ddf.groupby('category').value.mean().compute()

✅ Dask loads and processes data in parallel, using disk storage when needed.


๐Ÿ“Œ 7. Save Intermediate Results with Feather or Parquet

These formats are much faster than CSV for storing and loading data.


python

Copy

Edit

df.to_parquet('data.parquet')

df = pd.read_parquet('data.parquet')

✅ Great for iterative workflows or repeated access.


๐Ÿง  Summary: Tips for Handling Big Data with Pandas

Strategy Benefit

chunksize Load and process data in parts

dtype, usecols Reduce memory use

Convert to category, int32 Optimize memory

Early filtering/sampling Avoid processing unneeded data

Use compressed files Save disk space and load faster

Use Dask or Vaex for huge data Handle out-of-core data efficiently

Use Parquet or Feather Faster I/O


Would you like a working notebook template or tips for handling big data with SQL and pandas together?

Learn Data Science Course in Hyderabad

Read More

Writing Efficient Code for Data Science Projects

Data Science with SQL: Why Every Data Scientist Needs It

Visit Our Quality Thought Training Institute in Hyderabad

Get Directions

Comments

Popular posts from this blog

Understanding Snowflake Editions: Standard, Enterprise, Business Critical

Why Data Science Course?

How To Do Medical Coding Course?