How to Handle Large Datasets with Pandas
๐ผ How to Handle Large Datasets with Pandas
Pandas is a powerful library for data analysis, but working with large datasets (gigabytes or more) can be slow or memory-intensive.
Here are proven techniques to handle large datasets efficiently using pandas.
⚙️ 1. Use Chunking with read_csv()
Instead of loading the whole file at once, read it in smaller parts.
python
Copy
Edit
chunksize = 100000 # number of rows per chunk
for chunk in pd.read_csv('large_file.csv', chunksize=chunksize):
process(chunk) # custom function to handle each chunk
๐ Useful for processing logs, summaries, or filtering without exhausting memory.
๐งช 2. Use dtype and usecols to Reduce Memory Usage
Specify only the columns and types you need.
python
Copy
Edit
dtype = {
'id': 'int32',
'value': 'float32'
}
cols = ['id', 'value', 'category']
df = pd.read_csv('large_file.csv', usecols=cols, dtype=dtype)
✅ This can reduce memory usage by 50% or more.
๐ 3. Convert Data Types After Loading
Sometimes columns default to larger types like float64 or object.
python
Copy
Edit
df['id'] = df['id'].astype('int32')
df['category'] = df['category'].astype('category') # for repeated text
๐ Categorical types are ideal for columns with many repeated string values.
๐ 4. Filter or Sample Early
Only keep what you need from the start.
python
Copy
Edit
df = pd.read_csv('large_file.csv', nrows=100000) # only read first 100k rows
# or filter on the fly during chunking
filtered = []
for chunk in pd.read_csv('large_file.csv', chunksize=50000):
small = chunk[chunk['status'] == 'active']
filtered.append(small)
df = pd.concat(filtered)
๐ง 5. Use Compression
Read and write compressed files to save disk space and I/O time.
python
Copy
Edit
df.to_csv('output.csv.gz', compression='gzip')
df = pd.read_csv('output.csv.gz', compression='gzip')
Supported formats: .gz, .bz2, .zip, .xz
๐งต 6. Use Dask for Out-of-Core Processing
For very large datasets that don’t fit into memory, use Dask as a pandas-compatible alternative.
python
Copy
Edit
import dask.dataframe as dd
ddf = dd.read_csv('large_file.csv')
result = ddf.groupby('category').value.mean().compute()
✅ Dask loads and processes data in parallel, using disk storage when needed.
๐ 7. Save Intermediate Results with Feather or Parquet
These formats are much faster than CSV for storing and loading data.
python
Copy
Edit
df.to_parquet('data.parquet')
df = pd.read_parquet('data.parquet')
✅ Great for iterative workflows or repeated access.
๐ง Summary: Tips for Handling Big Data with Pandas
Strategy Benefit
chunksize Load and process data in parts
dtype, usecols Reduce memory use
Convert to category, int32 Optimize memory
Early filtering/sampling Avoid processing unneeded data
Use compressed files Save disk space and load faster
Use Dask or Vaex for huge data Handle out-of-core data efficiently
Use Parquet or Feather Faster I/O
Would you like a working notebook template or tips for handling big data with SQL and pandas together?
Learn Data Science Course in Hyderabad
Read More
Writing Efficient Code for Data Science Projects
Data Science with SQL: Why Every Data Scientist Needs It
Visit Our Quality Thought Training Institute in Hyderabad
Comments
Post a Comment