Common Steps to Preprocess Text Data
1. Lowercasing
Convert all characters to lowercase to ensure uniformity.
text = text.lower()
2. Remove Noise
Eliminate unwanted characters like punctuation, special symbols, numbers, etc.
import re
text = re.sub(r'[^a-zA-Z\s]', '', text)
3. Tokenization
Split the text into individual words or tokens.
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)
4. Stop Words Removal
Remove common words that don't add much meaning (like "and", "the", "is").
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word not in stop_words]
5. Stemming or Lemmatization
Stemming reduces words to their base form by chopping off suffixes.
Lemmatization uses vocabulary and morphological analysis for better accuracy.
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]
# OR
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]
6. Handling Spelling Errors (optional)
Correct spelling using libraries like TextBlob or SymSpell.
from textblob import TextBlob
corrected_text = str(TextBlob(text).correct())
7. Remove Extra Whitespace
Clean up unnecessary spaces.
text = ' '.join(text.split())
⚙️ Optional Preprocessing Steps (Depending on Use Case)
Removing HTML tags: If scraping data from the web.
from bs4 import BeautifulSoup
text = BeautifulSoup(html, "html.parser").get_text()
Handling emojis/emoticons: For sentiment analysis.
Custom word normalization: E.g., converting "u" to "you".
Text segmentation: For languages like Chinese.
๐ฆ Libraries Commonly Used
nltk – Tokenization, stopwords, stemming, lemmatization
spaCy – Faster and more accurate NLP pipeline
re – Regex for text cleaning
BeautifulSoup – HTML parsing
TextBlob – Spelling correction and sentiment analysis
๐ง Example: Simple Preprocessing Pipeline
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
def preprocess(text):
# Lowercase
text = text.lower()
# Remove punctuation and numbers
text = re.sub(r'[^a-z\s]', '', text)
# Tokenize
tokens = word_tokenize(text)
# Remove stopwords
stop_words = set(stopwords.words('english'))
tokens = [word for word in tokens if word not in stop_words]
# Lemmatize
lemmatizer = WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(word) for word in tokens]
return tokens
๐งช Final Notes
Preprocessing depends on the model you're using. For deep learning models like BERT, minimal preprocessing is preferred (often just tokenization).
Always consider the language and domain of your text data (e.g., legal, medical, casual).
Learn AI ML Course in Hyderabad
Read More
From Chatbots to Virtual Assistants: The Role of NLP in AI
How to Build a Speech Recognition System with AI
Exploring Named Entity Recognition (NER) with ML
Building a Text Classification Model with Deep Learning
Subscribe by Email
Follow Updates Articles from This Blog via Email
No Comments