Friday, October 3, 2025

thumbnail

How to Preprocess Text Data for NLP Applications

 Common Steps to Preprocess Text Data

1. Lowercasing


Convert all characters to lowercase to ensure uniformity.


text = text.lower()


2. Remove Noise


Eliminate unwanted characters like punctuation, special symbols, numbers, etc.


import re

text = re.sub(r'[^a-zA-Z\s]', '', text)


3. Tokenization


Split the text into individual words or tokens.


from nltk.tokenize import word_tokenize

tokens = word_tokenize(text)


4. Stop Words Removal


Remove common words that don't add much meaning (like "and", "the", "is").


from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

filtered_tokens = [word for word in tokens if word not in stop_words]


5. Stemming or Lemmatization


Stemming reduces words to their base form by chopping off suffixes.


Lemmatization uses vocabulary and morphological analysis for better accuracy.


from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]


# OR


from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]


6. Handling Spelling Errors (optional)


Correct spelling using libraries like TextBlob or SymSpell.


from textblob import TextBlob

corrected_text = str(TextBlob(text).correct())


7. Remove Extra Whitespace


Clean up unnecessary spaces.


text = ' '.join(text.split())


⚙️ Optional Preprocessing Steps (Depending on Use Case)


Removing HTML tags: If scraping data from the web.


from bs4 import BeautifulSoup

text = BeautifulSoup(html, "html.parser").get_text()



Handling emojis/emoticons: For sentiment analysis.


Custom word normalization: E.g., converting "u" to "you".


Text segmentation: For languages like Chinese.


๐Ÿ“ฆ Libraries Commonly Used


nltk – Tokenization, stopwords, stemming, lemmatization


spaCy – Faster and more accurate NLP pipeline


re – Regex for text cleaning


BeautifulSoup – HTML parsing


TextBlob – Spelling correction and sentiment analysis


๐Ÿง  Example: Simple Preprocessing Pipeline

import re

from nltk.tokenize import word_tokenize

from nltk.corpus import stopwords

from nltk.stem import WordNetLemmatizer


def preprocess(text):

    # Lowercase

    text = text.lower()

    

    # Remove punctuation and numbers

    text = re.sub(r'[^a-z\s]', '', text)

    

    # Tokenize

    tokens = word_tokenize(text)

    

    # Remove stopwords

    stop_words = set(stopwords.words('english'))

    tokens = [word for word in tokens if word not in stop_words]

    

    # Lemmatize

    lemmatizer = WordNetLemmatizer()

    tokens = [lemmatizer.lemmatize(word) for word in tokens]

    

    return tokens


๐Ÿงช Final Notes


Preprocessing depends on the model you're using. For deep learning models like BERT, minimal preprocessing is preferred (often just tokenization).


Always consider the language and domain of your text data (e.g., legal, medical, casual).

Learn AI ML Course in Hyderabad

Read More

From Chatbots to Virtual Assistants: The Role of NLP in AI

How to Build a Speech Recognition System with AI

Exploring Named Entity Recognition (NER) with ML

Building a Text Classification Model with Deep Learning


Subscribe by Email

Follow Updates Articles from This Blog via Email

No Comments

About

Search This Blog

Powered by Blogger.

Blog Archive