Friday, October 3, 2025

thumbnail

Building a Text Classification Model with Deep Learning

 Step-by-Step Guide to Building a Text Classification Model

1. Define the Problem


Text classification assigns predefined categories to text documents. Examples include:


Spam detection (spam or not)


Sentiment analysis (positive, neutral, negative)


Topic classification (sports, politics, tech, etc.)


2. Collect and Prepare the Dataset


You can use datasets from sources like:


Kaggle


Hugging Face Datasets


Scikit-learn (e.g., 20 Newsgroups)


Example: IMDB Movie Reviews Dataset


from tensorflow.keras.datasets import imdb

from tensorflow.keras.preprocessing.sequence import pad_sequences


# Load dataset

vocab_size = 10000

max_len = 200


(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size)

x_train = pad_sequences(x_train, maxlen=max_len)

x_test = pad_sequences(x_test, maxlen=max_len)


3. Preprocess the Text


If you're not using a preprocessed dataset:


Clean the text (remove punctuation, lowercase, etc.)


Tokenize (convert text to integers)


Pad sequences to equal length


Using Keras Tokenizer:


from tensorflow.keras.preprocessing.text import Tokenizer

from tensorflow.keras.preprocessing.sequence import pad_sequences


tokenizer = Tokenizer(num_words=10000, oov_token="<OOV>")

tokenizer.fit_on_texts(texts)


sequences = tokenizer.texts_to_sequences(texts)

padded = pad_sequences(sequences, maxlen=200, truncating='post')


4. Build the Model


Use an embedding layer + a deep learning architecture (like LSTM, GRU, or CNN).


Example: LSTM Model

from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout


model = Sequential([

    Embedding(input_dim=vocab_size, output_dim=128, input_length=max_len),

    LSTM(64, return_sequences=False),

    Dropout(0.5),

    Dense(1, activation='sigmoid')  # For binary classification

])


model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

model.summary()


5. Train the Model

history = model.fit(

    x_train, y_train,

    epochs=5,

    validation_data=(x_test, y_test),

    batch_size=64

)


6. Evaluate the Model

loss, accuracy = model.evaluate(x_test, y_test)

print(f"Test Accuracy: {accuracy*100:.2f}%")



Optional: Plot accuracy/loss graphs to see model performance over epochs.


7. Make Predictions

predictions = model.predict(x_test)

predicted_classes = (predictions > 0.5).astype("int32")


8. (Optional) Save and Load the Model

# Save

model.save("text_classification_model.h5")


# Load

from tensorflow.keras.models import load_model

model = load_model("text_classification_model.h5")


๐Ÿง  Alternatives and Improvements


Use pretrained embeddings (like GloVe or Word2Vec)


Fine-tune transformers (like BERT) for better accuracy


Try data augmentation for small datasets


Implement attention mechanisms


๐Ÿ› ️ Tools and Libraries


TensorFlow / Keras – for building and training models


Scikit-learn – for metrics and preprocessing


NLTK / SpaCy – for natural language preprocessing


Hugging Face Transformers – for state-of-the-art models like BERT

Learn AI ML Course in Hyderabad

Read More

How AI is Enhancing Language Translation Systems

Creating a Sentiment Analysis Model with Machine Learning

How to Use Pre-trained Models for Natural Language Processing

NLP & Text-Based AI

Subscribe by Email

Follow Updates Articles from This Blog via Email

No Comments

About

Search This Blog

Powered by Blogger.

Blog Archive