Using Hugging Face for NLP Projects
๐ค Using Hugging Face for NLP Projects
Build real-world NLP applications with pre-trained models in minutes.
๐ What is Hugging Face?
Hugging Face
is a company and open-source platform known for:
Transformers library (for NLP, vision, audio, and more)
Thousands of pretrained models (BERT, GPT, T5, etc.)
Datasets for ML/NLP tasks
Model inference, training, and deployment tools
๐งฐ Key Libraries You'll Use
Library Use
transformers Load, train, and use pre-trained models
datasets Load and process popular NLP datasets
tokenizers Fast tokenization
accelerate Speed up training on CPUs/GPUs
gradio Easily build and demo NLP apps
๐ง Installation
pip install transformers datasets
Optionally, add:
pip install gradio accelerate
๐ง Common NLP Tasks with Hugging Face
Task Example
Text Classification Sentiment analysis, topic detection
Named Entity Recognition (NER) Extract names, locations, dates
Question Answering Answer questions based on context
Text Generation Generate creative or informative text
Translation English → French, etc.
Summarization Convert long text to short summaries
๐ Quick Start: Sentiment Analysis
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
result = classifier("I love using Hugging Face for NLP!")[0]
print(f"Label: {result['label']}, Confidence: {result['score']:.2f}")
Output:
Label: POSITIVE, Confidence: 0.99
๐ค Tokenization and Model Inference (Manual)
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
inputs = tokenizer("Transformers are amazing!", return_tensors="pt")
outputs = model(**inputs)
probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(probs)
๐ Using NLP Datasets
from datasets import load_dataset
dataset = load_dataset("imdb") # Sentiment dataset
print(dataset["train"][0]) # First review
Other popular datasets:
ag_news – News classification
squad – Question answering
conll2003 – NER
common_voice – Speech-to-text
๐️ Fine-Tuning a Pretrained Model (Text Classification)
Step 1: Load Dataset
from datasets import load_dataset
dataset = load_dataset("ag_news")
Step 2: Preprocess and Tokenize
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
def tokenize(batch):
return tokenizer(batch["text"], padding=True, truncation=True)
tokenized_dataset = dataset.map(tokenize, batched=True)
Step 3: Train Model
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=4)
training_args = TrainingArguments(
output_dir="./results",
evaluation_strategy="epoch",
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
num_train_epochs=3,
logging_dir="./logs",
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset["train"],
eval_dataset=tokenized_dataset["test"]
)
trainer.train()
๐งช Demo Your Model with Gradio
import gradio as gr
def predict_sentiment(text):
result = classifier(text)[0]
return f"{result['label']} ({result['score']:.2f})"
gr.Interface(fn=predict_sentiment, inputs="text", outputs="text").launch()
๐ง Top Hugging Face Models by Task
Task Model
Sentiment Analysis distilbert-base-uncased-finetuned-sst-2-english
NER dslim/bert-base-NER
QA deepset/roberta-base-squad2
Text Generation gpt2, tiiuae/falcon-7b-instruct
Summarization facebook/bart-large-cnn
Translation Helsinki-NLP/opus-mt-en-fr
Find more here: https://huggingface.co/models
๐ฆ Bonus: Hugging Face Hub
Upload and share your models/datasets:
pip install huggingface_hub
huggingface-cli login # use your HF token
Upload a model:
from huggingface_hub import notebook_login
notebook_login()
model.push_to_hub("your-model-name")
๐งญ Final Tips
Use pipeline() for rapid prototyping
Fine-tune when pre-trained accuracy isn’t enough
Use datasets library for real-world NLP data
Use gradio or streamlit for demos
Monitor GPU usage with accelerate when training
Learn Data Science Course in Hyderabad
Read More
MLflow for Machine Learning Experiment Tracking
How to Automate Data Science Workflows with Apache Airflow
Using Streamlit for Building Data Science Applications
How Docker and Kubernetes Help in Data Science Deployment
Visit Our Quality Thought Training Institute in Hyderabad
Comments
Post a Comment