Machine Translation with Transformer in Python
Last Updated :
11 Mar, 2025
Machine translation converts text from one language to another and enable tools like Google Translate. Modern translation models use Transformer-based architectures which capture context efficiently. In this article we fine-tune a pre-trained Transformer model from Hugging Face to translate English to Hindi.
Understanding Transformers
Transformer is a deep learning model introduced in the paper Attention Is All You Need by Vaswani et al. It is widely used in natural language processing (NLP) tasks, including machine translation, because of its ability to understand long-range dependencies in text. A Transformer consists of two main components:
- Encoder: Reads and processes the input text (English sentence in our case).
- Decoder: Generates the output text in the target language (Hindi sentence in our case).
The key mechanism in Transformers is self-attention which helps the model focus on different words in a sentence while translating. Unlike traditional models Transformers process entire sentences in parallel making them highly efficient. We will use a pre-trained Transformer model from Helsinki-NLP, an open-source NLP project that provides various translation models. Specifically we will fine-tune the English-to-Hindi model on our dataset
Machine Translation using Transformers
Transformers have significantly improved the quality and efficiency of machine translation models. In this section, we will use Hugging Face's transformer models to perform English to Hindi translation.
1. Libraries installation
Before starting, ensure that you have the required libraries installed in your environment. Use the following commands to install them:
Python
!pip install datasets
!pip install transformers
!pip install sentencepiece
!pip install transformers[torch]`
!pip install sacrebleu
!pip install evaluate
!pip install sacrebleu
!pip install accelerate -U
!pip install gradio
!pip install kaleido cohere openai tiktoken typing-extensions==4.5.0
We will use cfilt/iitb-english-hindi dataset available on hugging face.
IIT Bombay English-Hindi corpus comprises parallel texts for English-Hindi and monolingual Hindi texts sourced from various existing platforms and corpora established at the Center for Indian Language Technology, IIT Bombay, over time. It is a resource for training and evaluating English-Hindi machine translation models. Researchers and developers can use the datasets to improve the accuracy and performance of machine translation systems for these languages.
To get more specific details about the "cfilt/iitb-english-hindi" dataset, including its size, source, and any specific characteristics, check the official documentation or publications from CFILT or IITB.
2. Dataset loading
We will use the cfilt/iitb-english-hindi dataset available on Hugging Face for this tutorial. This dataset consists of parallel texts for English-Hindi translation and is ideal for training and evaluating machine translation models.
To load the dataset:
Python
from datasets import load_dataset
dataset = load_dataset("cfilt/iitb-english-hindi")
The dataset is a dictionary-like object containing splits like "train", "validation", and "test".
3. Model and Tokenizer loading
We will use a pre-trained translation model, Helsinki-NLP/opus-mt-en-hi, for English to Hindi translation. The AutoTokenizer and AutoModelForSeq2SeqLM classes from the Hugging Face transformers library allow us to load the tokenizer and model.
Loading the model and tokenizer:
Python
max_length = 256
# Load model directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-hi")
model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-en-hi")
Example Translation:
Let us see the output of model on one of the validation datasets. The input sequence is: 'Rajesh Gavre, the President of the MNPA teachers association, honoured the school by presenting the award' .
Python
article = dataset['validation'][2]['translation']['en']
inputs = tokenizer(article, return_tensors="pt")
translated_tokens = model.generate(
**inputs, max_length=256
)
tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
Output:
'एमएनएपी शिक्षकों के राष्ट्रपति, राजस्वीवर ने इस पुरस्कार को पेश करके स्कूल की प्रतिष्ठा की'
Let's check the expected output using the following code.
Python
dataset['validation'][2]['translation']['hi']
Output:
'मनपा शिक्षक संघ के अध्यक्ष राजेश गवरे ने स्कूल को भेंट देकर सराहना की।'
Let us fine tune the model.
4. Tokenize the dataset
To fine-tune the model, we need to preprocess the dataset. This involves tokenizing both the input (English) and target (Hindi) sentences and ensuring that they are properly formatted for the model.
Python
def preprocess_function(examples):
inputs = [ex["en"] for ex in examples["translation"]]
targets = [ex["hi"] for ex in examples["translation"]]
model_inputs = tokenizer(inputs, max_length=max_length, truncation=True)
labels = tokenizer(targets,max_length=max_length, truncation=True)
model_inputs["labels"] = labels["input_ids"]
return model_inputs
We map each of the examples of our dataset using the map function.
Python
tokenized_datasets_validation = dataset['validation'].map(
preprocess_function,
batched= True,
remove_columns=dataset["validation"].column_names,
batch_size = 2
)
tokenized_datasets_test = dataset['test'].map(
preprocess_function,
batched= True,
remove_columns=dataset["test"].column_names,
batch_size = 2)
5. Define the datacollator
DataCollatorForSeq2Seq is to take batches of examples, tokenize them using the provided tokenizer, and format them in a way suitable for training seq2seq models.
By using the data collator, you can ensure that the input sequences and target sequences are appropriately padded and formatted before being fed into the model during training. It handles tasks such as padding sequences to the maximum length in a batch, creating attention masks, and organizing the data into the required format for seq2seq training.
DataCollatorForSeq2Seq is a class designed to collate and process batches of data for sequence-to-sequence (seq2seq) tasks.
Python
from transformers import DataCollatorForSeq2Seq
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)
6. Model training parameters
To fine-tune the model, we need to specify training parameters. In this case, we freeze some layers and train only the last few layers to fine-tune the model effectively.
Python
for parameter in model.parameters():
parameter.requires_grad = True
num_layers_to_freeze = 10 # Adjust as needed
for layer_index, layer in enumerate(model.model.encoder.layers):
print
if layer_index < len(model.model.encoder.layers) - num_layers_to_freeze:
for parameter in layer.parameters():
parameter.requires_grad = False
num_layers_to_freeze = 10 # Adjust as needed
for layer_index, layer in enumerate(model.model.decoder.layers):
print
if layer_index < len(model.model.encoder.layers) - num_layers_to_freeze:
for parameter in layer.parameters():
parameter.requires_grad = False
7. Model evaluation
We use SacreBLEU for evaluating the model's performance. BLEU (Bilingual Evaluation Understudy) is a metric commonly used for evaluating machine translation models.
Python
import evaluate
metric = evaluate.load("sacrebleu")
import numpy as np
def compute_metrics(eval_preds):
preds, labels = eval_preds
if isinstance(preds, tuple):
preds = preds[0]
decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
decoded_preds = [pred.strip() for pred in decoded_preds]
decoded_labels = [[label.strip()] for label in decoded_labels]
result = metric.compute(predictions=decoded_preds, references=decoded_labels)
return {"bleu": result["score"]}
8. Model training
We define the training parameters using Seq2SeqTrainingArguments from Hugging Face and initiate training with Seq2SeqTrainer.
Python
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
from transformers import Seq2SeqTrainingArguments
model.to(device)
training_args = Seq2SeqTrainingArguments(
f"finetuned-nlp-en-hi",
gradient_checkpointing=True,
per_device_train_batch_size=32,
learning_rate=1e-5,
warmup_steps=2,
max_steps=2000,
fp16=True,
optim='adafactor',
per_device_eval_batch_size=16,
metric_for_best_model="eval_bleu",
predict_with_generate=True,
push_to_hub=False,
)
We initiate training using below code
Python
from transformers import Seq2SeqTrainer
trainer = Seq2SeqTrainer(
model,
training_args,
train_dataset=tokenized_datasets_test,
eval_dataset=tokenized_datasets_validation,
data_collator=data_collator,
tokenizer=tokenizer,
compute_metrics=compute_metrics,
)
trainer.train()
Output:
Step Training Loss
500 2.920800
1000 2.555000
1500 2.437100
2000 2.389700
Gradio App for Interactive Translation
We can create an interactive Gradio app to demonstrate the model's translation capability:
Python
import gradio as gr
def translate(text):
inputs = tokenizer(text, return_tensors="pt").to(device)
translated_tokens = model.generate(**inputs, max_length=256)
results = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
return results
interface = gr.Interface(fn=translate,inputs=gr.Textbox(lines=2, placeholder='Text to translate'),
outputs='text')
interface.launch()
Output:
Gradio InterfaceGet complete notebook link here:
Notebook Link: click here.
Dataset: click here.
In this tutorial, we demonstrated how to use transformers for machine translation from English to Hindi. We covered dataset loading, model fine-tuning, and evaluation. We also created a Gradio app for interactive translation.
Similar Reads
Image Transformations using OpenCV in Python
In this tutorial, we are going to learn Image Transformation using the OpenCV module in Python. What is Image Transformation? Image Transformation involves the transformation of image data in order to retrieve information from the image or preprocess the image for further usage. In this tutorial we
5 min read
Matrix Transpose Without Numpy In Python
Matrix transpose is a fundamental operation in linear algebra where the rows of a matrix are swapped with its columns. This operation is denoted by A^T, where A is the original matrix. The transposition of a matrix has various applications, such as solving systems of linear equations, computing the
3 min read
Perspective Transformation - Python OpenCV
In Perspective Transformation, we can change the perspective of a given image or video for getting better insights into the required information. In Perspective Transformation, we need to provide the points on the image from which want to gather information by changing the perspective. We also need
2 min read
Python Code Generation Using Transformers
Python's code generation capabilities streamline development, empowering developers to focus on high-level logic. This approach enhances productivity, creativity, and innovation by automating intricate code structures, revolutionizing software development. Automated Code Generation Automated code ge
3 min read
Linear Transformation to incoming data in Pytorch
We could apply linear transformation to the incoming data using the torch.nn.Linear() module in PyTorch. This module is designed to create a Linear Layer in the neural networks. A linear layer computes the linear transformation as below- [Tex]y=xA^T+b [/Tex] Where [Tex]x [/Tex] is the incoming data.
5 min read
Transformer using PyTorch
In this article, we will explore how to implement a basic transformer model using PyTorch , one of the most popular deep learning frameworks. By the end of this guide, youâll have a clear understanding of the transformer architecture and how to build one from scratch. Understanding Transformers in N
9 min read
Converting an image to a Torch Tensor in Python
In this article, we will see how to convert an image to a PyTorch Tensor. A tensor in PyTorch is like a NumPy array containing elements of the same dtypes. Â A tensor may be of scalar type, one-dimensional or multi-dimensional. To convert an image to a tensor in PyTorch we use PILToTensor() and ToTe
3 min read
Python Virtual Machine
The Python Virtual Machine (VM) is a crucial component of the Python runtime environment. It executes Python bytecode, which is generated from Python source code or intermediate representations like Abstract Syntax Trees (ASTs). In this article, we'll explore the Python Virtual Machine, discussing i
3 min read
Normalizing Textual Data with Python
In this article, we will learn How to Normalizing Textual Data with Python. Let's discuss some concepts : Textual data ask systematically collected material consisting of written, printed, or electronically published words, typically either purposefully written or transcribed from speech.Text normal
7 min read
FNet: A Transformer Without Attention Layer
This article delves into FNet, a transformative architecture that reimagines the traditional transformer by discarding attention mechanisms entirely. Let's begin the journey to explore FNet, but first, let's look at the limitations of transformers. What is FNet?In contrast to conventional transforme
7 min read