Thanks to visit codestin.com
Credit goes to www.geeksforgeeks.org

Open In App

Machine Translation with Transformer in Python

Last Updated : 11 Mar, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Machine translation converts text from one language to another and enable tools like Google Translate. Modern translation models use Transformer-based architectures which capture context efficiently. In this article we fine-tune a pre-trained Transformer model from Hugging Face to translate English to Hindi.

Understanding Transformers

Transformer is a deep learning model introduced in the paper Attention Is All You Need by Vaswani et al. It is widely used in natural language processing (NLP) tasks, including machine translation, because of its ability to understand long-range dependencies in text. A Transformer consists of two main components:

  1. Encoder: Reads and processes the input text (English sentence in our case).
  2. Decoder: Generates the output text in the target language (Hindi sentence in our case).

The key mechanism in Transformers is self-attention which helps the model focus on different words in a sentence while translating. Unlike traditional models Transformers process entire sentences in parallel making them highly efficient. We will use a pre-trained Transformer model from Helsinki-NLP, an open-source NLP project that provides various translation models. Specifically we will fine-tune the English-to-Hindi model on our dataset

Machine Translation using Transformers

Transformers have significantly improved the quality and efficiency of machine translation models. In this section, we will use Hugging Face's transformer models to perform English to Hindi translation.

1. Libraries installation

Before starting, ensure that you have the required libraries installed in your environment. Use the following commands to install them:

Python
!pip install datasets
!pip install transformers
!pip install sentencepiece
!pip install transformers[torch]`
!pip install sacrebleu
!pip install evaluate
!pip install sacrebleu
!pip install accelerate -U
!pip install gradio 
!pip install kaleido cohere  openai tiktoken typing-extensions==4.5.0


We will use cfilt/iitb-english-hindi dataset available on hugging face.

IIT Bombay English-Hindi corpus comprises parallel texts for English-Hindi and monolingual Hindi texts sourced from various existing platforms and corpora established at the Center for Indian Language Technology, IIT Bombay, over time. It is a resource for training and evaluating English-Hindi machine translation models. Researchers and developers can use the datasets to improve the accuracy and performance of machine translation systems for these languages.

To get more specific details about the "cfilt/iitb-english-hindi" dataset, including its size, source, and any specific characteristics, check the official documentation or publications from CFILT or IITB.

2. Dataset loading

We will use the cfilt/iitb-english-hindi dataset available on Hugging Face for this tutorial. This dataset consists of parallel texts for English-Hindi translation and is ideal for training and evaluating machine translation models.

To load the dataset:

Python
from datasets import load_dataset
dataset = load_dataset("cfilt/iitb-english-hindi")

The dataset is a dictionary-like object containing splits like "train", "validation", and "test".

3. Model and Tokenizer loading

We will use a pre-trained translation model, Helsinki-NLP/opus-mt-en-hi, for English to Hindi translation. The AutoTokenizer and AutoModelForSeq2SeqLM classes from the Hugging Face transformers library allow us to load the tokenizer and model.

Loading the model and tokenizer:

Python
max_length = 256

# Load model directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-hi")
model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-en-hi")


Example Translation:

Let us see the output of model on one of the validation datasets. The input sequence is: 'Rajesh Gavre, the President of the MNPA teachers association, honoured the school by presenting the award' .

Python
article = dataset['validation'][2]['translation']['en']
inputs = tokenizer(article, return_tensors="pt")

translated_tokens = model.generate(
     **inputs,  max_length=256
 )
tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]

Output:

'एमएनएपी शिक्षकों के राष्ट्रपति, राजस्वीवर ने इस पुरस्कार को पेश करके स्कूल की प्रतिष्ठा की'

Let's check the expected output using the following code.

Python
dataset['validation'][2]['translation']['hi']

Output:

'मनपा शिक्षक संघ के अध्यक्ष राजेश गवरे ने स्कूल को भेंट देकर सराहना की।'

Let us fine tune the model.

4. Tokenize the dataset

To fine-tune the model, we need to preprocess the dataset. This involves tokenizing both the input (English) and target (Hindi) sentences and ensuring that they are properly formatted for the model.

Python
def preprocess_function(examples):
  inputs = [ex["en"] for ex in examples["translation"]]
  targets = [ex["hi"] for ex in examples["translation"]]

  model_inputs = tokenizer(inputs, max_length=max_length, truncation=True)
  labels = tokenizer(targets,max_length=max_length, truncation=True)
  model_inputs["labels"] = labels["input_ids"]

  return model_inputs


We map each of the examples of our dataset using the map function.

Python
tokenized_datasets_validation = dataset['validation'].map(
    preprocess_function,
    batched= True,
    remove_columns=dataset["validation"].column_names,
    batch_size = 2
)

tokenized_datasets_test = dataset['test'].map(
    preprocess_function,
    batched= True,
    remove_columns=dataset["test"].column_names,
    batch_size = 2)

5. Define the datacollator

DataCollatorForSeq2Seq is to take batches of examples, tokenize them using the provided tokenizer, and format them in a way suitable for training seq2seq models.

By using the data collator, you can ensure that the input sequences and target sequences are appropriately padded and formatted before being fed into the model during training. It handles tasks such as padding sequences to the maximum length in a batch, creating attention masks, and organizing the data into the required format for seq2seq training.

DataCollatorForSeq2Seq is a class designed to collate and process batches of data for sequence-to-sequence (seq2seq) tasks.

Python
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

6. Model training parameters

To fine-tune the model, we need to specify training parameters. In this case, we freeze some layers and train only the last few layers to fine-tune the model effectively.

Python
for parameter in model.parameters():
    parameter.requires_grad = True
num_layers_to_freeze = 10  # Adjust as needed
for layer_index, layer in enumerate(model.model.encoder.layers):
    print
    if layer_index < len(model.model.encoder.layers) - num_layers_to_freeze:
        for parameter in layer.parameters():
            parameter.requires_grad = False

num_layers_to_freeze = 10  # Adjust as needed
for layer_index, layer in enumerate(model.model.decoder.layers):
    print
    if layer_index < len(model.model.encoder.layers) - num_layers_to_freeze:
        for parameter in layer.parameters():
            parameter.requires_grad = False

7. Model evaluation

We use SacreBLEU for evaluating the model's performance. BLEU (Bilingual Evaluation Understudy) is a metric commonly used for evaluating machine translation models.

Python
import evaluate

metric = evaluate.load("sacrebleu")

import numpy as np


def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]

    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    decoded_preds = [pred.strip() for pred in decoded_preds]
    decoded_labels = [[label.strip()] for label in decoded_labels]

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    return {"bleu": result["score"]}

8. Model training

We define the training parameters using Seq2SeqTrainingArguments from Hugging Face and initiate training with Seq2SeqTrainer.

Python
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

from transformers import Seq2SeqTrainingArguments

model.to(device)
training_args = Seq2SeqTrainingArguments(
    f"finetuned-nlp-en-hi",
    gradient_checkpointing=True,
    per_device_train_batch_size=32,
    learning_rate=1e-5,
    warmup_steps=2,
    max_steps=2000,
    fp16=True,
    optim='adafactor',
    per_device_eval_batch_size=16,
    metric_for_best_model="eval_bleu",
    predict_with_generate=True,
    push_to_hub=False,
)


We initiate training using below code

Python
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    model,
    training_args,
    train_dataset=tokenized_datasets_test,
    eval_dataset=tokenized_datasets_validation,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

trainer.train()

Output:

Step Training Loss 500 2.920800 1000 2.555000 1500 2.437100 2000 2.389700

Gradio App for Interactive Translation

We can create an interactive Gradio app to demonstrate the model's translation capability:

Python
import gradio as gr


def translate(text):
  inputs = tokenizer(text, return_tensors="pt").to(device)
  translated_tokens = model.generate(**inputs,  max_length=256)
  results = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
  return results

interface = gr.Interface(fn=translate,inputs=gr.Textbox(lines=2, placeholder='Text to translate'),
                        outputs='text')

interface.launch()

Output:

Machine Translation using Transformers
Gradio Interface

Get complete notebook link here:

Notebook Link: click here.

Dataset: click here.

In this tutorial, we demonstrated how to use transformers for machine translation from English to Hindi. We covered dataset loading, model fine-tuning, and evaluation. We also created a Gradio app for interactive translation.


Next Article

Similar Reads