- Data Structure
- Networking
- RDBMS
- Operating System
- Java
- MS Excel
- iOS
- HTML
- CSS
- Android
- Python
- C Programming
- C++
- C#
- MongoDB
- MySQL
- Javascript
- PHP
- Physics
- Chemistry
- Biology
- Mathematics
- English
- Economics
- Psychology
- Social Studies
- Fashion Studies
- Legal Studies
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
BLEU Score for Evaluating Neural Machine Translation using Python
Using NMT or Neural Machine Translation in NLP, we can translate a text from a given language to a target language. To evaluate how well the translation is performed, we use the BLEU or Bilingual Evaluation Understudy score in Python.
The BLEU Score works by comparing machine translated sentences to human translated sentences, both in n-grams. Also, with the increase in sentence length, the BLEU score decreases. In general, a BLEU score is in the range from 0 to 1 and a higher value indicates a better quality. However, achieving a perfect score is very rare. Note that the evaluation is done on the basis of substring matching and it does not consider other aspects of language like coherence, tenses and grammar, etc.
Formula
BLEU = BP * exp(1/n * sum_{i=1}^{n} log(p_i))
Here, the various terms have the following meanings −
BP is the Brevity Penalty. It adjusts the BLEU score based on the lengths of the two texts. Its formula is given by −
BP = min(1, exp(1 - (r / c)))
n is the maximum order of n-gram matching
p_i is the precision score
Algorithm
Step 1 − Import datasets library.
Step 2 − Use the load_metric function with bleu as its parameter.
Step 3 − Make a list out of the words of the translated string.
Step 4 − Repeat step 3 with the words of the desired output string.
Step 5 − Use bleu.compute to find the bleu value.
Example 1
In this example, we will use Python’s NLTK library to calculate the BLEU score for a german sentence machine translated to english.
Source text (German) − es regnet heute
Machine translated text − it rain today
Desired text − it is raining today, it was raining today
Although we can see that the translation is not done correctly, we can get a better view of the translation quality by finding the BLUE score.
Example
#import the libraries from datasets import load_metric #use the load_metric function bleu = load_metric("bleu") #setup the predicted string predictions = [["it", "rain", "today"]] #setup the desired string references = [ [["it", "is", "raining", "today"], ["it", "was", "raining", "today"]] ] #print the values print(bleu.compute(predictions=predictions, references=references))
Output
{'bleu': 0.0, 'precisions': [0.6666666666666666, 0.0, 0.0, 0.0], 'brevity_penalty': 0.7165313105737893, 'length_ratio': 0.75, 'translation_length': 3, 'reference_length': 4}
You can see that the translation done is not very good and thus, the bleu score comes out to be 0.
Example 2
In this example, we will again calculate the BLEU score. But this time, we will take a French sentence that is machine translated to english.
Source text (German) − nous partons en voyage
Machine translated text − we going on a trip
Desired text − we are going on a trip, we were going on a trip
You can see that this time, the translated text is much closer to the desired text. Let us check the BLEU score for it.
Example
#import the libraries from datasets import load_metric #use the load_metric function bleu = load_metric("bleu") #steup the predicted string predictions = [["we", "going", "on", "a", "trip"]] #steup the desired string references = [ [["we", "are", "going", "on", "a", "trip"], ["we", "were", "going", "on", "a", "trip"]] ] #print the values print(bleu.compute(predictions=predictions, references=references))
Output
{'bleu': 0.5789300674674098, 'precisions': [1.0, 0.75, 0.6666666666666666, 0.5], 'brevity_penalty': 0.8187307530779819, 'length_ratio': 0.8333333333333334, 'translation_length': 5, 'reference_length': 6}
You can see that this time, the translation done was quite close to the desired output and thus, the blue score is also higher than 0.5.
Conclusion
BLEU Score is a wonderful tool to check the efficiency of a translation model and thus, improve it further to produce better results. Although the BLEU score can be used to get a rough idea about a model, it is limited to a specific vocabulary and often ignores the nuances of language. This is why there is so little coordination of the BLEU score with human judgment. But there are some alternatives like ROUGE score, METEOR metric and CIDEr metric that you can definitely try.