Welcome to our NLP model metrics discussion! In this article, we'll cover some of the basic evaluation metrics for your NLP model.


Whenever we build NLP models, we need some form of metric to measure the goodness of the model. Bear in mind that the “goodness” of the model could have multiple interpretations, but generally when we speak of it here we are talking of the measure of a model's performance on new instances that weren’t a part of the training data.

Governing whether the model being used for a specific task is successful or not depends on 2 key factors:

  1. Whether the evaluation metric we have selected is the correct one for our problem.
  2. If we are following the correct evaluation process.

However, you might be wondering, why do we really need a metric? Let's find out below.

Why do we need a metric?

The main purpose of developing these AI solutions is to apply them to real-world problems and make our lives easier and better. However, our real world is not a simple one. So how do we decide which model to use and when? That is when these metrics come into use.

If we are trying to solve two problems with one model, we would want to know the model’s performance on both of these tasks, to make an informed decision, to be aware of the trade-offs we are making. This is also where the “goodness” of a metric comes in. The real world is full of biases and we don’t want our solutions to be biased as it can have inconceivable consequences.

Let's say if we are translating a text from language X to English. For a particular sentence, if we are talking about Group A, it is translated to “They did a good job.” in contrast for Group B it is translated to “They did a great job.”, that is a crystal clear sign that our model is biased towards Group B. Such biases should be known before it is deployed in the real world and metrics should help us in surfacing these.

Even though learning biases has more to do with training data and less to do with model building, having a metric for capturing biases or a standard for biases would be a good practice to adapt.

Below we will cover some top metrics that you should consider to capture these biases

Top Evaluation Metrics

  • BLEU

BLEU: Bilingual Evaluation Understudy or BLEU is a precision-based metric used for evaluating the quality of text which has been machine-translated from one natural language to another by computing the 𝑛-gram overlap between the reference and the hypothesis. In particular, BLEU is the ratio of the number of overlapping 𝑛-grams to the total number of 𝑛-grams in the hypothesis. To be precise, the numerator contains the sum of the overlapping 𝑛-grams across all the hypotheses (i.e., all the test instances) and the denominator contains the sum of the total 𝑛-grams across all the hypotheses (i.e., all the test instances). Here, each 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 is summed over all the hypotheses, thus, BLEU is called a corpus-level metric, i.e., BLEU gives a score over the entire corpus (as opposed to scoring individual sentences and then taking an average).

  • NIST

The name NIST comes from the organization, “US National Institute of Standards and Technology”. This metric can be thought of as a variant of BLEU which weighs each matched 𝑛-gram based on its information gain(Entropy or Gini Index). The information gain for an 𝑛-gram made up of words 𝑤1, ..,𝑤𝑛 is computed over the set of references.
It is based on the BLEU metric, but with some alterations. Where BLEU simply calculates n-gram precision adding equal weight to each one, NIST also calculates how informative a particular n-gram is. The idea here is to give more credit if a matched 𝑛-gram is rare and less credit if a matched 𝑛-gram is common which reduces the chance of gaming the metric by producing trivial 𝑛-grams.


There are two major drawbacks of BLEU:
(i) It does not take recall into account.
(ii) It only allows exact 𝑛-gram matching.

To overcome these drawbacks, METEOR (Metric for Evaluation of Translation with Explicit ORdering) came into being which is based on F-measure and uses relaxed matching criteria. In particular, even if a unigram in the hypothesis does not have an exact surface level match with a unigram in the reference but is still equivalent to it (say, is a synonym) then METEOR considers this as a matched unigram.

More specifically, it first performs exact word mapping, followed by stemmed-word matching, and finally, synonym and paraphrase matching then computes the F-score using this relaxed matching strategy.

METEOR only considers unigram matches as opposed to 𝑛-gram matches it seeks to reward longer contiguous matches using a penalty term known as ‘fragmentation penalty’. To compute this, ‘chunks’ of matches are identified in the hypothesis, where contiguous hypothesis unigrams that are mapped to contiguous unigrams in a reference can be grouped together into one chunk. Therefore, longer 𝑛-gram matches lead to a fewer number of chunks, and the limiting case of one chunk occurs if there is a complete match between the hypothesis and reference. On the other hand, if there are no bigram or longer matches, the number of chunks will be the same as the number of unigrams.


ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is Recall based, unlike BLEU which is Precision based. ROUGE metric includes a set of variants: ROUGE-N, ROUGE-L, ROUGE-W, and ROUGE-S. ROUGE-N is similar to BLEU-N in counting the 𝑛-gram matches between the hypothesis and reference. This is a set of metrics used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation.

  • CIDEr

CIDEr (Consensus-based Image Description Evaluation) proposed in the context of image captioning where each image is accompanied by multiple reference captions. It is based on the premise that 𝑛-grams that are relevant to an image would occur frequently in its set of reference captions.

CIDEr weighs each 𝑛-gram in a sentence based on its frequency in the corpus and in the reference set of the particular instance, using TF-IDF (term-frequency and inverse-document-frequency). However, 𝑛-grams that appear frequently in the entire dataset (i.e., in the reference captions of different images) are less likely to be informative/relevant and hence they are assigned a lower weight using inverse-document-frequency (IDF) term.


SPICE (Semantic Propositional Image Caption Evaluation)  is another Image captioning Algorithm that focuses on 𝑛-gram similarity, here more importance is given to the semantic propositions implied by the text. SPICE uses ‘scene-graphs’ to represent semantic propositional content. The hypothesis and references are converted into scene graphs and the SPICE score is computed as the F1-score between the scene-graph tuples of the proposed sentence and all reference sentences. For matching the tuples, SPICE also considers synonyms from WordNet similar to METEOR.
One issue with SPICE is that it depends heavily on the quality of parsing and it also neglects fluency assuming that the sentences are well-formed. It is thus possible that SPICE would assign a high score to captions that contain only objects, attributes, and relations, but are grammatically incorrect.

The embedding-based metrics discussed above use static word embeddings, i.e., the embeddings of the words are not dependent on the context in which they are used but here the embedding of a word depends on the context in which it is used.

  • BERT

BERT is to obtain the word embeddings and shows that using contextual embeddings along with a simple average recall-based metric gives competitive results. The BERT score is the average recall score overall tokens, using a relaxed version of token matching based on BERT embeddings, i.e., by computing the maximum cosine similarity between the embedding of a reference token and any token in the hypothesis.

  • Bert Score

Bert score or Bidirectional Encoder Representations from Transformers compute cosine similarity of each hypothesis token 𝑗 with each token 𝑖 in the reference sentence using contextualized embeddings. They use a greedy matching approach instead of a time-consuming best-case matching approach and then compute the F1 measure.
For more information about Bert, click here.

  • MOVERscore

MOVERscore takes inspiration from WMD metric to formulate an optimal matching metric, which uses contextualized embeddings to compute the Euclidean distances between words or 𝑛-grams. In contrast to BERTscore which allows one-to-one hard matching of words, MoverScore allows many-to-one matching as it uses soft/partial alignments, similar to how WMD allows partial matching with word2vec embeddings. It has been shown to have competitive correlations with human judgments in 4 NLG tasks: machine translation, image captioning, abstractive summarization, and data-to-text generation.

Wrap Up

We have discussed some of the top evaluation metrics for NLP models.

Hope this article helped you gain some insights into the same.

Stay tuned for more. See you next time!