We will talk about some of the classification metrics every data scientist must know.
Let's first do a quick revision about what classification is.
- Classification refers to a predictive modeling problem where a class label is predicted for a given example of input data.
In this article, we’ll focus only on binary classification (i.e. Dichotomy).
In simple terms, binary classification is the task of classifying the elements of a set into two groups on the basis of a classification rule.
A very common example of binary classification is spam detection, where the input data includes the email text and metadata (sender, sending time), and the output label is either “spam” or “not spam”.
So, the Machine learning model to detect whether an email is spam or not is built, and to evaluate its performance we need the performance metrics.
One question that comes to mind is, how would we know that when to stop the training and evaluation and when to call it good?
Here comes the role of classification metrics like Accuracy, precision, recall, AUC-ROC, etc.
Let us discuss them in brief.
The table above is named confusion metrics. It helps you find the best classification model by visualizing the performance of an algorithm, typically a supervised learning one.
Note: In unsupervised learning, it is usually called a matching matrix.
Some Terminologies of the Confusion Metrics:
- condition positive (P): The number of real positive cases in the data.
- condition negative (N): The number of real negative cases in the data.
Let us say there is a vending machine that can detect a fake coin, here we are predicting whether a coin is fake or not using the classification model.
The class of interest is called the positive class. Here the vending machine is trying to detect fake or not fake. This makes the Fake the positive class.
True Positive – A fake coin classified as fake.
False Positive – A real coin classified as fake.
False Negative – A fake coin classified as not fake.
True Negative – A real coin classified as not fake.
The Classification Metrics that are commonly used for the Evaluation:
Definition: It is defined as the ratio of correctly classified observations to total number of observations.
Accuracy = Correctly classified observations/ All observations
i.e. Accuracy = (True Positives + True Negatives)/(True Positives + True Negatives + False Positives + False Negatives)
Accuracy is not always the best metric to measure model performance when dealing with class imbalance. Class imbalance is a situation where one class is more frequent than the other.
In a model of fraud classification where 98% of transactions are genuine and only 2% are fraud, a classifier could be built that predicts all transactions as genuine and thus the model will have an accuracy of 98%. While the accuracy is high the model fails terribly at its real purpose to detect fraud transactions. Thus, we need a more nuanced metric to access the performance of a model in such cases.
- Definition: Specificity is defined as the ratio of true negatives to the actual negatives in the data. i.e. out of the total actual negatives, what percentage have been correctly predicted as negative by the model.
Specificity = True Negative / (True Negative + False Positive)
Consider a model to predict spam email. So, here spam is our positive class. So, specificity will be
Specificity = Genuine mail predicted not spam / Total number of genuine mail.
Specificity is used for spam filter models where it is better for an email user to rather send spam to inbox (False Negative) than send real email to the spam filter (False Positive).
Sensitivity (Recall or True Positive Rate)
- Definition: Sensitivity, also known as Recall or True Positive Rate can be defined as the ratio of true positives to the actual positives in the data. i.e. out of the total actual positives, what percentage have been correctly predicted as positive by the model.
Sensitivity = True Positive / (True Positive + False Negative)
- Definition: It is the ratio of true positives to the total number of observations predicted as positive i.e out of the observations predicted as positive, what percentage is correct?
Precision = True Positive/ (True positive + False Positive)
In case of the example of mail prediction as Spam,
Precision = Spam mail predicted as spam/Total number of mails predicted spam
A high precision will imply that not many real emails are predicted as spam while a high sensitivity will imply that most spam mails are predicted as spam even if some real mails are also predicted as spam.
- Definition: F-measure is a measure of a test's accuracy. The F1 score is the harmonic mean of precision and recall. The highest possible value of an F-score is 1.0, indicating perfect precision and recall, and the lowest possible value is 0 if either the precision or the recall is zero. The F-score is often used in the field of information retrieval for measuring search, document classification, and query classification
F1 Score = 2 * Precision * Recall/ (Precision + Recall)
Simply stated the F1 score sort of maintains a balance between the precision and recall for your classifier. If your precision is low, the F1 is low and if the recall is low again your F1 score is low.
If you are a police inspector and you want to catch criminals, you want to be sure that the person you catch is a criminal (Precision) and you also want to capture as many criminals (Recall) as possible. The F1 score manages this tradeoff.
AUC is the area under the ROC curve. AUC-ROC indicates how well the probabilities from the positive classes are separated from the negative classes AUC is the area under the ROC curve. AUC-ROC indicates how well the probabilities from the positive classes are separated from the negative classes
What is the ROC curve?
A ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds.
This curve plots two parameters:
- True Positive Rate
- False Positive Rate
AUC ranges in value from 0 to 1. A model whose predictions are 100% wrong has an AUC of 0.0; one whose predictions are 100% correct has an AUC of 1.0.
- AUC is scale-invariant, as it measures how well predictions are ranked, rather than their absolute values.
Let's say if you as a marketer want to find a list of users who will respond to a marketing campaign. AUC is a good metric to use since the predictions ranked by probability are the order in which you will create a list of users to send the marketing campaign.
- Classification-threshold invariance is not always desirable. In cases where there are wide disparities in the cost of false negatives vs. false positives, it may be critical to minimize one type of classification error.
For example, when doing email spam detection, you likely want to focus on minimizing false positives (even if that results in a significant increase of false negatives). AUC isn't a useful metric for this type of optimization.
These were some metrics that you must know as a data scientist to be able to assess your Classification model's performance.
We have attached some affiliate links in this post to relevant resources as sharing knowledge is never a bad idea.
I hope you will read it profoundly.
Thanks for being here and giving us a chance to make things simple for you.
Stay tuned for more!