Within this guide, we’ll go through the popular metrics for machine learning model evaluation.
When selecting machine learning models, it’s critical to have evaluation metrics to quantify the model performance. In this post, we’ll focus on the more common supervised learning problems. There are multiple commonly used metrics for both classification and regression tasks. So it’s also important to get an overview of them to choose the right one based on your business goals.
Following this overview, you’ll discover how to evaluate ML models using:
- Accuracy
- Confusion Matrix
- Area Under the ROC Curve (AUC)
- F1 Score
- Precision-Recall Curve
- Log/Cross Entropy Loss
- Mean Squared Error
- Mean Absolute Error
If you want to evaluate and select among different machine learning algorithms/models, this guide will help you kick start.
Let’s roll!
We’ll start with the model evaluation techniques for machine learning classification problems. For simplicity, we’ll give examples for binary classification, where the output variable only has two possible classes. But many of the metrics can be extended for use on multiclass problems.
Classification Accuracy
This is the most intuitive model evaluation metric. When we make predictions by classifying the observations, the result is either correct (True) or incorrect (False). The classification accuracy measures the percentage of the correct classifications with the formula below:
Accuracy = # of correct predictions / # of total predictions
The higher the accuracy, the more accurate the model. Yet, accuracy doesn’t tell the full story, especially for imbalanced datasets.
Imagine we are predicting the fraudulent transactions among a sample of bank transactions. Let’s assume the dataset has 97% being legit and 3% fraudulent, similar to the case in reality. While we can set a model to always classify the transaction as legit, which is not predicting any fraud, the accuracy will be high at 97%.
So we often need other metrics to evaluate our models. Let’s look at some more sophisticated metrics.
Confusion Matrix
The confusion matrix is a critical concept for classification evaluation. Many of the following metrics are derived from the confusion matrix. So it’s essential to understand this matrix before moving on.
Given that we have N number of classes, a confusion matrix is an N * N table that summarizes the prediction results of a classification model. One axis of the matrix has the classes/labels predicted by the model, while the other axis shows the actual classes.
Consider a binary problem where we are classifying an animal into either Unicorn or Horse. Based on a classification model’s prediction, there are four possible outcomes:
- True Positive: Predicted to be Positive (Unicorn), and it’s truly Positive.
- True Negative: Predicted to be Negative (Horse), and it’s truly Negative.
- False Positive: Predicted to be Positive (Unicorn), and it’s actually Negative.
- False Negative: Predicted to be Negative (Horse), and it’s actually Positive.
The terms ‘’Positive’’ and ‘’Negative’’ refer to the model’s prediction, and the terms ‘’True’’ and ‘’False’’ refer to whether that result corresponds to the actual observation.
The graph below is the confusion matrix for the classification of Unicorn and Horse. Each cell in the matrix should record the number of observations that fall into one of the four outcomes.

To generalize the case, the confusion matrix records the four scenarios as below:

As you can see, we want the diagonal numbers to be higher, while the other two are smaller. We also call the two possible types of errors by:
- Type I error: False Positive
- Type II error: False Negative
The classification accuracy we introduced earlier can be defined based on the confusion matrix:
Accuracy = # of correct predictions / # of total predictions = (TP + TN) / (TP + TN + FP + FN)
ROC and AUC
The ROC (Receiver Operating Characteristic) curve plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at different classification thresholds.
The threshold value determines the boundary between classes when using the classifier to predict. For example, a value above a threshold can be classified as Unicorn, a value below the threshold is marked as Horse. We can choose different classification thresholds for the same model.
Now let’s define TPR and FPR, which are derived from the confusion matrix concept.
True Positive Rate = TPR = TP/P = TP / (TP + FN) = Sensitivity
False Positive Rate = FPR = FP/N = FP / (FP + TN)
The TPR measures the probability of detection, which is also called sensitivity. While FPR measures the probability of false alarm. For a classification model, we need to balance the cost and benefit, because we want to maximize the TPR while minimizing the FPR.
The chart below shows an example ROC curve, with:
- x-axis being the FPR
- y-axis being the TPR
- each point representing the FPR and TPR are at different thresholds for classification.

The most “ideal” model has a ROC curve that reaches the top left corner (coordinate (0, 1)) of the plot: an FPR of zero, and a TPR of one. While this is not realistic, we can tell that the larger the two-dimensional Area Under the ROC Curve (AUC or AUROC), the better the model.
The AUC, ranging between 0 and 1, is a model evaluation metric, irrespective of the chosen classification threshold. The AUC of a model is equal to the probability that this classifier ranks a randomly chosen Positive example higher than a randomly chosen Negative example. The model that can predict 100% correct has an AUC of 1.
Note that another metric specificity is related to the FPR, which measures the proportion of actual negatives that are predicted correctly.
Specificity = TN / (TN + FP) = 1 – FPR
F1 Score (Precision and Recall)
F1 score is another metric that’s based on the confusion matrix. It’s an accuracy measure of the model performance based on precision and recall.
What are precision and recall?
Precision = TP / (TP + FP)
Recall = Sensitivity = TPR = TP/ (TP + FN)
Precision measures the proportion of positive prediction results that are correct. Recall (the same as TPR) measures the probability of detection or the proportion of actual positives that were predicted correctly. Even though we want to maximize both metrics, it’s not realistic, so we need to find the tradeoff between them.
The traditional F score is the F1 score, which is the harmonic mean of the precision and recall:

F1-score tells us how precise as well as how robust the model is. The highest F1 score of 1 gives the best model.
We are putting equal importance on the precision and recall for the F1 score. Yet, depending on the choices of weights of recall and precision in the calculation, we can generalize the F1 measure to other F scores based on different business needs.
Precision-Recall Curve
Like the ROC curve, the precision-recall curve shows the trade-off between two metrics (precision and recall) among different thresholds. Similarly, we can also look at the Area Under the Curve (AUC) for the precision-recall curve.
The Precision-Recall curve is more informative than the ROC when the classes are imbalanced.

Log Loss or Cross Entropy Loss
Log loss or cross-entropy loss is the loss function used in logistic regression or its extensions like neural networks. It’s the negative log-likelihood of the logistic model.
For the classification problem with binary output variable y of 0 or 1, and p = P (y = 1), the logarithmic loss or cross-entropy loss function is:
log loss = – log(P(y|p)) = – (y*log(p) + (1-y)*log(1-p))
It’s not as intuitive to understand compared to other metrics, but the smaller this function, the better the model. For a more clear explanation, check out the logistic regression article below.
Further Reading: Logistic Regression for Machine Learning: complete Tutorial
This article explains the logistic regression, the fundamental classification machine learning algorithm, including the cross-entropy loss function.
So far, we’ve been talking about evaluation metrics for classification machine learning models. Let’s look at the basic metrics for regression problems.
Mean Squared Error
Mean Squared Error (MSE), the most common measure for regression prediction performance, is the average of the squared residuals (the difference between the actual value and the predicted value).
Assume we have n observations Y1, Y2, … Yn, the MSE formula is below:

As you can see, the smaller the MSE, the better the predictor fits the data.
Mean Absolute Error
Besides squared error, we can also compute the average of the absolute value of residuals. This metric is called Mean Absolute Error (MAE):

MAE is easier to interpret and more robust to outliers compared to MSE.
Great!
That’s all the popular evaluation metrics for the machine learning model. You should have a better idea of how to evaluate the performance of your models. The particular choice of metrics depends on business needs.
Leave a comment for any questions you may have or anything else.
Related Resources: