Model Evaluation Metrics in Machine Learning

İrem Tanrıverdi
7 min readApr 15, 2021

Introduction

Machine learning has become very popular nowadays. We use machine learning to make inferences about new situations using old data, and there are too many machine learning algorithms to do this. Linear Regression, Logistic Regression, Decision Tree, Naive Bayes, K-Means, and Random Forest have commonly used machine learning algorithms. We don’t just try an algorithm when predicting data. Sometimes, we use more than one algorithm, and then we continue with the one that makes better predictions on data. How do we understand which algorithm works better? Model evaluation metrics help us evaluate our model’s accuracy and measure the performance of this trained model. Model evaluation metrics that define adaptive vs non-adaptive machine learning models tell us how well the model generalizes on the unseen data. By using different metrics for performance evaluation, we could improve the overall predictive power of our model before, we roll it out for production on unseen data. When evaluating machine learning models, choosing the right metric is also critical. There are various metrics to evaluate machine learning models in different applications. Let examine the evaluation metrics for evaluating the performance of a machine learning model, which is a very crucial step of any data science project because it aims to estimate the generalization accuracy of a model on the future data.

  1. Classification Metrics

When the response is binary (only taking two values ex. 0:failure and 1: success) in a machine learning model, we use the classification models like logistic regression, decision trees, random forest, XGboost, convolutional neural network etc. Then, to evaluate these models, we use classification metrics.

1.1. Confusion Matrix(Accuracy, Sensitivity, and Specificity)

A confusion matrix includes prediction results of any binary testing that is often used to describe the performance of the classification model.

Figure 1. Confusion matrix table
Figure 2. Confusion table
  • Let look at a sample R implementation of the Confusion matrix.

We see the all Accuracy, Sensitivity, and Specificity in confusion matrix.

  • Accuracy is 0.76.
  • Sensitivity is 0.90 which is the ability of a test to correctly classify an individual as “have diabetes”.
  • Specificity is 0.54 which is the ability of test to correctly classify an individual as “do not have diabetes”. Model is doing the mistake of 46% when predicting the people who really do not have the disease.

1.2. Precision

When we have a class imbalance, accuracy can become an unreliable metric for measuring our performance. Therefore we need to look at class specific performance metrics too. Precision is one of such metrics, which is defined as positive predicted values.

1.3. Recall (Sensitivity)

Recall is also one of the important metric, it is the proportion of actual positive cases which are correctly identified.

1.4. F1-score

F1 score is a combination of two important error metrics: Precision and Recall. Thus, it can be considered as the Harmonic mean of Precision and Recall error metrics for an imbalanced dataset with respect to binary classification of data.

  • We can see the confusion table only by writing:
  • By pulling the byClass argument in the result confusion matrix, we can also see the F1 score, Precision and Recall.

1.5. Receiver Operating Characteristics (ROC) Curve

Measuring the area under the ROC curve is also a very useful method for evaluating a model. It is shows the performance of a binary classifier as function of its cut-off threshold. It essentially shows the sensitivity against the false positive rate for various threshold values.

We write a function which allows use to make predictions based on different probability cutoffs, and then obtain the accuracy, sensitivity, and specificity for these classifiers.

  • If AUC value increases, we can say that model is adequate. (This is obtained with the high sensitivity and specificity).

1.6. Log Loss

Log Loss is a metric that quantifies the accuracy of a classifier by penalizing false classifications. This metric’s value represents the amount of uncertainty of prediction based on how much it varies from the actual label.

2. Regression Related Metrics

When the response is continuous (target variable can take any values in real line) in a machine learning model, we use the regression models like linear regression, random forest, XGboost, convolutional neural network, recurrent neural network etc.Then, to evaluate these models, we use regression Related Metrics.

2.1. Mean Absolute Error (MAE)

MAE measures the average magnitude of the errors in a set of predictions, without considering their direction. It’s the average over the test sample of the absolute differences between prediction and actual observation where all individual differences have equal weight.

2.2. Mean Square Error (MSE)

MSE tells us tells us how close a regression line is to a set of points. That means it finds the average squared error between the predicted and actual values. It is the most popular regression Related Metrics.

2.3. Root Mean Square Error (RMSE)

The root-mean-square error (RMSE) is a frequently used measure of the differences between values (sample or population values) predicted by a model or an estimator and the values observed.

  • Let look at a sample R implementation of the Regression Related Metrics.
  • Mean square error is the mean square of the difference between actual values and predicted values. Residuals of the model gives us the difference between actual values and predicted values, so we can pull the residuals from the model and we can obtain the mean square of the residuals.

Or, we can also see the predicted values, and we can take the difference manually.

Or we can use the MAE function in “MLmetrics” library.

  • Mean absolute error is the mean absolute difference between actual values and predicted values. Residuals of the model gives us the difference between actual values and predicted values, so we can pull the residuals from the model and we can obtain the mean absolute residuals.

Or we can use the MAE function in “Metrics” library.

  • Root mean square error is the squared of the mean square difference between actual values and predicted values. Residuals of the model gives us the difference between actual values and predicted values, so we can pull the residuals from the model and we can obtain the squared of the mean square residuals.

Or we can use the RMSE function in the “Metrics” library.

Conclusion

To conclude, in this article, we examine some of the popular Machine learning metrics which are Regression Related Metrics and Classification Metrics used for evaluating the performance of classification and regression models. Moreover, we examine the importance of the usage of the metrics to obtain good predictions.

References

[1]. Precision and recall. (2021, March 25). Retrieved March 30, 2021, from https://en.wikipedia.org/wiki/Precision_and_recall

[2]. Minaee, S. (2019, October 28). 20 popular machine LEARNING Metrics. Part 1: Classification & Regression evaluation metrics. Retrieved March 30, 2021, from https://towardsdatascience.com/20-popular-machine-learning-metrics-part-1-classification-regression-evaluation-metrics-1ca3e282a2ce

--

--

İrem Tanrıverdi

Research and Teaching Assistant. MSc in Statistics. Interested in programming, machine learning and artificial inteligence.