
Welcome to our Model Evaluation video series, led by Guillermo Germade, AI/ML & Backend Developer at Azumo.
This 7-part series guides you through one of the most critical phases of the machine learning workflow: evaluating models. From train/test splits to precision, recall, F1 scores, and ROC AUC curves, Guillermo explains how to measure whether models are “good enough” — both from a technical and a business perspective.
Whether you’re a project manager wanting to interpret results or a developer looking to compare models, these videos provide the tools to make sense of performance metrics and apply them to real-world problems like fraud detection, medical diagnostics, and spam filtering.
Introduction to Model Evaluation
Guillermo opens the series by explaining why model evaluation matters for both practitioners and business stakeholders. After a high-level overview of AI in the first training, this session zooms into supervised learning models (classification and regression), setting the stage with use cases and the metrics used to measure performance.
What Is Model Evaluation?
Here, Guillermo breaks down what evaluation means: testing models on unseen data to measure how well they predict the future. He explains why different models trained on the same dataset can perform differently and why, in practice, data scientists train multiple models and select the best based on evaluation rather than theory.
The Data Science Process & Why Evaluation Matters
Using the example of fraud detection in banking, Guillermo walks through the entire data science process — from business understanding and data preparation to model training and deployment. He highlights how evaluation is the key stage where project managers can meaningfully engage with data scientists and decide whether models meet business goals. The section also touches on data requirements across domains, the role of structured vs. unstructured data, and introduces MLOps for monitoring deployed models.
Understanding Train/Test Split
This lesson introduces the train/test split, the foundation of model evaluation. Guillermo explains why separating data into training and testing sets prevents inflated results and rookie mistakes, using the example of credit card transaction data.
Decoding the Confusion Matrix: Simple Guide to AI Model Accuracy
Here, Guillermo introduces classification problems and the confusion matrix. He explains core metrics — precision, recall, and accuracy — and why trade-offs between false positives and false negatives matter. With real-world examples like fraud detection, he shows how businesses often prioritize recall (catching nearly all fraud) over precision (avoiding false alarms), depending on their objectives.
F1 Score and ROC AUC Curve Explained
This section builds on the basics by introducing the F1 score, a single number balancing precision and recall, and the ROC AUC curve, a visual tool for comparing models. Guillermo shows how to interpret these metrics and why the area under the curve is such a powerful measure of overall performance.
Hands-On Model Evaluation Demo & Closing Remarks
In the final part, Guillermo demonstrates a live coding example using Python’s Scikit-learn. He trains a logistic regression model on a synthetic dataset of credit card transactions, calculates accuracy, precision, recall, and confusion matrices, and interprets the trade-offs between false positives and false negatives. He closes by showing how evaluation principles generalize across business domains and previews the next training on regression evaluation and different model types.
To Sum Up
Stay tuned for the continuation of this series, where we’ll move deeper into regression evaluation and explore a wider range of models. You’ll learn which metrics to use, how to interpret results, and how to improve your models — ensuring that your AI solutions deliver both technical accuracy and real-world value.