Effective testing for machine learning systems
In this blog post, we’ll include what testing looks like for traditional software development, why testing machine learning systems can be different, and discuss some approaches for writing effective tests for machine learning systems. We’ll also explain the distinction between the closely associated roles of evaluation and testing as part of the model development process. By the end of this blog post, I hope you’re satisfied with both the extra work needed to effectively test machine learning systems and the value of doing such work.
What’s different about testing machine learning systems?
In traditional software systems, humans write the logic which associates with data to generate the desired behavior. Our software tests help assure that this written logic aligns with the actual expected behavior.
However, in machine learning systems, humans present desired behavior as examples during training, and the model optimization method provides the logic of the system. How do we assure this learned logic is going to consistently produce our desired behavior.
Let’s begin by looking at the best methods for testing traditional software systems and developing high-quality software.
A typical software testing suite will cover:
- unit tests which run on atomic pieces of the codebase and can be operated quickly during development,
- regression tests replicate flaws that we’ve previously encountered and fixed,
- integration tests which are typically longer-running tests that examine higher-level behaviors that leverage multiple elements in the codebase,
and follow conventions such as:
- don’t mix code unless all tests are passing,
- always write tests for recently introduced logic when contributing code,
- when contributing a bug fix, be certain to write a test to capture the bug and prevent future regressions.
What’s the difference between model testing and model evaluation?
While reporting evaluation metrics is surely good practice for quality assurance during model development, I don’t think it’s enough. Without a granular report of specific behaviors, we won’t be able to instantly understand the nuance of how behavior may change if we switch over to the new model. Additionally, we won’t be ready to track (and prevent) behavioral regressions for particular failure modes that had been earlier addressed.
This can be particularly dangerous for machine learning systems since often times failures happen silently. For example, you might enhance the overall evaluation metric but introduce regression on a significant subset of data. Or you could unknowingly add a gender bias to the model by the inclusion of a new dataset during training. We require more nuanced reports of model behavior to recognize such cases, which is exactly where model testing can help.
For machine learning systems, we should be operating model evaluation and model tests in parallel.
- Model evaluation comprises metrics and plots which summarize performance on validation or test dataset.
- Model testing includes explicit checks for behaviors that we require our model to follow.
Both of these perspectives are necessary for building high-quality models.
How do you write model tests?
In my opinion, there are two broad classes of model tests that we’ll want to write.
- Pre-train tests enable us to recognize some bugs early on and short-circuit a training job.
- Post-train tests use the trained model artifact to examine behaviors for a variety of important scenarios that we establish.
There are some tests that we can operate without requiring trained parameters. These tests include:
- check the shape of your model output and assure it regulates with the labels in your dataset
- check the output ranges and assure it regulates our expectations (eg. the output of a classification model should be a division with class probabilities that sum to 1)
- ensure a single gradient step on a batch of data generates a decrease in your loss
- make declarations about your datasets
- check for label leakage among your training and validation datasets
The main goal here is to recognize some errors early so we can bypass a wasted training job.
However, in order for us to be able to know model behaviors, we’ll need to test against trained model artifacts. These tests intend to interrogate the logic learned during training and present us with a behavioral report of model performance.
Invariance tests enable us to define a set of perturbations we should be able to make to the input without altering the model’s output. We can use these perturbations to produce pairs of input examples (original and perturbed) and review for consistency in the model predictions. This is closely related to the idea of data augmentation, where we employ perturbations to inputs through training and preserve the original label.
Directional Expectation Tests
Directional expectation tests, on the other hand, enable us to define a set of perturbations to the input which should have an expected effect on the model output.
Minimum Functionality Tests (aka data unit tests)
Just as software unit tests intend to isolate and test atomic components in your codebase, data unit tests enable us to quantify model performance for particular cases found in your data.
This enables you to recognize critical scenarios where prediction errors lead to high consequences. You may also choose to write data unit tests for failure modes that you uncover during fault analysis; this enables you to “automate” searching for such errors in future models.
In traditional software tests, we typically arrange our tests to mirror the structure of the code repository. However, this method doesn’t translate well to machine learning models as our logic is structured by the parameters of the model.
Machine learning systems are more difficult to test due to the fact that we’re not explicitly writing the logic of the system. Nonetheless, automated testing is still a crucial tool for the development of high-quality software systems. These tests can present us with a behavioral report of trained models, which can serve as a systematic method towards error analysis.
Developing machine learning models also rely on a huge amount of “traditional software development” in order to prepare data inputs, design feature representations, perform data augmentation, arrange model training, expose interfaces to external systems, and much more. Thus, effective testing for machine learning systems needs both a traditional software testing suite (for model development infrastructure) and a model testing suite (for trained models).
Need some guidance in testing machine learning systems? Choose to team up with a QA services provider like TestUnity. Our team of testing experts specializes in QA and have years of experience implementing tests with different testing software. Partner with our QA engineers who can help your team in adopting the best suitable testing practices. Get in touch with a TestUnity expert today.