Automated tests are strange creatures, they are code that validates code, written by fallible beings. The messier the system under test, the messier the code required to test it.
I have grown increasingly fond of property-based testing as a better technique of automated verification of software and in the following paragraphs I am going to explain why.
Let us start with some observations about automated tests: Firstly, tests are a way of encoding how a system is supposed to behave. Expectations can fall into two categories: Some expectations are considered part of the interface of a subsystem, something that the user of the system can rely on. Other expectations are behaviour that the system currently exhibits but that are not part of the contract. These are commonly called “implementation details”. It is widely considered good practice to only test contracts and to not test implementation.
There are multiple reasons for this: Code changes, and every time it does, the tests potentially need to be adapted. Interfaces, by definition, change less frequently than implementations, making tests around the interfaces more stable. A second reason is that tests that purely focus on the interface of a system can be considered documentation of said system. Someone looking for an example of how to use a component could look at a test.
That being said, many real world systems I have come across do not clearly define their contract, making the distinction blurry.
This leads us to a second observation: Some systems are easier to test than others. If a system’s contract is clearly stated and easy to understand, writing tests is easier. If a system’s behaviour depends on global state, defining tests is more difficult. If the system is an isolated component, testing is easy. If the system depends on a complex set of dependencies, not so much.
We can also observe that tests are themselves code and can thus contain mistakes. This is particularly true when the tests grow complex: The more difficult the interface of a system, the more complex the test code becomes, the higher the likelihood of bugs in the test.
Lastly, it is a common occurrence that tests are incomplete and do not validate the edge cases that will end up getting you into trouble.
This is particularly true, if you are writing tests for your own code: It is not crazy to think that the same blind spots that caused you to miss an edge case in the implementation will also cause you to miss a case in your test.
Property Based Testing
One way of looking at traditional unit tests is that they are example based: For a specific input to the system you expect a specific output (or behavior).
Property based testing turns this around: Instead of validating particular inputs, you want to define properties that the system always needs to satisfy:
- The result set of a function call can never be empty.
- A function should return the same value, irrespective of the order of the elements in a collection it receives as input.
- The function should not throw an exception on any input.
- The output of the function should always be a positive number.
A property testing library like ScalaCheck will then randomly generate inputs and throw them at your test. If any of the inputs do not satisfy the property, a counterexample has been found and the test fails.
This has multiple benefits. The first thing I want to note is that this approach removes some blind spots. It hits edge cases that you might not think of. For example: How often do you account for integer overflow in your own unit tests?
Secondly, writing properties also serves as a form of documentation. In a sense it is taking “testing the interface” to its logical conclusion: Instead of hard-coding selected interactions with the system under test it encodes exactly the interface that you expect. An interface is more than a return type, it is also all the implicit or explicit assumptions you make about the behaviour of a system that are not encoded in the types. Testable properties are an opportunity to explicitly formulate these invariants.
Lastly I think that it encourages good API design: For property tests to work, you need to write code that has properties. If you cannot clearly state the properties of the function you wrote, it probably also fails the “what does this thing do” test.
Conversely, code that has properties is also more testable making the code required for automated verification shorter and less prone to mistakes.
Randomness In Unit Tests
Some people might be concerned about the idea of introducing a source of randomness into their tests. Having experienced software builds that fluctuate between red and green without anyone really feeling inclined to fix it, I understand the concern. However, property testing frameworks come with a number of tools that, if combined with an appropriate change in mindset, solve most of the challenges with randomness.
First of all, let us note that using a random number generator does not automatically imply non-deterministic behavior: As long as you initialize your random number generator with the same seed value, you expect the same results.
ScalaTest will print out the seed that was used on a failed run. Imagine you observe a failure during a CI run – you will be able to copy this seed from the build log and explicitly initialize your test with this value to reproduce the failure. Here is an example of how to do this.
A second problem with randomness is that the counterexamples found are not necessarily intuitive for the human debugging the problem. To mitigate this issue, most property testing libraries implement the concept of “shrinking” – iteratively simplifying a counterexample while ensuring that the result still fails the test. That way a list of 300 elements might turn into a list of length 3.
That being said, I found that implementing good heuristics for shrinking is a tricky affair and in practice you might still be confronted with rather unwieldy counterexamples.
In my experience, the biggest challenge with randomness in property tests is managing the distribution of inputs. In practice it may happen that a lot of the testing time is spent on inputs that are not very interesting. Imagine implementing a method that finds the intersection of two line segments. Naively generating input can quickly lead to many tests in which the lines do not touch at all. Interesting examples are those where the segments barely touch and those where the segments have multiple points in common. Making sure that these examples are part of the input sometimes requires some explicit consideration. So, even if it is true that property testing can remove some blind spots, you will still need to think about edge cases that may not yet be covered.
The Naive-Complex Pattern
A particular pattern that works well with property based testing is implementing a piece of functionality twice, once in an easy to understand but inefficient way and once efficiently. The test then validates that on any input the two implementations produce the same output.
This works particularly well for algorithms and data structure operations. You might implement a set union as a naive removal of duplicates on an array, for example, before implementing a more complex hash-based approach.
If you have been interviewing at any technology company recently you are probably familiar with the idea of implementing a brute-force solution to an algorithm challenge before inventing a better approach. Property based tests could validate that these two are indeed the same.
I have noticed that there are areas where this approach tends to run into problems. One observation is that most of day-to-day software development is not implementing hash-based set union algorithms but rather gluing together various other subsystems. Code that just takes some input, transforms it and then sends the data to some external service is definitely testable – but the result will be much less elegant than some of the other examples we saw above.
Another limitation are tests for slow operations. In the examples above we accepted running the code through many not very relevant examples in order to find the one critical counterexample. With slow, long running operations you may not have that luxury. Either you will need to very carefully tune the distribution of the input – or you simply hand pick examples as in regular unit testing.
Clearly, property based testing is just another tool to complement your testing strategy. There are many practical situations where it may not be applicable.
I want to also mention a couple of directions that I think could be worth exploring. What if we extrapolate this testing strategy just a little bit further?
One observation is that the quality of the test result can depend on the number of examples thrown at the system under test. This can be seen as a trade-off against build times. However, it does not have to: The quantity of test cases analyzed can scale almost linearly through parallel execution. A simple idea would be to spin up 50 containers during a CI build, each running the tests with a different seed, to achieve 50 times better coverage.
A more advanced idea that comes to mind is guided failure analysis: We can instrument our code to measure which lines have been hit during a test run. This technique is commonly used to measure code coverage. But what if we looked at the lines hit during failing runs and lines hit during passing runs? Property-based testing ensures that most lines will be hit by different runs corresponding to different inputs.
Lines of code that were not part of many passing runs but part of several failing runs may be good candidates to start an investigation into the failure. The statistics behind this intuition are a bit more complicated than that, so I will leave this for a future blog post.
Another idea would be to keep track of inputs that historically turned out to be problematic and to learn from them to generate inputs that have a higher chance of tripping over the function under test. The input generator could be seen as an adversary that tries to come up with examples that are likely to cause problems. This does not sound like an easy task but I think it is worth exploring. What if there was a global database that kept track of data that tends to cause certain properties to fail?
A last point where I see potential is keeping track of code paths that each property covers and then weighing the quantity of examples generated towards the code paths that were actually changed.
I think that I have made it clear that I really like this form of testing. Not as a replacement of other existing practices but as a complement. I would be happy to see a wider adoption and broader interest as it might allow us to dive into some of the directions I have outlined above.