Automated Testing Keeps the World Turning
I recently came across a blog written by a former developer at ORACLE.
The author highlighted the trials and tribulations of maintaining and modifying the codebase of Oracle (version 12) database. This version, by the way, is not some legacy piece of software, but is the current major version that is running all over the world. According to the author, the codebase is close to 25 million lines of C code and described as “unimaginable horror”. It is riddled with thousands of flags, multitudes of macros and extremely complex routines that were added by generations of developers trying to meet various deadlines. According to the author, this is a common experience for an Oracle developer at the company (their words, not mine):
- Start working on a new bug.
- Spend two weeks trying to understand the 20 different flags that interact in mysterious ways to cause this bug.
- Add one more flag to handle the new special scenario. Add a few more lines of code that checks this flag and works around the problematic situation and avoids the bug.
- Submit the changes to a test farm consisting of about 100 to 200 servers that would compile the code, build a new Oracle DB, and run the millions of tests in a distributed fashion.
- Go home. Come the next day and work on something else. The tests can take 20 hours to 30 hours to complete.
- Go home. Come the next day and check your farm test results. On a good day, there would be about 100 failing tests. On a bad day, there would be about 1000 failing tests. Pick some of these tests randomly and try to understand what went wrong with your assumptions. Maybe there are some 10 more flags to consider to truly understand the nature of the bug.
- Add a few more flags in an attempt to fix the issue. Submit the changes again for testing. Wait another 20 to 30 hours.
- Rinse and repeat for another two weeks until you get the mysterious incantation of the combination of flags right.
- Finally one fine day you would succeed with 0 tests failing.
- Add a hundred more tests for your new change to ensure that the next developer who has the misfortune of touching this new piece of code never ends up breaking your fix.
- Submit the work for one final round of testing. Then submit it for review. The review itself may take another 2 weeks to 2 months. So now move on to the next bug to work on.
- After 2 weeks to 2 months, when everything is complete, the code would be finally merged into the main branch.
Millions of tests
As I read this post, I started getting a little nervous, being a software developer and currently interacting with an Oracle 12 database. Quickly it became clear that millions of tests were acting as a true line of defense against the release of disastrous software. It was interesting to read that the author was diligent in adding tests not just because it was his job, but out of concern for ensuring that future developers will not break the bug fix being added. There are probably many ways of streamlining the development lifecycle described by the author, and maybe it is a topic for another blog, but it seems that having hundreds or thousands of failed tests is infinitely better than the alternative.
The black box
Product owners and business stakeholders are not the only people in an organization that may view their codebase as a black box. Developers may as well. As the number of lines of code grows over time, no institutional knowledge can cover the various nuances and complexities that various features introduce. When new developers are added to a project, the value of having automated tests becomes more pronounced. Organizations should strive to remove the perception of the code being a “black box” as much as possible. Failure to do so will result in loss of confidence in quality and viability of the products and features being developed, not to mention placing unnecessary burden on test teams.
I have heard horror stories from colleagues and friends about maintaining “legacy” systems. One question that comes to mind is if the system is being modified, is it really a legacy system, or is it just an old system that is still updated? In my experience, if a bug fix or a new feature is being added to an older system, the cost of not automating the corresponding tests most often results in some other seemingly unrelated bug being introduced. Sometimes it can be very difficult to automate tests especially in old systems. There may not be any available testing frameworks that could be easily plugged in. In such cases I would advocate for writing your own testing framework. It may not be as sophisticated as some of the commercial or open source packages, but it will help immensely. Testing automation of older systems can also ensure a smooth retirement of obsolete features that keep the codebase bloated and more complex that it needs to be.
In my experience, having more automated tests is better than having less. Over time, tests may be become redundant as they are being added. This is not the worst problem to have and can generally be managed as technical debt. What is worse, is presence of flaky tests. These are tests that change from passing to failing from day to day with no clear explanation. These result in significant time sink and loss of confidence in the product being developed. Getting rid of such tests should be the top priority. Ideally, they should be rewritten and made more robust, but in some cases, they may be completely removed as they do not reliably prove that the software is working as it should.
Test automation is not new, but it still appears to be lacking in many systems. Organizations should embrace the cost of test automation as part of the cost of developing new features and modifying existing ones. Doing so will help build confidence in ensuring quality for product owners as well as developers.