Skip to main content

A Practical Guide to Systems Testing

doug small (Intuit)
Operations
Beekman
Average rating: ***..
(3.67, 9 ratings)
Slides:   1-PDF 

Turbotax Online first became ‘online’ back in 1998 for the ’97 tax year. It was a huge success and milestone for our business and customers. It was originally hosted on cutting edge Windows NT4 and Compaq x8 Pentium II computers. Since then, I am proud to say that we have made a lot of mistakes, many of which we have learned from and some we will need to learn again. Day in and Day out at work, almost all of our efforts are focused on the production delivery of our application so that our customers can use it and get value from it. Often times we do a great job of white and black box testing of the APIs and the UI, but its easy to stop there. How do we know our app is really ready for production? Have we tested the complete system including how our distributed apps run in the real network, on a real platform, with non-emulated dependencies? In this talk, I will outline my story at Intuit and what works for us with Turbotax Online. Major elements of systems testing that are often overlooked are:

Happy Path

This is the most basic tests. It is very analogous to the pilot checking all the flight surfaces prior to take off. Do the flaps work? All status light report back green? Do you see traffic patterns as expected? Simple stuff.

Failure Testing

Testing happy path is easy, but failure testing is more fun and leads to better application insights. Now that everything checks out ok, what happens if an instance of your app (tomcat/jboss) fails? Do you alerts trigger? Does the app execute the right behavior in terms of failing the node as expected? Do your load balancers operate as you would expect? How long did it take for the System to react to the event and get back to normal runtime status? What was the user experience during this time? Was this failure graceful from an end user perspective or did they lose data? These types of tests should be executed at all layers of the app, db, and web, as well as traditional IT components where possible (like a server failure, load balancer/network appliance, and lan/wan failures)

Dependent service Testing

Most apps dont contain all the components required to deliver the customer experience in themselves and often rely on other partners within your company or even 3rd party customers. For example, your app might rely on a centralized oauth app, an order processing application for credit card processing, or a marketing application. What happens to your customer experience or performance if one of these fails? Do users get http/500 errors or does the app gracefully handle these conditions? It is often useful to test both partner service unavailable as well as partner service is slow or timing out.

Catastrophic Testing

For Turbotax Online, we have had multiple near misses (fire in the building leading to flooding, the infamous Southern California wild fires, Marine F-18 Hornet crash, farmers digging up ISP connection, and other bad operational controls). I like to say that if you have a DR (disaster recovery) solution and plan, but have not yet tested it, then you dont have a DR solution. Catastrophic testing is just that. What happens if you were to lose your data center? Do you know how to restore service or continue operations somewhere else? Do you know how to turn on the other data center when it comes back? Catastrophic testing can be at a smaller scale also, like what if you lost your DB? Have you tested recovery of DB nodes and data? This list is often very long and difficult to do, so its best to create a list and test what is most important as its probably not possible to test everything.

Connectivity Testing

Now that you are ready for production, you should know what ports your application needs to connect to customers, partners, and other network services. When you turn your application on, are you sure that every web and app server will have all the network ports open that it needs? Maybe it has too many open? Especially when you have a large application in a shared network, its easy for most of the ports to be available, but that’s not good enough for customers. Connectivity is simply testing that every port needed is open on every server. This should include ports needed to support your DR strategy and partner service failure plans.

Environmental Concerns

Do you have a homogenous environment in production or do you have a mixed platform like some bare metal boxes, some insourced in your own cloud?

Privacy, Compliance, Security Testing

Its best to leave the formal testing of this to the professionals. But that doesn’t let us off of the hook. There are some best practices for tests that we can implement easily. Types of tests to execute here are things like validating we are not logging user names, passwords or other customer Personally Identifiable Information (PII). Checking cookies and URLs to make sure there isn’t unnecessary information or PII in there also. Validating we don’t have configuration files with credentials in the clear (like your oracle.properties file). There are probably some basic checks you can do if you fall under SOX or PCI rules like validating you don’t store or log credit card data or that you have proper auditing in place. Other types of checks might be to validate your app runs as the correct user with appropriate permissions and not as root as well as checks to make sure that one user cant hijack another users session (think public computer at a library or public network at a coffee shop).

Configuration Testing

This one bites us the most. It can be caused simply by things developers setting up an environment on their macbook that then sets some environmental setting like JVM memory improperly for a larger production machine. Please take the time to validate that in production you are not relying on performance emulators, non prod monitoring systems, customer acceptance partner environments, or a pre-prod order fulfillment system.

Installation and Operational control testing

While your app is in production, it will break. Something will go wrong. It will probably happen at the point of your highest load, or when the network, system and other operations teams are doing maintenance or deployments, which is usually 2am. Therefore it is critical that your operational processes are automated and rock solid. It is worth the time to make and test singular scripts that can shut down, startup, or restart your app. Further, if there are runtime attributes

Monitoring and Server Logging

This is a full topic in itself, but when I am asked, what should we monitor for in our app, I respond with these 3 things as being the basics for application monitoring:

  1. Response times of critical app components including how many times they are executed in as small of a time span as possible (sub 1min intervals).
  2. Response times and number of times they are executed for remote service calls. These might be to partner services as well as 3rd parties
  3. Collect and report on customer metrics. Can they get into your site (you do have load, right?), are they having a good experience? Be able to answer the question of ‘how many customers were impacted?’ when an issue comes up. On the topic of logging, at Intuit we have learned to log both the success messages as well as the failures. The use of a log anaylitics tool like splunk makes it very easy to search and report on and this leads to much faster root cause analysis of issues as well as real time alerting. Another best practice with logging is to include a non PII customer ID in logs so research on customer behavior and impact is possible.

A nice to have is to collect server and network specific metrics as well, but this normally already has a lot of attention.

Production Readiness

Lastly, there are some other production ready disciplines that we find useful. Implementing these 3 things will really improve the quality of your app in production

War Room Games

  • We create scenarios (or use ones that happened recently) to create a virtual scavenger hunt for teams from our operations and engineering staff. I will include some sample questions and scenarios to illustrate how we do this and make it fun. The outcome is that everyone knows how to log into operational tools, can do basic queries and knows how to look up information.

Product Performance Testing

  • We perform load tests in our perf environments, but we also do them in production. This validates that everything from networks, storage, servers, etc can all perform at peak loads. This step is probably the most important piece of our production readiness activities.

Escalation Procedure Review

  • A periodic review of your escalation is critical. There are so many changes in apps release to release that this often causes drift with the procedures given to operations resulting in slow, confusing and sometimes wrong steps being executed when there are problems.
Photo of doug small

doug small

Intuit

TurboTax Online Staff Systems Quality Engineer at Intuit