Internet Scale Services Checklist

Courtesy of Adrian Colyer, who runs the kaleidoscopically illuminating The Morning Paper blog.

A checklist for designing and developing internet scale services, inspired by James Hamilton's 2007 paper "On Designing and Deploying Internet-Scale Services."

The nub of the matter is summarised immediately:

Basic tenets

Does the design expect failures to happen regularly and handle them gracefully?

Have we kept things as simple as possible?

Have we automated everything?

Then some choice selections (text verbatim, some points excluded):

Overall Application Design & Development

Can the service survive failure without human administrative interaction?
Are failure paths frequently tested?
Have we documented all conceivable component failure modes and combinations thereof?
Does our design tolerate these failure modes? And if not, have we undertaken a risk assessment to determine the risk is acceptable?
Have we avoided single points of failure?

Automatic Management and Provisioning

Are all of our operations restartable?
Is all persistent state stored redundantly?
Have we automated provisioning and installation?
Are configuration and code delivered by development in a single unit?
Is the unit created by development used all the way through the lifecycle (test and prod. deployment)?
Is there an audit log mechanism to capture all changes made in production?
Have we eliminated any dependency on local storage for non-recoverable information?
Is our deployment model as simple as it can possibly be? (Hard to beat file copy!)
Are we using a chaos monkey in production?

Dependency Management

(How to handle dependencies on other services / components).

Can we tolerate highly variable latency in service calls? Do we have timeout mechanisms in place and can we retry interactions after a timeout (idempotency)?
Are all retries reported, and have we bounded the number of retries?
Do we have circuit breakers in place to prevent cascading failures? Do they 'fail fast'?
Have we implemented inter-service monitoring and alerting?

Release Cycle and Testing

Are we shipping often enough?
Have we defined specific criteria around the intended user experience? Are we continuously monitoring it?
Are we collecting the actual numbers rather than just summary reports? Raw data will always be needed for diagnosis.
Have we minimized false-positives in the alerting system?
Do we have a process in place to catch performance and capacity degradations in new releases?
Are we running tests using real data?
Do we have (and run) system-level acceptance tests?

Hardware Selection and Standardization

(I deviate from the Hamilton paper here, on the assumption that you'll use at least an IaaS layer).

Do we depend only on standard IaaS compute, storage, and network facilities?
Have we avoided dependencies on specific hardware features?
Have we abstracted the network and naming? (For service discovery)

Operations and Capacity Planning

Is there a devops team that takes shared responsibility for both developing and operating the service?
Do we have a discipline of only making one change at a time?
Is everything that might need to be configured or tuned in production able to be changed without a code change?

Auditing, Monitoring, and Alerting

Are we tracking the alerts:trouble-ticket ratio (goal is near 1:1)?
Are we tracking the number of systems health issues that don't have corresponding alerts? (goal is near zero)
Have we instrumented every customer interaction that flows through the system? Are we reporting anomalies?
Do we have automated testing that takes a customer view of the service?
Do we have individual accounts for everyone who interacts with the system?
Are we tracking all fault-tolerant mechanisms to expose failures they may be hiding?
Do we have sufficient assertions in the code base?
Are we keeping historical performance and log data?
Are we exposing suitable health information for monitoring?
Do our problem reports contain enough information to diagnose the problem?
Can we snapshot system state for debugging outside of production?
Are we recording all significant system actions? Both commands sent by users, and what the system internally does.

Graceful Degradation and Admission Control

Can we meter admission to slowly bring a system back up after a failure?

Customer and Press Communication Plan

Do we have a communications plan in place for issues such as wide-scale system unavailability, data loss or corruption, security breaches, privacy violations etc..?