Courtesy of Adrian Colyer, who runs the kaleidoscopically illuminating The Morning Paper blog.

A checklist for designing and developing internet scale services, inspired by James Hamilton's 2007 paper "On Designing and Deploying Internet-Scale Services."

The nub of the matter is summarised immediately:

Basic tenets
  • Does the design expect failures to happen regularly and handle them gracefully?
  • Have we kept things as simple as possible?
  • Have we automated everything?

Then some choice selections (text verbatim, some points excluded):

Overall Application Design & Development
  • Can the service survive failure without human administrative interaction?
  • Are failure paths frequently tested?
  • Have we documented all conceivable component failure modes and combinations thereof?
  • Does our design tolerate these failure modes? And if not, have we undertaken a risk assessment to determine the risk is acceptable?
  • Have we avoided single points of failure?
Automatic Management and Provisioning
  • Are all of our operations restartable?
  • Is all persistent state stored redundantly?
  • Have we automated provisioning and installation?
  • Are configuration and code delivered by development in a single unit?
  • Is the unit created by development used all the way through the lifecycle (test and prod. deployment)?
  • Is there an audit log mechanism to capture all changes made in production?
  • Have we eliminated any dependency on local storage for non-recoverable information?
  • Is our deployment model as simple as it can possibly be? (Hard to beat file copy!)
  • Are we using a chaos monkey in production?
Dependency Management

(How to handle dependencies on other services / components).

  • Can we tolerate highly variable latency in service calls? Do we have timeout mechanisms in place and can we retry interactions after a timeout (idempotency)?
  • Are all retries reported, and have we bounded the number of retries?
  • Do we have circuit breakers in place to prevent cascading failures? Do they 'fail fast'?
  • Have we implemented inter-service monitoring and alerting?
Release Cycle and Testing
  • Are we shipping often enough?
  • Have we defined specific criteria around the intended user experience? Are we continuously monitoring it?
  • Are we collecting the actual numbers rather than just summary reports? Raw data will always be needed for diagnosis.
  • Have we minimized false-positives in the alerting system?
  • Do we have a process in place to catch performance and capacity degradations in new releases?
  • Are we running tests using real data?
  • Do we have (and run) system-level acceptance tests?
Hardware Selection and Standardization

(I deviate from the Hamilton paper here, on the assumption that you'll use at least an IaaS layer).

  • Do we depend only on standard IaaS compute, storage, and network facilities?
  • Have we avoided dependencies on specific hardware features?
  • Have we abstracted the network and naming? (For service discovery)
Operations and Capacity Planning
  • Is there a devops team that takes shared responsibility for both developing and operating the service?
  • Do we have a discipline of only making one change at a time?
  • Is everything that might need to be configured or tuned in production able to be changed without a code change?
Auditing, Monitoring, and Alerting
  • Are we tracking the alerts:trouble-ticket ratio (goal is near 1:1)?
  • Are we tracking the number of systems health issues that don't have corresponding alerts? (goal is near zero)
  • Have we instrumented every customer interaction that flows through the system? Are we reporting anomalies?
  • Do we have automated testing that takes a customer view of the service?
  • Do we have individual accounts for everyone who interacts with the system?
  • Are we tracking all fault-tolerant mechanisms to expose failures they may be hiding?
  • Do we have sufficient assertions in the code base?
  • Are we keeping historical performance and log data?
  • Are we exposing suitable health information for monitoring?
  • Do our problem reports contain enough information to diagnose the problem?
  • Can we snapshot system state for debugging outside of production?
  • Are we recording all significant system actions? Both commands sent by users, and what the system internally does.
Graceful Degradation and Admission Control
  • Can we meter admission to slowly bring a system back up after a failure?
Customer and Press Communication Plan
  • Do we have a communications plan in place for issues such as wide-scale system unavailability, data loss or corruption, security breaches, privacy violations etc..?