Unforeseen Dangers of the Leap Second
A sufficiently-advanced warning of the dangers of time-based programming appears in this month's Availability Digest:
There is a reason that there will be no space launches on June 30 or July 1, 2015. Scientists do not want to risk a computer malfunction due to a leap second being added at midnight UTC.
...
The leap second is such an erratic and infrequent occurrence that it is likely that many systems have not been built to account for it. Consequently, everyone should monitor their systems carefully as the next leap second approaches, just in case.
Last time around (June 30, 2012, 23:59:60) a number of major sites and installations went down. A Linux timer kernel patch was required:
While Reddit was struggling with its Cassandra servers, Gawker had issues with its Tomcat servers, and Mozilla had trouble with Hadoop. Both Hadoop and Tomcat also depend on Linux and Java, and it would seem they were hit by the same glitch.
To avoid the issues that come with NTP servers issuing a "60th second" of the day (i.e. 23:59:59
, 23:59:60
, 00:00:00
) Google's approach is to implement a "leap smear" over the course of a day:
We modified our internal NTP servers to gradually add a couple of milliseconds to every update, varying over a time window before the moment when the leap second actually happens. This meant that when it became time to add an extra second at midnight, our clocks had already taken this into account, by skewing the time over the course of the day. All of our servers were then able to continue as normal with the new year, blissfully unaware that a leap second had just occurred.
As technically competent as Google's solution is, it's unlikely to get rolled out in time for Tuesday June 30th, 2015 (if at all). A mishandled leap second could affect anything from distributed transaction locking and kernel thread management to UIs in calendar and email applications. Last leap second, the obvious metric for affected servers was CPU load as syscalls went into loops.
Setting the date was sufficient to correct some reports, but "power cycling" a server is the oldest trick in the book (perhaps behind percussive maintainance) for a reason - if the intial error condition hasn't been handled, there's no guarantee of full recovery. Just another reason systems should be designed to fail gracefully.