Most of this post is off topic; the conclusion is not.
On the afternoon of Martin Luther King Day, 1990, AT&T's
backbone network slowed to a crawl. The cause: a patch intended
to save time when a switch that had taken itself off line (a
rare, but routine and almost imperceptible event) rejoined the
network. The patch was flawed; a lock should have been taken
one instruction sooner.
Bell Labs had tested the daylights out of the patch by
subjecting a real switch in the lab to tortuously heavy, but
necessarily artificial loads. It may also have been tested on
a switch in the wild before the patch was deployed throughout
the network, but that would not have helped.
The trouble was that a certain sequence of events happening
within milliseconds on calls both ways between two heavily
loaded switches could evoke a ping-pong of the switches leaving
and rejoining the network.
The phenomenon was contagious because of the enhanced odds of a
third switch experiencing the bad sequence with a switch that
was repeatedly taking itself off line. The basic problem (and
a fortiori the contagion) had not been seen in the lab because
the lab had only one of the multimillion-dollar switches to
play with.
The meltdown was embarrassing, to say the least. Yet nobody
ever accused AT&T of idiocy for not first testing on a private
network this feature that was inadvertently "designed to
compromise" switches.
Doug
Show replies by date