Thanks to Matt Jolly of Vology for his assistance in the preparation of this analysis.
Click on image for complete report. Here are some highlights from their official report. “On Sept. 10, 2012, many Go Daddy customers experienced intermittent outages that lasted for several hours. There was immediate speculation about whether we were hacked. It was being reported as “fact” before our engineers had identified the root cause. The service disruption was not the work of an external source, but rather an internal network event triggered by a number of factors.There was not a single issue that caused the service disruption. Rather, it was the combination of multiple factors. The combined factors that contributed to the service disruption were:
- Router memory exhaustion
- Router hardware failure modes
Our network is equipped with extensive filters and partitioned to proactively prevent service disruptions from spreading beyond a single location. In this case, our BGP route reflector filters viewed the routes as legitimate and advertised them to the network. As our routers fell into software switching mode, they were unable to forward incoming and outgoing DNS* traffic fast enough. We brought our systems back online by throttling the DNS queries with traffic rate-limiters on all of our Internet connection points around the world. As the limiters took effect, we started to bring up each DNS data center and continually increased traffic with each new DNS pod coming online. The analogy here is similar to electricity** failing in a city and the need to manage the demand (everybody wants their electricity back at the same time) when a single power station is restored. If the entire town attempts to regain power at the same time against only one part of the grid infrastructure, often times this overwhelms that component and the system fails to regain full strength.”
Our analysis is – It was the overwhelming amount of DNS queries that caused the routers to run out of memory and to start using software switching (read high CPU utilization). They had to actually severely limit DNS queries until their hardware could get back to a stable condition. A nasty problem to chase! It looks like they actually had a legitimate network error and it is actually impressive how quickly they were able to recover, all things considered. More than likely they had some legacy routers in place and their issues caused the initial bottlenecks that escalated throughout their global DNS infrastructure.
This ends the analysis and begins with two tutorials on Route Reflectors and DNS-domain name service.
See animated tutorial for indepth tutorial.
1 – Problem: Full-mesh (interconnected) network of all PE-Provider Edge routers in all AS-Autonomous Systems is uneconomical and practically impossible to manage.
One Solution: If MPLS-Multi-Protocol Label Switching is used, LDP-Label Distribution Protocol requires a full-mesh (shown here) of LDP sessions when adding any new PE-Provider Edge router. That is, the SP-Service Provider must perform Auto Discovery manually to look up all the other PEs that are part of that AS-Autonomous System or VPLS-Virtual Private LAN Service and build a new mesh network between the new and every other PE in that domain (AS-VPLS).
2- Quick Review – What is CR-LDP
CR-LDP-Constraint Routing Label Distribution Protocol is the signaling protocol (packet) used by MPLS-Multi-Protocol Label Switching routers to determine priorities and assign labels. That is, if you give IP packets special priority labels then all the routers in the network or across other carrier networks must understand these labels and assign similar or higher priorities in other to maintain QoS-Quality of Service. Moreover, the CR-LDP packet must be able to provide detailed information to MPLS routers about what the labels mean.
3 – Solution: Contrast the LDP approach is to use BGP-Border Gateway Protocol for both signaling and Auto Discovery (instead of LDP for signaling and Auto Discovery manually). When a new PE-Provider Edge router is added, only a BGP session and its nearest RR-Route Reflector be established. When a new VPLS-Virtual Private LAN Service is created, the PE advertises this service to the RR which in turn advertises it to all the other RRs and their respective PEs. If a VPLS needs to cross AS-Autonomous Systems, each AS can assign a Route Target to that VPLS or up to 4,096 VLAN-Virtual LANs within each AS.
4 – Solution: Using BGP-Border Gateway Protocol (not MPLS) routers are organized into clusters and assign an AS-Autonomous System interface called a RR-Route Reflectors. Members of clusters are called Clients and peer-only (interface) with Route Reflector. Route Reflectors also reduce unknown spurious Route Announcements received from other sources. Clients only listen to RR.
5 – RR-Route Reflector Road Rules
- RR filters route announcements and reflects (sends) to Clients only “best” route.
- Clients reflects (sends) to RR and RR reflects (sends) to other Clients.
*Here is a brief explanation of DNS. DNS-Domain Name Service data requests are carried over UDP-User Datagram Protocol or TCP-Transmission Control Protocol (UDP-TCP determined by the network administrator) over port 53 which is assigned for DNS functions. Address or A-Record (shown here), maps (defines) the name of a server/host machine to its 32-bit numeric IP address such as 198.22.34.09. That is, the A-Address record states the hostname and IP address of a certain machine. To “resolve” or translate a hostname (called the official name or CNAME-Canonical Name means to find its matching IP address. This is the record (directory name) that A NS-NAME Server would send another name server to answer a resolution query.
Click on image for detailed tutorial on DNS.
**Great analogy, however, since few people really understand electricity
(see TECHtionary.com for tutorials on grid and smart grid) a much better analogy would be NASCAR pileups which GoDaddy is also very familiar with where Danica Patrick Wrecks Sam Hornish Jr. at Talladega and others. Maybe GoDaddy might spend a little more money on their network rather than race cars.