Last Thursday afternoon, at approximately 2:25pm, there was a loud sucking sound that can only be heard by network engineers conditioned to expect bad, ugly things to happen at inopportune times, and all upstream connectivity to our corporate office died.
Predictably, IT was immediately assisted by many, many helpful people wandering by our area, sending emails, making phone calls, or stopping us in the hall to ask if we knew that the network was down. Usually in these situations the first couple of people get a good explanation of what we think the problem is, and what an ETA might be. After the 10th person, however, my responses tend to devolve a bit and I either end up giving a curt one-word answer, or feigning shock and amazement.
I should explain here that the way the architecture of our network works, we have our IP provider, SIP Trunks, Point-to-Point circuits, VPN end-points, and all of our external-facing servers in a very robust telecom hotel–The Westin Building, for those keeping score–in downtown Seattle. From there, we move everything over our DS3 to our corporate headquarters not far from Seattle. We also have many other dedicated circuits, IPsec tunnels, and assorted ballyhoo to other locations around the world, but for discussion here just keep in mind the three locations I’ve described.
So the DS3 that is our lifeline was down. It was after hours in our Canadian location so with any luck nobody would notice all night–they use a lot of critical services across another DS3, but that also routes through Seattle first. Additionally, it was a particularly nice day in Seattle (rare) and a lot of people were already out of the office when this link went down. Hopefully we could file a trouble ticket and get this resolved fairly quickly.
Within just a few minutes of filing said trouble ticket, I had a representative of the provisioning telecom on the line who said that, yes, they saw a problem and would be dispatching technicians. There were some other calls following that, but the short version is that by 5:30pm “everything was fixed” according to the telecom and would we please verify so they could close the ticket. Unfortunately, the problem was not fixed.
Now the fun began. To appease the telecom representative, I accepted the possibility that my DS3 controller card had coincidentally died or locked the circuit or some other bunch of weird pseudo-engineer guessing from the telecom representative. This meant I had to drive to our data center in Seattle, in rush hour traffic, to personally kick the offending router in the teeth.
After an hour or so of typically nasty Seattle rush-hour traffic I arrived at the datacenter and began testing. Our DS3 controller was showing AIP on the line, so more technicians were dispatched to find the offending problem. Meanwhile, I wandered over to the Icon Grill to get some dinner and an après-ski beverage or two.
Fast forward a few hours and the AIP condition on the DS3 controller was gone, but I now had an interface status of “up up (looped)” which is less than ideal, shall we say. I decided at this point to cut my losses and head home and possibly get some sleep while the telecom engineers and their cohort tried to figure out how this might be my fault.
With some three hours of sleep or so, I woke up at 5am and started looking at all of my emails, listening to all of my voicemails, and generally cursing anyone within earshot–mostly consisting of the cats–as my wife was still asleep. At this point I got on a conference bridge with the President of the telecom broker we use and together we managed to drag a rep in from the provisioning company who could then drag in as many engineers as needed to get the problem solved. Not, however, before I was rather pointedly told by said provisioning woman that I would have to pay for all of this cost since the problem was “obviously with my equipment, since her software showed no loops in the circuit.”
Once the engineers started hooking up testers to the circuit–physically this time–they could see a loop, but at the Seattle side (the side reporting the loop.) Another engineer saw a loop on the headquarters side, and still a third saw no loop at all. As it turns out, the circuit was provisioned by company “A” who then handed off to company “B” and finally to company “C” who terminated the circuit at the demarcation point at our headquarters. All for less than 20 miles, but I digress. Finally we all agreed to have Company “C” come onsite, interrupt the circuit physically at the demarcation equipment and look back down the link to see what he could see. As a precaution at this point, and tired of being blamed for ridiculous things, I and my staff physically powered down our routers on either side of the link. Since the loop stayed, that was the last time I had anyone point the finger my way. Small miracles and all of that.
Once the rep from Company “C” got onsite and interrupted the circuit for tests, he was still seeing “all green” down the line. Since the other engineers monitoring were still seeing a loop, they asked him to shut down the circuit. He did, and they still saw a loop. This was one of those “Aha” moments for all of us except the engineer from Company “C” who just couldn’t figure out what the problem might be. All of us suspected that the loop was between the Fujitsu OC-3 box at our Demarc, and the upstream OC-48 Fujitsu Mux a couple of miles away and we finally convinced this guy to go check out the OC-48. Sure enough, a few minutes after he left our circuit came back on again. And we all rejoiced, and ate Robin’s Minstrels.
At the end of the day, we ended up with just short of 24 hours of downtime, for a DS-3 from a major telecom provider that everyone here would recognize; 23 hours and 5 minutes, to be exact. So what was the problem, and the solution? Any telecom guys want to stop reading here and take a guess?
As it turns out, the original cause of our link going down was this same engineer pulling the circuit by mistake. When the trouble ticket was originally filed, he rushed out and “fixed” his mistake. But, what he hadn’t noticed the first time is two critical things:
(1) The circuit had failed over to the protect pair. DS3 circuits use one pair of fiber for the normally used (or working) circuit, and a separate fiber pair for the fail-over (or protect) circuit.
(2) The protect pair at the OC-3 box at the demarcation point hadn’t ever been installed.
For lessons learned here, the main thing that comes to me is that we absolutely have to find a way to get true redundancy on this link, even if it means connecting our own strings between tin-cans. I should explain, by the way, that redundancy to this headquarters building is very difficult due to location: the last mile provider is the same no matter who we use to provision the circuit. In addition, with one major fiber loop in the area, even if we could get redundancy on the last mile we would still be at the mercy of that loop. We are at this point, after this incident, looking at a fixed LoS wireless option that has recently become available. Apparently we can get at least 20Mb/s although I haven’t heard any claims on the latency, so we’ll see.
I’m also shocked and appalled that three major telecoms, all working in concert, took almost a full day to run this problem to ground. I’m probably naive, but I expect more. The only saving grace in all of this is the level of professionalism and support I received from the telecom brokers we use. They were absolutely on top of this from the beginning, shepherded the whole process along, even facilitating communications between the players with their own conference bridge for the better part of a day. If anyone needs any telecom services brokered, anywhere in the world I’m told, contact Rick Crabbe at Threshold Communications.
With this summation done, my venting complete, and everything right with the world, I’m off for a beverage.