• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
  • Standard Disclaimers
  • Resume/Curriculum Vitae
  • Why Blog?
  • About
  • (Somewhat Recent) Publication List

packetqueue.net

Musings on computer stuff, and things... and other stuff.

Fail

April 9, 2013 ASA

ASA Upgrade to 9.0(2)

Read­ing Time: 4 min­utes

“In the­o­ry there is no dif­fer­ence between the­o­ry and prac­tice. In prac­tice there is.” Yogi Berra

Usu­al­ly, no mat­ter how much plan­ning, test­ing, think­ing about, or stalling you build in to an upgrade project, the bill comes due at the end and isn’t what you expect­ed. Maybe some­one ordered the lob­ster, or mul­ti­ple drams of John­ny Walk­er Blue, but either way you have a sit­u­a­tion to deal with… right now. How you deal with it deter­mines many things about your skills and your char­ac­ter, but that’s for anoth­er post as I’m going to try my best to keep this one short and to the point.

I recent­ly had the won­der­ful expe­ri­ence of upgrad­ing an Active/Passive failover pair of ASA units to the newest of the new code, 9.0(2) from 8.4.2(8). After the 8.3 ker­fuf­fle (NAT changes automag­i­cal­ly, any­one?) I was par­tic­u­lar­ly keen to not miss any pos­si­ble gotchas in this upgrade. I also sched­uled a larg­er than usu­al main­te­nance window–even though we did­n’t expect any downtime–just in case.

I should inter­ject here, for those of you aghast at the thought that we would pos­si­bly imple­ment new, some would say, bleed­ing edge code on pro­duc­tion sys­tems, a cou­ple of things:

  1. We always run bleed­ing edge code, usu­al­ly because we have need of the bleed­ing edge fea­tures the code brings. In this case, IPv6 fea­tures sore­ly lack­ing in pri­or code ver­sions
  2. We have adopt­ed a very aggres­sive IPv6 stance, as I have writ­ten about before, and we tend to find our aspi­ra­tions and designs are well out in front of sig­nif­i­cant por­tions of the code avail­able for our equip­ment
  3. Not­ing the pri­or two items again, I’ll also add that Fire­walls, in par­tic­u­lar, seem to have code that is months or years behind the route/switch world. That holds true across all ven­dor plat­forms. Why? I don’t know, but that’s anoth­er post.

With our need-for-upgrade bona fides estab­lished, I duti­ful­ly read the entire Release notes for the 9.0(x) ASA code and while I was excit­ed at many of the new features–mostly around IPv6–and dis­ap­point­ed at oth­ers (No OSPFv3 address fam­i­lies? Real­ly?) some­thing imme­di­ate­ly jumped out at me: Page 19 and its dis­turb­ing title, “ACL Migra­tion in Ver­sion 9.0”

Any time you see the word “migra­tion” in any doc­u­men­ta­tion refer­ring to an upgrade of pro­duc­tion code or con­fig­u­ra­tion, you know two things:

  1. It prob­a­bly hap­pens auto-mag­i­cal­ly, which is basi­cal­ly a syn­onym for “we’re going to bork your code but we’re only going to loose­ly tell you where, how, and why.”
  2. You’d bet­ter have good back­ups and be pre­pared, because a roll-back is like­ly going to be as painful as just plow­ing ahead.

To sum­ma­rize the bor­ing details for you, pri­or to the new code you had two cat­e­gories of Access Con­trol Lists (ACLs): those for IPv4 and those for IPv6. Inside each of those macro-lev­els you had the nor­mal stan­dard and extend­ed lists and what­ev­er oth­er fea­tures. You applied the IPv4 and the IPv6 access-lists to the inter­face in what­ev­er direc­tion and that was that. True to the dual-stack mod­el, you real­ly were run­ning two par­al­lel net­works and nev­er the ‘twain shall meet.

Dur­ing and after the upgrade to 9.0(x) ASA code, a cou­ple of things hap­pen:

  1. IPv6 stan­dard ACLs are no longer sup­port­ed, and any you have are migrat­ed to extend­ed ACLs.
  2. If IPv4 and IPv6 ACLs are applied in the same direc­tion, on the same inter­face they are merged.
  3. The new key­words any4 and any6 are added in place of the old any key­word.
  4. Sup­pos­ed­ly, if cer­tain con­di­tions are met (and they were in my case) your IPv4 and IPv6 ACLs should be merged into one (they were not).

While it is a bit scary to have any ven­dor automag­i­cal­ly migrat­ing por­tions of your con­fig­u­ra­tion to a new for­mat, it hap­pens and as long as they doc­u­ment well and you do your due dili­gence, things can work out just fine. Oth­er times they com­plete­ly go to hell because of an undoc­u­ment­ed fea­ture. This upgrade fell some­where in the mid­dle.

As it turns out, a crit­i­cal fact was left out of the doc­u­men­ta­tion. Name­ly, that all of your access-groups that had been applied in some direc­tion or anoth­er would now, quite frankly, not be applied to any­thing. In oth­er words, my fire­walls were now let­ting any­thing out of the net­work and noth­ing in. I quick­ly applied my new access-lists to the inter­faces a cou­ple of times before I dis­cov­ered that you can now only have one applied in any direc­tion (par for most IOS devices).

Since these were pro­duc­tion and I had some high­er risk on the IPv4 side (we have a lot of rules, and a default-block out­bound pol­i­cy) than the IPv6 side, I did the fol­low­ing:

  1. I blocked IPv6 in and out, then applied the IPv4 lists to the inter­faces in the cor­rect direc­tions.
  2. I hand migrat­ed (notepad is your friend) the IPv6 access rules into the IPv4 lists and brought IPv6 access back online.
  3. I then delet­ed the redun­dant (old) ACLs.

Every­thing came back, life was good, most­ly nobody noticed any­thing. What’s the lessons learned from this expe­ri­ence? Besides don’t upgrade ASAs? How about these:

  1. Always have a back­up of your con­fig­u­ra­tion, prefer­ably tak­en a few min­utes before you start the upgrade. In this case I did­n’t use the back­ups for more than a ref­er­ence, but they were avail­able if I had want­ed to roll back.
  2. Know your con­fig­u­ra­tion and your devices. This seems intu­itive, but a lot of peo­ple would have got­ten part way through this migra­tion, saw that their ACLs were borked, and been lost. If you’re going to live on the edge, at least have a hel­met.
  3. Read the doc­u­men­ta­tion. I did, and while it did­n’t direct­ly help, I at least knew ahead of time what was like­ly to break. I also knew once it broke what the like­ly prob­lem area was. To tie this into the CCIE Lab (back to study­ing, so it’s on my mind) it’s a bit like being able to look at a net­work dia­gram and instinc­tive­ly know where you’ll have prob­lems (two routers doing redis­tri­b­u­tion between EIGRP and OSPF, check).

At the end of the day, it all worked out for a vari­ety of rea­sons list­ed above.  Would I sug­gest any read­ers out there try this sort of “no net” upgrade to bleed­ing edge code?  Prob­a­bly not.  In my case, I’m a masochist it seems, and this is my ther­a­py.  Now on to my 6500 upgrade to 15.1(1)SY.  I’m sure I’ll be writ­ing about that not long from today.

Share

November 30, 2010 Apple

iTunes Home Sharing

Read­ing Time: 3 min­utes

iTunes Home Sharing

A decent into the hell of Bon­jour and black tur­tle-necks

This is just anoth­er short exam­ple in what I’m expect­ing will be a recur­ring theme here on Pack­et Queue: atten­tion to detail.  As a net­work engi­neer, as in so many pro­fes­sions, pay­ing atten­tion to the lit­tle things can mean the dif­fer­ence between 10 min­utes of trou­bleshoot­ing and 3 days of unmit­i­gat­ed, sleep-deprived hell.  Luck­i­ly enough for me, the exam­ple I’m about to give wasn’t 3 days by any means, and since it was per­son­al and not busi­ness the urgency wasn’t the same as if a WAN link had failed.  That said, I want­ed it fixed.

My wife just bought a new computer—her first Mac since the original—and dur­ing the ini­tial mov­ing of files and such, I dis­cov­ered a nifty fea­ture of iTunes: Home Shar­ing.  Now, I have a large iTunes library at home already—something on the order of almost 180 Gigabytes—and want­ed her to be able to share that library on her new Mac.  After all, we’re not pirates; we just want to have access to our shared music library on any com­put­er or device in the house rel­a­tive­ly seam­less­ly.  So I read a quick lit­tle blurb on the how-tos and why-fores of home shar­ing (real men some­times read direc­tions) and turned it on.  Aside from the crick­ets, noth­ing hap­pened.  Sacre­bleu!

Bonjour?

Bon­jour! ¡No Hablo!

No, not a greet­ing but a name giv­en by Apple to their zero­conf imple­men­ta­tion that allows devices (print­ers, stor­age, shares, etc.) to auto-mag­i­cal­ly find one anoth­er.  This is the ser­vice that was sup­posed to make my iTunes library share­able between com­put­ers.  This is the ser­vice that was sup­posed to make every­thing in my dull world shiny again.  Not being over­ly steeped in the Apple world, how­ev­er, has made me nat­u­ral­ly sus­pi­cious of any­thing that “just works” as more often than not, said thing only “just works” if you “just use it in this one way”.  That nat­ur­al sus­pi­cion of mine was proven to be well-found­ed.

Upon read­ing up on Bon­jour, I dis­cov­ered that it uses mDNS (mul­ti­cast DNS) to find ser­vices.  Well, I thought, that would mean that mul­ti­cast rout­ing should work to fix my woes and I set off to work my mag­ic.  Of course, I had missed a crit­i­cal detail that would have saved me some time: the mul­ti­cast DNS imple­men­ta­tion that forms a part of Bon­jour uses the mul­ti­cast group address of 224.0.0.251.  If you haven’t noticed the prob­lem yet, nei­ther did I right away.  Had I noticed said prob­lem, I wouldn’t have com­plete­ly recon­fig­ured my ASA and 2811 for mul­ti­cast rout­ing, and I wouldn’t have start­ed trac­ing pack­ets with Wire­Shark:

The Mul­ti­cast range runs from 224.0.0.0 through 239.255.255.255 as every first-year net­work­ing stu­dent prob­a­bly knows.  But that range is like all oth­er ranges and has cer­tain reserved address­es with­in it.  In our case, the most inter­est­ing range is 224.0.0.0/24 which is known as the Local Net­work Con­trol Block, or some­times just Link-local.  Address­es in this range include the OSPF address­es of 224.0.0.5 and .6, and RIPv2 address of 224.0.0.9, among oth­ers. The salient detail being that these mul­ti­cast address­es are typ­i­cal­ly sourced with a TTL of 1 and are not to be sent off of the broad­cast domain in which they orig­i­nate.

My wire­less net­work, which my wife’s new Mac is on, is a dif­fer­ent VLAN (and hence, dif­fer­ent broad­cast domain) from my wired net­work.  In fact, between my three wire­less net­works and mul­ti­ple lab net­works, my home envi­ron­ment prob­a­bly has some­thing on the order of 25 dif­fer­ent broad­cast domains.  Def­i­nite­ly not the norm for the aver­age user, but also not uncom­mon if you start look­ing at more tech­ni­cal peo­ple or pro­duc­tion envi­ron­ments.  So, the bot­tom line is that Bon­jour and iTunes won’t work in my envi­ron­ment with­out an mDNS proxy or some oth­er trick­ery.

What both­ers me most about this rev­e­la­tion is that a lot of Apple’s soft­ware and periph­er­als work on this same sys­tem.  Air­port (Apple’s wire­less) as well as their print­er set­up, shares, etc. all work using Bon­jour so are, from at least a sim­ple view­point, bro­ken across broad­cast domains.  I’m guess­ing from Google search­es and such that it’s a minor­i­ty of users of iTunes who are con­cerned about this, and so it may not even make sense for Apple to address the prob­lem.  But if you extrap­o­late that out to every­thing else using Bon­jour, and con­sid­er a cor­po­rate net­work envi­ron­ment, I have to won­der how much of this con­tributes to Apple’s lack of pen­e­tra­tion into enter­prise net­works.

As always, if I’ve got­ten details wrong or you’d just like to offer your own opin­ion back and fur­ther the dis­cus­sion, I can be reached here on this blog or via @someclown on Twit­ter.

Share

November 21, 2010 Fail

Things I Hate, Episode 1

Read­ing Time: 3 min­utes

Things I Hate, Episode 1

Brought to you by the Imped­i­ment-to-Sales Sales depart­ment

It was not long ago that I was sit­ting across a con­fer­ence room table from our [insert large soft­ware ven­dor of choice here] expect­ing to have a con­ver­sa­tion about the fea­tures and ben­e­fits of upgrad­ing one of our major soft­ware pack­ages to the newest ver­sion.  This was our inter­nal mail sys­tem, and as we had quite a few inter­con­nect­ed sys­tems and sites, along with what we already knew of the major archi­tec­tur­al changes in the new ver­sion, we knew the upgrade would­n’t be easy.  So, we acqui­esced to our sales­per­son­’s requests and set up the meet­ing.  That was our first mis­take.

After the req­ui­site ini­tial pleas­antries were exchanged, we began dis­cussing the prod­uct in ques­tion.  We weren’t sold just yet on actu­al­ly doing the upgrade, so one of the first ques­tions we want­ed answered was basi­cal­ly just a sim­ple “Why do we want to upgrade?”  In oth­er words, we already have a work­ing sys­tem, so what does this newest ver­sion bring to the table vis-a-vis new fea­tures, ben­e­fits, man­age­abil­i­ty, etc.  Ask­ing this question–and expect­ing a clear, use­ful answer–turned out to be not only an exer­cise in futil­i­ty, but also mis­take num­ber two.

“It allows you to cre­ate pock­ets of col­lab­o­ra­tion by lever­ag­ing out-of-the-box, par­a­digm-shift­ing, syn­er­gies of strate­gic plan­ning.”

But what does it do?

“The new ver­sion bet­ter lever­ages ver­ti­cal inter­est seg­ments in a tran­si­to­ry user base, which rep­re­sents a shift­ing par­a­digm in strat­e­gy-focused mind-share and thought-lead­er­ship.”

The con­ver­sa­tion went on like that for a bit before we final­ly decid­ed to cut our loss­es and move on to some oth­er top­ics around our upcom­ing license renew­al, etc.  As it turns out, sub­stan­tial pock­ets of the sales-force at cer­tain large soft­ware ven­dors seem to be trained in a lan­guage that sounds a lot like Eng­lish, uses a lot of inter­est­ing words strung togeth­er in fair­ly obscure pat­terns, and in the end almost exact­ly fails to com­mu­ni­cate any­thing at all use­ful.  The unfor­tu­nate thing is that this was sup­pos­ed­ly the expert in the prod­uct line who could answer our questions–he was brought along to the meet­ing specif­i­cal­ly to speak “engi­neer-to-engi­neer.”

Now, I am not only a net­work engi­neer but also the IT Direc­tor for a mul­ti-nation­al man­u­fac­tur­ing firm.  I am used to strad­dling the line between engi­neer­ing and man­age­ment, and actu­al­ly pride myself on being able to com­mu­ni­cate com­plex engi­neer­ing prin­ci­ples to c‑level exec­u­tives in a way that makes sense, and accom­plish­es some­thing.  I don’t think that I’m so far gone on the engi­neer­ing side that I have to have a team of PhDs come in every time I want to learn about a prod­uct.  That said, I do expect that my time will be respect­ed and when I want to know what your prod­uct has to offer that you will take the rad­i­cal step as a ven­dor of bring­ing along some­one who knows what the hell they’re talk­ing about.

The moral of the sto­ry is that we did not then, nor have we since, upgrade to that new prod­uct ver­sion.  Not out of spite or any bad feel­ings for the ven­dor as a whole, but sim­ply because we final­ly found the answers we need­ed from a com­bi­na­tion of white papers and some peer groups with whom we main­tain rela­tion­ships with.  For you ven­dors who can’t seem to artic­u­late what your prod­uct actu­al­ly does with­out using a hodge-podge of terms poached from a buzz-word bin­go card, my gen­er­al gut reac­tion is that your prod­uct is prob­a­bly not unique or help­ful in any mean­ing­ful way–and that is not the first impres­sion you as a ven­dor or sales­per­son want to make.  I sus­pect I am not alone in that feel­ing, either.

Share

October 4, 2010 ASA

Back from China

Read­ing Time: 4 min­utes

Back from China

After rough­ly one week in Shang­hai, Chi­na to set up a new site on our cor­po­rate net­work, it is painful­ly appar­ent that I need to get back into this writ­ing busi­ness.  The dates on my posts belie my weak attempts at cov­er­ing up my lazi­ness.  That said, there were at least a cou­ple of things of note worth using up a few words on.

Note One (where it real­ly is some­one else’s fault)

Due to an unfor­tu­nate series of poor deci­sions, poor project man­age­ment, and a quite sud­den and unrea­son­able expec­ta­tion of deliv­ery dates, we [IT] were forced to poach some band­width from a sis­ter com­pa­ny of ours who had a slight excess in the man­u­fac­tur­ing facil­i­ty where our new office was to be set up.  By slight, I mean exact­ly 256kb.  For those of you not accus­tomed to see­ing the abbre­vi­a­tion for kilo, well, you’re too young.  More on that in a moment.

All of that aside, we engi­neered the cir­cuit all the way from the provider’s edge in Shang­hai, back to our facil­i­ties on the West Coast and ver­i­fied traf­fic was flow­ing.  Once we got to Shang­hai and hooked up our router and start­ed build­ing out the net­work behind it, we noticed that we could­n’t move any traf­fic at all.  With a quick extend­ed ping using the inside net­work inter­face as the source, as well as some trace-routes from oth­er places, we ver­i­fied that the provider had neglect­ed to put a route in the BGP tables for the new net­work.  Thank God for 24-hour NOC sup­port, and with­in 30 min­utes that prob­lem was resolved.

Note Two (where the author tries to check the turn-sig­nal flu­id)

As we moved on to cre­at­ing and join­ing up a shiny new R2 Read-only Domain Con­troller (RODC) every­thing went off the rails.  Time­outs galore.  DNS would­n’t resolve quick­ly enough to allow the new DC to join the for­est.  Off I go on a jol­ly search for default time­out set­tings, reg­istry tweaks, offline meth­ods to install a DC (ugly at best) and gen­er­al­ly going fur­ther down the rab­bit hole of com­plex­i­ty, even going so far as to direct anoth­er engi­neer work­ing for me to prep for a call to Microsoft (nev­er fun.)

Hav­ing already racked a lot of gear, we decid­ed to call it a night and come back fresh in the morn­ing.  I always find it help­ful to con­tem­plate prob­lems like this over a good sin­gle-malt Scotch.  So I did, a few times, and that led to morn­ing and a face-palm moment.  To wit:

In the cab on the way to the office the next day it occurred to me that I should check the secu­ri­ty on the fire­walls back home.  I knew I did­n’t put any ACLs on that new link, know­ing I’d be test­ing and I pre­fer to test in the absence of arti­fi­cial prob­lems, then crank down the screws once I’m con­fi­dent of the design.  I thought, how­ev­er, that I had over­looked a NAT exemp­tion or some­thing else and decid­ed to spend some qual­i­ty time check­ing that por­tion of the infra­struc­ture.

So, I got my cof­fee at the office (thank­ful­ly the office man­ag­er is a Kiwi who favors Star­bucks) and start­ed to look over the con­fig­u­ra­tion of the fire­walls.  Right there was my face-palm moment: an ACL which, in some secu­ri­ty-con­scious delu­sion I had put on the link in ques­tion, allow­ing ICMP traf­fic and deny­ing every­thing else.  *GAAAAAH*  Long sto­ry short, I changed the ACL and every­thing “mag­i­cal­ly” start­ed work­ing.

All I can say here is that it does­n’t mat­ter who you are, or how much expe­ri­ence with trou­bleshoot­ing you have, always start with the sim­ple stuff first.  Some of the best advice I ever got was from an instruc­tor of mine who was fond of say­ing “be the pack­et.”  By that he just meant that you have to start from the pack­et’s point of view and slow­ly work through every­thing that hap­pens to said pack­et from begin­ning to end.  Wiring, arp­ing, rout­ing, etc. what­ev­er.  Be the pack­et.

Also, don’t be afraid to admit mis­takes.  They will hap­pen and hope­ful­ly oth­er peo­ple around you can learn from them.  At least after they’re done laugh­ing.  Which some­times takes a while.

Note Three (hey you kids, get off my @#$ lawn)

I just want­ed to take a quick moment to address the inevitable ques­tions about our link at 256kb, and the mus­ings that I obvi­ous­ly must have meant mb instead.  Band­width is always seen as one of those more is bet­ter kind of things, and ignor­ing the temp­ta­tion to toss out the stan­dard bandwidth/delay screed here, let me just tell you that you can get by with less than you think. By the way, those of us who can remem­ber when a 2400 baud modem was blow-your-hair-back fast (mul­ti­ple lines of text at once!) tend to be more prag­mat­ic about these things.

On that link we cur­rent­ly have a domain con­troller run­ning, sev­er­al work­sta­tions with email and our main cor­po­rate ERP soft­ware.  We also, and this is what I real­ly like, have two 7900-series IP phones run­ning.  These are both homed to our main CUCM, Uni­ty and IPCC servers back in the U.S. and have excel­lent call qual­i­ty.  In fact, we test­ed a phone call (no local call­ing for these phones in Chi­na) from those phones in Shang­hai, through the CUCM in the U.S., and back to a U.S. homed cell phone in Shang­hai and got no dis­cern­able jit­ter.

Moral of the sto­ry?  You can get by with less band­width than you think.  Would I choose to?  Hell no!  🙂

Share

August 10, 2010 Cisco

New Post Delay…

Read­ing Time: 1 minute

For all of you pay­ing any atten­tion at all, I owe you an apol­o­gy on the com­plete lack of writ­ing the last week or so.  Last week, how­ev­er, a large pile of back­o­rdered Cis­co gear showed up at the office and need­ed to be staged *imme­di­ate­ly* or as close to that as could rea­son­ably be expect­ed.  We’re in the throes of a com­plete infra­struc­ture upgrade and every­thing sort of backed up unex­pect­ed­ly with the recent deliv­ery prob­lems from Cis­co.  I would ques­tion how that becomes my prob­lem, but after 16+ years doing this pro­fes­sion­al­ly, I already know the answer to that ques­tion.  At any rate, look for a new post in my series on 802.1x in the next day or so.

Share

July 27, 2010 DS3

TELCO FAIL

Read­ing Time: 5 min­utes

Last Thurs­day after­noon, at approx­i­mate­ly 2:25pm, there was a loud suck­ing sound that can only be heard by net­work engi­neers con­di­tioned to expect bad, ugly things to hap­pen at inop­por­tune times, and all upstream con­nec­tiv­i­ty to our cor­po­rate office died.

*Ka-phoot*

Pre­dictably, IT was imme­di­ate­ly assist­ed by many, many help­ful peo­ple wan­der­ing by our area, send­ing emails, mak­ing phone calls, or stop­ping us in the hall to ask if we knew that the net­work was down.  Usu­al­ly in these sit­u­a­tions the first cou­ple of peo­ple get a good expla­na­tion of what we think the prob­lem is, and what an ETA might be.  After the 10th per­son, how­ev­er, my respons­es tend to devolve a bit and I either end up giv­ing a curt one-word answer, or feign­ing shock and amaze­ment.

I should explain here that the way the archi­tec­ture of our net­work works, we have our IP provider, SIP Trunks, Point-to-Point cir­cuits, VPN end-points, and all of our exter­nal-fac­ing servers in a very robust tele­com hotel–The West­in Build­ing, for those keep­ing score–in down­town Seat­tle.  From there, we move every­thing over our DS3 to our cor­po­rate head­quar­ters not far from Seat­tle.  We also have many oth­er ded­i­cat­ed cir­cuits, IPsec tun­nels, and assort­ed bal­ly­hoo to oth­er loca­tions around the world, but for dis­cus­sion here just keep in mind the three loca­tions I’ve described.

So the DS3 that is our life­line was down.  It was after hours in our Cana­di­an loca­tion so with any luck nobody would notice all night–they use a lot of crit­i­cal ser­vices across anoth­er DS3, but that also routes through Seat­tle first.  Addi­tion­al­ly, it was a par­tic­u­lar­ly nice day in Seat­tle (rare) and a lot of peo­ple were already out of the office when this link went down.  Hope­ful­ly we could file a trou­ble tick­et and get this resolved fair­ly quick­ly.

With­in just a few min­utes of fil­ing said trou­ble tick­et, I had a rep­re­sen­ta­tive of the pro­vi­sion­ing tele­com on the line who said that, yes, they saw a prob­lem and would be dis­patch­ing tech­ni­cians.  There were some oth­er calls fol­low­ing that, but  the short ver­sion is that by 5:30pm “every­thing was fixed” accord­ing to the tele­com and would we please ver­i­fy so they could close the tick­et.  Unfor­tu­nate­ly, the prob­lem was not fixed.

Now the fun began.  To appease the tele­com rep­re­sen­ta­tive, I accept­ed the pos­si­bil­i­ty that my DS3 con­troller card had coin­ci­den­tal­ly died or locked the cir­cuit or some oth­er bunch of weird pseu­do-engi­neer guess­ing from the tele­com rep­re­sen­ta­tive.  This meant I had to dri­ve to our data cen­ter in Seat­tle, in rush hour traf­fic, to per­son­al­ly kick the offend­ing router in the teeth.

After an hour or so of typ­i­cal­ly nasty Seat­tle rush-hour traf­fic I arrived at the dat­a­cen­ter and began test­ing.  Our DS3 con­troller was show­ing AIP on the line, so more tech­ni­cians were dis­patched to find the offend­ing prob­lem.  Mean­while, I wan­dered over to the Icon Grill to get some din­ner and an après-ski bev­er­age or two.

Fast for­ward a few hours and the AIP con­di­tion on the DS3 con­troller was gone, but I now had an inter­face sta­tus of “up up (looped)” which is less than ide­al, shall we say.  I decid­ed at this point to cut my loss­es and head home and pos­si­bly get some sleep while the tele­com engi­neers and their cohort tried to fig­ure out how this might be my fault.

With some three hours of sleep or so, I woke up at 5am and start­ed look­ing at all of my emails, lis­ten­ing to all of my voice­mails, and gen­er­al­ly curs­ing any­one with­in earshot–mostly con­sist­ing of the cats–as my wife was still asleep.  At this point I got on a con­fer­ence bridge with the Pres­i­dent of the tele­com bro­ker we use and togeth­er we man­aged to drag a rep in from the pro­vi­sion­ing com­pa­ny who could then drag in as many engi­neers as need­ed to get the prob­lem solved.  Not, how­ev­er, before I was rather point­ed­ly told by said pro­vi­sion­ing woman that I would have to pay for all of this cost since the prob­lem was “obvi­ous­ly with my equip­ment, since her soft­ware showed no loops in the cir­cuit.”

Once the engi­neers start­ed hook­ing up testers to the circuit–physically this time–they could see a loop, but at the Seat­tle side (the side report­ing the loop.)  Anoth­er engi­neer saw a loop on the head­quar­ters side, and still a third saw no loop at all.  As it turns out, the cir­cuit was pro­vi­sioned by com­pa­ny “A” who then hand­ed off to com­pa­ny “B” and final­ly to com­pa­ny “C” who ter­mi­nat­ed the cir­cuit at the demar­ca­tion point at our head­quar­ters.  All for less than 20 miles, but I digress.  Final­ly we all agreed to have Com­pa­ny “C” come onsite, inter­rupt the cir­cuit phys­i­cal­ly at the demar­ca­tion equip­ment and look back down the link to see what he could see.  As a pre­cau­tion at this point, and tired of being blamed for ridicu­lous things, I and my staff phys­i­cal­ly pow­ered down our routers on either side of the link.  Since the loop stayed, that was the last time I had any­one point the fin­ger my way.  Small mir­a­cles and all of that.

Once the rep from Com­pa­ny “C” got onsite and inter­rupt­ed the cir­cuit for tests, he was still see­ing “all green” down the line.  Since the oth­er engi­neers mon­i­tor­ing were still see­ing a loop, they asked him to shut down the cir­cuit.  He did, and they still saw a loop.  This was one of those “Aha” moments for all of us except the engi­neer from Com­pa­ny “C” who just could­n’t fig­ure out what the prob­lem might be.  All of us sus­pect­ed that the loop was between the Fujit­su OC‑3 box at our Demarc, and the upstream OC-48 Fujit­su Mux a cou­ple of miles away and we final­ly con­vinced this guy to go check out the OC-48.  Sure enough, a few min­utes after he left our cir­cuit came back on again.  And we all rejoiced, and ate Robin’s Min­strels.

At the end of the day, we end­ed up with just short of 24 hours of down­time,  for a DS‑3 from a major tele­com provider that every­one here would rec­og­nize; 23 hours and 5 min­utes, to be exact.  So what was the prob­lem, and the solu­tion?  Any tele­com guys want to stop read­ing here and take a guess?

As it turns out, the orig­i­nal cause of our link going down was this same engi­neer pulling the cir­cuit by mis­take.  When the trou­ble tick­et was orig­i­nal­ly filed, he rushed out and “fixed” his mis­take.  But, what he had­n’t noticed the first time is two crit­i­cal things:

(1)    The cir­cuit had failed over to the pro­tect pair.  DS3 cir­cuits use one pair of fiber for the nor­mal­ly used (or work­ing) cir­cuit, and a sep­a­rate fiber pair for the fail-over (or pro­tect) cir­cuit.

(2)    The pro­tect pair at the OC‑3 box at the demar­ca­tion point had­n’t ever been installed.

For lessons learned here, the main thing that comes to me is that we absolute­ly have to find a way to get true redun­dan­cy on this link, even if it means con­nect­ing our own strings between tin-cans.  I should explain, by the way, that redun­dan­cy to this head­quar­ters build­ing is very dif­fi­cult due to loca­tion: the last mile provider is the same no mat­ter who we use to pro­vi­sion the cir­cuit.  In addi­tion, with one major fiber loop in the area, even if we could get redun­dan­cy on the last mile we would still be at the mer­cy of that loop.  We are at this point, after this inci­dent, look­ing at a fixed LoS wire­less option that has recent­ly become avail­able.  Appar­ent­ly we can get at least 20Mb/s although I haven’t heard any claims on the laten­cy, so we’ll see.

I’m also shocked and appalled that three major tele­coms, all work­ing in con­cert, took almost a full day to run this prob­lem to ground.  I’m prob­a­bly naive, but I expect more.  The only sav­ing grace in all of this is the lev­el of pro­fes­sion­al­ism and sup­port I received from the tele­com bro­kers we use.  They were absolute­ly on top of this from the begin­ning, shep­herd­ed the whole process along, even facil­i­tat­ing com­mu­ni­ca­tions between the play­ers with their own con­fer­ence bridge for the bet­ter part of a day.  If any­one needs any tele­com ser­vices bro­kered, any­where in the world I’m told, con­tact Rick Crabbe at Thresh­old Com­mu­ni­ca­tions.

With this sum­ma­tion done, my vent­ing com­plete, and every­thing right with the world, I’m off for a bev­er­age.

Share

Copyright© 2023 · by Shay Bocks