• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
  • Standard Disclaimers
  • Resume/Curriculum Vitae
  • Why Blog?
  • About
  • (Somewhat Recent) Publication List

packetqueue.net

Musings on computer stuff, and things... and other stuff.

Fail

April 9, 2013 ASA

ASA Upgrade to 9.0(2)

Read­ing Time: 4 min­utes

“In the­o­ry there is no dif­fer­ence between the­o­ry and prac­tice. In prac­tice there is.” Yogi Berra

Usu­al­ly, no mat­ter how much plan­ning, test­ing, think­ing about, or stalling you build in to an upgrade project, the bill comes due at the end and isn’t what you expect­ed. Maybe some­one ordered the lob­ster, or mul­ti­ple drams of John­ny Walk­er Blue, but either way you have a sit­u­a­tion to deal with… right now. How you deal with it deter­mines many things about your skills and your char­ac­ter, but that’s for anoth­er post as I’m going to try my best to keep this one short and to the point.

I recent­ly had the won­der­ful expe­ri­ence of upgrad­ing an Active/Passive failover pair of ASA units to the newest of the new code, 9.0(2) from 8.4.2(8). After the 8.3 ker­fuf­fle (NAT changes automag­i­cal­ly, any­one?) I was par­tic­u­lar­ly keen to not miss any pos­si­ble gotchas in this upgrade. I also sched­uled a larg­er than usu­al main­te­nance window–even though we did­n’t expect any downtime–just in case.

I should inter­ject here, for those of you aghast at the thought that we would pos­si­bly imple­ment new, some would say, bleed­ing edge code on pro­duc­tion sys­tems, a cou­ple of things:

  1. We always run bleed­ing edge code, usu­al­ly because we have need of the bleed­ing edge fea­tures the code brings. In this case, IPv6 fea­tures sore­ly lack­ing in pri­or code ver­sions
  2. We have adopt­ed a very aggres­sive IPv6 stance, as I have writ­ten about before, and we tend to find our aspi­ra­tions and designs are well out in front of sig­nif­i­cant por­tions of the code avail­able for our equip­ment
  3. Not­ing the pri­or two items again, I’ll also add that Fire­walls, in par­tic­u­lar, seem to have code that is months or years behind the route/switch world. That holds true across all ven­dor plat­forms. Why? I don’t know, but that’s anoth­er post.

With our need-for-upgrade bona fides estab­lished, I duti­ful­ly read the entire Release notes for the 9.0(x) ASA code and while I was excit­ed at many of the new features–mostly around IPv6–and dis­ap­point­ed at oth­ers (No OSPFv3 address fam­i­lies? Real­ly?) some­thing imme­di­ate­ly jumped out at me: Page 19 and its dis­turb­ing title, “ACL Migra­tion in Ver­sion 9.0”

Any time you see the word “migra­tion” in any doc­u­men­ta­tion refer­ring to an upgrade of pro­duc­tion code or con­fig­u­ra­tion, you know two things:

  1. It prob­a­bly hap­pens auto-mag­i­cal­ly, which is basi­cal­ly a syn­onym for “we’re going to bork your code but we’re only going to loose­ly tell you where, how, and why.”
  2. You’d bet­ter have good back­ups and be pre­pared, because a roll-back is like­ly going to be as painful as just plow­ing ahead.

To sum­ma­rize the bor­ing details for you, pri­or to the new code you had two cat­e­gories of Access Con­trol Lists (ACLs): those for IPv4 and those for IPv6. Inside each of those macro-lev­els you had the nor­mal stan­dard and extend­ed lists and what­ev­er oth­er fea­tures. You applied the IPv4 and the IPv6 access-lists to the inter­face in what­ev­er direc­tion and that was that. True to the dual-stack mod­el, you real­ly were run­ning two par­al­lel net­works and nev­er the ‘twain shall meet.

Dur­ing and after the upgrade to 9.0(x) ASA code, a cou­ple of things hap­pen:

  1. IPv6 stan­dard ACLs are no longer sup­port­ed, and any you have are migrat­ed to extend­ed ACLs.
  2. If IPv4 and IPv6 ACLs are applied in the same direc­tion, on the same inter­face they are merged.
  3. The new key­words any4 and any6 are added in place of the old any key­word.
  4. Sup­pos­ed­ly, if cer­tain con­di­tions are met (and they were in my case) your IPv4 and IPv6 ACLs should be merged into one (they were not).

While it is a bit scary to have any ven­dor automag­i­cal­ly migrat­ing por­tions of your con­fig­u­ra­tion to a new for­mat, it hap­pens and as long as they doc­u­ment well and you do your due dili­gence, things can work out just fine. Oth­er times they com­plete­ly go to hell because of an undoc­u­ment­ed fea­ture. This upgrade fell some­where in the mid­dle.

As it turns out, a crit­i­cal fact was left out of the doc­u­men­ta­tion. Name­ly, that all of your access-groups that had been applied in some direc­tion or anoth­er would now, quite frankly, not be applied to any­thing. In oth­er words, my fire­walls were now let­ting any­thing out of the net­work and noth­ing in. I quick­ly applied my new access-lists to the inter­faces a cou­ple of times before I dis­cov­ered that you can now only have one applied in any direc­tion (par for most IOS devices).

Since these were pro­duc­tion and I had some high­er risk on the IPv4 side (we have a lot of rules, and a default-block out­bound pol­i­cy) than the IPv6 side, I did the fol­low­ing:

  1. I blocked IPv6 in and out, then applied the IPv4 lists to the inter­faces in the cor­rect direc­tions.
  2. I hand migrat­ed (notepad is your friend) the IPv6 access rules into the IPv4 lists and brought IPv6 access back online.
  3. I then delet­ed the redun­dant (old) ACLs.

Every­thing came back, life was good, most­ly nobody noticed any­thing. What’s the lessons learned from this expe­ri­ence? Besides don’t upgrade ASAs? How about these:

  1. Always have a back­up of your con­fig­u­ra­tion, prefer­ably tak­en a few min­utes before you start the upgrade. In this case I did­n’t use the back­ups for more than a ref­er­ence, but they were avail­able if I had want­ed to roll back.
  2. Know your con­fig­u­ra­tion and your devices. This seems intu­itive, but a lot of peo­ple would have got­ten part way through this migra­tion, saw that their ACLs were borked, and been lost. If you’re going to live on the edge, at least have a hel­met.
  3. Read the doc­u­men­ta­tion. I did, and while it did­n’t direct­ly help, I at least knew ahead of time what was like­ly to break. I also knew once it broke what the like­ly prob­lem area was. To tie this into the CCIE Lab (back to study­ing, so it’s on my mind) it’s a bit like being able to look at a net­work dia­gram and instinc­tive­ly know where you’ll have prob­lems (two routers doing redis­tri­b­u­tion between EIGRP and OSPF, check).

At the end of the day, it all worked out for a vari­ety of rea­sons list­ed above.  Would I sug­gest any read­ers out there try this sort of “no net” upgrade to bleed­ing edge code?  Prob­a­bly not.  In my case, I’m a masochist it seems, and this is my ther­a­py.  Now on to my 6500 upgrade to 15.1(1)SY.  I’m sure I’ll be writ­ing about that not long from today.

Share

November 21, 2010 Fail

Things I Hate, Episode 1

Read­ing Time: 3 min­utes

Things I Hate, Episode 1

Brought to you by the Imped­i­ment-to-Sales Sales depart­ment

It was not long ago that I was sit­ting across a con­fer­ence room table from our [insert large soft­ware ven­dor of choice here] expect­ing to have a con­ver­sa­tion about the fea­tures and ben­e­fits of upgrad­ing one of our major soft­ware pack­ages to the newest ver­sion.  This was our inter­nal mail sys­tem, and as we had quite a few inter­con­nect­ed sys­tems and sites, along with what we already knew of the major archi­tec­tur­al changes in the new ver­sion, we knew the upgrade would­n’t be easy.  So, we acqui­esced to our sales­per­son­’s requests and set up the meet­ing.  That was our first mis­take.

After the req­ui­site ini­tial pleas­antries were exchanged, we began dis­cussing the prod­uct in ques­tion.  We weren’t sold just yet on actu­al­ly doing the upgrade, so one of the first ques­tions we want­ed answered was basi­cal­ly just a sim­ple “Why do we want to upgrade?”  In oth­er words, we already have a work­ing sys­tem, so what does this newest ver­sion bring to the table vis-a-vis new fea­tures, ben­e­fits, man­age­abil­i­ty, etc.  Ask­ing this question–and expect­ing a clear, use­ful answer–turned out to be not only an exer­cise in futil­i­ty, but also mis­take num­ber two.

“It allows you to cre­ate pock­ets of col­lab­o­ra­tion by lever­ag­ing out-of-the-box, par­a­digm-shift­ing, syn­er­gies of strate­gic plan­ning.”

But what does it do?

“The new ver­sion bet­ter lever­ages ver­ti­cal inter­est seg­ments in a tran­si­to­ry user base, which rep­re­sents a shift­ing par­a­digm in strat­e­gy-focused mind-share and thought-lead­er­ship.”

The con­ver­sa­tion went on like that for a bit before we final­ly decid­ed to cut our loss­es and move on to some oth­er top­ics around our upcom­ing license renew­al, etc.  As it turns out, sub­stan­tial pock­ets of the sales-force at cer­tain large soft­ware ven­dors seem to be trained in a lan­guage that sounds a lot like Eng­lish, uses a lot of inter­est­ing words strung togeth­er in fair­ly obscure pat­terns, and in the end almost exact­ly fails to com­mu­ni­cate any­thing at all use­ful.  The unfor­tu­nate thing is that this was sup­pos­ed­ly the expert in the prod­uct line who could answer our questions–he was brought along to the meet­ing specif­i­cal­ly to speak “engi­neer-to-engi­neer.”

Now, I am not only a net­work engi­neer but also the IT Direc­tor for a mul­ti-nation­al man­u­fac­tur­ing firm.  I am used to strad­dling the line between engi­neer­ing and man­age­ment, and actu­al­ly pride myself on being able to com­mu­ni­cate com­plex engi­neer­ing prin­ci­ples to c‑level exec­u­tives in a way that makes sense, and accom­plish­es some­thing.  I don’t think that I’m so far gone on the engi­neer­ing side that I have to have a team of PhDs come in every time I want to learn about a prod­uct.  That said, I do expect that my time will be respect­ed and when I want to know what your prod­uct has to offer that you will take the rad­i­cal step as a ven­dor of bring­ing along some­one who knows what the hell they’re talk­ing about.

The moral of the sto­ry is that we did not then, nor have we since, upgrade to that new prod­uct ver­sion.  Not out of spite or any bad feel­ings for the ven­dor as a whole, but sim­ply because we final­ly found the answers we need­ed from a com­bi­na­tion of white papers and some peer groups with whom we main­tain rela­tion­ships with.  For you ven­dors who can’t seem to artic­u­late what your prod­uct actu­al­ly does with­out using a hodge-podge of terms poached from a buzz-word bin­go card, my gen­er­al gut reac­tion is that your prod­uct is prob­a­bly not unique or help­ful in any mean­ing­ful way–and that is not the first impres­sion you as a ven­dor or sales­per­son want to make.  I sus­pect I am not alone in that feel­ing, either.

Share

July 27, 2010 DS3

TELCO FAIL

Read­ing Time: 5 min­utes

Last Thurs­day after­noon, at approx­i­mate­ly 2:25pm, there was a loud suck­ing sound that can only be heard by net­work engi­neers con­di­tioned to expect bad, ugly things to hap­pen at inop­por­tune times, and all upstream con­nec­tiv­i­ty to our cor­po­rate office died.

*Ka-phoot*

Pre­dictably, IT was imme­di­ate­ly assist­ed by many, many help­ful peo­ple wan­der­ing by our area, send­ing emails, mak­ing phone calls, or stop­ping us in the hall to ask if we knew that the net­work was down.  Usu­al­ly in these sit­u­a­tions the first cou­ple of peo­ple get a good expla­na­tion of what we think the prob­lem is, and what an ETA might be.  After the 10th per­son, how­ev­er, my respons­es tend to devolve a bit and I either end up giv­ing a curt one-word answer, or feign­ing shock and amaze­ment.

I should explain here that the way the archi­tec­ture of our net­work works, we have our IP provider, SIP Trunks, Point-to-Point cir­cuits, VPN end-points, and all of our exter­nal-fac­ing servers in a very robust tele­com hotel–The West­in Build­ing, for those keep­ing score–in down­town Seat­tle.  From there, we move every­thing over our DS3 to our cor­po­rate head­quar­ters not far from Seat­tle.  We also have many oth­er ded­i­cat­ed cir­cuits, IPsec tun­nels, and assort­ed bal­ly­hoo to oth­er loca­tions around the world, but for dis­cus­sion here just keep in mind the three loca­tions I’ve described.

So the DS3 that is our life­line was down.  It was after hours in our Cana­di­an loca­tion so with any luck nobody would notice all night–they use a lot of crit­i­cal ser­vices across anoth­er DS3, but that also routes through Seat­tle first.  Addi­tion­al­ly, it was a par­tic­u­lar­ly nice day in Seat­tle (rare) and a lot of peo­ple were already out of the office when this link went down.  Hope­ful­ly we could file a trou­ble tick­et and get this resolved fair­ly quick­ly.

With­in just a few min­utes of fil­ing said trou­ble tick­et, I had a rep­re­sen­ta­tive of the pro­vi­sion­ing tele­com on the line who said that, yes, they saw a prob­lem and would be dis­patch­ing tech­ni­cians.  There were some oth­er calls fol­low­ing that, but  the short ver­sion is that by 5:30pm “every­thing was fixed” accord­ing to the tele­com and would we please ver­i­fy so they could close the tick­et.  Unfor­tu­nate­ly, the prob­lem was not fixed.

Now the fun began.  To appease the tele­com rep­re­sen­ta­tive, I accept­ed the pos­si­bil­i­ty that my DS3 con­troller card had coin­ci­den­tal­ly died or locked the cir­cuit or some oth­er bunch of weird pseu­do-engi­neer guess­ing from the tele­com rep­re­sen­ta­tive.  This meant I had to dri­ve to our data cen­ter in Seat­tle, in rush hour traf­fic, to per­son­al­ly kick the offend­ing router in the teeth.

After an hour or so of typ­i­cal­ly nasty Seat­tle rush-hour traf­fic I arrived at the dat­a­cen­ter and began test­ing.  Our DS3 con­troller was show­ing AIP on the line, so more tech­ni­cians were dis­patched to find the offend­ing prob­lem.  Mean­while, I wan­dered over to the Icon Grill to get some din­ner and an après-ski bev­er­age or two.

Fast for­ward a few hours and the AIP con­di­tion on the DS3 con­troller was gone, but I now had an inter­face sta­tus of “up up (looped)” which is less than ide­al, shall we say.  I decid­ed at this point to cut my loss­es and head home and pos­si­bly get some sleep while the tele­com engi­neers and their cohort tried to fig­ure out how this might be my fault.

With some three hours of sleep or so, I woke up at 5am and start­ed look­ing at all of my emails, lis­ten­ing to all of my voice­mails, and gen­er­al­ly curs­ing any­one with­in earshot–mostly con­sist­ing of the cats–as my wife was still asleep.  At this point I got on a con­fer­ence bridge with the Pres­i­dent of the tele­com bro­ker we use and togeth­er we man­aged to drag a rep in from the pro­vi­sion­ing com­pa­ny who could then drag in as many engi­neers as need­ed to get the prob­lem solved.  Not, how­ev­er, before I was rather point­ed­ly told by said pro­vi­sion­ing woman that I would have to pay for all of this cost since the prob­lem was “obvi­ous­ly with my equip­ment, since her soft­ware showed no loops in the cir­cuit.”

Once the engi­neers start­ed hook­ing up testers to the circuit–physically this time–they could see a loop, but at the Seat­tle side (the side report­ing the loop.)  Anoth­er engi­neer saw a loop on the head­quar­ters side, and still a third saw no loop at all.  As it turns out, the cir­cuit was pro­vi­sioned by com­pa­ny “A” who then hand­ed off to com­pa­ny “B” and final­ly to com­pa­ny “C” who ter­mi­nat­ed the cir­cuit at the demar­ca­tion point at our head­quar­ters.  All for less than 20 miles, but I digress.  Final­ly we all agreed to have Com­pa­ny “C” come onsite, inter­rupt the cir­cuit phys­i­cal­ly at the demar­ca­tion equip­ment and look back down the link to see what he could see.  As a pre­cau­tion at this point, and tired of being blamed for ridicu­lous things, I and my staff phys­i­cal­ly pow­ered down our routers on either side of the link.  Since the loop stayed, that was the last time I had any­one point the fin­ger my way.  Small mir­a­cles and all of that.

Once the rep from Com­pa­ny “C” got onsite and inter­rupt­ed the cir­cuit for tests, he was still see­ing “all green” down the line.  Since the oth­er engi­neers mon­i­tor­ing were still see­ing a loop, they asked him to shut down the cir­cuit.  He did, and they still saw a loop.  This was one of those “Aha” moments for all of us except the engi­neer from Com­pa­ny “C” who just could­n’t fig­ure out what the prob­lem might be.  All of us sus­pect­ed that the loop was between the Fujit­su OC‑3 box at our Demarc, and the upstream OC-48 Fujit­su Mux a cou­ple of miles away and we final­ly con­vinced this guy to go check out the OC-48.  Sure enough, a few min­utes after he left our cir­cuit came back on again.  And we all rejoiced, and ate Robin’s Min­strels.

At the end of the day, we end­ed up with just short of 24 hours of down­time,  for a DS‑3 from a major tele­com provider that every­one here would rec­og­nize; 23 hours and 5 min­utes, to be exact.  So what was the prob­lem, and the solu­tion?  Any tele­com guys want to stop read­ing here and take a guess?

As it turns out, the orig­i­nal cause of our link going down was this same engi­neer pulling the cir­cuit by mis­take.  When the trou­ble tick­et was orig­i­nal­ly filed, he rushed out and “fixed” his mis­take.  But, what he had­n’t noticed the first time is two crit­i­cal things:

(1)    The cir­cuit had failed over to the pro­tect pair.  DS3 cir­cuits use one pair of fiber for the nor­mal­ly used (or work­ing) cir­cuit, and a sep­a­rate fiber pair for the fail-over (or pro­tect) cir­cuit.

(2)    The pro­tect pair at the OC‑3 box at the demar­ca­tion point had­n’t ever been installed.

For lessons learned here, the main thing that comes to me is that we absolute­ly have to find a way to get true redun­dan­cy on this link, even if it means con­nect­ing our own strings between tin-cans.  I should explain, by the way, that redun­dan­cy to this head­quar­ters build­ing is very dif­fi­cult due to loca­tion: the last mile provider is the same no mat­ter who we use to pro­vi­sion the cir­cuit.  In addi­tion, with one major fiber loop in the area, even if we could get redun­dan­cy on the last mile we would still be at the mer­cy of that loop.  We are at this point, after this inci­dent, look­ing at a fixed LoS wire­less option that has recent­ly become avail­able.  Appar­ent­ly we can get at least 20Mb/s although I haven’t heard any claims on the laten­cy, so we’ll see.

I’m also shocked and appalled that three major tele­coms, all work­ing in con­cert, took almost a full day to run this prob­lem to ground.  I’m prob­a­bly naive, but I expect more.  The only sav­ing grace in all of this is the lev­el of pro­fes­sion­al­ism and sup­port I received from the tele­com bro­kers we use.  They were absolute­ly on top of this from the begin­ning, shep­herd­ed the whole process along, even facil­i­tat­ing com­mu­ni­ca­tions between the play­ers with their own con­fer­ence bridge for the bet­ter part of a day.  If any­one needs any tele­com ser­vices bro­kered, any­where in the world I’m told, con­tact Rick Crabbe at Thresh­old Com­mu­ni­ca­tions.

With this sum­ma­tion done, my vent­ing com­plete, and every­thing right with the world, I’m off for a bev­er­age.

Share

July 25, 2010 Uncategorized

ASA TU

Read­ing Time: 2 min­utes

You ever have one of those weeks where every­thing that can go wrong, does? And even things that can’t go wrong, still do? Last week was that week for me. But more on that lat­er.

I’m final­ly relax­ing with a beer in my home office, Fri­day after­noon, after said hell-week, when sud­den­ly I notice that my desk phone has mys­te­ri­ous­ly pow­ered off. Beyond the visu­al cue of no screen dis­play, and a nag­ging sus­pi­cion that some­thing was still not right in the world, I also heard it when it shut down (The Cis­co 7940 mod­els in par­tic­u­lar seem to make some noise when turn­ing on and off.) Since this phone is using PoE from a Cis­co ASA 5505, I glanced over at the 5505 to see what might be caus­ing all of the unhap­pi­ness. Imme­di­ate­ly I noticed that some­thing was­n’t right, as the sta­tus light was orange for a sec­ond, then the whole unit reboot­ed. At that point the dis­play lights go all green, then amber, then anoth­er reboot… ad nase­um.

What the @#$!?

Try­ing to log in via either SSH or ASDM yields no love at all, so I hook up a con­sole cable. There it is: the ASA appar­ent­ly does­n’t have any OS to boot.

Again with the @#$!?

So, I take the cov­er off and pull out the *cough* expen­sive-as-hell *cough* flash card to dou­ble-check things on desk­top com­put­er. Sure enough, the com­put­er reports that the flash card is not for­mat­ted. So I for­mat­ted it, rein­stalled the OS and license infor­ma­tion from–wait for it–the back­up I had made recent­ly. At this point I could have used this as anoth­er learn­ing expe­ri­ence and re-con­fig­ured the unit from scratch, just to test myself, but at the time I was try­ing to get some movie tick­ets pur­chased and had just got­ten done with a very, very tir­ing week. Not sur­pris­ing­ly then, I took the easy way out and restored the con­fig­u­ra­tion as well.

After a quick rein­stall, the unit came back up and every­thing is fine.

My main les­son learned here–and I’m won­der­ing if this is some­thing that has hap­pened to oth­er peo­ple, or hap­pens fre­quent­ly with the ASA units–is that the Cis­co ASAs seem to occa­sion­al­ly wipe out the installed flash card (cards plur­al if you have 5510s or big­ger.) Either that, or the ridicu­lous­ly expen­sive (like $300 or some­thing) Cis­co-brand­ed 512mb flash cards are flakey as hell. I don’t tend to go with that last option, how­ev­er, sim­ply because when flash cards go bad they’re not usu­al­ly amenable to being re-for­mat­ted and work­ing prop­er­ly again: it’s usu­al­ly game-over, buy anoth­er one.

So, anoth­er ran­dom prob­lem at least solved in the short-term. I’ll keep every­one post­ed on whether or not this hap­pens again. I haven’t seen any­thing indi­cat­ing that this is a known issue, but admit­ted­ly I haven’t real­ly been look­ing. Auto-mag­i­cal­ly wip­ing out the entire boot OS, con­fig­u­ra­tion, licens­es, etc. would seem to be a fair­ly non-use­ful type of bug to have–even by, say, Microsoft stan­dards… let alone Cis­co.

So, has any­one seen this before?

Share

Copyright© 2023 · by Shay Bocks