Editor’s Note: If you haven’t already, check out the first installment in this–hopefully not ongoing–series at http://blog.packetqueue.net/asa-tu/
At approximately 1:58pm PST last Thursday the two edge ASA 5510 units at our corporate headquarters dropped off the network. At the time I was in a different office up in Quebec, Canada and so delegated to one of the other engineers to work the problem with TAC and bring them back online. That process took much longer than expected, and I won’t bore you with the details. What I will bore you with, however, are a few observations I have now that we have more time and experience working with Cisco’s ASA product line:
- The ASA has some sort of systemic, though exceedingly rare, problem on 8.3(x) and newer code.
- Said problem causes the units to reboot and take out the system flash (disk0:) but not user flash (disk1:).
- The flash appears to be erased, but it is in fact the MBR that is gone, not the data (we used a hardware forensic disk analysis unit to verify this).
- Cisco doesn’t have enough data points yet to even acknowledge this is an issue. I don’t believe they’re “hiding” a problem; I just don’t think enough people have experienced the particular set of circumstances that would cause this and subsequently reported back to Cisco.
My own suspicions about the root cause are below, though I’d welcome any additional thoughts from anyone with experiences in this area. I should also point out that I have heard from at least two other people that they have experienced this exact problem.
- The behavior and crash lead me to believe that the ASA experiences, at the point of failure, the equivalent of a Windows “BSOD”. This would point to either memory or motherboard itself as these are the primary hardware-based causes of this type of crash in any system. Most other crashes can be recovered from and produce data.
- The ASA accesses the flash on initial load, but then runs from memory. The flash cards in these units had trashed MBRs which leads me to believe that the ASA was touching the MBR at the time of the crash, which is inconsistent with what I know about how the ASA is supposed to operate. It’s possible it was just accessing the flash to write a crash-dump and crashed partway through. That makes some sense to me.
- All failures I have experienced and heard of from others have at least a couple of things in common: They are all on 8.3(x) code. They are all post user-upgraded to support 8.3(x). This code required a memory and flash upgrade, and so you had to buy upgrades from Cisco and field-install them yourself. These units were also all manufactured immediately following the Cisco manufacturing slowdown in 2008/2009 when lead times were running into the several months range. This makes me a bit suspicious that quality control on either the memory or the units themselves could be to blame. I’ve tried to verify with revision numbers, etc., but I haven’t been able to gather enough data from “out there” to settle on this as a cause.
I hope this helps someone out there, and I truly am interested in getting more information from anyone that has it. Cisco is taking our units back, but pulling them aside before refurbishment so that their engineers can dissect the units. If I find anything out from that I’ll post the findings here.
The configuration and build-out of the ASA 5510 units is as follows:
- 1 Gigabyte of memory, 512MB of system flash, 256MB of user flash. IPS Module, Security-Plus, Botnet filter, AnyConnect Essentials, Mobile, etc. licenses. Actually, just about every license is on board; these units are at this point maxed on everything. Utilization is at a reasonable level still.
- Configuration includes use of multiple IPsec site-to-site VPNs, SSL VPN for all Mac, Linux, Windows, iPad and iPhone, sub-interfaces, stateful failover, both IPv4 and IPv6, OSPF with static redistribution, and full IPS functionality.