Sometimes things go right for a change.
I finally got my dual-ISP single-hub DMVPN with mGRE, ipsec and EIGRP working. I may have to do some tuning of the failover to smooth things out, but it’s working, and that’s what’s important. I can kill the primary ISP connection and after a short delay of 20 seconds, traffic resumes flowing to my spokes over the secondary ISP connection.
The tools that go into this:
- DMVPN
- mGRE tunnels
- IPSEC
- IP SLA’s
- EEM
- EIGRP
I’m hoping to write a how-to regarding this soon.
Well, it’s been a week for Cisco bugs. A couple ASA5510′s that I manage succumbed to a bug that causes a gradual performance decrease until they are no longer usable or accessible remotely. Luckily it was a known bug that was fixed in a newer firmware version. A quick upgrade after-hours and things are up and running smoothly again.
One thing I found out, upgrading 2 ASA’s that are set up for active/standby failover is SLICK. Cisco has a “zero-downtime” upgrade process whereby you upgrade the standby unit, failover to it, upgrade the primary, fail back, etc… Everything’s upgraded and noone noticed because the failovers are so seamless.
On a related note, I got confirmation from Cisco regarding me MGRE bug. They want me to try another workaround for the initial bug, and they found another bug that I was hitting having to do with clearing nat translations that is fixed in a later IOS version. So I’m supposed to upgrade the IOS to fix the second bug and re-try the workaround for the first bug to see if the problem. The only problem is, it’s for a customer, and I hate testing a fix on a production router. Unfortunately, I’ve never been able to re-produce the bug in the test lab – maybe because I can’t simulate the same kind of constant traffic the production system is seeing.
We’ll see, maybe it will all be fixed and everyone will be happy. I know I will.
Well, apparently there’s some issue with the Cisco 2801, 12.4(20)T IOS, and DMVPN with MGRE tunnels. It starts spewing errors, goes into a CPUHOG, and eventually crashes unpleasantly. Switch back to normal ipsec tunnels managed with crypto maps, and the problem goes away.
Cisco tried to tell me it was due to ping packets over 1500 bytes going across GRE tunnels over my cellular interfaces. Well, I don’t have any cellular interfaces, but I do have GRE tunnels across my Tunnels. So, I tried turning off virtual-reassembly like they said, but the router still crashes. I’m waiting to hear back from them with another workaround or something. Hopefully I’ll have time to do some stress-testing of my lab setup to see if I can get it to fail in the same way with heavy traffic loads.
Just when I thought I had everything all figured out, the world bites me in the butt. I had all my gre tunnels, route-maps, ipsec and isakmp transport-sets, DMVPN hub and spoke, EEM applet….. Put it all into production and the hub router starts crashing periodically. Opened another TAC case, got a suggestion, but that didn’t fix it. It apparently is unhappy with something about the configuration I gave it and is having a software crash. Sent TAC another 2 crashinfo files, waiting to hear back from them now, but had to revert the setup back to the old way of doing things so it will quit crashing.
It’s almost like I wished for it and it was dropped into my lap. EEM – Embedded Event Manager from Cisco. It’s a tool that in it’s most basic form lets you trigger a series of commands based on a track condition.
So, create an SLA that pings the default gateway through the primary interface to the first ISP.
Create a track statement that tracks the reachability of the gateway based on the SLA.
Create an event manager applet that triggers on changes in the track from up to down and vice versa. When either of those two events happen, trigger commands to force a clear of the NAT translations.
It’s all so obvious and easy once you know it exists.
So tomorrow I’m going to do final testing of the whole thing, and hopefully adjust a few things to make the transitions smoother. Today when I was testing, I had 3-second routing table transitions when the primary tunnel dropped and EIGRP was forced to re-route. That’s pretty good, but that’s also in a test lab with virtually unlimited bandwidth between devices. I’m probably also going to have to do something to keep EIGRP from flapping up and down too much on an intermittent failure of the primary. It’s going to in reality need to fail over to the secondary and then not fail back until the primary has been back up for a minute.
More on the results tomorrow….
Well, back to work today after a long-needed 4-day weekend. I managed to avoid working on the whole DMVPN thing the last few days, so I’m diving back in with new energy and an open mind. Time to lick this whole thing once and for all. Working on it here at home I was a [...]
Simple fix from CIsco TAC – turn on route-cache cef on the tunnel interface, set the MTU to 1400, and set the MSS to 1300. Other than the route-cache cef, I coulda sworn I’d done those things several different times and in different combinations. Hopefully the TAC rep can explain further. I’m up against another [...]
Here’s my latest struggle. Why can I ping and traceroute through a ipsec-encrypted GRE tunnel, but not browse? Every article I read that offers to fix this for me points to MTU and MSS settings, due to fragmentation of packets caused by the additional overhead of GRE and ipsec encapsulation, but I’ve beaten my head [...]
Well, I think I figured this out before TAC contacted me back. I had been running a continuous ping from the inside of the LAN to an outside public IP so that I could watch the failover when I cut the primary ISP dead. The only problem is, the continous ping was keeping the NAT [...]
My latest frustration – getting a Cisco router to swing from one Internet connection to another one, and to get it to drop the nat translations at the same time. Starting to get in my nerves that this won’t work, so time to get Cisco TAC involved. They’ll probably tell me in about 60 seconds [...]