EEM Tricks: Automatic Failover (Internet)

When our company decided to deploy local Internet breakouts in every single office (cloud readiness) there was a design concern around high availability. Even though our firewalls are being deployed using HA pair, a decision has been made not to overdesign service provider (SP) edge sublayer. In particular, we decided not to deploy more than one external switch. Even if we did, 99% of branches would have only one circuit deployed using single physical media. If switch and/or ISP fail, then manual intervention would be required (recabling, or routing adjustments)… In presence of regional Internet breakouts it was an obvious choice to include these into design as failover component. The question was… how to make users experience as seamless as possible if local Internet breakout fails? EEM was there to help!

The following diagram shows what we wanted to achieve from a high level perspective. The plan was to track the state of local Internet breakout and in case of failure force traffic via regional Internet breakout using dynamic routing (regional Internet breakouts inject default route into BGP).

The choice of IP SLA and TRACKING objects was obvious. However, I then realized that I cannot meet my requirements using IOS tracking objects due to limited capabilities. In particular, my requirement was saying:

In case of failure, local Internet service MUST be stable for 10 minutes before failing back.

Tracking objects support delayed triggerring, using delay up/down command. However, value for both timers can vary between 0 and 180 seconds (up to 3 minutes). This is where I realized that I have to built something custom and find different way to fulfill the requirement.

Instead, I have decided to use IP SLA with reaction configuration as my triggerring mechanism for EEM applet, such as shown below

ip route <sla-destination> 255.255.255.255 <interface> <next-hop> permanent
!
ip sla 99
 icmp-echo <sla-destination> source-interface <sla-source-interface>
 frequency 30
!
ip sla schedule 99 life forever start-time now
ip sla enable reaction-alerts
!
! ------ OPTION #1: LINK HARD DOWN DETECTION ------
ip sla reaction-configuration 99 react timeout threshold-type consecutive 10 action-type triggerOnly
! ------ OPTION #2: LINK FLAPPING DETECTION  ------
ip sla reaction-configuration 99 react timeout threshold-type XofY 6 10 action-type triggerOnly

Note! You will have to replace bold-italic text with values applicable to your environment.

Key things to understand here.

  • There’s permanent static route configured (in our case we use 8.8.8.8/32) pointing to local FW/local Internet breakout. Permanent keyword makes this route to ALWAYS stay in RIB no matter what. It can be anything, as long as it’s something located in public Internet and is not important service from enterprise perspective. You can use ISP’s gateway, but in this case script will not be able to detect failure scenarios affecting ISP network (that is gateway can be up and reachable, but ISP backbone is down).
  • IP SLA object 99 uses ICMP Echo to monitor reachability state of external service using local Internet breakout (again, in our case it is 8.8.8.8). We try to ping this destination every 30 seconds.
  • IP SLA schedule is configured to run IP SLA object indefinitely
  • To make sure we can use results of IP SLA object, IP SLA reaction-alerts feature is enabled. This feature, once enabled, sends IP SLA notifications to all registered applications (in our case it is EEM)
  • As you can see, I have provided two different ways to configure IP SLA reaction configuration.
    • We use Option #1 in our branches with unstable Internet service (such as Middle East). It will trigger IP SLA event/notification if 10 consecutive pings failed. Assuming IP SLA is configured to send ICMP Echo request every 30 seconds, failover will happen after circuit was down for 300 seconds, or 5 minutes.
    • Option #2 is suitable for branches with stable Internet service as it can also detect flapping condition (in addition to hard down). XofY 6 10 means if any 6 out of 10 ICMP Echo probes have not been acknowledged, failover condition should be triggerred.

Keyword triggerOnly in IP SLA reaction-configurationb means that no SNMP trap should be generated, only internal notification should be sent to all registered applications (EEM).

All good so far, we have IP SLA object, which keeps track of our Internet circuit and then triggers some event if failure condition (option 1 or 2) is detected. What’s next? We have to capture this event using EEM applet and perform required changes. It’s now the time to show the complete algorithm (you may want to open it in a separate window).

There are three main components. First one is IP SLA TRIGGER Function, which we have just discussed. I don’t think you will have any problems understanding its algorithm. There’s only one thing that requires few more words. When IP SLA reaction triggers a notification it can have two states/values – Occured and Cleared.

Occured notification happens when IP SLA reaction configuration detects fault condition, i.e. 10 consecutive probes failed, or 6 out of 10 probes failed (depending on configuration). Cleared notification, however, happens immediately after very first probe succeeds. It means there has to be something else in place to make sure default static route is not injected back immediately after first probe is acknowledged. As you can see on the diagram, I have defined two different applets

  • EEM IP SLA Reaction (Parent) Applet is triggerred by IP SLA reaction notification (fault, or one successful probe after fault has been triggered). This applet does the following
    • Checks if failure occurred (IP SLA notification equals OCCURED, see above) and if so
      • It checks if RECOVERY mode is active and if so, it immediately cancels it (link is still flapping!)
      • Otherwise, it removes static route from RIB and lets dynamic protocols do the rest
    • If IP SLA notification equals to CLEARED (one probe was successful) it creates EEM Recovery (Child) Applet
  • EEM Recovery (Child) Applet starts uncoditionally after 600 seconds delay and performs two tasks
    • Recover static default route in RIB pointing to local Internet breakout
    • Removes itself from running-config

Here’s the complete code of both EEM applets:

! TRACKING OBJECT
! Used as global BOOLEAN variable to track RECOVERY state (accessible from Tcl)
track 99 stub-object
!
! Global Tcl variable (quotes)
event manager environment qt "
!
! IP SLA Tracking (Parent) Applet
event manager applet track-route authorization bypass
 description Track IP SLA object and Take action to remove or re-install static route
 event tag 1.0 ipsla operation-id 99 reaction-type timeout

 action 001 cli command "enable"
 action 002 cli command "config t"

 action 005 info type routername

 action 009 comment IF IP SLA Reaction trigerred Unreachable state
 action 010 if $_ipsla_condition eq "Occurred"
  action 011 comment READ Recovery Mode state (via stub tracking object)
  action 012 track read 99

  action 013 comment IF Recovery Mode state is not Active, then remove static route form RIB
  action 014 if $_track_state eq down
   action 015 cli command "no ip route 0.0.0.0 0.0.0.0 <exit-interface> <next-hop> tag 100"
   action 016 mail server <smtp-server> from alert@$_info_routername to <recipient> subject " Internet DOWN "
  action 017 comment ELSE IF Recovery Mode state is Active, Kill Child Applet to cancel Recovery
  action 018 else
   action 019 cli command "no event manager applet track-route-recovery"
   action 020 track set 99 state down
  action 021 end

 action 022 comment ELSE IF IP SLA Reaction triggered Reacheable state
 action 023 else
  action 024 comment SET Recovery Mode to Active
  action 025 track set 99 state up
  action 026 comment CREATE Child Applet to wait X seconds before re-installing static route
  action 027 cli command "event manager applet track-route-recovery authorization bypass"
  action 028 cli command " description Default Route Recovery Applet"
  action 029 cli command " event timer countdown time 600"
  action 030 cli command " action 1.0 track set 99 state down"
  action 031 cli command " action 1.1 info type routername"
  action 032 cli command " action 2.1 cli command $qt enable $qt"
  action 033 cli command " action 2.2 cli command $qt config t $qt"
  action 034 cli command " action 2.3 cli command $qt ip route 0.0.0.0 0.0.0.0 <exit-interface> <next-hop> tag 100 $qt"
  action 035 cli command " action 2.4 mail server <smtp-server> from alert@$_info_routername to <recipient> subject $qt Internet UP $qt"
  action 036 cli command " action 2.5 cli command $qt no event manager applet track-route-recovery $qt"
 action 040 end

Ok, so now it is very important to understand what I am doing here…

  • IP SLA object triggers notification, which can have a value of
    • Occurred (IP SLA reaction occured when 10, or 6 out of 10, probes failed), or
    • Cleared (when one probe was acknowledged after Occurred condition has already took place)
  • Parent applet is executed every time IP SLA generates a notification (see above)
    • In case of Cleared condition (this is when we assume link has recovered, but we are not 100% sure because only ONE probe succeeded), parent applet initiates RECOVERY procedure: sets RECOVERY state to Active and adds child applet into running configuration. Child applet waits 600 seconds and after that recovers local Internet breakout by reconfiguring static default route. You may ask “how do you signal about RECOVERY state?”. It was tricky. EEM applet has no access to variables outside of its own scope (such as global vars). Therefore there’s no simple way to preserve variable’s state/value between different runs/executions of the same applet. I found a neat solution (well, I think it’s neat and it works!). EEM applet can READ and SET the value of STUB TRACKING object. This is a TRACKING object, but it doesn’t track anything. Its state has to be changed manually. So, I use TRACK object 99 to set its state to up (RECOVERY is Active) and down (RECOVERY is Inactive). Look at it as a BOOLEAN variable.
    • In case of Occurred condition (we know it’s a fault) parent applet checks if RECOVERY mode is Active. If so, we assume that link is still unstable (we are in RECOVERY mode, but OCCURED/fault condition was triggered). Therefore, parent simply kills child applet by removing it from running configuration. This cancels recovery, no routes are restored and no hassle.
    • In case of Occurred condition, but no RECOVERY state, we assume operational link failed and failover condition MUST be immediately initiated. Parent applet removes static route from routing table.

As you can see, my applet also sends email notification when link goes up or down (that was my colleague’s suggestion – thanks Mike!). I haven’t shown this on the diagram, but if you read the code carefully you will see it. Couple more things require special attention:

  • $_info_routername is a special variable available to EEM and it contains configured hostname value. I use it as email’s sender to easily identify the source of problem. To get access to this variable, special command has to be executed: info type routername (action 005)
  • $_track_state contains the value of tracked object. To make sure this variable points to the appropriate tracking object, execute track read <obj-id> (action 012)
  • $_ipsla_condition contains the value of IP SLA reaction notification (Occurred or Cleared)
  • $qt is an environmental variable. Basically, it is a variable that you can create outside of EEM applet and use inside of EEM applet (but you cannot change it, once defined). I use this variable to store quotes. Pay attention how child applet is created. I cannot use nested quotes as it will break syntax and there are no controls, such as \ (escape), the one you can use in C, Python, PHP. Solution is to define quotes using environmental variable and then use it when nested quotes are required.
  • Both applets are configured using ‘authorization bypass‘. If you have AAA configured, make sure these keywords are specified, otherwise EEM script execution will fail.

Please note, even if you save the config during failover (i.e. when default route was removed from running config), it is intelligent enough to recover after reboot, but it may take additional 10 minutes of course (flapping protection). Of course, to make it work… complete SLA and EEM configuration must be applied to a network device (in our case it was branch’s core switch).

This script is in production and has been already tested in a number of offices. We do receive alerts when Internet fails and recovers. You may want to fine tune it to your specific requirements (recover more or less than 600 seconds, different failover condition, etc). You can use it freely in your environment, but please preserve authorship if you’re going to share it anywhere on the web. I won’t mind having few references back 😀

I hope this was helpful and useful.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: