Sunday, March 23, 2014

When upgrades go bad: Installing JunOS from USB in a SRX router

So, I screwed up pretty bad. I decided to upgrade the JunOS release in this Juniper SRX210 router to the one (at the time I type this) recommended by Juniper, 11.4R10.3. When it booted up after the install, it crashed during the boot process. Well, I could have spent the time kicking myself but I am doing this upgrade off-hours and I did account for things going badly in my downtime estimate. And, this router is part of a redundant router setup using the Virtual Router Redundancy Protocol (VRRP); being down will not affect production. In other words, this is more of an annoyance than a real issue. Since I have to deal with this, how about if we learn how to restore the OS in this juniper router?

I tried a few ways and thought that the easiest one was to use a USB drive. Of course, it will not work well if you are not physically close to said router (other things will also not work well in these circumstances but that is another topic), but since I can I am doing the USB upgrade.

Procedure

  1. Get a USB drive. I know, this is a pretty obvious step but it is step 1. Ideally use a 1GB/2GB USB drive, formatted as fat16/fat32. Honestly I do not know how critical that is, but my experience with Cisco, which seems not to like the higher capacity ones, made me be leery. On the plus side, you should be able to find those rather easily as people replace their old ones with newer larger ones. If not, there are always the usual sources such as ebay or amazon.
  2. Download and copy OS image you are going to use, say junos-srxsme-11.4R10.3-domestic.tgz, into USB drive. If you are smarter than me, you would have gone to the Juniper downloads site and got all the OS images you need, placing them in your file server. I wasn't so I had to go the SRX210 download page and fetch it.
  3. Have your trusty serial cable and connect it to the router's console port. The default setup is the time-honored 9600 8N1. If you changed it, make sure you wrote than somewhere. I am lazy and I kinda like that setting.
  4. Connect USB drive to router.
  5. Reboot router after you attack the usb drive to it. It needs to know the drive exists as it boots up. Otherwise, it will bark like this:
    loader> install file:///junos-srxsme-11.4R10.3-domestic.tgz
    cannot open package (error 22)
    loader>

    When you try to install it.

  6. Now, if you boot with USB already connected to router, it will first say something like this:

    Running U-Boot CRC Test... OK.
    Flash:  4 MB
    USB:   scanning bus for devices... 4 USB Device(s) found
           scanning bus for storage devices... 2 Storage Device(s) found
    Clearing DRAM........ done
    BIST check passed.

    Some of you noticed the 2 storage devices message. It is talking about the inboard one (probably where the OS should be) and the external drive.

  7. Now, when you see

    POST Passed
    Press SPACE to abort autoboot in 1 seconds

    Please keep your fingers in your pockets. If you press space here, you will end up in the => prompt (U-boot). If you wait you will then see

    Protected 1 sectors
    Loading /boot/defaults/loader.conf
    /kernel data=0xb0f9c0+0x134788 DA(some hot action happening here)

    have your space-bar finger on standby for the next message will be

    Hit [Enter] to boot immediately, or space bar for command prompt.
  8. Then you will press space bar and get the loader> prompt. And now, it will start doing the install thingie:

    loader> install file:///junos-srxsme-11.4R10.3-domestic.tgz
    /kernel data=0xae82f0+0x12d2b8 syms=[0x4+0x88ce0+0x4+0xc6af6]
    Kernel entry at 0x801000d8 ...
    init regular console
    GDB: debug ports: uart
    GDB: current port: uart
    KDB: debugger backends: ddb gdb
    KDB: current backend: ddb
    Copyright (c) 1996-2013, Juniper Networks, Inc.
    All rights reserved.
    Copyright (c) 1992-2006 The FreeBSD Project.
    Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
            The Regents of the University of California. All rights reserved.
    JUNOS 11.4R10.3 #0: 2013-11-15 06:56:20 UTC
    [...]
  9. After a while (I got bored and went to make me some tea), you will see it recreate the ssh key pairs and then finally be ready for business (apologies for the bad cut-n-pasting but my terminal console was being cute):

    |
    |                 |
    |  .o  ..         |
    |.+o .o.o.
    |X . .. .. E      |
    |oo ..            |
    |  .+             |
    |.-+
    root@uranus% omplete
    Setting initial options: .
    Starting optface configuration:
    additional daemons: eventd.
    Additional rout;/boot/modules -> /bo;
    kld netpfe drv: ifpfed_dialer default_adtwork setup:.
    Starting final network daemons:.
    setting ldconfig.
    Initial rc.mips initialization:.
    Local package initializationup access
    .
    kern.securelevel: -1 -> 1
    Creating JAIL MFS partitirade.uboot="0xBFC00000"
    boot.upgrade.loader="0xBFE00000"
    Boot mILE SYSTEM CLEAN; SKIPPING CHECKS
    clean, 78249 free (17 frags, ar 20 16:46:25 CDT 2014
    
    uranus (ttyu0)

    Note that it remembered the hostname for the router. I still went through the configs before letting it join the router cluster. But that is pretty much it! Router is back in business.

Closing Thoughts

  1. The universe is Murphian; things will go wrong. Try not to stress about that.
  2. When you schedule downtime for upgrades, account for things going badly in your time estimates.
  3. The hardest thing to do is figuring out what can go wrong. But, you could ask yourself "If this upgrade halts server or just this service, what would be my backup plan?" and then see if you can answer that question.
  4. Next time I need to upgrade the OS in this or another router, I will have the firmware/OS on standby in a USB drive. I do not know about you but I found out when I am prepared everything works out perfectly.
  5. If you can afford it, redundancy is a wonderful thing.
  6. Always save your configs somewhere, well, safe. Having to recreate them from scratch is a bit of a drag.

Monday, March 17, 2014

generic failure: GSSAPI Error: Unspecified GSS failure. Minor code may provide more information ()

This will be a quick post about something that was biting my ass these last few days and what was the real cause. After you read it, you are welcome to laugh at my expense. Go ahead! I deserve it!

I was working in a kerberos/ldap (linux) server and needed to debug the connection to a given client. The ldap connection uses TLS, GnuTLS specifically since the two machines were ubuntu servers, which means we also had to worry about certs. And since kerberos is in the picture, we need to configure for that. To help in solving other issues, which I should comment about later (at least those were clever problems not like this one), I was running slapd in debug mode,

/usr/sbin/slapd -d 256 -h "ldap:/// ldapi:/// ldaps:///" -g openldap -u openldap -F /etc/ldap/slapd.d

and that did help solve the other issue I had. Some of you will notice I am also running ldaps (port 636), which I really do not need since TLS should take care of the encryption thingie. But, I digress for this post, so let's go back on topic. What I then noticed was some very problems with ldap. For instance, if I created a kerberos ticket and then tried to run ldapsearch, I would then get the following error:

root@services:~# export KRB5CCNAME=/tmp/host.tkt
root@services:~# ldapsearch -vvv
ldap_initialize(  )
SASL/GSSAPI authentication started
ldap_sasl_interactive_bind_s: Other (e.g., implementation specific) error (80)
        additional info: SASL(-1): generic failure: GSSAPI Error: Unspecified GSS failure.  Minor code may provide more information ()
root@services:~#

Here is what the server sees:

53261bde conn=1043 fd=19 ACCEPT from IP=192.168.1.181:44610 (IP=0.0.0.0:389)
53261bde conn=1043 op=0 EXT oid=1.3.6.1.4.1.1466.20037
53261bde conn=1043 op=0 STARTTLS
53261bde conn=1043 op=0 RESULT oid= err=0 text=
53261bde conn=1043 fd=19 TLS established tls_ssf=128 ssf=128
53261bde conn=1043 op=1 BIND dn="" method=163
53261bde SASL [conn=1043] Failure: GSSAPI Error: Unspecified GSS failure.  Minor code may provide more information ()
53261bde conn=1043 op=1 RESULT tag=97 err=80 text=SASL(-1): generic failure: GSSAPI Error: Unspecified GSS failure.  Minor code may provide more information ()
53261bde conn=1043 op=2 UNBIND
53261bde conn=1043 fd=19 closed

Since I do not have many clever things to talk about and fill the space until the solution, how about if we talk about what some of those lines mean?

  • IP=192.168.1.181:44610 (IP=0.0.0.0:389): Client 192.168.1.181 is connecting from its port 44610 to my port 389.
  • oid=1.3.6.1.4.1.1466.20037: Start TLS extended request (per rfc2830).
  • BIND dn="": anonymous if we are doing a SIMPLE bind. If we are however doing SASL bind, it is not used.
  • tag=97: result from a client bind operation.

As you noticed, at least from reading the title of this post, the error line is this generic failure: GSSAPI Error: Unspecified GSS failure. Minor code may provide more information () thingie. Here is where it annoyed me to no end: what minor code? It is supposed to put some kind of message between the parenthesis, like "No principal in keytab matches desired name" or "Ticket expired". Then I would be able to search online for something. Instead, zilch. I could not find a single entry where the minor code parenthesis thingie was empty. Not very helpful today are we?

Solution

So, what was wrong? Me. User error. Do you remember how I was running slapd? Do you also remember the part about kerberos? Well, in the /etc/default/slapd (that'll be /etc/sysconfig/ldap for you RedHat/CentOS/Fedora folks) I have defined

export KRB5_KTNAME=/etc/ldap/ldap.keytab

which means ldap knows then where the keytab containing the ldap service principal hides. Can you see where this is going? No? Let's look again at how I am running slapd, shall we?

/usr/sbin/slapd -d 256 -h "ldap:/// ldapi:/// ldaps:///" -g openldap -u openldap -F /etc/ldap/slapd.d

As you can see, I did not pass a KRB5_KTNAME to slapd. As soon as I fed that to slapd, all was once again well in the Land of Ooo.