Wednesday, January 16, 2019

Failover and swapping controllers in a Equallogic SAN, command line involved

Equallogic as some of you know are old storage systems sold by Dell that can do at best RAID6 in a good day. As of now it has been out of support for a few years; Dell bought EMC and would rather if we used that instead. We do have them too but this article is about the Equallogic we still use.

On a Tuesday evening I got an email from one of those telling me it was unhappy. As my recurring motif, I do not like GUIs to manage devices. This one decided it would not reinforce that belief and instead just refused to start. So it is time for plan B. Most of the storage appliances and hypervisors out there either are built on Linux or Freebsd. In the last time I had to probulate it, I found it runs freebsd. And that usually means we can ssh into it. And we do it; I think I do not need to show how to do that; if you ssh into a machine, you ssh into them all.

Now we are inside we need to find what's up (I am cheating a bit because the email did tell me that the device PS4100E-01 was unhappy, which is why I am probulating it specifically):

STORGroup01> member select PS4100E-01 show
_____________________________ Member Information ______________________________
Name: PS4100E-01                       Status: online
TotalSpace: 15.98TB                    UsedSpace: 3.86TB
SnapSpace: 192.92GB                    Description:
Def-Gateway: 192.168.1.1               Serial-Number:
Disks: 12                                CN-TWO_AND_A_HALF
Spares: 1                              Controllers: 2
CacheMode: write-thru                  Connections: 4
RaidStatus: ok                         RaidPercentage: 0.000%
LostBlocks: false                      HealthStatus: critical
LocateMember: disable                  Controller-Safe: disabled
Version: V9.1.4 (R443182)              Delay-Data-Move: disable
ChassisType: DELLSBB2u12 3.5           Accelerated RAID Capable: no
Pool: default                          Raid-policy: raid6
Service Tag: H9CHSW1                   Product Family: PS4100
All-Disks-SED: no                      SectorSize: 512
Language-Kit-Version: de, es, fr, ja,  ExpandedSnapDataSize: N/A
  ko, zh                               CompressedSnapDataSize: N/A
CompressionSavings: N/A                Data-Reduction: no-capable-hardware
Raid-Rebuild-Delay-State: disabled     Raid-Expansion-Status: enabled
_______________________________________________________________________________

____________________________ Health Status Details ____________________________

Critical conditions::
Critical hardware component failure.

Warning conditions::
None
_______________________________________________________________________________


____________________________ Operations InProgress ____________________________


ID StartTime            Progress Operation Details                             

-- -------------------- -------- -----------------------------------------------
STORGroup01>

Yep, the Health Status Details menu tells us it is unhappy; we knew that already. But, what does the log say?

STORGroup01> show recentevents
[...]
6492:881:PS4100E-01:SP: 8-Jan-2019 19:34:51.700834:cache_driver.cc:1056:WARNING:
28.3.17:Active control module cache is now in write-through mode. Array performa
nce is degraded.

6491:880:PS4100E-01:SP: 8-Jan-2019 19:34:51.700833:emm.c:355:ERROR:28.4.85:Criti
cal hardware component failure, as shown next.
        C2F power module is not operating.

6490:879:PS4100E-01:SP: 8-Jan-2019 19:34:51.700832:emm.c:2363:ERROR:28.4.47:Crit
ical health conditions exist.
 Correct immediately before they affect array operation.
        Critical hardware component failure.
        There are 1 outstanding health conditions. Correct these conditions before they
 affect array operation.

OK, controller thingie is not happy. But, this device has 2 of them. Which one is it?

STORGroup01> member select PS4100E-01 show controllers
___________________________ Controller Information ____________________________
SlotID: 0                              Status: active
Model: 70-0476(TYPE 12)                BatteryStatus: failed
ProcessorTemperature: 65               ChipsetTemperature: 44
LastBootTime: 2018-04-23:15:06:55      SerialNumber:
Manufactured: 0327                       CN-I_AM_FEELING_DEPRESSED
ECOLevel: C00                          CM Rev.: A04
FW Rev.: Storage Array Firmware V9.1.4 BootRomVersion: 3.6.4
   (R443182)                           BootRomBuilDate: Mon Jun 27 10:20:45
                                          EDT 2011                             
_______________________________________________________________________________
_______________________________________________________________________________
SlotID: 1                              Status: secondary
Model: 70-0476(TYPE 12)                BatteryStatus: ok
ProcessorTemperature: 0                ChipsetTemperature: 0
LastBootTime: 2018-04-23:15:13:17      SerialNumber:
Manufactured: 031S                       CN-I_AM_FEELING_GREAT
ECOLevel: C00                          CM Rev.: A04
FW Rev.: Storage Array Firmware V9.1.4 BootRomVersion: 3.6.4
   (R443182)                           BootRomBuilDate: Mon Jun 27 10:20:45
                                          EDT 2011                             
_______________________________________________________________________________

______________________________ Cache Information ______________________________
CacheMode: write-thru                  Controller-Safe: disabled
Low-Battery-Safe: enabled                                                      
_______________________________________________________________________________
STORGroup01>

Fun fact: I haven't the foggiest idea of which slot is 1 and which one is 0. We will have to find that out as we go along.

SPOILER ALERT: Making ASSumptions here is a bad idea.

The Controller

We sent out the above info -- model, firmware -- to the vendor we have a support contract with, which sent a controller card. Here is it in its purple gloriousness.

The configuration is stored in a SD card on the socket being pointed by my finger. When swapping the controller, the laziest thing to do is put the old SD card in the new controller. This way we do not have to configure it.


We know one of the controllers is unhappy, but which one? You see, this storage device only needs one controller to work. But, it is an enterprise device; the reason to have two is that the second one is on standby: if the primary has a problem, service can failover to the secondary/backup. If you look at the picture below, you will see one controller has both lights green while the one below is has the top green and the bottom orange. That (orange light) indicates it is the either the backup, secondary, failover, or unused controller. The OR is very important here

Now, this system is designed to be hot swappable but with one proviso: you can swap the device that is not currently in use, which would make sense since chances are the bad controller failed and the backup took over. So, the failed device should be the one in standby.

So, I swapped it. And then checked again.

STORGroup01> member select PS4100E-01 show controllers
___________________________ Controller Information ____________________________
SlotID: 0                              Status: active
Model: 70-0476(TYPE 12)                BatteryStatus: failed
ProcessorTemperature: 64               ChipsetTemperature: 44
LastBootTime: 2018-04-23:15:07:14      SerialNumber:
Manufactured: 0327                       CN-I_AM_FEELING_DEPRESSED
ECOLevel: C00                          CM Rev.: A04
FW Rev.: Storage Array Firmware V9.1.4 BootRomVersion: 3.6.4
   (R443182)                           BootRomBuilDate: Mon Jun 27 10:20:45
                                          EDT 2011
_______________________________________________________________________________
_______________________________________________________________________________
SlotID: 1                              Status: secondary
Model: 70-0476(TYPE 12)                BatteryStatus: ok
ProcessorTemperature: 0                ChipsetTemperature: 0
LastBootTime: 2019-01-15:12:56:01      SerialNumber:
Manufactured: 025E                       CN-I_AM_THE_NEW_GUY
ECOLevel: C00                          CM Rev.: A03
FW Rev.: Storage Array Firmware V9.1.4 BootRomVersion: 3.6.4
   (R443182)                           BootRomBuilDate: Mon Jun 27 10:20:45
                                          EDT 2011
_______________________________________________________________________________

It seems I replaced the secondary (serial number CN-I_AM_THE_NEW_GUY), and the primary one (CN-I_AM_FEELING_DEPRESSED) is still the problematic one. Why didn't it failover when it realized it had a problem? I don't know; what I know is that I still need to replace the problematic controller. At least now we know the SlotID 0 is the top controller and SlotID 1 the bottom. Small progress but progress nevertheless.

It turns out there is a way to programmatically force the failover; that is achieved using the command restart. But, since I have 3 different storage devices -- PS4100E-01, PS4100E-00, and PS6100E -- in this setup and am currently accessing them from the system that control all of them, and restart does not seem to allow me to specify which device I want to reboot,

STORGroup01> restart
            - Optional argument to restart.
 
STORGroup01>

we should ssh into PS4100E-01 (IP 192.168.1.3 for those who are wondering) and then issue the command there. We do not need to worry about it restarting the secondary controller because it only reboots the active one, which causes it to failover to the secondary.

STORGroup01> restart

Restarting the system will result in the active and secondary control
modules switching roles. Therefore, the current active control module
will become the secondary after the system restart.


After you enter the restart command, the active control module will fail over.
To continue to use the serial connection when the array restarts, connect the
serial cable to the new active control module.

Do you really want to restart the system? (yes/no) [no]:yes
Restarting at Tue Jan 15 13:40:48 EST 2019 -- please wait...
Waiting for the secondary to synchronize with us (max 900s)
....Rebooting the active controller
Connection to 192.168.1.3 closed by remote host.
Connection to 192.168.1.3 closed.
raub@desktop:~$

And, yes, it is that anticlimatic. I just got kicked off PS4100E-01; our network access to the storage device is through the controller. Now we need to wait for it to come back. I am lazy so I will use ping to let me know when the network interface is back up.

raub@desktop:~$ ping 192.168.1.3
PING 192.168.1.3 (192.168.1.3) 56(84) bytes of data.

64 bytes from 192.168.1.3: icmp_seq=10 ttl=254 time=3.50 ms
64 bytes from 192.168.1.3: icmp_seq=10 ttl=254 time=3.51 ms (DUP!)
64 bytes from 192.168.1.3: icmp_seq=11 ttl=254 time=90.8 ms
64 bytes from 192.168.1.3: icmp_seq=12 ttl=254 time=1.14 ms
64 bytes from 192.168.1.3: icmp_seq=13 ttl=254 time=1.49 ms
64 bytes from 192.168.1.3: icmp_seq=14 ttl=254 time=1.64 ms
64 bytes from 192.168.1.3: icmp_seq=15 ttl=254 time=1.06 ms
^C
--- 192.168.1.3 ping statistics ---
15 packets transmitted, 6 received, +1 duplicates, 60% packet loss, time 14078m
s
rtt min/avg/max/mdev = 1.068/14.741/90.815/31.072 ms
raub@desktop:~$

What I can't show in this article is that it took a while until I started getting packets back. In fact, I went to make coffee (which required me to find coffee first) and then came back and then run ping a few times until I got the above output. These devices are in no hurry when rebooting.

Once we see pinging, we can check if ssh is running again and then log back in and asked what's up with the controllers:

STORGroup01> member select PS4100E-01 show controllers
___________________________ Controller Information ____________________________
SlotID: 0                              Status: unknown
Model: ()                              BatteryStatus: unknown
ProcessorTemperature: 0                ChipsetTemperature: 0
LastBootTime: 2019-01-15:13:43:19      SerialNumber:
Manufactured:                          ECOLevel:
CM Rev.:                               FW Rev.:
BootRomVersion:                        BootRomBuilDate:
_______________________________________________________________________________
_______________________________________________________________________________
SlotID: 1                              Status: active
Model: 70-0476(TYPE 12)                BatteryStatus: ok
ProcessorTemperature: 60               ChipsetTemperature: 43
LastBootTime: 2019-01-15:12:56:00      SerialNumber:
Manufactured: 025E                       CN-I_AM_THE_NEW_GUY
ECOLevel: C00                          CM Rev.: A03
FW Rev.: Storage Array Firmware V9.1.4 BootRomVersion: 3.6.4
   (R443182)                           BootRomBuilDate: Mon Jun 27 10:20:45
                                          EDT 2011
_______________________________________________________________________________

When the output was captured, the controller in slot 0 has not come back up from the restart command. What matters, however, is the controller in slot 1 became the active one, meaning we can swap the unhappy controller. If we wait a bit, we will see the controller on slot 0 be reported as failed,

STORGroup01> member select PS4100E-01 show controllers
___________________________ Controller Information ____________________________
SlotID: 0                              Status: secondary
Model: 70-0476(TYPE 12)                BatteryStatus: failed
ProcessorTemperature: 0                ChipsetTemperature: 0
LastBootTime: 2019-01-15:13:43:09      SerialNumber:
Manufactured: 0327                       CN-I_AM_FEELING_DEPRESSED
ECOLevel: C00                          CM Rev.: A04
FW Rev.: Storage Array Firmware V9.1.4 BootRomVersion: 3.6.4
   (R443182)                           BootRomBuilDate: Mon Jun 27 10:20:45
                                          EDT 2011
_______________________________________________________________________________

which means we can now swap the top guy (slot 0) with the old controller we accidentally removed from slot 1. After a few minutes, all is well in the world:

STORGroup01> member select PS4100E-01 show controllers
___________________________ Controller Information ____________________________
SlotID: 0                              Status: secondary
Model: 70-0476(TYPE 12)                BatteryStatus: ok
ProcessorTemperature: 0                ChipsetTemperature: 0
LastBootTime: 2019-01-15:13:55:33      SerialNumber:
Manufactured: 031S                       CN-I_AM_FEELING_GREAT
ECOLevel: C00                          CM Rev.: A04
FW Rev.: Storage Array Firmware V9.1.4 BootRomVersion: 3.6.4
   (R443182)                           BootRomBuilDate: Mon Jun 27 10:20:45
                                          EDT 2011
_______________________________________________________________________________
_______________________________________________________________________________
SlotID: 1                              Status: active
Model: 70-0476(TYPE 12)                BatteryStatus: ok
ProcessorTemperature: 62               ChipsetTemperature: 42
LastBootTime: 2019-01-15:12:55:43      SerialNumber:
Manufactured: 025E                       CN-I_AM_THE_NEW_GUY
ECOLevel: C00                          CM Rev.: A03
FW Rev.: Storage Array Firmware V9.1.4 BootRomVersion: 3.6.4
   (R443182)                           BootRomBuilDate: Mon Jun 27 10:20:45
                                          EDT 2011
_______________________________________________________________________________

There might be a GUI way to do all of that but I am not smart enough to click on things.

No comments: