Equallogic as some of you know are old storage systems sold by Dell that can do at best RAID6 in a good day. As of now it has been out of support for a few years; Dell bought EMC and would rather if we used that instead. We do have them too but this article is about the Equallogic we still use.
On a Tuesday evening I got an email from one of those telling me it was unhappy. As my recurring motif, I do not like GUIs to manage devices. This one decided it would not reinforce that belief and instead just refused to start. So it is time for plan B. Most of the storage appliances and hypervisors out there either are built on Linux or Freebsd. In the last time I had to probulate it, I found it runs freebsd. And that usually means we can ssh into it. And we do it; I think I do not need to show how to do that; if you ssh into a machine, you ssh into them all.
Now we are inside we need to find what's up (I am cheating a bit because the email did tell me that the device PS4100E-01 was unhappy, which is why I am probulating it specifically):
STORGroup01> member select PS4100E-01 show _____________________________ Member Information ______________________________ Name: PS4100E-01 Status: online TotalSpace: 15.98TB UsedSpace: 3.86TB SnapSpace: 192.92GB Description: Def-Gateway: 192.168.1.1 Serial-Number: Disks: 12 CN-TWO_AND_A_HALF Spares: 1 Controllers: 2 CacheMode: write-thru Connections: 4 RaidStatus: ok RaidPercentage: 0.000% LostBlocks: false HealthStatus: critical LocateMember: disable Controller-Safe: disabled Version: V9.1.4 (R443182) Delay-Data-Move: disable ChassisType: DELLSBB2u12 3.5 Accelerated RAID Capable: no Pool: default Raid-policy: raid6 Service Tag: H9CHSW1 Product Family: PS4100 All-Disks-SED: no SectorSize: 512 Language-Kit-Version: de, es, fr, ja, ExpandedSnapDataSize: N/A ko, zh CompressedSnapDataSize: N/A CompressionSavings: N/A Data-Reduction: no-capable-hardware Raid-Rebuild-Delay-State: disabled Raid-Expansion-Status: enabled _______________________________________________________________________________ ____________________________ Health Status Details ____________________________ Critical conditions:: Critical hardware component failure. Warning conditions:: None _______________________________________________________________________________ ____________________________ Operations InProgress ____________________________ ID StartTime Progress Operation Details -- -------------------- -------- ----------------------------------------------- STORGroup01>
Yep, the Health Status Details menu tells us it is unhappy; we knew that already. But, what does the log say?
STORGroup01> show recentevents [...] 6492:881:PS4100E-01:SP: 8-Jan-2019 19:34:51.700834:cache_driver.cc:1056:WARNING: 28.3.17:Active control module cache is now in write-through mode. Array performa nce is degraded. 6491:880:PS4100E-01:SP: 8-Jan-2019 19:34:51.700833:emm.c:355:ERROR:28.4.85:Criti cal hardware component failure, as shown next. C2F power module is not operating. 6490:879:PS4100E-01:SP: 8-Jan-2019 19:34:51.700832:emm.c:2363:ERROR:28.4.47:Crit ical health conditions exist. Correct immediately before they affect array operation. Critical hardware component failure. There are 1 outstanding health conditions. Correct these conditions before they affect array operation.
OK, controller thingie is not happy. But, this device has 2 of them. Which one is it?
STORGroup01> member select PS4100E-01 show controllers ___________________________ Controller Information ____________________________ SlotID: 0 Status: active Model: 70-0476(TYPE 12) BatteryStatus: failed ProcessorTemperature: 65 ChipsetTemperature: 44 LastBootTime: 2018-04-23:15:06:55 SerialNumber: Manufactured: 0327 CN-I_AM_FEELING_DEPRESSED ECOLevel: C00 CM Rev.: A04 FW Rev.: Storage Array Firmware V9.1.4 BootRomVersion: 3.6.4 (R443182) BootRomBuilDate: Mon Jun 27 10:20:45 EDT 2011 _______________________________________________________________________________ _______________________________________________________________________________ SlotID: 1 Status: secondary Model: 70-0476(TYPE 12) BatteryStatus: ok ProcessorTemperature: 0 ChipsetTemperature: 0 LastBootTime: 2018-04-23:15:13:17 SerialNumber: Manufactured: 031S CN-I_AM_FEELING_GREAT ECOLevel: C00 CM Rev.: A04 FW Rev.: Storage Array Firmware V9.1.4 BootRomVersion: 3.6.4 (R443182) BootRomBuilDate: Mon Jun 27 10:20:45 EDT 2011 _______________________________________________________________________________ ______________________________ Cache Information ______________________________ CacheMode: write-thru Controller-Safe: disabled Low-Battery-Safe: enabled _______________________________________________________________________________ STORGroup01>
Fun fact: I haven't the foggiest idea of which slot is 1 and which one is 0. We will have to find that out as we go along.
SPOILER ALERT: Making ASSumptions here is a bad idea.
The Controller
We sent out the above info -- model, firmware -- to the vendor we have a support contract with, which sent a controller card. Here is it in its purple gloriousness.
The configuration is stored in a SD card on the socket being pointed by my finger. When swapping the controller, the laziest thing to do is put the old SD card in the new controller. This way we do not have to configure it.
We know one of the controllers is unhappy, but which one? You see, this storage device only needs one controller to work. But, it is an enterprise device; the reason to have two is that the second one is on standby: if the primary has a problem, service can failover to the secondary/backup. If you look at the picture below, you will see one controller has both lights green while the one below is has the top green and the bottom orange. That (orange light) indicates it is the either the backup, secondary, failover, or unused controller. The OR is very important here
Now, this system is designed to be hot swappable but with one proviso: you can swap the device that is not currently in use, which would make sense since chances are the bad controller failed and the backup took over. So, the failed device should be the one in standby.
So, I swapped it. And then checked again.
STORGroup01> member select PS4100E-01 show controllers ___________________________ Controller Information ____________________________ SlotID: 0 Status: active Model: 70-0476(TYPE 12) BatteryStatus: failed ProcessorTemperature: 64 ChipsetTemperature: 44 LastBootTime: 2018-04-23:15:07:14 SerialNumber: Manufactured: 0327 CN-I_AM_FEELING_DEPRESSED ECOLevel: C00 CM Rev.: A04 FW Rev.: Storage Array Firmware V9.1.4 BootRomVersion: 3.6.4 (R443182) BootRomBuilDate: Mon Jun 27 10:20:45 EDT 2011 _______________________________________________________________________________ _______________________________________________________________________________ SlotID: 1 Status: secondary Model: 70-0476(TYPE 12) BatteryStatus: ok ProcessorTemperature: 0 ChipsetTemperature: 0 LastBootTime: 2019-01-15:12:56:01 SerialNumber: Manufactured: 025E CN-I_AM_THE_NEW_GUY ECOLevel: C00 CM Rev.: A03 FW Rev.: Storage Array Firmware V9.1.4 BootRomVersion: 3.6.4 (R443182) BootRomBuilDate: Mon Jun 27 10:20:45 EDT 2011 _______________________________________________________________________________
It seems I replaced the secondary (serial number CN-I_AM_THE_NEW_GUY), and the primary one (CN-I_AM_FEELING_DEPRESSED) is still the problematic one. Why didn't it failover when it realized it had a problem? I don't know; what I know is that I still need to replace the problematic controller. At least now we know the SlotID 0 is the top controller and SlotID 1 the bottom. Small progress but progress nevertheless.
It turns out there is a way to programmatically force the failover; that is achieved using the command restart. But, since I have 3 different storage devices -- PS4100E-01, PS4100E-00, and PS6100E -- in this setup and am currently accessing them from the system that control all of them, and restart does not seem to allow me to specify which device I want to reboot,
STORGroup01> restart- Optional argument to restart. STORGroup01>
we should ssh into PS4100E-01 (IP 192.168.1.3 for those who are wondering) and then issue the command there. We do not need to worry about it restarting the secondary controller because it only reboots the active one, which causes it to failover to the secondary.
STORGroup01> restart Restarting the system will result in the active and secondary control modules switching roles. Therefore, the current active control module will become the secondary after the system restart. After you enter the restart command, the active control module will fail over. To continue to use the serial connection when the array restarts, connect the serial cable to the new active control module. Do you really want to restart the system? (yes/no) [no]:yes Restarting at Tue Jan 15 13:40:48 EST 2019 -- please wait... Waiting for the secondary to synchronize with us (max 900s) ....Rebooting the active controller Connection to 192.168.1.3 closed by remote host. Connection to 192.168.1.3 closed. raub@desktop:~$
And, yes, it is that anticlimatic. I just got kicked off PS4100E-01; our network access to the storage device is through the controller. Now we need to wait for it to come back. I am lazy so I will use ping to let me know when the network interface is back up.
raub@desktop:~$ ping 192.168.1.3 PING 192.168.1.3 (192.168.1.3) 56(84) bytes of data. 64 bytes from 192.168.1.3: icmp_seq=10 ttl=254 time=3.50 ms 64 bytes from 192.168.1.3: icmp_seq=10 ttl=254 time=3.51 ms (DUP!) 64 bytes from 192.168.1.3: icmp_seq=11 ttl=254 time=90.8 ms 64 bytes from 192.168.1.3: icmp_seq=12 ttl=254 time=1.14 ms 64 bytes from 192.168.1.3: icmp_seq=13 ttl=254 time=1.49 ms 64 bytes from 192.168.1.3: icmp_seq=14 ttl=254 time=1.64 ms 64 bytes from 192.168.1.3: icmp_seq=15 ttl=254 time=1.06 ms ^C --- 192.168.1.3 ping statistics --- 15 packets transmitted, 6 received, +1 duplicates, 60% packet loss, time 14078m s rtt min/avg/max/mdev = 1.068/14.741/90.815/31.072 ms raub@desktop:~$
What I can't show in this article is that it took a while until I started getting packets back. In fact, I went to make coffee (which required me to find coffee first) and then came back and then run ping a few times until I got the above output. These devices are in no hurry when rebooting.
Once we see pinging, we can check if ssh is running again and then log back in and asked what's up with the controllers:
STORGroup01> member select PS4100E-01 show controllers ___________________________ Controller Information ____________________________ SlotID: 0 Status: unknown Model: () BatteryStatus: unknown ProcessorTemperature: 0 ChipsetTemperature: 0 LastBootTime: 2019-01-15:13:43:19 SerialNumber: Manufactured: ECOLevel: CM Rev.: FW Rev.: BootRomVersion: BootRomBuilDate: _______________________________________________________________________________ _______________________________________________________________________________ SlotID: 1 Status: active Model: 70-0476(TYPE 12) BatteryStatus: ok ProcessorTemperature: 60 ChipsetTemperature: 43 LastBootTime: 2019-01-15:12:56:00 SerialNumber: Manufactured: 025E CN-I_AM_THE_NEW_GUY ECOLevel: C00 CM Rev.: A03 FW Rev.: Storage Array Firmware V9.1.4 BootRomVersion: 3.6.4 (R443182) BootRomBuilDate: Mon Jun 27 10:20:45 EDT 2011 _______________________________________________________________________________
When the output was captured, the controller in slot 0 has not come back up from the restart command. What matters, however, is the controller in slot 1 became the active one, meaning we can swap the unhappy controller. If we wait a bit, we will see the controller on slot 0 be reported as failed,
STORGroup01> member select PS4100E-01 show controllers ___________________________ Controller Information ____________________________ SlotID: 0 Status: secondary Model: 70-0476(TYPE 12) BatteryStatus: failed ProcessorTemperature: 0 ChipsetTemperature: 0 LastBootTime: 2019-01-15:13:43:09 SerialNumber: Manufactured: 0327 CN-I_AM_FEELING_DEPRESSED ECOLevel: C00 CM Rev.: A04 FW Rev.: Storage Array Firmware V9.1.4 BootRomVersion: 3.6.4 (R443182) BootRomBuilDate: Mon Jun 27 10:20:45 EDT 2011 _______________________________________________________________________________
which means we can now swap the top guy (slot 0) with the old controller we accidentally removed from slot 1. After a few minutes, all is well in the world:
STORGroup01> member select PS4100E-01 show controllers ___________________________ Controller Information ____________________________ SlotID: 0 Status: secondary Model: 70-0476(TYPE 12) BatteryStatus: ok ProcessorTemperature: 0 ChipsetTemperature: 0 LastBootTime: 2019-01-15:13:55:33 SerialNumber: Manufactured: 031S CN-I_AM_FEELING_GREAT ECOLevel: C00 CM Rev.: A04 FW Rev.: Storage Array Firmware V9.1.4 BootRomVersion: 3.6.4 (R443182) BootRomBuilDate: Mon Jun 27 10:20:45 EDT 2011 _______________________________________________________________________________ _______________________________________________________________________________ SlotID: 1 Status: active Model: 70-0476(TYPE 12) BatteryStatus: ok ProcessorTemperature: 62 ChipsetTemperature: 42 LastBootTime: 2019-01-15:12:55:43 SerialNumber: Manufactured: 025E CN-I_AM_THE_NEW_GUY ECOLevel: C00 CM Rev.: A03 FW Rev.: Storage Array Firmware V9.1.4 BootRomVersion: 3.6.4 (R443182) BootRomBuilDate: Mon Jun 27 10:20:45 EDT 2011 _______________________________________________________________________________
There might be a GUI way to do all of that but I am not smart enough to click on things.
No comments:
Post a Comment