Showing posts with label nas. Show all posts
Showing posts with label nas. Show all posts

Wednesday, January 16, 2019

Failover and swapping controllers in a Equallogic SAN, command line involved

Equallogic as some of you know are old storage systems sold by Dell that can do at best RAID6 in a good day. As of now it has been out of support for a few years; Dell bought EMC and would rather if we used that instead. We do have them too but this article is about the Equallogic we still use.

On a Tuesday evening I got an email from one of those telling me it was unhappy. As my recurring motif, I do not like GUIs to manage devices. This one decided it would not reinforce that belief and instead just refused to start. So it is time for plan B. Most of the storage appliances and hypervisors out there either are built on Linux or Freebsd. In the last time I had to probulate it, I found it runs freebsd. And that usually means we can ssh into it. And we do it; I think I do not need to show how to do that; if you ssh into a machine, you ssh into them all.

Now we are inside we need to find what's up (I am cheating a bit because the email did tell me that the device PS4100E-01 was unhappy, which is why I am probulating it specifically):

STORGroup01> member select PS4100E-01 show
_____________________________ Member Information ______________________________
Name: PS4100E-01                       Status: online
TotalSpace: 15.98TB                    UsedSpace: 3.86TB
SnapSpace: 192.92GB                    Description:
Def-Gateway: 192.168.1.1               Serial-Number:
Disks: 12                                CN-TWO_AND_A_HALF
Spares: 1                              Controllers: 2
CacheMode: write-thru                  Connections: 4
RaidStatus: ok                         RaidPercentage: 0.000%
LostBlocks: false                      HealthStatus: critical
LocateMember: disable                  Controller-Safe: disabled
Version: V9.1.4 (R443182)              Delay-Data-Move: disable
ChassisType: DELLSBB2u12 3.5           Accelerated RAID Capable: no
Pool: default                          Raid-policy: raid6
Service Tag: H9CHSW1                   Product Family: PS4100
All-Disks-SED: no                      SectorSize: 512
Language-Kit-Version: de, es, fr, ja,  ExpandedSnapDataSize: N/A
  ko, zh                               CompressedSnapDataSize: N/A
CompressionSavings: N/A                Data-Reduction: no-capable-hardware
Raid-Rebuild-Delay-State: disabled     Raid-Expansion-Status: enabled
_______________________________________________________________________________

____________________________ Health Status Details ____________________________

Critical conditions::
Critical hardware component failure.

Warning conditions::
None
_______________________________________________________________________________


____________________________ Operations InProgress ____________________________


ID StartTime            Progress Operation Details                             

-- -------------------- -------- -----------------------------------------------
STORGroup01>

Yep, the Health Status Details menu tells us it is unhappy; we knew that already. But, what does the log say?

STORGroup01> show recentevents
[...]
6492:881:PS4100E-01:SP: 8-Jan-2019 19:34:51.700834:cache_driver.cc:1056:WARNING:
28.3.17:Active control module cache is now in write-through mode. Array performa
nce is degraded.

6491:880:PS4100E-01:SP: 8-Jan-2019 19:34:51.700833:emm.c:355:ERROR:28.4.85:Criti
cal hardware component failure, as shown next.
        C2F power module is not operating.

6490:879:PS4100E-01:SP: 8-Jan-2019 19:34:51.700832:emm.c:2363:ERROR:28.4.47:Crit
ical health conditions exist.
 Correct immediately before they affect array operation.
        Critical hardware component failure.
        There are 1 outstanding health conditions. Correct these conditions before they
 affect array operation.

OK, controller thingie is not happy. But, this device has 2 of them. Which one is it?

STORGroup01> member select PS4100E-01 show controllers
___________________________ Controller Information ____________________________
SlotID: 0                              Status: active
Model: 70-0476(TYPE 12)                BatteryStatus: failed
ProcessorTemperature: 65               ChipsetTemperature: 44
LastBootTime: 2018-04-23:15:06:55      SerialNumber:
Manufactured: 0327                       CN-I_AM_FEELING_DEPRESSED
ECOLevel: C00                          CM Rev.: A04
FW Rev.: Storage Array Firmware V9.1.4 BootRomVersion: 3.6.4
   (R443182)                           BootRomBuilDate: Mon Jun 27 10:20:45
                                          EDT 2011                             
_______________________________________________________________________________
_______________________________________________________________________________
SlotID: 1                              Status: secondary
Model: 70-0476(TYPE 12)                BatteryStatus: ok
ProcessorTemperature: 0                ChipsetTemperature: 0
LastBootTime: 2018-04-23:15:13:17      SerialNumber:
Manufactured: 031S                       CN-I_AM_FEELING_GREAT
ECOLevel: C00                          CM Rev.: A04
FW Rev.: Storage Array Firmware V9.1.4 BootRomVersion: 3.6.4
   (R443182)                           BootRomBuilDate: Mon Jun 27 10:20:45
                                          EDT 2011                             
_______________________________________________________________________________

______________________________ Cache Information ______________________________
CacheMode: write-thru                  Controller-Safe: disabled
Low-Battery-Safe: enabled                                                      
_______________________________________________________________________________
STORGroup01>

Fun fact: I haven't the foggiest idea of which slot is 1 and which one is 0. We will have to find that out as we go along.

SPOILER ALERT: Making ASSumptions here is a bad idea.

The Controller

We sent out the above info -- model, firmware -- to the vendor we have a support contract with, which sent a controller card. Here is it in its purple gloriousness.

The configuration is stored in a SD card on the socket being pointed by my finger. When swapping the controller, the laziest thing to do is put the old SD card in the new controller. This way we do not have to configure it.


We know one of the controllers is unhappy, but which one? You see, this storage device only needs one controller to work. But, it is an enterprise device; the reason to have two is that the second one is on standby: if the primary has a problem, service can failover to the secondary/backup. If you look at the picture below, you will see one controller has both lights green while the one below is has the top green and the bottom orange. That (orange light) indicates it is the either the backup, secondary, failover, or unused controller. The OR is very important here

Now, this system is designed to be hot swappable but with one proviso: you can swap the device that is not currently in use, which would make sense since chances are the bad controller failed and the backup took over. So, the failed device should be the one in standby.

So, I swapped it. And then checked again.

STORGroup01> member select PS4100E-01 show controllers
___________________________ Controller Information ____________________________
SlotID: 0                              Status: active
Model: 70-0476(TYPE 12)                BatteryStatus: failed
ProcessorTemperature: 64               ChipsetTemperature: 44
LastBootTime: 2018-04-23:15:07:14      SerialNumber:
Manufactured: 0327                       CN-I_AM_FEELING_DEPRESSED
ECOLevel: C00                          CM Rev.: A04
FW Rev.: Storage Array Firmware V9.1.4 BootRomVersion: 3.6.4
   (R443182)                           BootRomBuilDate: Mon Jun 27 10:20:45
                                          EDT 2011
_______________________________________________________________________________
_______________________________________________________________________________
SlotID: 1                              Status: secondary
Model: 70-0476(TYPE 12)                BatteryStatus: ok
ProcessorTemperature: 0                ChipsetTemperature: 0
LastBootTime: 2019-01-15:12:56:01      SerialNumber:
Manufactured: 025E                       CN-I_AM_THE_NEW_GUY
ECOLevel: C00                          CM Rev.: A03
FW Rev.: Storage Array Firmware V9.1.4 BootRomVersion: 3.6.4
   (R443182)                           BootRomBuilDate: Mon Jun 27 10:20:45
                                          EDT 2011
_______________________________________________________________________________

It seems I replaced the secondary (serial number CN-I_AM_THE_NEW_GUY), and the primary one (CN-I_AM_FEELING_DEPRESSED) is still the problematic one. Why didn't it failover when it realized it had a problem? I don't know; what I know is that I still need to replace the problematic controller. At least now we know the SlotID 0 is the top controller and SlotID 1 the bottom. Small progress but progress nevertheless.

It turns out there is a way to programmatically force the failover; that is achieved using the command restart. But, since I have 3 different storage devices -- PS4100E-01, PS4100E-00, and PS6100E -- in this setup and am currently accessing them from the system that control all of them, and restart does not seem to allow me to specify which device I want to reboot,

STORGroup01> restart
            - Optional argument to restart.
 
STORGroup01>

we should ssh into PS4100E-01 (IP 192.168.1.3 for those who are wondering) and then issue the command there. We do not need to worry about it restarting the secondary controller because it only reboots the active one, which causes it to failover to the secondary.

STORGroup01> restart

Restarting the system will result in the active and secondary control
modules switching roles. Therefore, the current active control module
will become the secondary after the system restart.


After you enter the restart command, the active control module will fail over.
To continue to use the serial connection when the array restarts, connect the
serial cable to the new active control module.

Do you really want to restart the system? (yes/no) [no]:yes
Restarting at Tue Jan 15 13:40:48 EST 2019 -- please wait...
Waiting for the secondary to synchronize with us (max 900s)
....Rebooting the active controller
Connection to 192.168.1.3 closed by remote host.
Connection to 192.168.1.3 closed.
raub@desktop:~$

And, yes, it is that anticlimatic. I just got kicked off PS4100E-01; our network access to the storage device is through the controller. Now we need to wait for it to come back. I am lazy so I will use ping to let me know when the network interface is back up.

raub@desktop:~$ ping 192.168.1.3
PING 192.168.1.3 (192.168.1.3) 56(84) bytes of data.

64 bytes from 192.168.1.3: icmp_seq=10 ttl=254 time=3.50 ms
64 bytes from 192.168.1.3: icmp_seq=10 ttl=254 time=3.51 ms (DUP!)
64 bytes from 192.168.1.3: icmp_seq=11 ttl=254 time=90.8 ms
64 bytes from 192.168.1.3: icmp_seq=12 ttl=254 time=1.14 ms
64 bytes from 192.168.1.3: icmp_seq=13 ttl=254 time=1.49 ms
64 bytes from 192.168.1.3: icmp_seq=14 ttl=254 time=1.64 ms
64 bytes from 192.168.1.3: icmp_seq=15 ttl=254 time=1.06 ms
^C
--- 192.168.1.3 ping statistics ---
15 packets transmitted, 6 received, +1 duplicates, 60% packet loss, time 14078m
s
rtt min/avg/max/mdev = 1.068/14.741/90.815/31.072 ms
raub@desktop:~$

What I can't show in this article is that it took a while until I started getting packets back. In fact, I went to make coffee (which required me to find coffee first) and then came back and then run ping a few times until I got the above output. These devices are in no hurry when rebooting.

Once we see pinging, we can check if ssh is running again and then log back in and asked what's up with the controllers:

STORGroup01> member select PS4100E-01 show controllers
___________________________ Controller Information ____________________________
SlotID: 0                              Status: unknown
Model: ()                              BatteryStatus: unknown
ProcessorTemperature: 0                ChipsetTemperature: 0
LastBootTime: 2019-01-15:13:43:19      SerialNumber:
Manufactured:                          ECOLevel:
CM Rev.:                               FW Rev.:
BootRomVersion:                        BootRomBuilDate:
_______________________________________________________________________________
_______________________________________________________________________________
SlotID: 1                              Status: active
Model: 70-0476(TYPE 12)                BatteryStatus: ok
ProcessorTemperature: 60               ChipsetTemperature: 43
LastBootTime: 2019-01-15:12:56:00      SerialNumber:
Manufactured: 025E                       CN-I_AM_THE_NEW_GUY
ECOLevel: C00                          CM Rev.: A03
FW Rev.: Storage Array Firmware V9.1.4 BootRomVersion: 3.6.4
   (R443182)                           BootRomBuilDate: Mon Jun 27 10:20:45
                                          EDT 2011
_______________________________________________________________________________

When the output was captured, the controller in slot 0 has not come back up from the restart command. What matters, however, is the controller in slot 1 became the active one, meaning we can swap the unhappy controller. If we wait a bit, we will see the controller on slot 0 be reported as failed,

STORGroup01> member select PS4100E-01 show controllers
___________________________ Controller Information ____________________________
SlotID: 0                              Status: secondary
Model: 70-0476(TYPE 12)                BatteryStatus: failed
ProcessorTemperature: 0                ChipsetTemperature: 0
LastBootTime: 2019-01-15:13:43:09      SerialNumber:
Manufactured: 0327                       CN-I_AM_FEELING_DEPRESSED
ECOLevel: C00                          CM Rev.: A04
FW Rev.: Storage Array Firmware V9.1.4 BootRomVersion: 3.6.4
   (R443182)                           BootRomBuilDate: Mon Jun 27 10:20:45
                                          EDT 2011
_______________________________________________________________________________

which means we can now swap the top guy (slot 0) with the old controller we accidentally removed from slot 1. After a few minutes, all is well in the world:

STORGroup01> member select PS4100E-01 show controllers
___________________________ Controller Information ____________________________
SlotID: 0                              Status: secondary
Model: 70-0476(TYPE 12)                BatteryStatus: ok
ProcessorTemperature: 0                ChipsetTemperature: 0
LastBootTime: 2019-01-15:13:55:33      SerialNumber:
Manufactured: 031S                       CN-I_AM_FEELING_GREAT
ECOLevel: C00                          CM Rev.: A04
FW Rev.: Storage Array Firmware V9.1.4 BootRomVersion: 3.6.4
   (R443182)                           BootRomBuilDate: Mon Jun 27 10:20:45
                                          EDT 2011
_______________________________________________________________________________
_______________________________________________________________________________
SlotID: 1                              Status: active
Model: 70-0476(TYPE 12)                BatteryStatus: ok
ProcessorTemperature: 62               ChipsetTemperature: 42
LastBootTime: 2019-01-15:12:55:43      SerialNumber:
Manufactured: 025E                       CN-I_AM_THE_NEW_GUY
ECOLevel: C00                          CM Rev.: A03
FW Rev.: Storage Array Firmware V9.1.4 BootRomVersion: 3.6.4
   (R443182)                           BootRomBuilDate: Mon Jun 27 10:20:45
                                          EDT 2011
_______________________________________________________________________________

There might be a GUI way to do all of that but I am not smart enough to click on things.

Thursday, August 06, 2009

Disassembling a Brocade 2800 fabric switch (and finding its serial port/resetting its password in the process!)

What we have here is a Brocade 2800 fabric switch, but you probably new that by reading the title of today's chapter. In previous jobs I have worked with 24000 and 48000 Brocade switches like this hairball here:

Yes, they are bigger (256 fibre ports for the 24000 and 512 for the 48000 vs 16 for the 2800) and faster (up to 4GB compared to 1GB) but they all do the same thing: they were originally designed to be used in storage, connecting SAN and NAS devices to the hosts that use them. The fabric part of the name means just that. But, we are here talking about the 2800, so sit down and take notes because I will not repeat myself.

As mentioned before, the 2800 has 16 fibre ports... and one ethernet port. The ethernet port, a 100Base-TX one, is normally used as your console: you assign an IP to it, connect to the switch either using a web interface or ssh, and do some switch managing.

Tearing it apart.

All that talk about what these switches can do is most interesting, but that is not what we are here to do today. We want to see what is inside the box! So, let's get busy!

The first thing we should do is to remove the two redundant power supplies. To do so, you first pop the clip that goes around the power connector; you will note it becomes a neat handle (shown here is the left power supply sitting on the top of the right, which is still in the switch).

Then, you press down on the tab the clip hinges around. That unlocks the power supply from the case, allowing you to slide it out from the front.

Here is another view of the hole the right power supply hides. Note the tabs on the left. They are what hold the top in place, which we will be removing in the next step.

Ok, I was lying. We are not removing the top of the switch just yet. First, we need to remove the plastic cover that surrounds the front LCD and the buttons that control the menu on the display. It will come off if you grab it on one side and work it out, I promise! Just do not be too aggressive.

Once the cover is off, well, the cover is of. Now we can go back to the business of removing the top cover.

This next step require a bit of force and traction: you need to slide the lid forward which will mean you (or a friend) will have to hold the switch in place while wiggling the cover and slowly moving it. I did that by myself but will say finding a place to hold the cover was tricky. This is in my opinion the hardest step in this entire adventure.

With a judicious use of curse works, strength, and even finesse, it should eventually slide forward until it can't go any further. Yeah, as the picture below shows it is not much, but remember all it has to do is to line its tabs with the holes you saw through the power supply bays. Now, you have to grab the back of the cover and tilt it up around the front. Then, a bit of wiggling should get the cover out.

Here is how the switch looks like topless. Those wires on the top come from the front panel. You can also see the fans on the rear. Also note the titled plate those wires go through. It is probably that way to help direct the air from the motherboard to the fans; we will talk about them in a moment.

And here is a rear view of the case. Can you see the motherboard hiding there?

I promised I was going to talk a bit about the fans, and talk about them I will. If you look on the back of the case, you will find 4 thumbscrews that look like this:

They hold the fan assembly to the case. So, if you unscrew them, you can remove the fans without much fuss.

Overall I will say this is very easy switch to take apart, which means you can replace any of its components in minutes.

Hunting for the serial port.

If you are like me, you will make sure you keep all documentation on the switch somewhere safe... or safer than piling up by the trash bin. The more confidential stuff, like its configuration and passwords, you probably have a secure encrypted file somewhere with them; it sure beats post-it notes! But, what do you do if you lose your password or the previous guy left without letting anybody know it? Well, two things you can do: call Brocade and have them send you a file to reset the password (which they will gladly do depending on your contract) or you have to take matters on your own hand. Brocade was actually nice: in addition to the normal ethernet console port, they hid a serial port inside the switch. As I looked online for info on it, it seemed that its location was only told to the initiated after they proven their mastery of the secret network society handshake.

Jokes apart, I have never found pictures of the inside of this switch, much less detailed information of the location of the serial port (I guess it rather be called IDC but I will ignore its wishes). And this is why I decided to do this entire writeup. The part about tearing the siwtch apart is more of a bonus feature; this is the neat stuff:

Getting to the serial port will require one tool and one tool only: a flathead screwdriver. If you look on the front of the switch, you will find a slotted cylinder thingie about in the middle of the ports; that is where you put the screwdriver on:

That actually turns a long shaft which ends inside the case. The threaded end of the shaft turns around a nut that is attached to a bracket. Just click on the pict below to see it; it is kinda behind where the wires from the front panel go down.

As you unscrew it, the drawer containing the motherboard and the switch ports will slide forward. It will not go very far; you will know when you unscrewed it as far as it can go. After that you will need to work it out. It is tight in there. I wiggled it and eventually it slid enough I could grab it and pull it out like a drawer:

Here is what the motherboard looks like once it is removed from the case. Now, what about the serial port you talked about? Calm down and look to its upper left corner, by the left large white connector. Can't see it? Let me zoom a bit.

How about now? Can you see it? Hint: I circled it in yellow.

And here is yet an even closer look. Can you see the pin numbers? Hint: White dot usually indicates pin 1.

If you want to make your serial cable, here is the pinout:

I will add how to reset the password later.