I went to a customer today to configure a NetApp FAS2020 for SnapMirror replication. For sake of argument, let’s call this filer FLR2. After hooking up the fibre channel interconnect to a second diskshelf (containing SATA-disks), I noticed that all disks from the onboard shelf were missing. They were added to a second FAS2020 (FLR1), which is running all the production VM’s and thus has more benefit from the 300GB 10K SAS disks than FLR2, which was going to do only SnapMirror replication from FLR1 over a 20 Mbit internet connection. Without the SAS-disks in place, which hold aggr0 and the root volume (vol0), there wasn’t much to boot from. Hence, I had myself a very expensive piece of useless storage.

Because my experience with NetApp filers has been relatively minimal up until this point, I needed to dive into it deeply to figure out how the root volume works, why NetApp has placed it onto the ‘data’ disks (and not onto the CompactFlash storage), and most important, how to fix this mess.

The bootlog

I began by examining the output of the boot process. I hookup up the serial connection, and saw the following:

AMI BIOS8 Modular BIOS
Copyright (C) 1985-2006,  American Megatrends, Inc. All Rights Reserved
Portions Copyright (C) 2006 Network Appliance, Inc. All Rights Reserved
BIOS Version 3.0
+++++++++++++++

Boot Loader version 1.3
Copyright (C) 2000,2001,2002,2003 Broadcom Corporation.
Portions Copyright (C) 2002-2006 Network Appliance Inc.

CPU Type: Mobile Intel(R) Celeron(R) CPU 2.20GHz

Starting AUTOBOOT press Ctrl-C to abort...
Loading:.................0x200000/33899484 0x22543dc/31974692 0x40d2900
 /2557760 Entry at 0x00200000
Starting program at 0x00200000
cpuid 0x80000000: 0x80000004 0x0 0x0 0x0
Press CTRL-C for special boot menu

[fci.initialization.failed:error]: Initialization failed on
 Fibre Channel adapter 0b.

NetApp Release 7.2.6.1: Wed Dec 10 21:26:02 PST 2008
Copyright (c) 1992-2008 NetApp.
Starting boot on Mon Nov  2 15:50:00 GMT 2009

[nvram.battery.turned.on:info]: The NVRAM battery is turned ON.
 It is turned OFF during system shutdown.
[diskown.isEnabled:info]: software ownership has been enabled
 for this system
[coredump.spare.none:info]: No sparecore disk was found.
 WARNING: 0 disks found!

Storage Adapters found:
2 Fibre Channel Storage Adapters found!
1 SAS Adapters found!
0 Parallel SCSI Storage Adapters found!
1 ATA Adapters found!

Target Adapters found:
0 Fibre Channel Target Adapters found!
1 iSCSI Target Adapters found!
1 Unknown Target Adapters found!

WARNING: there do not appear to be any disks attached to the system.

Check that disks have been assigned ownership to this
system (ID 135058219) using the 'disk show' and 'disk assign'
commands from maintenace mode.

No root volume found.
Rebooting... (ctrl-c to break reboot loop)

OK, so the bootloader is still functioning properly, but complaining about the lack of disks, which makes perfect sense, because the ownership of the disks hasn’t been claimed yet. More importantly, even if the filer had claimed the disks, there wouldn’t be anything on those disks, so the boot process would’ve stopped in a similar fashion anyway.

The solution

So, the filer doesn’t have any disks assigned to it and there are no volumes defined on those disks. Let’s fix that! See the red bits in the log below for details:

  1. We’ll start out by pressing [CTRL] – [C] to enable the very special boot menu.
  2. Select option 4a to assign ownership of a couple of disks, initialize them and create a flexible root volume.

AMI BIOS8 Modular BIOS
Copyright (C) 1985-2006,  American Megatrends, Inc. All Rights Reserved
Portions Copyright (C) 2006 Network Appliance, Inc. All Rights Reserved
BIOS Version 3.0
+++++++++++++++

Boot Loader version 1.3
Copyright (C) 2000,2001,2002,2003 Broadcom Corporation.
Portions Copyright (C) 2002-2006 Network Appliance Inc.

CPU Type: Mobile Intel(R) Celeron(R) CPU 2.20GHz

Starting AUTOBOOT press Ctrl-C to abort...
Loading:.................0x200000/33899484 0x22543dc/31974692 0x40d2900
 /2557760 Entry at 0x00200000
Starting program at 0x00200000
cpuid 0x80000000: 0x80000004 0x0 0x0 0x0
Press CTRL-C for special boot menu
Special boot options menu will be available.

[fci.initialization.failed:error]: Initialization failed on
 Fibre Channel adapter 0b.

NetApp Release 7.2.6.1: Wed Dec 10 21:26:02 PST 2008
Copyright (c) 1992-2008 NetApp.
Starting boot on Mon Nov  2 15:57:08 GMT 2009

[nvram.battery.turned.on:info]: The NVRAM battery
 is turned ON. It is turned OFF during system shutdown.
[diskown.isEnabled:info]: software ownership has
 been enabled for this system

(1)  Normal boot.
(2)  Boot without /etc/rc.
(3)  Change password.
(4)  Assign ownership and initialize disks for root volume.
(4a) Same as option 4, but create a flexible root volume.
(5)  Maintenance mode boot.
Selection (1-5)? 4a

The system has 0 disks whereas it needs 3 to boot, will try to assign the
required number.

[diskown.changingOwner:info]: changing ownership for disk 0a.18
 from unowned (ID -1) to  (ID 135058219)
[diskown.changingOwner:info]: changing ownership for disk 0a.23
 from unowned (ID -1) to  (ID 135058219)
[diskown.changingOwner:info]: changing ownership for disk 0a.27
 from unowned (ID -1) to  (ID 135058219)

Zero disks and install a new file system? y
This will erase all the data on the disks, are you sure? y
Zeroing disks takes about 280 minutes.

[coredump.spare.none:info]: No sparecore disk was found.

[raid.disk.zero.done:notice]: Disk 0a.18 Shelf 1 Bay 2
 [X269_SMOOS01TSSX NA01]: disk zeroing complete
[raid.disk.zero.done:notice]: Disk 0a.27 Shelf 1 Bay 11
 [X269_SMOOS01TSSX NA01]: disk zeroing complete
[raid.disk.zero.done:notice]: Disk 0a.23 Shelf 1 Bay 7
 [X269_SMOOS01TSSX NA01]: disk zeroing complete

[raid.vol.disk.add.done:notice]: Addition of Disk
 /aggr0/plex0/rg0/0a.27 Shelf 1 Bay 11 [X269_SMOOS01TSSX NA01]
 to aggregate aggr0 has completed successfully
[raid.vol.disk.add.done:notice]: Addition of Disk
 /aggr0/plex0/rg0/0a.23 Shelf 1 Bay 7 [X269_SMOOS01TSSX NA01]
 to aggregate aggr0 has completed successfully
[raid.vol.disk.add.done:notice]: Addition of Disk
 /aggr0/plex0/rg0/0a.18 Shelf 1 Bay 2 [X269_SMOOS01TSSX NA01]
 to aggregate aggr0 has completed successfully

[wafl.vol.add:notice]: Aggregate aggr0 has been added to the system.
[fmmbx_instanceWorke:info]: no mailbox instance on local side
[fmmb.current.lock.disk:info]: Disk 0a.18 is a local HA mailbox disk.
[fmmb.current.lock.disk:info]: Disk 0a.23 is a local HA mailbox disk.
[fmmbx_instanceWorke:info]: normal mailbox instance on local side

After configuring the hostname, network settings for e0a and the bmc-adapter, I was ready to hit NetApp FilerView to configure the rest. But the web server that should be running FilerView wasn’t running. Back to the console log once more:
[init_java:warning]: Java disabled:  Missing /etc/java/rt131.jar.
[httpd_servlet:warning]: Java Virtual Machine not accessible
[asup.config.minimal.unavailable:warning]: Minimal Autosupports
 unavailable. Could not read /etc/asup_content.conf
[sysconfig.sysconfigtab.openFailed:notice]: sysconfig: table of
 valid configurations (/etc/sysconfigtab) is missing.

Hmm, it seems the filer is still missing a lot of files, all in the /etc folder, which should reside on the root volume. Most importantly, the Java bits needed to use FilerView weren’t there. It looks like the Data ONTAP installation still isn’t recovered fully. As I noticed this filer’s ONTAP version (7.2.6.1) was old anyway, I decided to upgrade the software in the usual fashion. I browsed to the NOW page for ONTAP v7.3.2, Checked out the release notes, changed features page, the requirements for running this version, read up on the known problems and limitations and important cautions.

I was convinced that this release was the right one for me. As this was the first time I did an actual upgrade of ONTAP, I decided to read up on the Data ONTAP Upgrade Guide as well. While we’re at it, I downloaded the complete set of documentation for Data ONTAP here. Lastly, I grabbed the ‘System Files‘ and ‘Netboot Image‘.

Upgrading itself seemed easy: the filer wasn’t performing any tasks or I/O and isn’t in any cluster. This means I can simply jank out the power cables if I wanted to. Not that that’s such a good idea when upgrading firmware, but still. The second thing that made the upgrade easier was that the BIOS/NABL (v3.0), Diagnostics (v5.3.7) and the BMC controller firmware (v1.2) are all up to date.

Since the wipe-out of ONTAP by removing the SAS disks also destroyed any reference to licenses, I could not place the ONTAP upgrade software on the filer using CIFS or NFS, or even iSCSI or FCP. The only option I could use was to download it through HTTP. So I placed the System File onto a HTTP server, ready for the filer to pick up the bits and do its magic.

flr2> software update http://webserver/ontap/732_setup_e.exe
software: copying to 732_setup_e.exe software:
100% file read from location.
software: /etc/software/732_setup_e.exe has been copied.
software: installing software, this could take a few minutes...
software: installation of 732_setup_e.exe completed.

[rc:info]: software: installation of 732_setup_e.exe completed.
 Please type "download" to load the new software, and "reboot"
 subsequently for the changes to take effect.

flr2> download 
[download.request:notice]: Operator requested download initiated
 download: Downloading boot device download: If upgrading from a
 version of Data ONTAP prior to 7.3, please ensure download: there
 is at least 3% of available space on each aggregate before upgrading.
 Additional information can be found in the release notes.

Version 1 ELF86 kernel detected.............
download: Downloading boot device (Service Area).........
[download.requestDone:notice]: Operator requested download completed

flr2> reboot 
[kern.shutdown:notice]: System shut down because : "reboot".

Wow, that’s way easier than I expected. Copy the file to the filer, install the .exe image, activate the new code on the filer’s boot device and actually reboot the unit. During reboot, a couple of messages about updating software on one of the shelfs flew by:
[kern.version.change:notice]: Data ONTAP kernel version was changed from
 NetApp Release 7.2.6.1 to NetApp Release 7.3.2.
[sas.adapter.firmware.download:info]: Updating firmware on SAS adapter 0c
 from version 1.24.03.00 to version 1.26.03.00.
[dfu.firmwareUpToDate:info]: Firmware is up-to-date on all disk drives
[sfu.firmwareDownrev:warning]: Disk shelf firmware needs to be updated on
 1 disk shelf.
[sfu.ctrllerElmntsPerShelf:info]: [storage download shelf]: 1 ES
 controller element can be updated on 0c.shelf0.
[sfu.downloadStarted:info]: Update of disk shelf firmware started
 on 1 shelf.
[sfu.downloadingController:info]: [storage download shelf]: Downloading
 SAS.0500.SFW on disk shelf controller module B on 0c.shelf0.
[sfu.suspendDiskIO:info]: Suspending disk I/O to reboot shelf modules for
 shelf firmware update.
[sfu.rebootRequest:info]: Issuing a request to reboot disk shelf
 0c.shelf0 module B.
[ses.shelf.fw.update.pfu.nrdnt:info]: SCSI Enclosure Services (SES)
 shelf firmware update will be performed using Package Firmware
 Update (PFU) due to not having redundant paths.
[sfu.adapterSuspendIO:info]: Suspending I/O to SAS adapter 0c for
 40 seconds while shelf firmware is updated.
[sfu.resumeDiskIO:info]: Resuming disk I/O after shelf firmware update.
[sfu.downloadSuccess:info]: [storage download shelf]: Firmware file
 SAS.0500.SFW downloaded on 0c.shelf0.
[sfu.downloadSummary:info]: Shelf firmware updated on 1 shelf.
[sfu.firmwareUpToDate:info]: Firmware is up-to-date on all disk shelves.

Nothing shocking to notice here, just confirmation that the ONTAP kernel version has been upgraded, that the SAS Adapter 0c and ES shelf controller 0c firmware will be upgraded and rebooted afterwards. Also, good to see the extra confirmation that all disk shelves and disk drives have current firmware versions after upgrading!

Concluding

Don’t ever take out the disks if the root volume is running on them. Period.

If you do, like this customer did, you’ll have yourself a couple of hours o’ fun restoring and upgrading ONTAP, as well as a heart-to-heart with the people who removed them in the first place, which is likely to take up the remainder of the day.

:)