7: Troubleshooting the System

This chapter describes some DiskSuite problems and their appropriate solution. It is not intended to be all-inclusive but rather to present common scenarios and recovery procedures.

Prerequisites for Troubleshooting the System

Here are the prerequisites for the steps in this section:

Have root privilege.

Have a current backup of all data.

General Guidelines for Troubleshooting DiskSuite

Have the following information on hand when troubleshooting a DiskSuite problem:

Contents of the /etc/vfstab file

Status of state database replicas, metadevices, and hot spares, either from DiskSuite Tool or from the output of the metadb(1M) and metastat(1M) commands

Disk partitioning information, either from the prtvtoc(1M) command or Storage Manager (Solaris systems), or the fdisk command (x86 systems)

Solaris version

Solaris patches

DiskSuite patches

Recovering the DiskSuite Configuration

The /etc/opt/SUNWmd/md.cf file is a backup file of the DiskSuite configuration for a "local" diskset. Whenever you make a configuration change, the md.cf file is automatically updated (except for hot sparing). You never edit the md.cf file directly.

If your system loses the information maintained in the metadevice state database , and as long as no metadevices were created or changed in the meantime, you can use the md.cf file to recover your DiskSuite configuration.

Note - The md.cf file does not maintain information on active hot spares. Thus, if hot spares were in use when the DiskSuite configuration was lost, those metadevices that were hot-spared will likely be corrupted.

How to Use the md.cf File to Recover a DiskSuite Configuration

Warning -

Only use this procedure if you have experienced a complete loss of your DiskSuite configuration.

1. Recreate the state database replicas.

Refer to Chapter 1, "Getting Started," for information on creating state database replicas .

2. Make a backup copy of the

/etc/opt/SUNWmd/md.tab

file.

3. Copy the information from the

md.cf

file to the

md.tab

file.

4. Edit the "new"

md.tab

file so that:

All mirrors are one-way mirrors. If a mirror's submirrors are not the same size, be sure to use the smallest submirror for this one-way mirror. Otherwise data could be lost.

RAID5 metadevices are recreated with the -k option, to prevent reinitialization of the device. (Refer to the metainit(1M) man page for more information on this option.)

5. Run the

metainit(1M)

command to check the syntax of the

md.tab

file entries.

-------------------

# metainit -n -a

-------------------

6. After verifying that the syntax of the

md.tab

file entries is correct, run the

metainit(1M)

command to recreate the metadevices and hot spare pools from the

md.tab

file.

----------------

# metainit -a

----------------

7. Run the

metattach(1M)

command to make the one-way mirrors into multi-way mirrors.

8. Validate the data on the metadevices.

Changing DiskSuite Defaults

By default, the DiskSuite configuration defaults to 128 metadevices and state database replicas that are sized to 1034 blocks. The default number of disksets is four. All of these values can be changed if necessary, and the tasks in this section tell you how.

Preliminary Information for Metadevices

The default number of metadevices for a system is 128. If you need to configure more than the default, you can increase this value up to 1024.

When you add large numbers of metadevices, you may begin to see some system performance degradation only while administering metadevices (using DiskSuite Tool or the command line utilities). A large number of metadevices should not have an impact on normal system operation.

If you increase the number of metadevices to gain a larger namespace for partitioning the types of devices within certain numeric ranges, but you create fewer than 128 devices, you should not see any performance degradation. In this case, you should not have to add larger state database replicas .

How to Increase the Number of Default Metadevices (Command Line)

This task describes how to increase the number of metadevices from the default value of 128.

Warning -

If you lower this number, any metadevice existing between the old number and the new number may not be available, potentially resulting in data loss. If you see a message such as "md: d20: not configurable, check /kernel/drv/md.conf" you will need to edit the md.conf file as explained in this task.

1. After checking the prerequisites and the preliminary information, edit the

/kernel/drv/md.conf

file.

2. Change the value of the

nmd

field. Values are supported up to 1024.

3. Save your changes.

4. Perform a reconfiguration reboot to build the metadevice names.

------------

# boot -r

------------

Example - `md.conf` File

Here is a sample md.conf file configured for 256 metadevices.

------------------------------------------------------------

#
#ident "@(#)md.conf   1.7     94/04/04 SMI"
#
# Copyright (c) 1992, 1993, 1994 by Sun Microsystems, Inc.
#
name="md" parent="pseudo" nmd=256 md_nsets=4;

------------------------------------------------------------

Preliminary Information for Disksets

The default number of disksets for a system is 4. If you need to configure more than the default, you can increase this value up to 32. The number of shared disksets is always one less than the md_nsets value, because the local set is included in md_nsets.

How to Increase the Number of Default Disksets (Command Line)

This task shows you how to increase the number of disksets from the default value of 4.

Warning -

If you lower this number, any diskset existing between the old number and the new number may not be persistent.

1. After checking the prerequisites, edit the

/kernel/drv/md.conf

file.

2. Change the value of the

md_nsets

field. Values are supported up to 32.

3. Save your changes.

4. Perform a reconfiguration reboot to build the metadevice names.

------------

# boot -r

------------

Example - `md.conf` File

Here is a sample md.conf file configured for five disksets. The value of md_nsets is six, which results in five disksets and one local diskset.

------------------------------------------------------------

#
#ident "@(#)md.conf   1.7     94/04/04 SMI"
#
# Copyright (c) 1992, 1993, 1994 by Sun Microsystems, Inc.
#
name="md" parent="pseudo" nmd=255 md_nsets=6;

------------------------------------------------------------

Preliminary Information for State Database Replicas

If you create large numbers of metadevices, the state database replicas may eventually be too small to contain all the necessary information. If this happens, try adding larger state database replicas by using the -l option to the metadb(1M) command then removing the smaller state database replicas.

As a general rule, if you've doubled the number of default metadevices, double the size of the state database replicas.

How to Add Larger State Database Replicas (Command Line)

After checking the prerequisites, and reading the preliminary information, use the metadb command to add larger state database replicas , then to delete the old, smaller state database replicas. Refer to the metadb(1M) man page for more information.

Example - Adding Larger State Database Replicas

----------------------------------------------------------

# metadb -a -l 2068 c1t0d0s3 c1t1d0s3 c2t0d0s3 c2t1d0s3
# metadb -d c1t0d0s7 c1t1d0s7 c2t0d0s7 c2t1d0s7

----------------------------------------------------------

The first metadb command adds state database replicas whose size is specified by the -l 2068 option (2068 blocks). This is double the default replica size of 1034 blocks. The second metadb command removes those smaller state database replicas from the system.

Checking For Errors

When DiskSuite encounters a problem, such as being unable to write to a metadevice due to physical errors at the slice level, it changes the status of the metadevice, for example, to "Maintenance." However, unless you are constantly looking at DiskSuite Tool or running metastat(1M), you may never see these status changes in a timely fashion.

There are two ways that you can automatically check for DiskSuite errors:

Using SNMP traps

Using a script to constantly check for errors

The first method is described in "Integrating SNMP Alerts With DiskSuite."

The following section describes the kind of script you can use to check for DiskSuite errors.

How to Automate Checking for Slice Errors in Metadevices (Command Line)

One way to continually and automatically check for a bad slice in a metadevice is to write a script that is invoked by cron. Here is an example:

--------------------------------------------------------------------------------

#
#ident "@(#)metacheck.sh   1.3     96/06/21 SMI"
#
# Copyright (c) 1992, 1993, 1994, 1995, 1996 by Sun Microsystems, Inc.
#
#
# DiskSuite Commands
#
MDBIN=/usr/opt/SUNWmd/sbin
METADB=${MDBIN}/metadb
METAHS=${MDBIN}/metahs
METASTAT=${MDBIN}/metastat
#
# System Commands
#
AWK=/usr/bin/awk
DATE=/usr/bin/date
MAILX=/usr/bin/mailx
RM=/usr/bin/rm
#
# Initialization
#
eval=0
date=`${DATE} '+%a %b %e %Y'`
SDSTMP=/tmp/sdscheck.${$}
${RM} -f ${SDSTMP}
MAILTO=${*:-"root"}			# default to root, or use arg list
#
# Check replicas for problems, capital letters in the flags indicate an error.
#

--------------------------------------------------------------------------------

--------------------------------------------------------------------------

(Continued from previous page)
dbtrouble=`${METADB} | tail +2 | \
    ${AWK} '{ fl = substr($0,1,20); if (fl ~ /[A-Z]/) print $0 }'`
if [ "${dbtrouble}" ]; then
        echo ""   >${SDSTMP}
        echo "SDS replica problem report for ${date}"	>${SDSTMP}
        echo ""   >${SDSTMP}
        echo "Database replicas are not active:"     >${SDSTMP}
        echo ""   >${SDSTMP}
        ${METADB} -i >${SDSTMP}
        eval=1
fi
#
# Check the metadevice state, if the state is not Okay, something is up.
#
mdtrouble=`${METASTAT} | \
    ${AWK} '/State:/ { if ( $2 != "Okay" ) print $0 }'`
if [ "${mdtrouble}" ]; then
        echo ""  >${SDSTMP}
        echo "SDS metadevice problem report for ${date}"  >${SDSTMP}
        echo ""  >${SDSTMP}
        echo "Metadevices are not Okay:"  >${SDSTMP}
        echo ""  >${SDSTMP}
        ${METASTAT} >${SDSTMP}
        eval=1
fi
#
# Check the hotspares to see if any have been used.
#
hstrouble=`${METAHS} -i | \
    ${AWK} ' /blocks/ { if ( $2 != "Available" ) print $0 }'`
if [ "${hstrouble}" ]; then
        echo ""  >${SDSTMP}
        echo "SDS Hot spares in use  ${date}"  >${SDSTMP}
        echo ""  >${SDSTMP}
        echo "Hot spares in usage:"  >${SDSTMP}
        echo ""  >${SDSTMP}
        ${METAHS} -i >${SDSTMP}
        eval=1
fi

--------------------------------------------------------------------------

-------------------------------------------------------------------------------

(Continued from previous page)
#
# If any errors occurred, then mail the report to root, or whoever was called
# out in the command line.
#
if [ ${eval} -ne 0 ]; then
        ${MAILX} -s "SDS problems ${date}" ${MAILTO} <${SDSTMP}
        ${RM} -f ${SDSTMP}
fi
exit ${eval}

-------------------------------------------------------------------------------

For information on invoking scripts in this way, refer to the cron(1M) man page.

Note - This script serves as a starting point for automating DiskSuite error checking. You may need to modify it for your own configuration.

Boot Problems

Because DiskSuite enables you to mirror root (/), swap, and /usr, special problems can arise when you boot the system, either through hardware or operator error. The tasks in this section are solutions to such potential problems.

Table 7-1 describes these problems and points you to the appropriate solution.

Table 7-1 Common DiskSuite Boot Problems

The System Does Not Boot Because ...	Refer To ...
The `/etc/vfstab file` contains incorrect information.	"How to Recover From Improper /etc/vfstab Entries (Command Line)"
There are not enough state database replicas.	"How to Recover From Insufficient State Database Replicas (Command Line)"
A boot device (disk) has failed.	"How to Recover From a Boot Device Failure (Command Line)"
The boot mirror has failed.	"SPARC: How to Boot From the Alternate Device (Command Line)" or "x86: How to Boot From the Alternate Device (Command Line)"

Preliminary Information for Boot Problems

If the metadevice driver takes a metadevice offline due to errors, unmount all file systems on the disk where the failure occurred. Because each disk slice is independent, multiple file systems may be mounted on a single disk. If the metadisk driver has encountered a failure, other slices on the same disk will likely experience failures soon. File systems mounted directly on disk slices do not have the protection of metadisk driver error handling, and leaving such file systems mounted can leave you vulnerable to crashing the system and losing data.

Minimize the amount of time you run with submirrors disabled or offline. During resyncing and online backup intervals, the full protection of mirroring is gone.

How to Recover From Improper `/etc/vfstab Entries` (Command Line)

If you have made an incorrect entry in the /etc/vfstab file, for example, when mirroring root (/), the system will appear at first to be booting properly then fail. To remedy this situation, you need to edit /etc/vfstab while in single-user mode.

The high-level steps to recover from improper /etc/vfstab file entries are:

Booting the system to single-user mode

Running fsck(1M) on the mirror metadevice

Remounting file system read-write

Optional: running the metaroot(1M) command for a root (/) mirror

Verifying that the /etc/vfstab file correctly references the metadevice for the file system entry

Rebooting

Example - Recovering the root (/) Mirror

In the following example, root (/) is mirrored with a two-way mirror, d0. The root (/) entry in /etc/vfstab has somehow reverted back to the original slice of the file system, but the information in /etc/system still shows booting to be from the mirror d0. The most likely reason is that the metaroot(1M) command was not used to maintain /etc/system and /etc/vfstab, or an old copy of /etc/vfstab was copied back.

The incorrect /etc/vfstab file would look something like the following:

---------------------------------------------------------------------------------
#device device mount FS fsck mount mount #to mount to fsck point type pass at boot options # /dev/dsk/c0t3d0s0 /dev/rdsk/c0t3d0s0 / ufs 1 no - /dev/dsk/c0t3d0s1 - - swap - no - /dev/dsk/c0t3d0s6 /dev/rdsk/c0t3d0s6 /usr ufs 2 no - # /proc - /proc proc - no - fd - /dev/fd fd - no - swap - /tmp tmpfs - yes - ---------------------------------------------------------------------------------

Because of the errors, you automatically go into single-user mode when the machine is booted:

-------------------------------------------------------------------
ok boot ... SunOS Release 5.5 Version Generic [UNIX(R) System V Release 4.0] Copyright (c) 1983-1995, Sun Microsystems, Inc. configuring network interfaces: le0. Hostname: antero mount: /dev/dsk/c0t3d0s0 is not this fstype. setmnt: Cannot open /etc/mnttab for writing INIT: Cannot create /var/adm/utmp or /var/adm/utmpx INIT: failed write of utmpx entry:" " INIT: failed write of utmpx entry:" " INIT: SINGLE USER MODE Type Ctrl-d to proceed with normal startup, (or give root password for system maintenance): <root-password> -------------------------------------------------------------------

At this point, root (/) and /usr are mounted read-only. Follow these steps:

1. Run fsck(1M) on the root (/) mirror.

Note - Be careful to use the correct metadevice for root.

--------------------------------------------------------------
# fsck /dev/md/rdsk/d0 ** /dev/md/rdsk/d0 ** Currently Mounted on / ** Phase 1 - Check Blocks and Sizes ** Phase 2 - Check Pathnames ** Phase 3 - Check Connectivity ** Phase 4 - Check Reference Counts ** Phase 5 - Check Cyl groups 2274 files, 11815 used, 10302 free (158 frags, 1268 blocks, 0.7% fragmentation) --------------------------------------------------------------

2. Remount root (/) read/write so you can edit the /etc/vfstab file.

---------------------------------------------------------
# mount -o rw,remount /dev/md/dsk/d0 / mount: warning: cannot lock temp file </etc/.mnt.lock'> ---------------------------------------------------------

3. Run the metaroot(1M) command.

--------------------------------
# /usr/opt/SUNWmd/metaroot d0 --------------------------------

This edits the /etc/system and /etc/vfstab files to specify that the root (/) file system is now on metadevice d0.

4. Verify that the /etc/vfstab file contains the correct metadevice entries.

The root (/) entry in the /etc/vfstab file should appear as follows so that the entry for the file system correctly references the mirror:

---------------------------------------------------------------------------------
#device device mount FS fsck mount mount #to mount to fsck point type pass at boot options # /dev/md/dsk/d0 /dev/md/rdsk/d0 / ufs 1 no - /dev/dsk/c0t3d0s1 - - swap - no - /dev/dsk/c0t3d0s6 /dev/rdsk/c0t3d0s6 /usr ufs 2 no - # /proc - /proc proc - no - fd - /dev/fd fd - no - swap - /tmp tmpfs - yes - ---------------------------------------------------------------------------------

5. Reboot.

The system returns to normal operation.

How to Recover From Insufficient State Database Replicas (Command Line)

If for some reason the state database replica quorum is not met, for example, due to a drive failure, the system cannot be rebooted. In DiskSuite terms, the state database has gone "stale." This task explains how to recover.

The high-level steps in this task are:

Deleting the stale state database replicas and rebooting

Repairing the problem disk

Adding back the state database replica(s)

Example - Recovering From Stale State Database Replicas

In the following example, a disk containing two replicas has gone bad. This leaves the system with only two good replicas, and the system cannot reboot.

1. Boot the machine to determine which state database replicas are down.

-------------------------------------------------------------------
ok boot ... Hostname: demo metainit: demo: stale databases Insufficient metadevice database replicas located. Use metadb to delete databases which are broken. Ignore any "Read-only file system" error messages. Reboot the system when finished to reload the metadevice database. After reboot, repair any broken database replicas which were deleted. Type Ctrl-d to proceed with normal startup, (or give root password for system maintenance): <root-password> Entering System Maintenance Mode SunOS Release 5.5 Version Generic [UNIX(R) System V Release 4.0] -------------------------------------------------------------------

2. Use the metadb(1M) command to look at the metadevice state database and see which state database replicas are not available.

-----------------------------------------------------------------------------
# /usr/opt/SUNWmd/metadb -i flags first blk block count a m p lu 16 1034 /dev/dsk/c0t3d0s3 a p l 1050 1034 /dev/dsk/c0t3d0s3 M p unknown unknown /dev/dsk/c1t2d0s3 M p unknown unknown /dev/dsk/c1t2d0s3 ... -----------------------------------------------------------------------------

The system can no longer detect state database replicas on slice /dev/dsk/c1t2d0s3, which is part of the failed disk. The metadb command flags the replicas on this slice as having a problem with the master blocks.

3. Delete the state database replicas on the bad disk using the -d option to the metadb(1M) command.

At this point, the root (/) file system is read-only. You can ignore the mddb.cf error messages:

------------------------------------------------------------
# /usr/opt/SUNWmd/metadb -d -f c1t2d0s3 metadb: demo: /etc/opt/SUNWmd/mddb.cf.new: Read-only file system ------------------------------------------------------------

4. Verify that the replicas were deleted.

---------------------------------------------------------------------------
# /usr/opt/SUNWmd/metadb -i flags first blk block count a m p lu 16 1034 /dev/dsk/c0t3d0s3 a p l 1050 1034 /dev/dsk/c0t3d0s3 ---------------------------------------------------------------------------

5. Reboot.

6. Once you have a replacement disk, halt the system, replace the failed disk, and once again, reboot the system. Use the format(1M) command or the

fmthard(1M)

command to partition the disk as it was before the failure.

------------------------------

# halt
...
boot
...
# format /dev/rdsk/c1t2d0s0
...

------------------------------

7. Use the

metadb(1M)

command to add back the state database replicas and to determine that the state database replicas are correct.

---------------------------------------------------------------

# /usr/opt/SUNWmd/metadb -a -c 2 c1t2d0s3
# /usr/opt/SUNWmd/metadb 
   flags        first blk  block count
  a m  p  luo      16           1034         dev/dsk/c0t3d0s3
  a    p  luo      1050         1034         dev/dsk/c0t3d0s3
  a       u        16           1034         dev/dsk/c1t2d0s3
  a       u        1050         1034         dev/dsk/c1t2d0s3

---------------------------------------------------------------

The metadb command with the -c 2 option adds two state database replicas to the same slice.

How to Recover From a Boot Device Failure (Command Line)

If you have a root (/) mirror and your boot device fails, you'll need to set up an alternate boot device.

The high-level steps in this task are:

Booting from the alternate root (/) submirror

Determining the errored state database replicas and metadevices

Repairing the problem disk

Restoring metadevice state database and metadevices to their original state

Example - Recovering From a Boot Device Failure

In the following example, the boot device containing two of the six state database replica s and the root (/), swap, and /usr submirrors fails.

Initially, when the boot device fails, you'll see a message similar to the following. This message may differ among various architectures.

-------------------------------------------------------------------------------

Rebooting with command:
Boot device: /iommu/sbus/dma@f,81000/esp@f,80000/sd@3,0   File and args: kadb
kadb: kernel/unix
The selected SCSI device is not responding
Can't open boot device
...

-------------------------------------------------------------------------------

When you see this message, note the device. Then, follow these steps:

1. Boot from another root (

/

) submirror.

Since only two of the six state database replicas in this example are in error, you can still boot. If this were not the case, you would need to delete the stale state database replicas in single-user mode. This procedure is described in "How to Recover From Insufficient State Database Replicas (Command Line)."

When you created the mirror for the root (/) file system, you should have recorded the alternate boot device as part of that procedure. In this example, disk2 is that alternate boot device.

------------------------------------------------------------------

ok boot disk2
...
SunOS Release 5.5 Version Generic [UNIX(R) System V Release 4.0]
Copyright (c) 1983-1995, Sun Microsystems, Inc.
Hostname: demo
...
demo console login: root
Password: <root-password>
Last login: Wed Dec 16 13:15:42 on console
SunOS Release 5.1 Version Generic [UNIX(R) System V Release 4.0]
...

------------------------------------------------------------------

2. Use the

metadb(1M)

command to determine that two state database replicas have failed.

------------------------------------------------------------------

# /usr/opt/SUNWmd/metadb
       flags         first blk    block count
    M     p          unknown      unknown      /dev/dsk/c0t3d0s3
    M     p          unknown      unknown      /dev/dsk/c0t3d0s3
    a m  p  luo      16           1034         /dev/dsk/c0t2d0s3
    a    p  luo      1050         1034         /dev/dsk/c0t2d0s3
    a    p  luo      16           1034         /dev/dsk/c0t1d0s3
    a    p  luo      1050         1034         /dev/dsk/c0t1d0s3

------------------------------------------------------------------

The system can no longer detect state database replicas on slice /dev/dsk/c0t3d0s3, which is part of the failed disk.

3. Use the

metastat(1M)

command to determine that half of the root (

/

swap

, and

/usr

mirrors have failed.

----------------------------------------------------------------

# /usr/opt/SUNWmd/metastat
d0: Mirror
    Submirror 0: d10
      State: Needs maintenance
    Submirror 1: d20
      State: Okay
...
d10: Submirror of d0
    State: Needs maintenance
    Invoke: "metareplace d0 /dev/dsk/c0t3d0s0 <new device"
    Size: 47628 blocks
    Stripe 0:
	Device              Start Block  Dbase State        Hot Spare
	/dev/dsk/c0t3d0s0          0     No    Maintenance  
d20: Submirror of d0
    State: Okay
    Size: 47628 blocks
    Stripe 0:
	Device              Start Block  Dbase State        Hot Spare
	/dev/dsk/c0t2d0s0          0     No    Okay         

----------------------------------------------------------------

----------------------------------------------------------------

(continued from previous page)
d1: Mirror
    Submirror 0: d11
      State: Needs maintenance
    Submirror 1: d21
      State: Okay
...
d11: Submirror of d1
    State: Needs maintenance
    Invoke: "metareplace d1 /dev/dsk/c0t3d0s1 <new device"
    Size: 69660 blocks
    Stripe 0:
	Device              Start Block  Dbase State        Hot Spare
	/dev/dsk/c0t3d0s1          0     No    Maintenance  
d21: Submirror of d1
    State: Okay
    Size: 69660 blocks
    Stripe 0:
	Device              Start Block  Dbase State        Hot Spare
	/dev/dsk/c0t2d0s1          0     No    Okay         
d2: Mirror
    Submirror 0: d12
      State: Needs maintenance
    Submirror 1: d22
      State: Okay
...

----------------------------------------------------------------

----------------------------------------------------------------

(continued from previous page)
d2: Mirror
    Submirror 0: d12
      State: Needs maintenance
    Submirror 1: d22
      State: Okay
...
d12: Submirror of d2
    State: Needs maintenance
    Invoke: "metareplace d2 /dev/dsk/c0t3d0s6 <new device"
    Size: 286740 blocks
    Stripe 0:
	Device              Start Block  Dbase State        Hot Spare
	/dev/dsk/c0t3d0s6          0     No    Maintenance  
d22: Submirror of d2
    State: Okay
    Size: 286740 blocks
    Stripe 0:
	Device              Start Block  Dbase State        Hot Spare
	/dev/dsk/c0t2d0s6          0     No    Okay         

----------------------------------------------------------------

In this example, the metastat shows that following submirrors need maintenance:

Submirror d10, device c0t3d0s0

Submirror d11, device c0t3d0s1

Submirror d12, device c0t3d0s6

4. Halt the system, repair the disk, and use the

format(1M)

command or the

fmthard(1M)

command, to partition the disk as it was before the failure.

------------------------------

# halt
...
Halted
...
ok boot
...
# format /dev/rdsk/c0t3d0s0

------------------------------

5. Reboot.

Note that you must reboot from the other half of the root (/) mirror. You should have recorded the alternate boot device when you created the mirror.

----------------

# halt
...
ok boot disk2

----------------

6. To delete the failed state database replicas and then add them back, use the

metadb(1M)

command.

------------------------------------------------------------------

# /usr/opt/SUNWmd/metadb
       flags         first blk    block count
    M     p          unknown      unknown      /dev/dsk/c0t3d0s3
    M     p          unknown      unknown      /dev/dsk/c0t3d0s3
    a m  p  luo      16           1034         /dev/dsk/c0t2d0s3
    a    p  luo      1050         1034         /dev/dsk/c0t2d0s3
    a    p  luo      16           1034         /dev/dsk/c0t1d0s3
    a    p  luo      1050         1034         /dev/dsk/c0t1d0s3
# /usr/opt/SUNWmd/metadb -d c0t3d0s3
# /usr/opt/SUNWmd/metadb -c 2 -a c0t3d0s3
# /usr/opt/SUNWmd/metadb
       flags         first blk    block count
     a m  p  luo     16           1034         /dev/dsk/c0t2d0s3
     a    p  luo     1050         1034         /dev/dsk/c0t2d0s3
     a    p  luo     16           1034         /dev/dsk/c0t1d0s3
     a    p  luo     1050         1034         /dev/dsk/c0t1d0s3
     a        u      16           1034         /dev/dsk/c0t3d0s3
     a        u      1050         1034         /dev/dsk/c0t3d0s3

------------------------------------------------------------------

7. Use the

metareplace(1M)

command to re-enable the submirrors.

-----------------------------------------------

# /usr/opt/SUNWmd/metareplace -e d0 c0t3d0s0
Device /dev/dsk/c0t3d0s0 is enabled
# /usr/opt/SUNWmd/metareplace -e d1 c0t3d0s1
Device /dev/dsk/c0t3d0s1 is enabled
# /usr/opt/SUNWmd/metareplace -e d2 c0t3d0s6
Device /dev/dsk/c0t3d0s6 is enabled

-----------------------------------------------

After some time, the resyncs will complete. You can now return to booting from the original device.

How to Record the Path to the Alternate Boot Device (Command Line)

When mirroring root (/), you might need the path to the alternate boot device later if the primary device fails.

Example - SPARC: Recording the Alternate Boot Device Path

In this example, you would determine the path to the alternate root device by using the ls -l command on the slice that is being attached as the second submirror to the root (/) mirror.

--------------------------------------------------------------------

# ls -l /dev/rdsk/c1t3d0s0
lrwxrwxrwx 1  root root  55 Mar 5 12:54  /dev/rdsk/c1t3d0s0 - ../.
./devices/sbus@1,f8000000/esp@1,200000/sd@3,0:a

--------------------------------------------------------------------

Here you would record the string that follows the /devices directory: /sbus@1,f8000000/esp@1,200000/sd@3,0:a

DiskSuite users who are using a system with Open Boot Prom can use the OpenBoot nvalias command to define a "backup root" devalias for the secondary root mirror. For example:

-----------------------------------------------------------------

ok  nvalias backup_root /sbus@1,f8000000/esp@1,200000/sd@3,0:a

-----------------------------------------------------------------

In the event of primary root disk failure, you then would only enter:

-----------------------

ok  boot backup_root

-----------------------

Example - x86: Recording the Alternate Boot Device Path

In this example, you would determine the path to the alternate boot device by using the ls -l command on the slice that is being attached as the second submirror to the root (/) mirror.

--------------------------------------------------------------------

# ls -l /dev/rdsk/c1t0d0s0
lrwxrwxrwx 1  root root  55 Mar 5 12:54  /dev/rdsk/c1t0d0s0 - ../.
./devices/eisa/eha@1000,0/cmdk@1,0:a

--------------------------------------------------------------------

Here you would record the string that follows the /devices directory: /eisa/eha@1000,0/cmdk@1,0:a

SPARC: How to Boot From the Alternate Device (Command Line)

To boot a SPARC system from the alternate boot device, type:

--------------------------------

# boot alternate-boot-device

--------------------------------

The procedure, "How to Record the Path to the Alternate Boot Device (Command Line)," describes how to determine the alternate boot device.

x86: How to Boot From the Alternate Device (Command Line)

Use this task to boot an x86 system from the alternate boot device.

1. Boot your system from the Multiple Device Boot (MDB) diskette.

After a moment, a screen similar to the following is displayed:

--------------------------------------------------------------

Solaris/x86 Multiple Device Boot Menu
Code			Device					Vendor				Model/Desc									Rev
============================================================
10			DISK					COMPAQ				C2244									0BC4
11			DISK					SEAGATE				ST11200N SUN1.05									8808
12			DISK					MAXTOR				LXT-213S SUN0207									4.24
13			CD					SONY				CD-ROM CDU-8812									3.0a
14			NET					SMC/WD				I/O=300 IRQ=5
80			DISK					First IDE drive (Drive C:)
81			DISK					Second IDE drive (Drive D:)
Enter the boot device code:

--------------------------------------------------------------

2. Enter your alternate disk code from the choices listed on the screen. The following is displayed:

----------------------------------------------------------------------

Solaris 2.4 for x86                Secondary Boot Subsystem,vsn 2.11
                  <<<Current Boot Parameters>'>
Boot path:/eisa/eha@1000,0/cmdk@0,0:a
Boot args:/kernel/unix
Type b[file-name] [boot-flags] <ENTER     to boot with options
or   i<ENTER                              to enter boot interpreter
or   <ENTER                               to boot with defaults
                    <<<timeout in 5 seconds>'>

----------------------------------------------------------------------

3. Type

i

to select the interpreter.

4. Type the following commands:

-------------------------------------------------

>setprop boot-path /eisa/eha@1000,0/cmdk@1,0:a
>^D

-------------------------------------------------

The Control-D character sequence quits the interpreter.

Replacing SCSI Disks

This section describes how to replace SCSI disks that are not part of a SPARCstorage Array in a DiskSuite environment.

How to Replace a Failed SCSI Disk (Command Line)

The high-level steps to replace a SCSI disk that is not part of a SPARCstorage Array are:

Identifying the disk that needs replacing

Deleting any metadevice state database replicas that are on the problem disk

Deleting hot spares marked "Available" that are on the problem disk

Locating and detaching submirrors that use slices on the problem disk

Halting the system and booting to singler-user mode

Physically replacing the disk

Repartitioning the new disk

Adding metadevice state database replicas that were deleted

Performing one of the following, depending on how the slice that failed was used:

For a simple slice: use normal recovery procedures
For a stripe or concatenation: newfs entire metadevice, restore from backup
For a mirror: reattach detached submirrors
For a RAID5 metadevice: resync (enable) affected slices
For a trans metadevice: run fsck(1M)

Adding hot spares that were deleted to hot spare pools

1. Identify the disk to be replaced by examining

/var/adm/messages

and

metastat

output.

2. Locate any local metadevice state database replicas that may have been

placed on the problem disk. Use the

metadb

command to find the replicas.

Errors may be reported for the replicas located on the failed disk. In this example, c0t1d0 is the problem device.

-----------------------------------------------------------------------

# metadb
   flags       first blk        block count
  a m     u        16               1034            /dev/dsk/c0t0d0s4
  a       u        1050             1034            /dev/dsk/c0t0d0s4
  a       u        2084             1034            /dev/dsk/c0t0d0s4
  W   pc luo       16               1034            /dev/dsk/c0t1d0s4
  W   pc luo       1050             1034            /dev/dsk/c0t1d0s4
  W   pc luo       2084             1034            /dev/dsk/c0t1d0s4

-----------------------------------------------------------------------

The output above shows three state database replicas on Slice 4 of each of the local disks, c0t0d0 and c0t1d0. The W in the flags field of the c0t1d0s4 slice indicates that the device has write errors. Three replicas on the c0t0d0s4 slice are still good.

Caution -

If, after deleting the bad state database replicas, you are left with three or less, add more state database replicas before continuing. This will ensure that your system reboots correctly.

3. Record the slice name where the replicas reside and the number of

replicas, then delete the state database replicas.

The number of replicas is obtained by counting the number of appearances of a slice in metadb output in Step 2. In this example, the three state database replicas that exist on c0t1d0s4 are deleted.

-----------------------

# metadb -d c0t1d0s4

-----------------------

4. Locate any submirrors using slices on the problem disk and detach them.

The metastat command can show the affected mirrors. In this example, one submirror, d10, is also using c0t1d0s4. The mirror is d20.

--------------------------------

# metadetach d20 d10
d20: submirror d10 is detached

--------------------------------

5. Delete hot spares on the problem disk.

------------------------------

# metahs -d hsp000 c0t1d0s6
hsp000: Hotspare is deleted

------------------------------

6. Halt the system and boot to single-user mode.

---------------

# halt
...
ok boot -s
...

---------------

7. Physically replace the problem disk.

8. Repartition the new disk.

Use the format(1M) command or the fmthard(1M) command to partition the disk with the same slice information as the failed disk.

9. If you deleted replicas in Step 3, add the same number back to the appropriate slice.

In this example, /dev/dsk/c0t1d0s4 is used.

---------------------------

# metadb -a c 3 c0t1d0s4

---------------------------

10. Depending on how the disk was used, you may have a variety of things to do. Use the following table to decide what to do next.

Table 7-2 SCSI Disk Replacement Decision Table

Type of Device	Do the Following ...
Slice	Use normal data recovery procedures.
Unmirrored Stripe or Concatenation	If the stripe/concat is used for a file system, run `newfs(1M)`, mount the file system then restore data from backup. If the stripe/concat is used as an application that uses the raw device, that application must have its own recovery procedures.
Mirror (Submirror)	Run `metattach(1M)` to reattach a detached submirror.
RAID5 metadevice	Run `metareplace(1M)` to re-enable the slice. This causes the resyncs to start.
Trans metadevice	Run `fsck(1M)` to repair the trans metadevice.

11. Replace hot spares that were deleted, and add them to the appropriate hot spare pool(s).

------------------------------

# metahs -a hsp000 c0t0d0s6
hsp000: Hotspare is added

------------------------------

12. Validate the data.

Check the user/application data on all metadevices. You may have to run an application-level consistency checker or use some other method to check the data.

Working With SPARCstorage Arrays

This section describes how to troubleshoot SPARCstorage Arrays using DiskSuite. The tasks in this section include:

Replacing a failed disk in a mirror

Replacing a failed disk in a RAID5 metadevice

Removing a tray

Replacing a tray

Replacing a controller

Recovering from power loss

Installation

The SPARCstorage Array should be installed according to the SPARCstorage Array Software instructions found with the SPARCstorage Array CD. The SPARCstorage Array Volume Manager need not be installed if you are only using DiskSuite.

Device Naming

DiskSuite accesses SPARCstorage Array disks exactly like any other disks, with one important exception: the disk names differ from non-SPARCstorage Array disks.

The SPARCstorage Array 100 disk naming convention is:

c[0-n]t[0-5]d[0-4]s[0-7]

In this name:

c indicates the controller attached to an SSA unit

t indicates one of the 6 SCSI strings within an SSA

d indicates one of the 5 disks on an internal SCSI string

s indicates the disk slice number

Strings t0 and t1 are contained in tray 1, t2 and t3 in tray 2, and t4 and t5 are in tray 3

The SPARCstorage Array 200 disk naming convention is:

c[0-n]t[0-5]d[0-6]s[0-7]

In this name:

c indicates the controller attached to an SSA unit

t indicates one of the 6 targets (trays) within the SSA unit

d indicates one of the 7 disks on an internal SCSI string

s indicates the disk slice number

Note - Older trays hold up to six disks; newer trays can hold up to seven.

The main difference between the SSA100 and SSA200 is that the SSA100 arranges pairs of targets into a tray, whereas the SSA200 has a separate tray for each target.

Preliminary Information for Replacing SPARCstorage Array Components

The SPARCstorage Array components that can be replaced include the disks, fan tray, battery, tray, power supply, backplane, controller, optical module, and fibre channel cable.

Some of the SPARCstorage Array components can be replaced without powering down the SPARCstorage Array. Other components require the SPARCstorage Array to be powered off. Consult the SPARCstorage Array documentation for details.

To replace SPARCstorage Array components that require power off without interrupting services, you perform the steps necessary for tray removal for all trays in the SPARCstorage Array before turning off the power. This includes taking submirrors offline, deleting hot spares from hot spare pools, deleting state database replicas from drives, and spinning down the trays.

After these preparations, the SPARCstorage Array can be powered down and the components replaced.

Note - Because the SPARCstorage Array controller contains a unique World Wide Name, which identifies it to Solaris, special procedures apply for SPARCstorage Array controller replacement. Contact your service provider for assistance.

How to Replace a Failed SPARCstorage Array Disk in a Mirror (DiskSuite Tool)

The steps to replace a SPARCstorage Array disk in a DiskSuite environment depend a great deal on how the slices on the disk are being used, and how the disks are cabled to the system. They also depend on whether the disk slices are being used as is, or by DiskSuite, or both.

Note - This procedure applies to a SPARCstorage Array 100. The steps to replace a disk in a SPARCstorage Array 200 are similar.

The high-level steps in this task are:

Identifying the disk that needs replacing and determining its location

Deleting hot spares marked "Available" that are in the tray that must be pulled

Deleting state database replicas that are on the disks in the tray that must be pulled

Locating submirrors using disks in the tray that must be pulled

Detaching submirrors with slices on the disk that is being replaced

Offlining other submirrors using disks in the tray

Spinning down all disks in the tray

Pulling the tray and replacing the disk

Making sure that all disks in the tray spin back up

Repartitioning the new disk

Bringing submirrors in the tray back online

Attaching detached submirrors in the tray

Replacing hot spares that were deleted

Adding hot spares that were deleted to hot spare pool

Adding metadevice state database replicas that were deleted

Note - You can use this procedure if a submirror is in the "Maintenance" state, replaced by a hot spare, or is generating intermittent errors.

To locate and replace the disk, perform the following steps:

1. Identify the disk to be replaced, either by using DiskSuite Tool to look at

the Status fields of objects, or by examining

metastat

and

/var/adm/messages

output.

----------------------------------------------------------------------------

# metastat
...
 d50:Submirror of d40
      State: Needs Maintenance
...
# tail -f /var/adm/messages
...
Jun 1 16:15:26 host1 unix: WARNING: /io-
unit@f,e1200000/sbi@0.0/SUNW,pln@a0000000,741022/ssd@3,4(ssd49):   
Jun 1 16:15:26 host1 unix: Error for command `write(I))' Err
Jun 1 16:15:27 host1 unix: or Level: Fatal
Jun 1 16:15:27 host1 unix: Requested Block 144004, Error Block: 715559
Jun 1 16:15:27 host1 unix: Sense Key: Media Error
Jun 1 16:15:27 host1 unix: Vendor `CONNER':
Jun 1 16:15:27 host1 unix: ASC=0x10(ID CRC or ECC error),ASCQ=0x0,FRU=0x15
...

----------------------------------------------------------------------------

The metastat command shows that a submirror is in the "Needs Maintenance" state. The /var/adm/messages file reports a disk drive that has an error. To locate the disk drive, use the ls command as follows, matching the symbolic link name to that from the /var/adm/messages output.

----------------------------------------------------------------------------

# ls -l /dev/rdsk/*
...
lrwxrwxrwx   1 root     root          90 Mar  4 13:26 /dev/rdsk/c3t3d4s0 -
> ../../devices/io-
unit@f,e1200000/sbi@0.0/SUNW,pln@a0000000,741022/ssd@3,4(ssd49)
...

----------------------------------------------------------------------------

Based on the above information and metastat output, it is determined that drive c3t3d4 must be replaced.

2. Determine the affected tray by using DiskSuite Tool.

To find the SPARCstorage Array tray where the problem disk resides, use the Disk View window.

a. Click Disk View to display the Disk View window.

b. Drag the problem metadevice (in this example, a mirror) from the Objects list to the Disk View window.

The Disk View window shows the logical to physical device mappings by coloring the physical slices that make up the metadevice. You can see at a glance which tray contains the problem disk.

c. An alternate way to find the SPARCstorage Array tray where the problem disk resides is to use the

ssaadm(1M)

command.

--------------------------------------------------------------------

host1# ssaadm display c3
         SPARCstorage Array Configuration
Controller path: /devices/io-
unit@f,e1200000/sbi@0.0/SUNW,soc@0,0/SUNW,pln@a0000000,741022:ctlr
         DEVICE STATUS
         TRAY1          TRAY2          TRAY3
Slot
1        Drive:0,0      Drive:2,0      Drive:4,0
2        Drive:0,1      Drive:2,1      Drive:4,1
3        Drive:0,2      Drive:2,2      Drive:4,2
4        Drive:0,3      Drive:2,3      Drive:4,3
5        Drive:0,4      Drive:2,4      Drive:4,4
6        Drive:1,0      Drive:3,0      Drive:5,0
7        Drive:1,1      Drive:3,1      Drive:5,1
8        Drive:1,2      Drive:3,2      Drive:5,2
9        Drive:1,3      Drive:3,3      Drive:5,3
10       Drive:1,4      Drive:3,4      Drive:5,4
         CONTROLLER STATUS
Vendor:    SUNW
Product ID:  SSA100
Product Rev: 1.0
Firmware Rev: 2.3
Serial Num: 000000741022
Accumulate performance Statistics: Enabled

--------------------------------------------------------------------

The ssaadm output for controller (c3) shows that Drive 3,4 (c3t3d4) is the closest to you when you pull out the middle tray.

3. [Optional] If you have a diskset, locate the diskset that contains the affected drive.

The following commands locate drive c3t3d4. Note that no output was displayed when the command was run with logicalhost2, but logicalhost1 reported that the name was present. In the reported output, the yes field indicates that the disk contains a state database replica.

-----------------------------------------------

host1# metaset -s logicalhost2 | grep c3t3d4
host1# metaset -s logicalhost1 | grep c3t3d4
c3t3d4 yes

-----------------------------------------------

Note - If you are using Solstice HA servers, you'll need to switch ownership of both logical hosts to one Solstice HA server. Refer to the Solstice HA documentation.

4. Determine other DiskSuite objects on the affected tray.

Because you must pull the tray to replace the disk, determine what other objects will be affected in the process.

a. In DiskSuite Tool, display the Disk View window. Select the tray. From the Object menu, choose Device Mappings. The Physical to Logical Device Mapping window appears.

b. Note all affected objects, including state database replicas, metadevices, and hot spares that appear in the window.

5. Prepare for disk replacement by preparing other DiskSuite objects in the affected tray.

a. Delete all hot spares that have a status of "Available" and that are in

the same tray as the problem disk.

Record all the information about the hot spares so they can be added back to the hot spare pools following the replacement procedure.

b. Delete any state database replicas that are on disks in the tray that

must be pulled. You must keep track of this information because you must replace these replicas in Step 14.

There may be multiple replicas on the same disk. Make sure you record the number of replicas deleted from each slice.

c. Locate the submirrors that are using slices that reside in the tray.

d. Detach all submirrors with slices on the disk that is being replaced.

e. Take all other submirrors that have slices in the tray offline.

This forces DiskSuite to stop using the submirror slices in the tray so that the drives can be spun down.

To remove objects, refer to Chapter 5, "Removing DiskSuite Objects." To detach and offline submirrors, refer to "Working With Mirrors."

6. Spin down all disks in SPARCstorage Array tray.

Refer to "How to Stop a Disk (DiskSuite Tool)."

Note - The SPARCstorage Array tray should not be removed as long as the LED on the tray is illuminated. Also, you should not run any DiskSuite commands while the tray is spun down as this may have the side effect of spinning up some or all of the drives in the tray.

7. Pull the tray and replace the bad disk.

Instructions for the hardware procedure are found in the SPARCstorage Array Model 100 Series Service Manual and the SPARCcluster High Availability Server Service Manual.

8. Make sure all disks in the tray of the SPARCstorage Array spin up.

The disks in the SPARCstorage Array tray should automatically spin up following the hardware replacement procedure. If the tray fails to spin up automatically within two minutes, force the action by using the following command.

-------------------------

# ssaadm start -t 2 c3

-------------------------

9. Use format(1M), fmthard(1M), or Storage Manager to repartition the

Saving the disk format information before problems occur is always a good idea.

10. Bring all submirrors that were taken offline back online.

Refer to "Working With Mirrors."

When the submirrors are brought back online, DiskSuite automatically resyncs all the submirrors, bringing the data up-to-date.

11. Attach submirrors that were detached.

Refer to "Working With Mirrors."

12. Replace any hot spares in use in the submirrors attached in

Step 11.

If a submirror had a hot spare replacement in use before you detached the submirror, this hot spare replacement will be in effect after the submirror is reattached. This step returns the hot spare to the "Available" status.

13. Add all hot spares that were deleted.

14. Add all state database replicas that were deleted from disks on the tray.

Use the information saved previously to replace the state database replicas.

15. [Optional] If using Solstice HA servers, switch each logical host back to its default master.

Refer to the Solstice HA documentation.

16. Validate the data.

Check the user/application data on all metadevices. You may have to run an application-level consistency checker or use some other method to check the data.

How to Replace a Failed SPARCstorage Array Disk in a RAID5 Metadevice (DiskSuite Tool)

When setting up RAID5 metadevices for online repair, you will have to use a minimum RAID5 width of three slices. While this is not an optimal configuration for RAID5, it is still slightly less expensive than mirroring, in terms of the overhead of the redundant data. You should place each of the three slices of each RAID5 metadevice within a separate tray. If all disks in a SPARCstorage Array are configured this way (or in combination with mirrors as described above), the tray containing the failed disk may be removed without losing access to any of the data.

Warning -

Any applications using non-replicated disks in the tray containing the failed drive should first be suspended or terminated.

1. Refer to Step 1 through Step 9 in the previous procedure, "How to Replace a Failed SPARCstorage Array Disk in a Mirror (DiskSuite Tool)."

You are going to locate the problem disk and tray, locate other affected DiskSuite objects, prepare the disk to be replaced, replace, then repartition the drive.

2. Use the

metareplace -e

command to enable the new drive in the tray.

3. Refer to Step 12 through Step 16 in the previous procedure, "How to Replace a Failed SPARCstorage Array Disk in a Mirror (DiskSuite Tool)."

How to Remove a SPARCstorage Array Tray (Command Line)

Before removing a SPARCstorage Array tray, halt all I/O and spin down all drives in the tray. The drives automatically spin up if I/O requests are made. Thus, it is necessary to stop all I/O before the drives are spun down.

1. Stop DiskSuite I/O activity.

Refer to the metaoffline(1M) command, which takes the submirror offline. When the submirrors on a tray are taken offline, the corresponding mirrors will only provide one-way mirroring (that is, there will be no data redundancy), unless the mirror uses three-way mirroring. When the submirror is brought back online, an automatic resync occurs.

Note - If you are replacing a drive that contains a submirror, use the metadetach(1M) command to detach the submirror.

2. Use the

metastat(1M)

command to identify all submirrors containing slices on the tray to be removed. Also, use the

metadb(1M)

command to identify any replicas on the tray. Any available hot spare devices must also be identified and the associated submirror identified using the

metahs(1M)

command.

With all affected submirrors offline, I/O to the tray will be stopped.

3. Refer to "How to Stop a Disk (DiskSuite Tool)."

Either using DiskSuite Tool or the ssaadm command, spin down the tray. When the tray lock light is out the tray may be removed and the required task performed.

How to Replace a SPARCstorage Array Tray

When you have completed work on a SPARCstorage Array tray, replace the tray in the chassis. The disks will automatically spin up.

However if the disks fail to spin up, you can use DiskSuite Tool (or the ssaadm command) to manually spin up the entire tray. There is a short delay (several seconds) between starting drives in the SPARCstorage Array.

After the disks have spun up, you must place online all the submirrors that were taken offline. When you bring a submirror online, an optimized resync operation automatically brings the submirrors up-to-date. The optimized resync copies only the regions of the disk that were modified while the submirror was offline. This is typically a very small fraction of the submirror capacity. You must also replace all state database replicas and add back hot spares.

Note - If you used metadetach(1M) to detach the submirror rather than metaoffline, the entire submirror must be resynced. This typically takes about 10 minutes per Gbyte of data.

How to Recover From SPARCstorage Array Power Loss (Command Line)

When power is lost to one SPARCstorage Array, the following occurs:

I/O operations to the DiskSuite objects will generate errors.

Errors are reported at the slice level rather than the drive level.

Errors are not reported until I/O operations are made to the disk.

Hot spare activity may be initiated if affected devices have assigned hot spares.

You must monitor the configuration for these events using the metastat(1M) command as explained in "Checking Status of DiskSuite Objects."

You may need to perform the following after power is restored:

Identify errored devices with metastat

Enable errored submirrors or RAID5 metadevices

Delete/recreate affected state database replicas

1. After power is restored, use the

metastat

command to identify the errored devices.

--------------------------------

# metastat
...
d10: Trans
    State: Okay
    Size: 11423440 blocks
    Master Device: d20
    Logging Device: d15
d20: Mirror
    Submirror 0: d30
      State: Needs maintenance
    Submirror 1: d40
      State: Okay
...
d30: Submirror of d20
    State: Needs maintenance
...

--------------------------------

2. Return errored devices to service using the

metareplace

command:

-------------------------------------

# metareplace -e metadevice slice

-------------------------------------

The -e option transitions the state of the slice to the "Available" state and resyncs the failed slice.

Note - Slices that have been replaced by a hot spare should be the last devices replaced using the metareplace command. If the hot spare is replaced first, it could replace another errored slice in a submirror as soon as it becomes available.

A resync can be performed on only one slice of a submirror (metadevice) at a time. If all slices of a submirror were affected by the power outage, each slice must be replaced separately. It takes approximately 10 minutes for a resync to be performed on a 1.05-Gbyte disk.

Depending on the number of submirrors and the number of slices in these submirrors, the resync actions can require a considerable amount of time. A single submirror that is made up of 30 1.05-Gbyte drives might take about five hours to complete. A more realistic configuration made up of five-slice submirrors might take only 50 minutes to complete.

3. After the loss of power, all state database replicas on the affected SPARCstorage Array chassis will enter an errored state. While these will be reclaimed at the next reboot, you may want to manually return them to service by first deleting and then adding them back.

----------------------

# metadb -d slice
# metadb -a slice

----------------------

Note - Make sure you add back the same number of state database replicas that were deleted on each slice. Multiple state database replicas can be deleted with a single metadb command. It may require multiple invocations of metadb -a to add back the replicas deleted by a single metadb -d. This is because if you need multiple copies of replicas on one slice these must be added in one invocation of metadb using the -c flag. Refer to the metadb(1M) man page for more information.

Because state database replica recovery is not automatic, it is safest to manually perform the recovery immediately after the SPARCstorage Array returns to service. Otherwise, a new failure may cause a majority of state database replicas to be out of service and cause a kernel panic. This is the expected behavior of DiskSuite when too few state database replicas are available.

How to Move SPARCstorage Array Disks Between Hosts (Command Line)

This procedure explains how to move disks containing DiskSuite objects from one SPARCstorage Array to another.

1. Repair any devices that are in an errored state or that have been replaced by hot spares on the disks that are to be moved.

2. Identify the state database replicas, metadevices, and hot spares on the

disks that are to be moved, by using the output from the

metadb

and

metastat -p

commands.

3. Physically move the disks to the new host, being careful to connect them in a similar fashion so that the device names are the same.

4. Recreate the state database replicas.

-------------------------------------

# metadb -a [-f] slice ...

-------------------------------------

Be sure to use the same slice names that contained the state database replicas as identified in Step 2. You might need to use the -f option to force the creation of the state database replicas.

5. Copy the output from the

metastat -p

command in Step 2 to the

md.tab

file.

6. Edit the

md.tab

file, making the following changes:

Delete metadevices which you did not move.

Change the old metadevice names to new names.

Make any mirrors into one-way mirrors for the time being, selecting the smallest submirror (if appropriate).

7. Check the syntax of the

md.tab

file.

-------------------

# metainit -a -n

-------------------

8. Recreate the moved metadevices and hot spare pools.

----------------

# metainit -a

----------------

9. Make the one-way mirrors into multi-way mirrors using the

metattach(1M)

command as necessary.

10. Edit the

/etc/vfstab

file for file systems that are to be automatically mounted at boot. Then remount file systems on the new metadevices as necessary.

Using the SPARCstorage Array as a System Disk

This section contains information on making a SPARCstorage Array function as a system disk (boot device).

Making a SPARCstorage Array Bootable

The minimum boot requirements for the SPARCstorage Array are:

Solaris 2.4 hardware 3/95 or later

Fcode 1.33 or later on the SOC card in the host

Firmware 1.9 or above in the SPARCstorage Array

To update or check the Fcode revision, use the fc_update program, which is supplied on the SPARCstorage Array CD, in its own subdirectory.

Consult the SPARCstorage Array documentation for more details.

How to Make SPARCstorage Array Disks Available Early in the Boot Process

Add the following forceload entries to /etc/system file to ensure that the SPARCstorage Array disks are made available early in the boot process. This is necessary to make the SPARCstorage Array function as a system disk (boot device).

-------------------------------------------------------------------------

*ident	"@(#)system	1.15	92/11/14 SMI" /* SVR4 1.5 */
*
* SYSTEM SPECIFICATION FILE
*
...
* forceload:
*
*	Cause these modules to be loaded at boot time, (just before mounting
*	the root filesystem) rather than at first reference. Note that
* 	forceload expects a filename which includes the directory. Also
*	note that loading a module does not necessarily imply that it will
*	be installed.
*
forceload: drv/ssd
forceload: drv/pln
forceload: drv/soc
...

-------------------------------------------------------------------------

Note - When creating a root (/) mirror on a SPARCstorage Array disk, running the metaroot(1M) command puts the above entries in the /etc/system file automatically.

7: Troubleshooting the System

General Guidelines for Troubleshooting DiskSuite

Example - md.conf File

Preliminary Information for Disksets

Example - md.conf File

Example - Adding Larger State Database Replicas

How to Automate Checking for Slice Errors in Metadevices (Command Line)

Preliminary Information for Boot Problems

Example - Recovering the root (/) Mirror

How to Recover From Insufficient State Database Replicas (Command Line)

Example - Recovering From Stale State Database Replicas

Example - Recovering From a Boot Device Failure

Example - SPARC: Recording the Alternate Boot Device Path

Example - x86: Recording the Alternate Boot Device Path

Installation

Device Naming

How to Replace a Failed SPARCstorage Array Disk in a Mirror (DiskSuite Tool)

How to Replace a Failed SPARCstorage Array Disk in a RAID5 Metadevice (DiskSuite Tool)

How to Move SPARCstorage Array Disks Between Hosts (Command Line)

Making a SPARCstorage Array Bootable

How to Make SPARCstorage Array Disks Available Early in the Boot Process

Example - `md.conf` File

Example - `md.conf` File