Solstice DiskSuite 4.1 is co-packaged with the Solaris Server Boxes. It is free with Solaris when you buy a server. You can also buy Solstice DiskSuite 4.1 as an add-on.
As long as you are not mirroring root (/), then the 1.33 Fcode is not required.
At the point that you mirror root(/), the 1.33 Fcode is required if you:
This is because access to the drivers is through the OBP, and the support for such items is provided by the 1.33 Fcode. This is analogous to the fact that 1.33 Fcode is required to put root (/) on a SPARCstorage Array disk.
Yes and no. There is a limit in the size of the metadevice unit structure. However, this limit is so large that it puts the number of partitions into the thousands.
No. UFS does not usually need defragmentation.
Jun 13 14:01:13 uns-tr1 unix: WARNING: forceload of misc/md_trans failed
Jun 13 14:01:13 uns-tr1 unix: WARNING: forceload of misc/md_raid failed
Jun 13 14:01:13 uns-tr1 unix: WARNING: forceload of misc/md_hotspares failed
It is safe to ignore these messages. This is an artifact of the way drivers are loaded during the boot process.
Does the metareplace command have an option to reenable all disks in a configuration at the same time rather than individually?
No. Currently, they all must be enabled one at a time. You could write an awk or sed script to automate the process.
Yes, as long as you do not use UFS logging. Also, you must be sure not to load and enable Prestoserve until the DiskSuite metadisk is enabled.
No.
You do not have to change your configuration. However, Solstice DiskSuite 4.1 is not backwardly compatible with previous Online: DiskSuite 2.0.1 and 3.0 products. When upgrading to DiskSuite 4.1, you must convert the current configuration. A conversion script (metacvt) is provided on the installation CD-ROM in the /scripts directory to automate the upgrade. As with any upgrade, perform a backup prior to the conversion to be safe.
The maximum number of metadevices supported in Solstice DiskSuite 4.1 is 1024. Prior to this release, the maximum number supported was 128, although there are sites that use more than 128. Although only 128 metadevices were supported in the previous release, you could modify the value of nmd in the /kernel/drv/md.conf file to a larger value. This also holds true for Solstice DiskSuite 4.1. Note, however, that only Solstice DiskSuite 4.1 officially supports up to 1024 metadevices.
Yes, metacvt can upgrade a system running Online: DiskSuite 2.0.1.
The color mapping is in effect during that particular usage of DiskSuite Tool. It cannot be saved.
These values are autoscaled to avoid displaying values such as .0001 Gbyte, or 1,000,000 Kbytes.
DiskSuite Tool always clears metadevices with the -f force flag on. When the force flag is needed, DiskSuite Tool displays a message asking for confirmation of the requested deletion. Note that DiskSuite Tool does not have the ability to recursively delete a metadevice such as a mirror and all its submirrors as the command line does, for example, using the metaclear -r command.
The database replicas must be read in early, before root (/) is mounted. This requires that the SOC SBus card have 1.33 Fcode loaded, and that the system has the correct rev of ssd drivers and patches.
Otherwise, SPARCstorage Array disks cannot be used as a system disk. The metadevice state database replicas are considered system disks.
The size of the replicas should not cause the creation of metadevices to slow down. The only impact from small replicas is that it would prevent large numbers of metadevices from being built.
You can create state database replicas either on dedicated slices, or on slices that will be used as part of a stripe, RAID5, or trans metadevice.
The md.tab File
The Solstice DiskSuite configuration is not stored in the md.tab file. Neither the CLI nor the GUI can update this file. This is true for all DiskSuite releases.
The configuration and state information is stored in the metadevice state databases, which are created by the metadb command. The information is replicated in the database, so that DiskSuite guarantees access to the correct data. This is most critical for mirrors and RAID devices because there is more than one copy of the data.
md.tab is used as an input (work) file to metadb and metainit. Think of md.tab as a script. (Typically, a script is not considered a storage medium for configuration data.) The best use for md.tab is to build your entire configuration in it. Then you can run metainit -n to see if the syntax in the md.tab is correct, followed by metainit -a to build the metadevices in the correct order. The GUI can do this too. In the GUI, you do not have to worry about syntax. Also, you can build a large configuration in the metadevice editor and commit all the devices at one time, and the GUI takes care of the ordering.
The md.cf File
DiskSuite Tool and the following command line utilities modify the md.cf file:
md.tab is a read-only file used to initialize metadevices all at once. Think of it as a script of metainit commands.
Note: The most current configuration information is obtained from the metadevice state database. The GUI uses and manipulates this information directly.
Yes. Solaris limits the total size of the stripe to 2 Gbytes.
The difference is in the ordering of the logical blocks. In a concatenation, all of the blocks of the first slice are read, followed by all of the blocks of the second slice, and so on. For striping, the first interlace's worth of blocks (defaulting to 16 Kbytes in Solstice DiskSuite 4.1) is read from the first slice, then the first interlace's worth of blocks is read from the second slice, and so on.
When you drag slices to a concat/stripe object, a dialog box prompts you to select stripe or concat.
Most likely these slices contain the disk label and DiskSuite is hopping over the label (by a full cylinder) to keep from obliterating it. The first disk has its label exposed so that if it contained an existing file system it would still show up in the right place. Another explanation might be that the other disks contain state database replicas.
In almost all but the most obscure cases, it is better to create a striped metadevice rather than a concatenation. The most obvious exception is encapsulating an existing file system in a slice, which requires that at least the first part of a concat/stripe be a concatenation.
Note: With fast SCSI and improved hardware and software, this warning is no longer so important. This is especially true if the I/O is mostly random, because the bus has more than enough bandwidth.
When I choose a disk to boot from, I assume that the boot loader,
kernel, and the initial set of drivers come from that disk. Then
when the md drivers start up, the mirroring takes over and
the "good" disk is used. Is it possible that the partner submirror
is then exclusively used and not the original disk used in the boot?
Yes, this can happen. In rare occasions, such as when the boot disk is powered off after the system is running, this can happen.
The boot prom information or eeprom command shows the boot disk. The metastat command shows the "good side" of the mirror, which is not necessarily the boot disk.
If a system problem occurs that causes a mirror write to fail, the submirrors will not be identical when the system restarts. It is important to note that at this point, neither submirror is "good" or "bad" or more "correct" than the other. Although one submirror might contain more recently written data, the write completion status has not been returned to the application and so the write cannot be considered complete. It is strictly a matter of chance which, if either, of the submirrors contains the more recently written data.
Because the data on the two submirrors might be different, the submirrors must be resynced before reuse. In this case, DiskSuite uses an optimized resync. This type of resync only updates those regions of the submirrors that might contain uncompleted writes at the time of the system failure.
DiskSuite does not use dirty region logs. Instead, it uses dirty region bitmaps. Each mirror has two bitmaps associated with it, stored in two state database replicas (created by the metadb command). The bitmaps cannot be turned off, to ensure that optimized resyncs are done for reboots. The bitmaps are dynamically sized at mirror creation to the appropriate size.
This is a generic problem with all raw devices, not just those used as metadevices. Currently, there is no way to register devices for use other than as file systems and swap. DiskSuite does not have a mechanism of its own for this purpose.
Yes. You can add slices to each submirror with the metattach command or DiskSuite Tool. The mirror will automatically expand when the last submirror is increased, growing to the size of the smallest submirror. This enables you to expand the size of the mirror while it is online.
If the mirror contains a UFS file system, the growfs command should be used to grow the file system.
This indicates that DiskSuite encountered an I/O error on the devices in the maintenance state. You should scan /var/adm/messages to see what kind of errors occured. If the errors are of a transitory nature and the disks themselves do not have problems, use metareplace with the -e option, such as:
metareplace -e d11 c1t4d0s7where d11 is a mirror and c1t4d0s7 is a slice in one of its submirrors.
If the disks are defective, you can either replace them with other available slices on the system using metareplace, or repair/replace them and run metareplace with -e as show above.
DiskSuite protects the label of underlying disks, unlike straight sd/ssd, etc. Most likely, your mirror is composed of stripes whose first component is Slice 0 or 2 (or some other slice beginning at cylinder 0). It appears that this application is not skipping over disk labels, which are protected by DiskSuite. Note that this is the case for all DiskSuite metadevices, not just mirrors.
The result is that a write is never started to a submirror until the corresponding region(s) are marked dirty. If a region is dirty it means there might be different data on the submirrors. In most cases, Solstice DiskSuite writes the same data on top of itself during a resync.
DiskSuite provides Level 0 (stripes and concatenations), Level 1 (mirrors), and Level 5 (RAID) devices. The metainit command has flags that specify what type of device to create: no flag indicates a concat/stripe; the -m flag indicates a mirror; and the -r flag indicates a RAID5 device.
The metastat command shows the type (stripe/concat, mirror, or RAID) of each metadevice.
A major reason for the lengthy initialization time for a RAID5 metadevice is that all of the stripes that constitute the RAID5 device must first be "zeroed" on disk. This is the only way to guarantee that the data on the disks and the parity is valid. (Parity is made up by XORing all the data bits, and XOR of 0's = 0). For a stripe there is no zeroing because it doesn't matter what bits are on the disk when the device is initialized. That is the price you pay for having parity protection. (There is also a cost for mirrors in that when you add a second submirror to a mirror, all the data from the first submirror must be copied to the second submirror (syncing). However, the user doesn't suffer because this resync operation can be done in the background, and the data is available from the first submirror.) RAID5 is the only RAID device that has this type of initialization cost.
If you want write performance, or performance after an error, use mirroring. RAID5 is appropriate for read-mostly applications.
The metattach command can be used to concatenate new devices to a RAID5 device. The new devices are parity protected. The resulting RAID5 device continues to handle a single failure.
RAID 0 (stripes and concatenations), RAID 1 (mirrors), and RAID 5 (RAID devices).
The parity areas are allocated when the initial RAID5 device is created. One column's worth of space is allocated to parity, although the actual parity blocks are distributed across all of the columns to avoid hot spots. When you concatenate additional devices to the RAID, the additional space is devoted to data--no new parity blocks are allocated. The data on the concatenated devices is, however, included in the parity calculations, so it is protected against single device failures.
The parity areas are allocated and initialized when the original RAID5 device is created, so it cannot be too "full" to store parity. When the concatenation is added, it is initialized by zeroing it, because adding additional blocks of 0 to the line does not change the calculated parity. If you do a metastat right after doing the metattach, you will see that the concatenation is in the "Initializing" state. After this initialization, the concatenation is available for use and it is included in the parity calculations. From the standpoint of parity calculation, the concatenation is treated as additional columns.
The parity is stored in the original RAID5 set. No parity is stored in the RAID concatenations if there are any. If you start with a 3 column RAID, the space is allocated like this (the size of each box is the RAID interlace size which can be seen in the metastat output):
--------- --------- --------- | data | | data | | par. | --------- --------- --------- | par. | | data | | data | --------- --------- --------- | data | | par. | | data | --------- --------- --------- | data | | data | | par. | --------- --------- ---------The parity is merely the exclusive OR of the data in the line.
It is best to view concatenation to an existing RAID5 device as a convenient way to add space, and defer reconfiguring the RAID device (including the dump/restore) until a more convenient time.
There is no RAID5 log. Instead, DiskSuite uses a prewrite area. If you view metastat output from a RAID5 device, you will see that the start block for each column in the RAID is greater than zero. This is because DiskSuite allocated the first part of each column to the prewrite area. When a write occurs on a RAID5 device, the data and parity are first written to the prewrite areas of the appropriate columns. Next, iodone is called for the operation so that the writing process can continue. The data is then written to the active area of the RAID device. When that operation is done, the temporary copy in the prewrite area is freed up. If the system crashes, DiskSuite rolls the completed prewrite transactions to the active area. Thus, the integrity of the RAID device is protected against crashes.
Yes. The actual maximum log size is 1 Gbyte. The minimum log size is 1 Mbyte. 1 Mbyte's worth of log per 1 Gbyte of file system is a recommended minimum. 1 Mbyte's worth of log per 100 Mbyte of file system is a recommended "average," but there are no hard and fast rules. The best log size varies with an individual system's load and configuration. Fortunately, log sizes can be changed without too much work. Refer to the metattach and metadetach commands.
Logs can be shared across file systems. However, for reliability and availability reasons, you may want to use separate logs in every case. It is also a good idea to mirror logs.
The file system structure data and the synchronously written user data are kept in the same log. Asynchronous user data is not logged.
The log data is rolled into the oldest transaction first order. Some cancellation or re-ordering can occur, as long as it does not break UNIX semantics.
After a panic or power outage, UFS logging "rolls" the log file forward, discarding any unfinished system calls in the log. Thus, the file system should never be in an irreparable state, no matter how much activity there was at the time of the panic or outage. In addition, fsck can tell if a file system is a logging file system, and not check it during the reboot. This can save quite a bit of time on large file systems when rebooting after a failure.
When you concatenate to a metadevice, it becomes larger, but UFS does not automatically know about the increase in size. You must use the growfs command to grow the UFS file system to take advantage of the increase in size.
No. There is no "shrinkfs" command to dynamically reduce the size of a UFS file system.
To some extent, yes. The data needs to be part of a mirror. You can then add a new submirror of equivalent or larger capacity, perform a resync, then detach the original submirror.
Hot sparing can only be used for mirrors and RAID5 devices. A hot spare pool must be associated with a submirror or a RAID device.
When a slice in a submirror/RAID5 device goes "bad," a slice from the associated hot spare pool is used to replace it. DiskSuite selects the first hot spare (slice) that is large enough to replace the "bad" slice.
A hot spare pool can contain 0 to n hot spares. A hot spare pool can be associated with multiple submirrors/RAID5 device. You can define one hot spare pool with a variety of different size slices, and associate it with all the submirrors/RAID5 devices.
After replacing the failed disk, repartition it so that it is exactly like the original failed disk. Then run metareplace as follows:
metareplace -e metadevice componentwhere component is the slice name of the failed disk.
Yes. If necessary, you can hard lock and then unmount.
Yes. Be aware, however, that the host that had the metadevices mounted will undergo a failfast panic.
You need to run the metaset command with the -d option as follows:
metaset -s nets -d
You will also probably need the -f flag to delete
the last items in a set.
The number of disksets is limited to 32 (including the
"local" or default diskset). This can be changed by
modifying the "md_nsets=4" variable in the
/kernel/drv/md.conf
file, and performing a reconfiguration reboot. This variable
should not be set any higher than is necessary, as it
creates a number of device nodes in /devices and
symlinks in /dev. Also, be sure to set the same value
on both hosts in the diskset.
The metadb command can take the -s
This cannot be done. Only one host at a time in the diskset can take
ownership of the whole diskset and access metadevices within the diskset.
The diskset information for each host and its drives is stored in the
local databases.
The daemons are synchronizing the diskset information. The information
is stored in binary formats in the local databases of each host.
Ownership is not important until there are drives in the diskset.
There is no "real data" to protect yet. As soon as there
is a drive, only one host has ownership of the diskset at any one
time.
You either need root in group 14, or you need a
/.rhosts file entry
containing the other hostname (on each host).
Either c1t0d0 is a physically different disk on each host,
or they have different device numbers on each host.
This is not possible. Disksets are the enablers for this type of configuration.
After installing Solstice DiskSuite 4.1, you need to perform a reconfiguration boot
(boot -r). This should get all the device nodes and symlinks created.
Yes.
Yes, either with the force flag or without it.
The two sets of replicas are independent of each other.
The diskset feature is just one of the building blocks you need
to build an HA system. Sun's HA framework coordinates take/release
of disksets and many other items. Disksets are an enabler for Sun's
(or an OEM's or VAR's) HA product. DiskSuite's diskset feature is not
a complete HA product in itself.
Yes. If the minor numbers are not the same on both hosts, typically
you see the message "drive c#t#d# is not common with host xxxx" when
adding drives to the diskset.
Changing a host's IP address should not affect diskset
operation.
The DiskSuite algorithm that determines this works as follows.
First, an attempt is made to balance state database replicas
across disks, controllers, and storage arrays. This algorithm
works differently if there are more than two controllers or
storage arrays. When there is only one storage array, the algorithm
treats the array as six SCSI chains. Then, if Slice 7 starts on
cylinder 0, and is as large as or larger than what DiskSuite needs
for the state database replica, and there are no overlapping slices,
and the tags and flag fields are correct, the disk is not
repartioned.
No, this is not supported.
No. Solstice DiskSuite 4.1 requires symmetric host configurations.
metaset -s nets -d -h A
A simple, unmirrored stripe cannot sustain write errors like a mirror, and so can lose data. Reporting that this is the case is not deemed useful. To view a "snapshot" state of a stripe, refer to the /var/adm/messages file. syslog writes information to this file. You can also configure syslog to send mail when various errors occur.
The first submirror to have an error is put into the "Maintenance" state. No more reads or write to this submirror occur.
The second submirror to have an error (assuming it represents the same logical blocks as the first submirror) is put into the "Last Erred" state. Reads and writes go to the "Last Erred" submirror. If an I/O error occurs now, the mirror has an I/O error. (At this point, the mirror behaves like an ordinary physical disk.)
You should run metareplace -e slice on the slice in the "Maintenance" state first. If the mirror resync completes successfully, run metareplace -e slice on the slice in the "Last Erred" state.
If the first metareplace fails, you will need to replace the slice (physical disk) and restore from backup tape.
If the first metareplace succeeds, it is a good idea to validate the data before continuing.
DiskSuite 4.0 sends notification to syslog. However, DiskSuite Tool does not send any notifcation. DiskSuite 4.1 enables DiskSuite Tool to use SNMP notifications, which SunNet Manager will be able to understand.