This appendix describes some ways to set up your configuration. Use the following table to locate specific information in this chapter.
When planning a configuration, the main point to keep in mind is that for any given application there are trade-offs in performance, availability, and hardware costs. Experimenting with the different variables is necessary to figure out what works best for your configuration.
Striping generally has the best performance, but it offers no data protection. For write intensive applications, mirroring generally has better performance than RAID5.
Mirroring and RAID5 metadevices both increase data availability, but they both generally have lower performance, especially for write operations. Mirroring does improve random read performance.
RAID5 metadevices have a lower hardware cost than mirroring. Both striped metadevices and concatenated metadevices have no additional hardware cost.
This section provides a list of guidelines for working with concatenations, stripes, mirrors, RAID5 metadevices, state database replicas, and file systems constructed on metadevices.
Disk geometry refers to how sectors and tracks are organized for each cylinder in a disk drive. The UFS organizes itself to use disk geometry efficiently. If slices in a concatenated metadevice have different disk geometries, DiskSuite uses the geometry of the first slice. This fact may decrease the UFS file system efficiency.
Note - Disk geometry differences do not matter with disks that use Zone Bit Recording (ZBR), because the amount of data on any given cylinder varies with the distance from the spindle. Most disks now use ZBR.
Note that the UNIX operating system implements a file system cache. Since read requests frequently can be satisfied from this cache, the read/write ratio for physical I/O through the file system can be significantly biased toward writing.
For example, an application I/O mix might be 80 percent reads and 20 percent writes. But, if many read requests can be satisfied from the file system cache, the physical I/O mix might be quite different - perhaps only 60 percent reads and 40 percent writes. In fact, if there is a large amount of memory to be used as a buffer cache, the physical I/O mix can even go the other direction: 80 percent reads and 20 percent writes might turn out to be 40 percent reads and 60 percent writes.
Figure 7-1 Mirror Performance Matrix
A mirrored metadevice can withstand multiple device failures in some cases (for example, if the multiple failed devices are all on the same submirror). A RAID5 metadevice can only withstand a single device failure. Striped and concatenated metadevices cannot withstand any device failures.
When a device fails in a RAID5 metadevice, read performance suffers because multiple I/O operations are required to regenerate the data from the data and parity on the existing drives. Mirrored metadevices do not suffer the same degradation in performance when a device fails.
In a RAID5 metadevice, parity must be calculated and both data and parity must be stored for each write operation. Because of the multiple I/O operations required to do this, RAID5 write performance is generally reduced. In mirrored metadevices, the data must be written to multiple mirrors, but mirrored performance in write-intensive applications is still much better than in RAID5 metadevices.
RAID5 metadevices have a lower hardware cost than mirroring. Mirroring requires twice the disk storage (for a two-way mirror). In a RAID5 metadevice, the amount required to store the parity is: 1/#-disks.
You can't encapsulate an existing file system in a RAID5 metadevice (you must backup and restore).
Both of these settings became the default mkfs(1M) parameters in
Solaris 2.5. They reduce the overhead of creating a file system from more
than 10 percent of the disk to about 3 percent, with no performance
trade-off.
For example, to create a new file system with a minimum percentage of free space set to one percent, and the number of bytes per inode set to 8 Kbytes, use the following command.
----------------------------------------
# newfs -m 1 -i 8192 /dev/md/rdsk/d53 ----------------------------------------
In this command,
----------------------------------------------------------
-m 1 Specifies that the minimum percentage of free space in the file system is one (1) percent. -i 8192 Specifies that the number of bytes per inode is 8 Kbytes. ----------------------------------------------------------
-----------------------------------
# newfs -c 256 /dev/md/rdsk/d114 -----------------------------------
Note - The man page in Solaris 2.3 and 2.4 incorrectly states that the maximum size is 32 cylinders.)
For example, try the following parameters for sequential I/O:
maxcontig = 16
(16 * 8 KB blocks = 128 KB clusters)
Using a four-way stripe with a 32 Kbyte interlace value results in a 128 Kbyte stripe width, which is a good performance match:
interlace size = 32 KB
(32 KB stripe unit size * 4 disks = 128 KB stripe width)
Performance may be improved if the file system I/O cluster size is some integral of the stripe width. For example, setting the maxcontig parameter to 16 results in 128 KB clusters (16 blocks * 8 KB file system block size).
Note - The options to the mkfs(1M) command can be used to modify the default minfree, inode density, cylinders/cylinder group, and maxcontig settings. You can also use the tunefs(1M) command to modify the maxcontig and minfree settings.
See the man pages for mkfs(1M), tunefs(1M), and newfs(1m) for more information.
This section compares performance issues for RAID5 metadevices and striped metadevices .
This section explains the differences between random I/O and sequential I/O, and DiskSuite strategies for optimizing your particular configuration.
Databases and general-purpose file servers are examples of random I/O environments. In random I/O, the time spent waiting for disk seeks and rotational latency dominates I/O service time.
You can optimize the performance of your configuration to take advantage of a random I/O environment.
You want all disk spindles to be busy most of the time servicing I/O requests. Random I/O requests are small (typically 2-8 Kbytes), so it's not efficient to split an individual request of this kind onto multiple disk drives.
The interlace size doesn't matter, because you just want to spread the data across all the disks. Any interlace value greater than the typical I/O request will do.
For example, assume you have 4.2 Gbytes DBMS table space. If you stripe across four 1.05-Gbyte disk spindles, and if the I/O load is truly random and evenly dispersed across the entire range of the table space, then each of the four spindles will tend to be equally busy.
The target for maximum random I/O performance on a disk is 35 percent or lower as reported by DiskSuite Tool's performance monitor, or by iostat(1M). Disk use in excess of 65 percent on a typical basis is a problem. Disk use in excess of 90 percent is a major problem.
If you have a disk running at 100 percent and you stripe the data across four disks, you might expect the result to be four disks each running at 25 percent (100/4 = 25 percent). However, you will probably get all four disks running at greater than 35 percent since there won't be an artificial limitation to the throughput (of 100 percent of one disk).
While most people think of disk I/O in terms of sequential performance figures, only a few servers - DBMS servers dominated by full table scans and NFS servers in very data-intensive environments - will normally experience sequential I/O.
You can optimize the performance of your configuration to take advantage of a sequential I/O environment.
The goal in this case is to get greater sequential performance than you can get from a single disk. To achieve this, the stripe width should be "small" relative to the typical I/O request size. This will ensure that the typical I/O request is spread across multiple disk spindles, thus increasing the sequential bandwidth.
You want to get greater sequential performance from an array than you can get from a single disk by setting the interlace value small relative to the size of the typical I/O request.
Example:
Assume a typical I/O request size of 256 KB and striping across 4 spindles. A good choice for stripe unit size in this example would be:
256 KB / 4 = 64 KB, or smaller
Note - Seek and rotation time are practically non-existent in the sequential case. When optimizing sequential I/O, the internal transfer rate of a disk is most important.
The most useful recommendation is: max-io-size / #-disks. Note that for UFS file systems, the maxcontig parameter controls the file system cluster size, which defaults to 56 KB. It may be useful to configure this to larger sizes for some sequential applications. For example, using a maxcontig value of 12 results in 96 KB file system clusters (12 * 8 KB blocks = 96 KB clusters). Using a 4-wide stripe with a 24 KB interlace size results in a 96 KB stripe width (4 * 24 KB = 96 KB) which is a good performance match.
Example: In sequential applications, typical I/O size is usually large (greater than 128 KB, often greater than 1 MB). Assume an application with a typical I/O request size of 256 KB and assume striping across 4 disk spindles. Do the arithmetic: 256 KB / 4 = 64 KB. So, a good choice for the interlace size would be 32 to 64 KB.
Number of stripes: Another way of looking at striping is to first determine the performance requirements. For example, you may need 10.4 MB/sec performance for a selected application, and each disk may deliver approximately 4 MB/sec. Based on this, then determine how many disk spindles you need to stripe across:
10.4 MB/sec / 4 MB/sec = 2.6
Therefore, 3 disks would be needed.
To summarize the trade-offs: Striping delivers good performance, particularly for large sequential I/O and for uneven I/O distributions, but it does not provide any redundancy of data.
Write intensive applications: Because of the read-modify-write nature of RAID5, metadevices with greater than about 20 percent writes should probably not be RAID5. If data protection is required, consider mirroring.
RAID5 writes will never be as fast as mirrored writes, which in turn will never be as fast as unprotected writes. The NVRAM cache on the SPARCstorage Array closes the gap between RAID5 and mirrored configurations.
Full Stripe Writes: RAID5 read performance is always good (unless the metadevice has suffered a disk failure and is operating in degraded mode), but write performance suffers because of the read-modify-write nature of RAID5.
In particular, when writes are less than a full stripe width or don't align with a stripe, multiple I/Os (a read-modify-write sequence) are required. First, the old data and parity are read into buffers. Next, the parity is modified (XOR's are performed between data and parity to calculate the new parity - first the old data is logically subtracted from the parity and then the new data is logically added to the parity), and the new parity and data are stored to a log. Finally, the new parity and new data are written to the data stripe units.
Full stripe width writes have the advantage of not requiring the read-modify-write sequence, and thus performance is not degraded as much. With full stripe writes, all new data stripes are XORed together to generate parity, and the new data and parity are stored to a log. Then, the new parity and new data are written to the data stripe units in a single write.
Full stripe writes are used when the I/O request aligns with the stripe and the I/O size exactly matches:
interlace_size * (num_of_columns - 1)
For example, if a RAID5 configuration is striped over 4 columns, in any one stripe, 3 chunks are used to store data, and 1 chunk is used to store the corresponding parity. In this example, full stripe writes are used when the I/O request starts at the beginning of the stripe and the I/O size is equal to: stripe_unit_size * 3. For example, if the stripe unit size is 16 KB, full stripe writes would be used for aligned I/O requests of size 48 KB.
Performance in degraded mode: When a slice of a RAID5 metadevice fails, the parity is used to reconstruct the data; this requires reading from every column of the RAID5 metadevice. The more slices assigned to the RAID5 metadevice, the longer read and write operations (including resyncing the RAID5 metadevice) will take when I/O maps to the failed device.
Sharing logs: trans metadevices can share log devices. However, if a file system is heavily used, it should have a separate log. The disadvantage to sharing a logging device is that certain errors require that all file systems sharing the logging device must be checked with the fsck(1M) command.
Assume you have a 4 GB file system. What are the recommended log sizes?
Note - Replicas cannot be stored on the root (/), swap, or /usr slices, or on slices containing existing file systems or data.
Three or more replicas are required. You want a majority of replicas to survive a single component failure. If you lose a replica (for example, due to a device failure), it may cause problems running DiskSuite or when rebooting the system.
The system will stay running with exactly half or more replicas. The system will panic when fewer than half the replicas are available to prevent data corruption.
The system will not reboot without one more than half the total replicas. In this case, you must reboot single-user and delete the bad replicas (using the metadb command).
As an example, assume you have four replicas. The system will stay running as long as two replicas (half the total number) are available. However, in order for the system to reboot, three replicas (half the total plus 1) must be available.
In a two-disk configuration, you should always create two replicas on each disk. For example, assume you have a configuration with two disks and you only created three replicas (two on the first disk and one on the second disk). If the disk with two replicas fails, DiskSuite will stop functioning because the remaining disk only has one replica and this is less than half the total number of replicas.
Note - If you created two replicas on each disk in a two-disk configuration, DiskSuite will still function if one disk fails. But because you must have one more than half of the total replicas available in order for the system to reboot, you will be unable to reboot in this state.
If multiple controllers exist, replicas should be distributed as evenly as possible across all controllers. This provides redundancy in case a controller fails and also helps balance the load. If multiple disks exist on a controller, at least two of the disks on each controller should store a replica.
Replicated databases have an inherent problem in determining which database has valid and correct data. To solve this problem, DiskSuite uses a majority consensus algorithm. This algorithm requires that a majority of the database replicas agree with each other before any of them are declared valid. This algorithm requires the presence of at least three initial replicas which you create. A consensus can then be reached as long as at least two of the three replicas are available. If there is only one replica and the system crashes, it is possible that all metadevice configuration data may be lost.
The majority consensus algorithm is conservative in the sense that it will fail if a majority consensus cannot be reached, even if one replica actually does contain the most up-to-date data. This approach guarantees that stale data will not be accidentally used, regardless of the failure scenario. The majority consensus algorithm accounts for the following: the system will stay running with exactly half or more replicas; the system will panic when fewer than half the replicas are available; the system will not reboot without one more than half the total replicas.