dd if=/dev/random of=/dev/blog

5. March 2010

AMD RAID-on-Chip: A valid technology? Or is it just too late in the game?

Filed under: RAID, Storage, File Systems, SCSI — admin @ 14:33

Back in December I just came across this article for an AMD RoC (RAID-on-Chip) that will be embedded into servers to provide uninterrupted RAID functionality. A quick question came to mind as I was reading this: “Considering today’s storage capabilities and low cost equipment, who will be using this?” And honestly I was not able to come up with an answer.

In an earlier blog post I had mentioned the rise in usage of software RAID. Small to Medium sized Business (SMB) have been running to these low cost solutions. And why not? You are able to get more bang for your buck. For instance, by running OpenSolaris, one is able to use the redundancy of the ZFS file system (with single/double parity or mirrored RAID), file system level snapshot, data deduplication, and more. On top of that, there is a checksum calculator to ensure that all data corruption (noisy and silent) are never a threat. Take these ZFS pools and share them via NFS/CIFS, over ftp/http to even mapping them over iSCSI, Fibre Channel, AoE or FCoE protocols. The operating system (with all bells and whistles) is freely distributed under the CDDL license. The only costs will be the hardware equipment (a server or two and if external storage is needed, a JBOD) and the storage administrator. For years, servers have been equipped with LSI Logic (or other) RAID controllers that have proven to be just as efficient as anything else to handle local storage. Now when you look at larger enterprise scale companies, they are not going to want a server to manage their RAID. Instead they will keep the external storage managed externally with special purpose RAID controllers managing hundreds of terabytes to petabytes of data storage and apart from all the nodes in a cluster accessing that equipment.

But going back to the server, how practical is it to have an implemented RoC? With today’s level of high speed computing, does it make that much of a difference if the RAID is accomplished on the chipset as opposed to the operating system? If so how easy is it to recover from data corruption or any other error? Unless you are setting up a small home or small business server, what if you wanted additional functionality such as snapshots, data deduplication and checksum validation? You still have to go to the operating system and have some sort of volume manager on top of the RoC grouped volumes. No offense to Dot Hill even though they were a direct competitor to one of my previous employers (Xyratex). According to their numbers posted on Google Finance, financially they have been struggling for at least the past 5 to 6 years and this is a great opportunity for them. Although it is in my opinion that this would have been a valid technology back in 2001 and not 2010.

18. September 2009

Enterprise Storage Forum Article: RAID’s Days May Be Numbered

Filed under: Storage, SCSI — admin @ 07:50

Earlier this morning I ran into an interesting article written by Henry Newman, “RAID’s Days May Be Numbered.” While Mr. Newman highlights different reasons for his prediction, I tend to feel the same way. It is worth the read.

26. August 2009

Some exciting updates expected for Linux kernel 2.6.31

Filed under: Storage, File Systems, SCSI, Linux — admin @ 11:03

Recently I came across this article on h-online.com discussing some of the new features and functionality that is to be expected in the 2.6.31 Linux kernel. As I am usually more interested in data storage technologies, it was the file system and other storage concepts that drew my attention. I will only cover a few of the listed topics. You can read a full list of these patches provided in the h-online link I posted above.

Some updates include a large patch for the btrfs file system which tunes the file system to achieve greater performance. It is also noted that in this release btrfs will be less memory hungry and the SSD mode has been improved. Early benchmarks comparing both standard and SSD modes have shown the early implementation of SSD mode to be less than ideal. I am interested to see this improvement, especially as  Flash-based SSDs increase in usage and popularity.

During the development of btrfs I have been spending more time on observing the development process as opposed to taking it for early test spins. So when I make the following comment(s), I am not speaking directly from experience and if I make any errors in my statement(s), I hope the reader will correct me. As we are still early in the development stage and it is still too soon to tell, I wonder if btrfs will offer tuning with on-line volumes (as can be seen with ZFS). Most (if not all) modern Linux file systems are only capabable of processing file system options during mount time and in some cases with the remount option when invoking mount again. For example, in ZFS a character device node is available for management applications which are capable of pulling real time data and altering file system options on-the-fly. Here is a document with the diagram (reference page 10; sorry, I could only find a German copy of it; explanation found in last bullet point of section). If I wanted to disable/enable atime, compression, checksums, or alter quotas to both the entire storage pool and/or specific mounted volumes, I can do so on the fly with a simple zpool/zfs command. I am curious as to if btrfs  will implement a similar feature which can be extremely advantageous in storage administration environments.

Other patches include support for ext4 online defragmentation. I am surprised to see that ext4 is really starting to gain some grounds. Fedora currently implements it as the default file system in their latest release while Ubuntu provides it as an option during installation. It usually takes a while for a new file system to gain public trust and support.

Some other exciting patches include Fibre Channel Pass-Through support. I am curious to learn more about this functionality and if there is any relation (in functionality) to the SCST project hosted on SourceForge.

15. July 2009

Opinion: On the Future of Data Storage and RAID Technologies

Filed under: File Systems, Solaris, SCSI, Linux, Microsoft — admin @ 13:44

Please note that this is only a personal opinion of mine as I have been observing the growth and various decline of storage concepts within the data storage industry. The views of the reader may differ from my own which is why I would invite you to please post your opinions as a comment to this post.

One of the most volatile and yet needed industries is the data storage industry. As computing technologies become more cloud centric and rely upon the web for business, productivity, education to even recreation, there is a constant push to increase capacities but even more so increase I/O throughput. As a result of recent demands, our approach with these technologies need to be re-evaluated. The primary focus of this article is on the future of data storage concepts and the limited life and functionality of RAID.

Back in 1987 when the idea of RAID was first conceived, the goal or vision was to be able to scale multiple drives into a single volume which was represented to a host as such while also offering a form of redundancy with a more sensitive magnetic platter-based disk technology. Flash forward to the present and we are still reliant upon the same technologies. Is that because RAID is so perfect or have we just grown too comfortable and are too afraid of change?

Hardware Vs. Software RAID 

There was a time when processing power was limited and it became advantageous to utilize external methods for creating and managing arrays of data storage, but as time progressed, this approach became increasingly insignificant. At least that is to say for the Small-to-Medium sized Business (SMB). For the last decade, a lot of efforts have been placed toward increasing the reliability, stability and enhanced features with the software-based RAID. This has slowly been eating away at the hardware vendors. Although it has been rarely noticeable.

These software implementations are integrated with methods of Logical Volume Management (with built in redundancy via RAID 1-6), Load Balancing/Multipathing capabilities, data encryption, along with the abilities to utilize incremental snapshot(s) over designated volumes. These software implementations include dynamic resizing, quota/permission management, enhanced copy-on-write file systems that perform very well along with routine checksums to correct noisy and silent data corruption; almost all of which can be managed while volumes are on-line. Some of these volume managers have the capability to export iSCSI & FCoE targets and can also be tuned to support FC targets.

To name a few you have ZFS (an all-in-one solution), Btrfs (still in development and under test), device-mapper / LVM2 / multipath-tools, mdadm, DRBD, etc. The list goes on. What is to stop an SMB from setting up an array of JBODS and (if more redundancy is needed) cluster a couple of Solaris / OpenSolaris or Linux servers to manage their software RAID while also exporting it via a file server or into a SAN? Note that Lustre support for ZFS is still in development. Realistically most entry-level modular external RAID solutions don’t run on the latest and greatest of hardware components (as they are intended for a limited purpose and not to provide other hosting services). You will most likely achieve much greater performance with the software approach while also utilizing a much more efficient virtual memory manager (for enhanced caching) alongside a finely tuned schedular.

On the enterprise end of computing you will find some very impressive storage solutions that are intended to take the workload of the enterprise environment. Such companies as Hitachi Data Systems (HDS) have been doing an excellent job with providing high-quality and well performing storage solutions that are also easily manageable. Other companies have resorted to being a little creative in order to gain some market share with the SMB and larger companies. Such notable companies are NetAppData Domain to even Cleversafe.

Earlier I found an interesting link differentiating the positives and negatives of both hardware and software RAID implementations. It should be noted that times have changed and some of the key points highlighted are no longer an issue. For instance, under the category /boot partition, this seems to no longer be an issue with at least ZFS.

Enter the SSD

In more recent years, the Flash-based Solid State Drive (SSD) has been entering into enterprise markets. This is a result from such notable providers as Sun Microsystems, etc. Currently the percentage in SSD usage in the enterprise is somewhat minimal as their is a limit in maximum capacities for the drives. This may soon change as in Q3 of 2009, PureSilicon will release their Nitro 1TB SSD drive. The throughput and performance speeds seem very optimal in arenas where greater speeds are needed, but the technology introduces additional handicaps (in the form of write operations and a limited cell life) which most environments and some manufacturers have a difficult time in accomodating to. To combat the limited cell life, vendors have implemented their own method of wear leveling, transparent to the host. With this concept, the same data cell, when accessed and written to multiple times will not get written to the exact location but instead, through an “intelligent” built in firmware the data will get written to another cell on the drive. To the operating system, it is still the same “sector” location. While there is very little latency in seeking performance (sequential and random), write operations take a huge hit, especially with smaller I/O transfer sizes, when typically the flash medium erase/rewrite a 128K page at a time.

SSD Tuning

With the recent hype of Flash-based SSDs, many vendors and UNIX/Linux distributions have been writing file systems tuned to perform extremely well on SSDs (and limit the impact of these handicaps). For example, Sun Microsystem’s ZFS (available on Solaris, OpenSolaris, MacOS X [read-only], FreeBSD and Linux [over FUSE]) had recently added tunable support for SSDs in their release versions for Solaris & OpenSolaris, while the development of Btrfs for Linux has done the same. In contrast the Microsoft developed NTFS does not offer such features or functionality. In fact the file system has remained somewhat unchanged over the course of the years and is just as inferior now as it was when it was first released as a replacement to the FAT series of file systems. I wrote an entire post explaining why the NTFS file system is not well suited for today’s methods of computing here.

In recent releases it should be noted that Microsoft’s Windows 7 has been tuned for SSDs that are to be provided on netbooks. What this means, I do not know? And by tuned, this is still unclear. You can read some of that information here. The only reason for the lack of changes in NTFS is to preserve backwards compatibility. This approach limits the ability to update a current existing server’s (if not running Windows 7) NTFS module if it needed to serve backend storage utilizing SSD media.

The Impact on RAID Technologies

As SSDs become more popular the advantages to using RAID are reduced, where the only benefits are gained from a simple stripe in a RAID 0 or mirroring to a backup array within a SAN or other form of network using RAID 01 (not to be confused with a RAID 10); just in case access to the first fails for whatever reason. This is where DRBD would come in real handy. As I briefly mentioned earlier, the whole concept of this form of redundancy was dependent upon the problematic nature of a magnetic disk device; where failures were imminent. And for those who are concerned with a method of error detection for both silent and noisy data corruptions, the majority of RAID implementations (both hardware and software) do not validate the data like the ZFS or Btrfs checksum implementation.

Changes in Protocol Layers?

With the popularity of SSD technologies growing and its costs reducing, the one drawback that is setting manufacturers and consumers back are the limitations offered by the protocols that they are working with. Today, Fibre Channel, SAS and SATA are not capable of handling full SSD speeds and serve only as a bottleneck to the technology. There have been recent attempts from vendors as Fusion-io to even PureSilicon to rely on other protocol interfaces such as PCI Express (PCI-E). Capable of handling up to 1 GB per second, it only seems natural for these vendors to move in that direction. I anticipate that shortly, others will follow. Fibre Channel and SAS may continue to serve the SAN (and with the appropriate load balancing mechanisms configured, it will perform well) but when it comes to the drive within the chassis, I expect to see more PCI Express in the near future. But who knows, with the recent drop in prices for 10Gb Ethernet or the supported high throughput offered from Infiniband, things may be moving toward another direction altogether.

In conclusion, I predict that in five years time we will start to see some huge and very interesting changes. I am looking forward to it.

26. June 2009

Hard Rectangular Drives (HRD)

Filed under: Storage, SCSI, Misc. — admin @ 13:55

I do not know all the details on this but I found the concept extremely interesting. It is a Hard Rectangular Drive (HRD) which is very unique in terms of design and functionality. You can read more about it here and here. This technology is being developed by Data Slide. The first article goes on in stating the following:

DataSlide says the new technology would find first use in a PCIe-based card format designed for use in Oracle database applications. The PCIe format is necessitated by the extremely high performance of HRD; like RAMDisks and high-end NAND SSDs, HRD would overwhelm a SATA or SAS interface. The cost of such a device is unknown, but its capacity would be comparable to that of a modern HDD.

29. April 2009

New Article: Linux Storage Management

Filed under: System, Virtual Memory, Storage, File Systems, SCSI, Linux — admin @ 09:22

To those who are interested in the topic of Linux Storage Management, planned for the 3/2009 issue and hitting the shelves July-September is my article of the same name in Linux+ magazine. I do not know how 3/2009 equates to July-September, but that is what I have been told. It is a 7 page article and gets into some great detail with storage management.

21. March 2009

device-mapper (dm): working with multipath-tools. Part 1

Filed under: Storage, SCSI, Linux — admin @ 10:33

Device-mapper (hereafter, dm) is one of the best collection of device drivers that I have ever worked with. It brings high availability, flexibility and more to the Linux 2.6 kernel. Device-mapper is a Linux 2.6 kernel infrastructure that provides a generic way to create virtual layers of a block device while supporting stripping, mirroring, snapshots, concatenation, multipathing, etc. While many modules are built on top of device-mapper, the focus of this article is on multipath-tools. Note that I will be using the terms multipath, multipath-tools and dm-multipath interchangeably to signify the same package. Also note that dm-multipath is the name of the repackaged multipath-tools redistributed under Red Hat in their Advanced Server Linux distribution.

Device-mapper multipath provides the following features (taken from HP dm-multipath reference guide):

  • Allows the multivendor Storage RAID systems and host servers equipped with multivendor Host Bus Adapters (HBAs) redundant physical connectivity along the independent Fibre Channel fabric paths available
  • Monitors each path and automatically reroutes (failover) I/O to an available functioning alternate path if an existing connection fails
  • Provides an option to perform fail-back of the LUN to the repaired paths
  • Implements failover or failback actions transparently without disrupting applications
  • Monitors each path and notifies if there is a change in the path status
  • Facilitates the load balancing among the multiple paths
  • Provides CLI with display options to configure and manage Multipath features
  • Provides all Device Mapper Multipath features support for any LUN newly added to the host
  • Provides an option to have customized names for the Device Mapper Multipath devices
  • Provides persistency to the Device Mapper Multipath devices across reboots if there are any change in the Storage Area Network
  • Provides policy based path grouping for the user to customize the I/O flow through specific set of paths

Installing multipath-tools

Installing multipath-tools is usually as simple as going to your distributions repository, finding the package and select it to be installed. You can always download it and build it from source; but most likely your distribution should have it in its repository. Again, note that multipath-tools runs on top of device-mapper, so you will need device-mapper installed in order to utilize multipath-tools.

Configuring multipath-tools

The two main or key components to manage and monitor in the multipath-tools package are the multipath.conf file and also the multipathd daemon. Both serve vital functions to help load a configuration and monitor it. Sometimes after the multipath-tools package has been installed, the multipath.conf file could be found in /etc. If not you can always run a search for an existing template, which in some distributions can exist in the following directories:

Redhat –

$ cd /usr/share/doc/device-mapper-multipath-<version no.>/multipath.conf.defaults
$ cp multipath.conf.defaults /etc/multipath.conf

SuSE –

$ cd /usr/share/doc/packages/multipath-tools/multipath.conf.synthetic
$ cp multipath.conf.synthetic /etc/multipath.conf

To edit the multipath.conf file simply open it up in a text editor:

$ vim /etc/multipath.conf

This is just an example. Your multipath.conf file may be configured differently to accommodate certain features and limitations with the external data storage that you are working with

# Blacklist all devices by default. Remove this to enable multipathing
# on the default devices.
#devnode_blacklist {
#       devnode "*"
#}
##
## This is a template multipath-tools configuration file
## Uncomment the lines relevent to your environment
##
defaults {
multipath_tool  "/sbin/multipath -v0"
udev_dir        /dev
polling_interval 10
default_selector        "round-robin 0"
#       default_path_grouping_policy    multibus
default_path_grouping_policy    failover
default_getuid_callout  "/sbin/scsi_id -g -u -s /block/%n"
default_prio_callout    "/bin/true"
#       default_features        "0"
rr_min_io               100
path_checker            tur
failback                3
no_path_retry      2
}
devnode_blacklist {
wwid 26353900f02796769
devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
devnode "^hd[a-z][0-9]*"
devnode "^cciss!c[0-9]d[0-9]*[p[0-9]*]"
}

In older versions of this multipath.conf file there is a known typo. In the blacklist section make sure that you correct the known error:
devnode “^hd[a-z][[0-9]*]” should read devnode “^hd[a-z][0-9]*”

Please understand what you set before activating the dm multipathed disk devices. For example the default_path_grouping_policy is set to failover and not multibus. That means despite the number of LUN path I have accessing the same logical volume, only one remains active at a single time. If the active path were to fail, then there would be a failover to a secondary defined path. Multibus simply send I/O requests across all paths which are marked as active unless failed. My path_checker is a Test Unit Ready (TUR), which is a low level SCSI command (opcode 0×00) to validate that the SCSI unit is ready to accept I/O requests. Also supported as path checkers are readsector0 and directio. Here is a guide to some of these field definitions.

Another extremely important field to this multipath.conf file is the blacklist. This tells the multipath-tools module to omit any device with the following characteristics, when scanning and grouping devices into device-mapper for multipathing.

There is so much more to this multipath.conf and I know I am only touching the surface of it but there is a wealth of information out there to help understand the vast amount of details buried within. I must admit though, that the coolest feature is that you can define specific settings for device specific environments. If you are working with a specific model of Compaq, Mylex or even Xyratex storage devices, these can be defined separately without interfering with any other connected storage device. Here is an example taken from the default multipath.conf file:

#       device {
#               vendor                  "COMPAQ  "
#               product                 "HSV110 (C)COMPAQ"
#               path_grouping_policy    multibus
#               getuid_callout          "/sbin/scsi_id -g -u -s /block/%n"
#               path_checker            readsector0
#               path_selector           "round-robin 0"
#               hardware_handler        "0"
#               failback                15
#               rr_weight               priorities
#               no_path_retry           queue
#       }

Obviously your definition would not be commented out.

Earlier I had mentioned the multipathd daemon. You can start or stop the daemon in the following ways:

Redhat -

$ service multipathd start
$ service multipathd stop

SuSE -

$ /etc/init.d/boot.multipath start
$ /etc/init.d/multipathd start
$ /etc/init.d/boot.multipath stop
$ /etc/init.d/multipathd stop

Note that multipathd will not function appropriately until you have all the appropriate modules loaded. In my case it is dm_round_robin, dm_mirror, dm_multipath and dm_mod.

Scanning the SCSI bus for multipath devices

To have the utility scan or update the nodes on the scsi bus/channel(s) you must type the following command:

$ multipath -v2
create: 32000000bb55555cd
[size=27 GB][features="0"][hwhandler="0"]
\_ round-robin 0
\_ 30:0:0:0 sda 8:0   [ready]
\_ round-robin 0
\_ 31:0:0:0 sde 8:64  [ready]
create: 32001000bb55555cd
[size=27 GB][features="0"][hwhandler="0"]
\_ round-robin 0
\_ 30:0:0:1 sdb 8:16  [ready]
\_ round-robin 0
\_ 31:0:0:1 sdf 8:80  [ready]
create: 32002000bb55555cd
[size=92 GB][features="0"][hwhandler="0"]
\_ round-robin 0
\_ 30:0:0:2 sdc 8:32  [ready]
\_ round-robin 0
\_ 31:0:0:2 sdg 8:96  [ready]
create: 32003000bb55555cd
[size=92 GB][features="0"][hwhandler="0"]
\_ round-robin 0
\_ 30:0:0:3 sdd 8:48  [ready]
\_ round-robin 0
\_ 31:0:0:3 sdh 8:112 [ready]

Everything gets grouped according to WWID into a dm device.  Multiple LUN mappings of the same LD will be given an alias (read below for these aliases).
To kill this mapping table you can simply run:

$ dmsetup remove_all

or by removing each individual entry:

$ dmsetup remove /dev/mapper/32003000bb55555cd

Once the mapping table is created you should be able to look in the following 3 paths to find a list of all dm devices either written as a dm-x device or under its WWID (World Wide Identifier).

$ ls /dev/dm-
dm-0  dm-1  dm-2  dm-3
$ ls /dev/mapper/
32000000bb55555cd  32001000bb55555cd  32002000bb55555cd  32003000bb55555cd  control
$ ls /dev/mpath/
32000000bb55555cd  32001000bb55555cd  32002000bb55555cd 32003000bb55555cd

Pick one of the paths and format the device as you normally would with any other raw Linux device to be mounted.

$ mke2fs -F /dev/dm-0 ......

Mount the device and verify that it is mounted by typing df at the command line.  Now to see all your active paths and monitor them during the test procedure you can type either multipath –ll or multipath –l at the command line.

$ multipath -ll
32002000bb55555cd
[size=92 GB][features="1 queue_if_no_path"][hwhandler="0"]
\_ round-robin 0 [active]
\_ 30:0:0:2 sdc 8:32  [active][ready]
\_ round-robin 0 [enabled]
\_ 31:0:0:2 sdg 8:96  [active][ready]
32001000bb55555cd
[size=27 GB][features="1 queue_if_no_path"][hwhandler="0"]
\_ round-robin 0 [active]
\_ 30:0:0:1 sdb 8:16  [active][ready]
\_ round-robin 0 [enabled]
\_ 31:0:0:1 sdf 8:80  [active][ready]
32000000bb55555cd
[size=27 GB][features="1 queue_if_no_path"][hwhandler="0"]
\_ round-robin 0 [active]
\_ 30:0:0:0 sda 8:0   [active][ready]
\_ round-robin 0 [enabled]
\_ 31:0:0:0 sde 8:64  [active][ready]
32003000bb55555cd
[size=92 GB][features="1 queue_if_no_path"][hwhandler="0"]
\_ round-robin 0 [active]
\_ 30:0:0:3 sdd 8:48  [active][ready]
\_ round-robin 0 [enabled]
\_ 31:0:0:3 sdh 8:112 [active][ready]

The results show that there are two LUN paths representing a single logical volume. One is set to active while the other is enabled and ready until the path fails over.

Note that a lot of individuals make the mistake of formatting and mounting the sd devices.  This is not allowed when using device-mapper.  sdc and sdg present dm device dm-0 or WWID 32002000bb55555cd.  These are virtual devices labeled to represent multiple LUN mappings to the same LD.  So you must use the dm labels as opposed to the sd ones.

Stay tuned for Part 2. Whenever that is going to be.

31. January 2009

Updating SCSI targets while in a production environment.

Filed under: Storage, Solaris, SCSI, Linux, UNIX — admin @ 08:07

It still amazes me to see storage administrators bringing the same Microsoft Windows mentality to the UNIX and Linux environments. That is, after changes to a configuration are made “reboot the console to view all changes.” Now while Microsoft Windows does a fairly decent job of updating any changes made to the SCSI Subsystem, UNIX and GNU/Linux still handle it somewhat differently. Rebooting the console should be the LAST thing anybody does. These operating systems are so modular that in most cases there is absolutely no need to reboot; unless you have made changes to the kernel. If you are working in a production environment, down time can mean lost time which results in lost revenue. So when you update your SAN, how do you manage your storage configuration changes on your GNU/Linux 2.6 and Sun Solaris hosts?

GNU/Linux

There is a noteworthy variable in the SCSI Subsystem designed on the Linux 2.6 kernel, which some may find it to be problematic. Although I believe that this is how it should be. That variable lies in the Lower-Layer of the Subsystem where the Host Bus Adapter (HBA) modules reside. While it is true that the Linux 2.6 kernel supports hotplugging which includes SCSI devices, the HBA modules are designed in such a way which would lead a novice storage administrator to believe otherwise.

For example, let us say I am working with Fibre Channel (FC) devices and I use a Qlogic, Emulex or LSI FC HBA. I have Logical Units (LU) mapped to the Fibre Channel Node Ports on the host’s HBA. So when I insert my modules for a Qlogic qla2340 (qla2300/qla2xxx), all Logical Unit Numbers (LUN) are recognized by the SCSI Subsystem and immediately udev assigns them the appropriate node names (/dev/sda, /dev/sdb, etc.). At least through the FC HBA, if the LUN mappings change and I add/remove devices, the HBA will not report any changes to the host and therefore the SCSI Subsystem is not updated. There are a few methods to getting the HBA to report changes to the GNU/Linux host, one of which is a reboot (which provides the same functions as the next method). The second is to reload the module and have it report the latest LUN mapping(s) to the SCSI Subsystem while the third, being the most appropriate of methods, does not require any downtime. It takes a simple command. To add a device:

echo "scsi remove-single-device 1 2 3 4" > /proc/scsi/scsi

And to remove it:

echo "scsi add-single-device 1 2 3 4" > /proc/scsi/scsi

Where 1=host; 2=channel; 3=id; 4=lun.

In the SCSI portion of the kernel source, the file drivers/scsi/scsi_proc.c has a function routine that takes these inputs and after parsing them, it will eventually verify that the target being added/removed is a valid one and the action is then performed. That function routine is:

static ssize_t proc_scsi_write(struct file *file, const char __user *buf, size_t length, loff_t *ppos){ ... }

Sounds simple, right? It is. No reboot or re-insertion of the module is necessary. Now I mentioned earlier about how I would prefer this method over a more dynamic approach as is seen in Microsoft Windows. Allow me to explain. When your hosting storage for an enterprise environment, anything outside of a static configuration can produce hazardous results. A great example was when I was working with LSI Logic to correct an unimplemented functionality which allowed for the disabling of hotplugging in their Serial Attached SCSI (SAS) HBA device drivers. To my knowledge those patches I submitted are still implemented to this day. Without the administrator’s knowledge, whevener the storage configuration gets updated even with flaky symptoms (i.e. a drive drops offline and back online again), it can bring down an entire server. Let us say I have 2 LUNs mapped which udev desingated as /dev/sda  and /dev/sdb. I mount these devices as /mnt/mnt1 and /mnt/mnt2. Now let us say that the hotpluggable feature incorporated into the HBA’s device driver is enabled and for some reason something happens and the LUN that has been allocated to /dev/sda drops offline for a few seconds. Who knows the external storage controller may be acting up. It happens all too frequently which is why Multipathing with Failover capabilities is a must. The mount path associated with that device (/mnt/mnt1) is still mounted and holds /dev/sda, preventing udev from removing the node. Meanwhile the SAS HBA realizes that a “new” drive (i.e. the drive that momentarily dropped offline) has been recognized and goes through the usual process to make the device accessable by the host. Now, wait a minute, /dev/sda is still taken and mounted to /mnt/mnt1. What happens now? A new device node is allocated: /dev/sdc and the path to the drive changes. The /mnt/mnt1 mount must be removed and /dev/sdc would have to mount to it instead. But the administrator still does not have any knowledge of the change. At least not until he/she reviews the kernel logs and notices nothing but SCSI Disk I/O errors when the original node was attempted to be written to.

Now there are ways around this, that is by working with udev and making specific devices with specific attributes lock to a specific device node. When hotplugging is not a feature or it is disable and a device drops offline for a quick moment, no changes to the configuration are reported to the SCSI layer and when it comes back online, it resume its original role.

Sun Solaris

Now working with Solaris is a bit different. Let us now say that you changed your SAN configuration and whatever has been mapped to your Sun box’s FC HBAs has been modified. Sometimes it is as simple as running 1 command. At the command prompt you would first type:

devfsadmn

This will update all changes in the SCSI layer. So now when you type format, your new devices will appear. And what if they don’t? The very handy utility luxadm comes into play. First list all of your HBA ports and their status:

luxadm -e port

The function traverses through the /devices path (this is similar to the Linux /sys path of sysfs) and produce a list of results that look similar to this:

/devices/pci@7,600000/SUNW,qlc@2/fp@0,0:devctl        CONNECTED

Now what you would want to do is force a lip (FC terminology) through each FC node.

luxadm -e forcelip /devices/pci@7,600000/SUNW,qlc@2/fp@0,0:devctl

Type format again, and now you SHOULD see the added disk device(s).

30. January 2009

The scsigen log: Making wishes come true.

Filed under: Storage, SCSI, Linux, Microsoft, UNIX — admin @ 09:50

Earlier today I came across a blog entry from an employee of Sun Microsystems. He addressed the frustrations of the many worldwide when it came to running much needed applications that were built only for a Microsoft Windows environment. As some of you may already know, the reality is that not everyone runs Microsoft Windows as their operating system and in other cases, some of those same people may have a dual boot set up but rarely if ever boot up into the Windows partition unless it is absolutely necessary.

The author spoke specifically about a topic I too had many problems with and it related to protocol analyzers. Working in the storage industry, a SCSI, Fibre Channel or SAS analyzer is a must have tool. The problem lies with the fact that the providers of such solutions (Lecroy, Finisar, etc.) develop the hardware management and trace viewing interfaces for Windows ONLY! No Mac OS X. No BSD or Linux versions. No Solaris. Why!?! As I had mentioned in numerous earlier posts, IDC reports show that UNIX + Linux combined own more of the enterprise market than Windows. Are these companies just too lazy or do they lack the skill-set to provide something on anything else other than Windows? Do they not see any money in it? If I work for a UNIX only solutions provider, the last thing I would want to do is install a Microsoft Windows console to do any work.

Well, rants like this make me all the more happier in being able to provide the solutions that people like this author are looking for; that is, SCSIGen v2.0, which is still in development and coming to a POSIX compliant platform near you!

26. January 2009

Tuning your Microsoft Windows host for your storage environment.

Filed under: Storage, SCSI, Microsoft — admin @ 20:13

It pains me to write this post but being in the storage industry I have had a significant exposure to Microsoft operating systems. If you have read any of my previous posts, you would know how I feel about Microsoft in the enterprise arena but if you are in the unfortunate situation of having to work with Microsoft Windows on top of your storage equipment, here is some helpful pointers to aid in tuning that node. Also please reference my earlier post on tuning your Linux 2.6 host for detailed information on some storage concepts. The following Windows variables are accessed through the Windows registry.

The SCSI target timeout value can be viewed and modified at HKLM\System\CurrentControlSet\Services\Disk\TimeoutValue. It takes a value of 1 to 255 seconds. Off of the top of my head I cannot recall what this value is defaulted to, but believe me when I say this: there are times when you may need to go as high as 255; for the return of an I/O transfer. When I used to work for Xyratex, their latest series of storage RAID controller (based on the nStor Wahoo technology) did not have an appropriately implemented cache. It was more of an intelligent buffer which temporarily stored data contiguously until the schedular made its way to the scene. So when you were configured with multiple LUN devices mapped to your host and initiated I/O to those same targets, it could have take as high as 255 seconds for the I/O to return to the SCSI laye thus eliminating the countdown to the timeout. If it were to timeout, as is the case with any OS, an Abort Sequences would be initiated to abort the I/O transfer.

I mentioned the MaximumSGList in a comment to an earlier post. This is found at HKLM\System\CurrentControlSet\Services\ [HBA Name] \Parameters\Device\MaximumSGList. The supported values are 16 to255. Setting it to 255 sets the maximum of 1MB and if you go over 255 (like 256) it will default to 64KB.

If you are a developer and need more bytes allocated for sense data (on the return of the SCSI command), you would have to modify the following field: HKLM\System\CurrentControlSet\Enum\ [Bus] \ [ DeviceID] \ [Device] \DeviceParameters\ScsiPort\TotalSenseDataBytes. According to Microsoft, the supported values are: “Between 18 and 255 for SCSI Port. Storport always uses 255.”

Some other extremely important variables are actually tuned on the HBA itself. For example, the queue depth which is sometimes referred to as throttling (by Qlogic) controls the amount of outstanding I/O transfer a LUN is limited to. Other variables include various retry counts and more. So please review the documentation to the HBA you are using.

Caution needs to be taken when modifying all of these values. You will need to understand the environment the host is being configured in and that also includes the I/O profile coming from that same host(s). If you set the queue depth too high and the host is sending out more I/O transfer than the storage controller can handle, the storage controller will either abort those commands or depending how it is configured send a BUSY status back to the initiator.

So there! I did it. <sigh>

Next Page »

Powered by WordPress