dd if=/dev/random of=/dev/blog

15. May 2010

Article ZFS data integrity testing and more random ZFS thoughts.

Filed under: OpenSolaris, File Systems, Solaris — admin @ 10:08

Earlier this week I came across this blog posting about data integrity testing on ZFS title: ZFS data integrity tested. It was a few months old from Robin Harris’ blog Storage Bits. I guess the most exciting part was validating Sun Microsystem’s claims to ZFS having the ability to correct data corruption even with error injection to both the disk and memory. ZFS continues to prove its worth on enterprise class systems and applications.

My only frustatrions with ZFS are that cluster support is currently not available, at least until Lustre 3.0 is out, whenever that will be. Another frustration is trying to write an application that will work directly with a zpool. For instance, there is no simple method to send a zpool a generic ioctl() such as DKIOCGGEOM to obtain the size of the volume. In most cases I don’t care about the number of cylinders, heads and sectors. In the end I calculate the total volume block and/or byte count. So those values could be generic and made up.

In the early stages of my discovering this, I posted a simple question on the OpenSolaris Forums:

“As I was navigating through the source code for the ZFS file system I saw that in zvol.c where the ioctls are defined, if a program sends a DKIOCGGEOM or DKIOCDVTOV, an ENOTSUP (Error Not Supported) is returned.

You can view this here: http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/zvol.c

My question is, what is an appropriate method to obtain the zpool’s volume size from a C coded application?”

After posting my question, I immediately went to view the open source to the general zpool/zfs binaries and observe how zpool reported the drive pool’s capacity back into user space. Unfortunately it utilized some cryptic method not as straight forward as sending a simple ioctl() to the desired volume. This was a bit frustrating as it was such an ugly approach to only receive the size of the volume.

I was grateful to have a response confirming my fear of choosing the ugly route; but it also made me realize the true value of open source. What if I simply patched a supported ioctl() definition to return the total accessible “block” count of a zpool? It would be similar to the Linux BLKGETSZ/BLKGETSZ64. This would be the most realistic and proper method; to add a new ioctl() and then modify all storage modules to accommodate it. For instance in the usr/src/uts/common/sys/dkio.h file we would need to define:

#define DKIOBLKGETSZ  (DKIO|50)

And then go back to the zvol.c file and add the extra ioctl() to handle this:

case DKIOBLKGETSZ: {
uint64_t vs = zv->zv_volsize;
if(ddi_copyout(&vs, (void *)arg, sizeof(uint64_t), flag))
error = EFAULT;
return (error);
}

To give a level of consistency across all storage devices, we will need to add the ioctl() definition to the following modules:

usr/src/uts/common/io/cmlb.c
usr/src/uts/common/io/ramdisk.c
usr/src/uts/common/io/fd.c
usr/src/uts/common/io/lofi.c

Although we do not necessarily have to support it and can instead interpret it as such:

case DKIOBLKGETSZ:
return ENOTSUP;

Who knows, one of these days I may get around to patching this myself and if the OpenSolaris community doesn’t accept it I can always make it available on any one of my website. I will most definitely post about it.

20. March 2010

OpenSolaris and ZFS: The beauty of snapshots.

Filed under: OpenSolaris, Storage, File Systems, Solaris — admin @ 13:57

Two days ago, I ran through a long needed image update to the OpenSolaris 2010.03 preview. I was updating through the pkg update manager from build 129 to build 134. So when I say, it was much needed, I wasn’t kidding. Anyway, after over 1 GBytes of updates was completed, a new boot environment (BE) was created with the native ZFS snapshot feature and I shut down the PC for the night.

The next day I turned the PC on into the latest boot environment to find that my gnome-terminal was giving me problems. The obvious symptom was that certain characters were not being echoed and their was misalignment with every entry and output displayed within the terminal.

petros@opensolaris:~$ ls
            .    ..    Desktop Documents    [ ... more results ... ]
                   petros@opensolaris:~$

After some research I came across OpenSolaris bug 12380: image-update loses /dev/ptmx from /etc/minor_perm. The fix (workaround) was simple: boot into the previous boot environment, mount the newest boot environment and clone the /etc/minor_perm from the one to the other. The steps are as follows:

[reboot into previous BE]
$ pfexec beadm mount [newest BE] /mnt
$ pfexec sh -c "grep ^clone: /etc/minor_perm >> /mnt/etc/minor_perm"
$ pfexec touch /mnt/reconfigure
$ pfexec bootadm update-archive -R /mnt
$ pfexec beadm unmount [newest BE]
[reboot newest BE]

And the problem was fixed. It was quick and easy thanks to ZFS.


	

19. March 2010

Revisited: ZFS, Btrfs and Oracle.

Filed under: OpenSolaris, File Systems, Solaris, Linux — admin @ 13:36

This entry is a continuation of one published in May of 2009. In fact it is relating to a comment made earlier today which I responded to in brief words. I am now taking the time to offer my viewpoint on the whole ZFS licensing under the CDDL and the reasoning for it.

It wasn’t until I started working with the OpenSolaris kernel and by working I mean, modifying code and going through the build process that I finally realized why OpenSolaris was licensed under the Common Development and Distribution License (CDDL). A lot of other people and companies have claim to code used within Solaris. That includes copyrighted code to which Sun does not have the authority to publish in an open source license. This is why they needed to work with a weak copyleft license such as the Mozilla Public License and modify it to their expectations. The CDDL was eventually approved by the Open Source Initiative (OSI) as a valid open source license and Sun Microsystems was then able to release code under its limitations.

Now before I continue I wish to describe 3 different open source licensing models: (1) the strong copyleft license, (2) the weak copyleft license and (3) the non-copyleft license.

The strong copyleft license is a project based license in which it requires that any derived code from the original project must remain under the original license. This method of licensing makes it nearly impossible to link with code under a non-strong copyleft license. As a result of this approach, strong copyleft licenses are often referred to as viral licenses. The most popular of these licenses is the General Public License (GPL) with 3 available versions. The Linux kernel is licensed under this and its success and growth can be attributed to it.

The weak copyleft license is similar to the strong copyleft license except that it is file-based instead of project based. This means that if there are any modifications to a file, the original license must apply; but that file can be combined in a project with code under a different license. This method makes the type of licensing non-viral. The CDDL and the MPL are categorized as weak copyleft licenses.

The third type is the non-copyleft license which offers no requirement for derived works to stay under the original license. In fact, there is also no requirement for derived code to be released under any open source license. This makes it simple for someone to take an open source project and use it as a basis for a proprietary product. A best known example is the BSD license; and Apple’s adoption of FreeBSD kernel code in their XNU kernel or NetApp and their use of FreeBSD in their customized storage appliances.

Continuing where I left off, it would not have been possible to open source the Solaris kernel for the OpenSolaris project if it weren’t for the CDDL license. In turn, ZFS would have been incompatible with the CDDL license if it were licensed under the GPL; although it has no conflict with non-copyleft licenses such as the BSD license. Because of this and now because of Oracle’s admitted support and commitment to Solaris, I doubt this licensing will change; especially to merge it into the Linux kernel. That is why we should be grateful that: (a) ZFS is available under an open source license making it impossible for it to disappear and (b) that Oracle has been committed to Btrfs and bringing an enterprise class solution into the Linux kernel.

This is why we have choices. If you want ZFS functionality, use OpenSolaris or Solaris. If you don’t necessarily need ZFS and are more comfortable with Linux, you have a lot more distributions to choose from. Or if you want ZFS and a familiar Linux environment, there is also Nexenta.

5. March 2010

AMD RAID-on-Chip: A valid technology? Or is it just too late in the game?

Filed under: RAID, Storage, File Systems, SCSI — admin @ 14:33

Back in December I just came across this article for an AMD RoC (RAID-on-Chip) that will be embedded into servers to provide uninterrupted RAID functionality. A quick question came to mind as I was reading this: “Considering today’s storage capabilities and low cost equipment, who will be using this?” And honestly I was not able to come up with an answer.

In an earlier blog post I had mentioned the rise in usage of software RAID. Small to Medium sized Business (SMB) have been running to these low cost solutions. And why not? You are able to get more bang for your buck. For instance, by running OpenSolaris, one is able to use the redundancy of the ZFS file system (with single/double parity or mirrored RAID), file system level snapshot, data deduplication, and more. On top of that, there is a checksum calculator to ensure that all data corruption (noisy and silent) are never a threat. Take these ZFS pools and share them via NFS/CIFS, over ftp/http to even mapping them over iSCSI, Fibre Channel, AoE or FCoE protocols. The operating system (with all bells and whistles) is freely distributed under the CDDL license. The only costs will be the hardware equipment (a server or two and if external storage is needed, a JBOD) and the storage administrator. For years, servers have been equipped with LSI Logic (or other) RAID controllers that have proven to be just as efficient as anything else to handle local storage. Now when you look at larger enterprise scale companies, they are not going to want a server to manage their RAID. Instead they will keep the external storage managed externally with special purpose RAID controllers managing hundreds of terabytes to petabytes of data storage and apart from all the nodes in a cluster accessing that equipment.

But going back to the server, how practical is it to have an implemented RoC? With today’s level of high speed computing, does it make that much of a difference if the RAID is accomplished on the chipset as opposed to the operating system? If so how easy is it to recover from data corruption or any other error? Unless you are setting up a small home or small business server, what if you wanted additional functionality such as snapshots, data deduplication and checksum validation? You still have to go to the operating system and have some sort of volume manager on top of the RoC grouped volumes. No offense to Dot Hill even though they were a direct competitor to one of my previous employers (Xyratex). According to their numbers posted on Google Finance, financially they have been struggling for at least the past 5 to 6 years and this is a great opportunity for them. Although it is in my opinion that this would have been a valid technology back in 2001 and not 2010.

26. November 2009

Linux Magazine Article: Three Simple Tweaks for Better SSD Performance

Filed under: Storage, Red Hat, File Systems, Ubuntu, Linux — admin @ 13:23

Earlier today I came across this interesting article on tuning your SSD drive to achieve greater performance. It is worth noting that this article is intended for Linux and when it mentions setting your file systems mount options with noatime, this too is relevant for file systems that support such an option.

I would also take the time to read the comments. There are some distribution specific responses to the author’s notes.

3. November 2009

Recently integrated into ZFS: Data Deduplication

Filed under: OpenSolaris, Storage, File Systems, Solaris, UNIX — admin @ 09:26

I just stumbled onto this blog entry on the implementation of data deduplication into the Sun Microsystem’s ZFS file system. It is implemented in such a nice and clean way, I am looking forward to testing it. For instance, just like any other feature of the ZFS file system, data dedup can be enabled disabled at any path from the ZFS root mount point. Examples taken from Jeff Bonwick’s blog post cited above:

zfs set dedup=on tank
zfs set dedup=off tank/home
zfs set dedup=on tank/vm
zfs set dedup=on tank/src

It is that simple (man 1 zfs).

27. October 2009

Apple discontinues port of Sun’s ZFS file system.

Filed under: OpenSolaris, BSD, File Systems, Solaris, UNIX — admin @ 14:22

On 23 October, 2009 it was announced on MacOSForge that Apple had decided to discontinue any and all development on the porting of the ZFS file system. I know that I am not the only one to say this but I am not surprised. Supposedly there were legal reasons behind this action but in the end, who cares? They are the ones losing out to continue with an out dated and still limiting file system.

Now Apple has recently been hiring file system developers to develop a next generation file system to replace the traditional HFS+ but (as Robin Harris has previously stated) how long will it take before it becomes stable and accepted by the general public? Traditionally it takes 5+ years before a file system is considered somewhat stable and ready for production use. It wasn’t until recently that ZFS was starting to make its impact in the enterprise scene. Though my question is, to whom will this next generation file system cater to? I am to assume that it will be for the general end user utilizing Mac devices that “don’t require the weight of the ZFS features and functionality” ; or so it has been said regarding the topic of Apple abandoning the ZFS project. If that is the case and is the primary focus of the new file system, how will this impact their server market share? We already know that there is no such thing as a perfect file system that will perform ideally in every arena it is thrown into. Some will excel more than others and is entirely dependent on its implementation and workload.

In past posts, I have always stressed the importance of the file system and what is integrated within the file system. I routinely point out the numerous drawbacks and limitations of the NTFS driver. Sure, Microsoft compensates for the “lack of features” with applications, services and additional APIs to fill in all those gaps. A good example is VSS (shadow copy). This can impact performance as it is taking file system concepts out from kernel mode and into user land and consuming user mode resources. All these feature should and need to be incorporated into the file system driver. That way we can ensure that there is stability and consistency with all functions the file system performs. Even the general layout is not ideal for traditional computing over large storage media; as the fragmented large seeks between the MFT and the file data can put a lot of stress on the magnetic device. Going back to HFS+ and sort of on the same topic (although the concept is a bit different), the same could be said about Apple’s Time Machine and it running as an application on top of the driver.

One thing that I hold to heart when it comes to file systems is the ability and flexibility to tune it even without taking the mounted device(s) off-line. Most modern UNIX and Linux file systems offer a lot of tunable features (built into the driver!). For instance (through the ZFS character device node) I can dynamically alter file system variables (man 1 zfs). For this example I will focus on access times. Let us say I am using an SSD and decide that it would be more cell friendly and better performing to disable file access times on the root mount.

atime=on | off
Controls whether the access time for  files  is  updated
when  they  are  read.

To view current settings and disable this feature you would type the following in the command-line terminal:

petros@opensolaris:~$ pfexec zfs get atime rpool/export/home
NAME               PROPERTY  VALUE  SOURCE
rpool/export/home  atime     on     default
petros@opensolaris:~$ pfexec zfs set atime=off rpool/export/home
petros@opensolaris:~$ pfexec zfs get atime rpool/export/home
NAME               PROPERTY  VALUE  SOURCE
rpool/export/home  atime     off    local

I just hope that Apple is prepared for the journey they are about to embark on. They obviously have file system development experience, and I have no doubts that they have the talent. Do they have the patience and time to invest?

8. October 2009

FlexTk article: NAS Performance Comparison

Filed under: Red Hat, Storage, OpenSolaris, File Systems, Ubuntu, Microsoft, Linux, UNIX — admin @ 14:11

Linked from linuxtoday.com, I found an interesting article posted on FlexTk regarding NAS Performance Comparisons between Linux, Windows and OpenSolaris. The results are very interesting. Under each category, comparisons are drawn between:

  • Red Hat Enterprise Linux 5.3 (64-bit)
  • Ubuntu Server 9.04 (64-bit)
  • OpenSolaris 2009.06 (64-bit)
  • Windows Server 2003 (64-bit)
  • Windows Server 2008 (64-bit)
  • Windows Storage Server 2008 (64-bit)

I assume that each operating system is utilizing the default file systems with default settings for that specific release. Red Hat and Ubuntu should be using Ext3-fs, Windows obviously uses NTFS while OpenSolaris is built on top of ZFS. The CIFS/NFS exported share(s) in turn are running on top of these defaulted file systems. Either way, with average overall performance, OpenSolaris seemed to really shine. It also did well in some of the other categories which made sense when knowing the design of the ZFS file system.

2. October 2009

LWN article: Log-structured file systems: There’s one in every SSD

Filed under: Storage, File Systems, Linux, UNIX — admin @ 08:39

Yesterday I came across this excellent article on log-structured file systems and their implementation on SSD technologies. It is worth the read.

Opinion: On pramfs and RAM based Linux file systems

Filed under: Storage, File Systems, Linux — admin @ 08:36

A few days ago I received the latest issue of Linux Journal Magazine. I must admit that one of the sections I look forward to reading is diff -u. This section summarizes the latest updates and discussions of the Linux kernel development community. It becomes much easier to read a summary as opposed to signing up for the mailing list because you will just get bombarded with e-mails which can be overwhelming the majority of the time.

While reading I came across a Montavista developed project called pramfs. In summary pramfs is a non-volatile RAM based file system, similar to your ramfs and tmpfs with a few differences to distinguish it from the others and in turn adapted for an embedded environment. Two obvious differences are that it is persistent like a traditional disk-based file system and does not reside in volatile DRAM. Pramfs is not new. It was originally announced back in 2004. It is designed to be a simplified file system that does not carry the same weight of the journal-based file systems.

Apparently there had been some problems with the patch being merged into the Linux kernel for a number of reasons. (1) Montavista was attempting to patent some of the concepts and algorithms used in the file system (in 2004) and (2) even after the dropped the idea of patenting their code, there was some discussion on the redundancy of having yet another file system implemented into the Linux kernel (in 2009). What that means, is that the Linux kernel already has two commonly used RAM file systems and a large number of other file systems. So why was there a need to write another one? Why couldn’t Montavista patch already existing code? (3) It is also not a full featured file system in that it does not support symbolic links.

I agree with this logic. Please do not misunderstand me. Montavista is a very respectable company that has done an excellent job in supporting embedded Linux. I am also glad to see them contribute to the kernel and in turn the community. But truth be told, tmpfs was build on top of the ramfs code. Why couldn’t pramfs follow the same course of development. The GPL makes it easy to not have to re-invent the wheel.

The two most noteworthy goals achieved for pramfs (1) is to work with NVRAM and (2) provide and interface that does not utilize the kernel page caching mechanism. By utilizing the DIRECTIO flag available in the 2.6 kernel, Montavista claims that I/O performance is increased significantly to an already high performing interface. Pramfs also allows the user to specify regions of memory for file system usage.

mount -t pramfs -o physaddr=0x1e000000,init=0x2000000 none /mnt/pramfs

With it working in non-volatile memory, the data contents will remain intact even after an expected/unexpected power cycle.

This concept got me thinking a bit. How difficult would it be to add some of these features in Ramfs? Ramfs offer some similar functionality as in it does not use the kernel’s page cache for file I/O.  Tmpfs was designed to offer that functionality along with additional file system control and limitations. Ramfs also has a slightly similar general file system layout. Sure a few structures and routines need to be redefined but that isn’t a big deal in the grand scheme of things.

I mention this in the light of some of the latest headlines circulating through the internet regarding Linux Torvalds’ comments on the kernel being bloated. Does the kernel leave room for additional “bloat” or would it be wiser to add on top of current features/functionality? I would love to read some of your opinions.

For more blog posts relating to RAM-based file systems and RAM Disk device drivers, you can find them posted here, here and here.

Next Page »

Powered by WordPress