dd if=/dev/random of=/dev/blog

31. December 2008

Why Microsoft is just not ready for the enterprise.

Filed under: Storage, File Systems, Linux, Microsoft, UNIX — admin @ 10:31

In my last post I had made some comments about the Microsoft Windows not being capable of enterprise high performance computing. In the comments (upon request) I had posted some details on the SCSI subsystem of the Operating System, talking of the scatter gather lists when sequential SCSI commands are being coalesced just prior to being sent to the SCSI-based media. I wanted to continue on that topic and focus specifically on the NTFS file system and why it too is not intended for enterprise class usage.

First I wish to start with the NTFS layout on the volume. Note that I will be using the term volume to signify either a pool of disk devices (traditionally pooled together via an external controller and mapped as a Logical Unit) or a single disk device. In both cases, they are represented to the host as a single volume. I may be a little old-fashioned by following in the footsteps of our UNIX forefathers but I have always felt that if you want an increase in disk access performance, then you keep your data and meta-data close to each other. We find that most traditional POSIX compliant file systems that are utilized today for server/client usage separate a volume in Allocation Groups (or Group Blocks), and within each AG there exists meta-data related blocks followed by the actual data blocks for the data file. Microsoft on the other hand had continued the path of their original FAT file systems and placed all meta-data at the VERY beginning of the volume while all fragmented data file are scattered throughout the rest of the volume. Logically placing meta data region close to data regions decreases seeking latencies. For example, let us say you are writing to a 500 GB or even a 1 TB volume (especially now since a single SATA disk device goes even higher in capacity). For a moment, ignoring journaling concepts and utilizing an NTFS file system will require for constant seeks back to the MFT (Master File Table) located at the very beginning of the file system no matter how far into the volume the data gets written. This layout is obviously not intended for high performance environments. When we look at file systems such as Ext2/3-fs, XFS and so on we see that the volume (as mentioned above) is separated in multiple equal sized groups, where each group contains all meta-data associated with its file data.

Moving on to the journal, what makes the NTFS truly stand out from its FAT predecessors is the ability to journal any and all changes made to the file system for quick recovery in an event of failure. With that journal comes performance loss. Most traditional UNIX/Linux file systems utilize a journal but lately there has been a movement to adopt log structured concepts. The functions of the journal is that it must first copy all changes to the file system in a journal prior to all data being committed to disk. If the system failed and the operations were incomplete, the journal is then played back to resolve any problems that may have been caused by the failure. Ideally the journal is implemented for speedy recoveries. There are two popular types of journaling methods: (1) meta-data only (2) and meta-data + file data. The latter is the default to the Ext3-fs but can always be tuned for meta-data only. Unfortunately this is not well known. While Ext3-fs with default settings offers the most redundant solution in times of failure this takes even more of a performance cost when writing all meta- and file data to the volume twice, one to the journal and again to the file system. NTFS on the other hand (as is seen in XFS, ReiserFS, etc.) does meta-data only journaling. As I had mentioned earlier in this paragraph this too takes a performance hit and that is why a lot of the more recently developed file system (i.e. ZFS, btrfs, Reiser4 etc.) have adopted a unique method of logging, where all new data gets written to a new location (ideally in sequence to the last written location of the volume) on the volume and upon success all meta-data is updated to point to the new region(s) for the file. In this scenario, nothing gets written twice and that is one of many reasons as to why ZFS and hopefully what will soon be a stable btrfs are classified as enterprise class file systems. This log structure also allows for easy implementations of snapshot features, which NTFS calls Volume Shadow Copy or Volume Snapshot Service (VSS) and unfortunately I do not know enough of this to comment. All that I do know is that VSS is not internal to the file system and must run on top of it as an additional service.

Some other major drawbacks is that some key NTFS file system maintenance MUST be done offline! Volume resizing (increase/decrease) on the enterprise level can be a frequent task. Taking the volume (and node) offline to accomplish this task is not good at all! This costs time which in turn costs money.

With regards to allocation sizes (also known as block size and in the Microsoft world as cluster size) is the file system’s minimum size of a unit partition to which data will get written to. To clarify let us say you are using a block size of 4 KB and you write 2 MB of data (2097152 / 4096), you will be using 512 of 4 KB blocks to store that data. On the other hand, if you write 579 bytes of data, that will use 1 of 4 KB block(s) and (unless tail packing is supported in the file system which is not provided in NTFS) the rest of the 4 KB data block will be wasted until that file grows in size or is deleted to eventually have that region written over with something larger. As of NT 3.51 and later NTFS supports 512 bytes, 1 KB, 2 KB and 4 KB. For high performance computing and especially working with larger file sizes, it is sometimes needed to go larger. XFS starts at 512 bytes and can go up as high as 64 KB. ZFS can go up to 128 KB. These high numbers do increase performance when working with large files. It also presents the volume with higher capacity range as less meta-data is utilized.

If I really had the time, this list could go on, but I wanted to shed some light on one last point and that is Volume Mount Points. In a Microsoft environment you are limited to the total numbers of the English alphabet. This is not the case with POSIX-like platforms. Again, in an enterprise environment, there may be a need to handle multiple volumes, things can get a bit problematic if the Microsoft server is attempting to serve more than 26 (but if you count the fact that A, B, and C drives are by default taken by the floppy and root operating system mount devices, then you have 23 left for CD/DVD, tape and other disk media devices.

Don’t get me wrong, just like anything other file system, NTFS can be tuned for better results. You can get better performance out of any file system, if you can search out the proper documentation for it and the documentation is all over the internet.

With these limitations well known, then why do we still try to deploy Microsoft Windows in environments it was not suited for? The answer is familiarity. Microsoft for the most part owns the client/end-user market and with that the end-user has gotten too familiar and too comfortable with its platform. In turn what was built for home (and to an extent small business) use has leaked into an environment where it is not ready for. Please understand that I am not trying to preach against Microsoft and attack them. As many others in the high performing server/storage industry I have come to understand where certain problems originate from and that includes the limitations of the Windows platform. If you, the reader, feel something different with Microsoft and their role in enterprise class computing please feel free to comment. I know that I may not always be correct in my viewpoints and if you can shed any additional light I would very grateful.

8 Comments »

  1. NTFS doesn’t support tail-packing but it does support packing the file data in with the metadata - if the file is small enough. So call that partial support.

    The built-in tools in Windows Vista/Server 2008 allow live volume resizing.

    Windows isn’t limited to 23 volumes - mount points have been supported since Windows 2000 (where you mount a drive as a folder). If you have more than 23 volumes then you are unlikely to find this complicated (it’s available in the GUI management tool).

    Also, if you want to list the flaws of NTFS, why don’t you list the good bits too? Like the fact that it supports a wide range of features, including transparent encryption and compression, snapshots, sparse files and quotas. Or the fact that it is extremely mature and well trusted technology (unlike BTRFS and even ZFS) which is especially important in the enterprise.

    Comment by Paul — 29. July 2009 @ 06:23

  2. Oh yeah, it also supports atomic transactions. Are there any other mainstream file systems that support this feature?

    Comment by Paul — 29. July 2009 @ 06:27

  3. Paul,

    Thank you very much for the feedback. It is very much appreciated. Note that I have never touched Windows Vista and only briefly played with Server 2008. I was unaware that online volume resizing was implemented within these releases. Again thank you for the your comments.

    With regards to your question:

    >>Oh yeah, it also supports atomic transactions. Are there any other mainstream
    >>file systems that support this feature?

    The first ones that come up to mind are ZFS, XFS, NetApp’s WAFL, Reiser4 and I am sure there are more.

    Comment by admin — 29. July 2009 @ 06:44

  4. >>Oh yeah, it also supports atomic transactions. Are there any other mainstream
    >>file systems that support this feature?

    >The first ones that come up to mind are ZFS, XFS, NetApp’s WAFL, Reiser4 and I am sure there are more.

    I don’t mean single-file transaction support (as in journaling), I mean where you can encapsulate changes to multiple files and then commit all the changes or roll them back as a single unit (i.e. if any single file operation fails, everything is rolled back). See http://en.wikipedia.org/wiki/Transactional_NTFS for more info.

    If that is what you meant; if so, have you got a link?

    Comment by Paul — 29. July 2009 @ 07:17

  5. Paul,

    Here is the Reiser4 design documentation. It supports atomic transaction as in “all or nothing.” But note that this will always be the case with a file system that utilizes a Copy-On-Write mechanism for write operations. It will always have something to fall back to if a failure occurs.

    Comment by admin — 29. July 2009 @ 07:32

  6. Based on the design doc you provided (thanks!) it does seem like it has low-level support for rolling back changes to multiple files at once, but it seems to be focused on filesystem consistancy in the event of crashes - a subtly different use case. Applications must be specially written to take advantage of transactional NTFS. As an example of what it is used for, Windows Installer uses it to ensure that an application is fully installed, or not at all (including registry operations as they are also transacted). Here is the API to open/create a file in transacted mode: http://msdn.microsoft.com/en-us/library/aa363859(VS.85,loband).aspx

    So yeah, while copy-on-write systems make roll-backs easy, that’s only a piece of the puzzle. Does Reiser4/Linux support all the necessary pieces?

    Comment by Paul — 29. July 2009 @ 07:54

  7. I understand now. This is more of an approach on which userland applications invoke the API to insure that all operations were successful. Unfortunately, I cannot give you an answer unless a more thorough investigation is conducted.

    Comment by admin — 29. July 2009 @ 08:07

  8. That’s it exactly. No worries if you don’t know; I was just curious.

    Comment by Paul — 29. July 2009 @ 08:18

RSS feed for comments on this post. TrackBack URL

Leave a comment

Powered by WordPress