IIUG Conference 2017

http://iiug2017.org/wp-content/uploads/2016/12/logo5_s.png

Wednesday, July 28, 2010

New Journaled Filesystem Rant

There have been questions from multiple posters on the Informix Forums lately asking about Journaled File Systems (JFSes) like EXT3, EXT4, and ZFS among others.  Bottom line?  JFSes should NEVER be used for storing data for a database system.  ANY database system, whether it is Berkley DB, Oracle, Sybase, DB2, MySQL, PostGreSQL, MS SQL Server, Informix, whatever.  "But", you protest, "the journaling makes the filesystem safer.  It speeds recovery.  It is 'good thing'!"  No.  Not for databases.  Flat out - no!

First, your database is already performing its own logging (read journaling for the DB neophite).  That is sufficient to permit proper and secure recovery.  It is also fast - if it weren't the database product would have gone the way of DBase2 and DBase3 long ago.  The filesystem's journal is redundant at best and at worst will actually slow recovery (versus using RAW or COOKED - non-filesystem - space for storage) by requiring two sets of recovery operations to happen sequentially.  Note that all properly designed database systems use O_SYNC or O_DIRECT mode write operations to ensure that their data is safely on disk.  However, it has come to my attention that many journaling filesystems do not obey these directives when it comes to metadata changes.  On these filesystems metadata is ALWAYS cached.  Therefore there is neither a safety nor recovery speed gain from using JSFes for database storage.

Most JFSes use metadata only journaling.  Here is some insight into that process, and why JFSes should not be used for database storage:
  • This (logical metadata only journaling) is the method used by EXT3, EXT4, JFS2, and ZFS
  • All of these except AIX's JSFS2  use block relocation instead of physical block journaling (AIX's JFS2 - and the Open Source JFS filesystem derived from it - does not journal or relocate data blocks so it is safe).  This means that on write a block is always written to a new location rather than overwriting the existing block on disk.  A properly designed JFS will commit the new version of the disk block before updating the metadata or the logical journal (that's the problem with EXT4 - and EXT3 with write-back enabled - they write the metadata first, then the journal entry before actually committing the physical change to disk).  Once the write and journal are completed the FS metadata is updated and the write is acknowledged.  This means that, in a proper JFS, on a crash there are three possibilities:
    • The new block version was partially or completely written but the journal entry was not written.
    • The new block version and journal entry were written and committed.
    • The new block version, journal, and metadata were written and committed.
In the first case, after recovery, the file remains unchanged, however the changes are lost.  In the second case, after recovery, the FS makes the missed metadata entries and the file is modified during recovery and the original block version is freed for reuse.  In the third case all was well before the crash and the original version of the block was released for reuse. 
The problem with EXT4 (and EXT3 with write-back enabled) is that the application (meaning in this case Informix or other database system) thinks everything is hunky dory since the FS acknowledged the change as committed.  However, immediately after the acknowledgment the physically modified block is still ONLY in cache and only the metadata and journal entry have been saved to disk.  At this point if there is a crash, the file is actually unrecoverable!  The metadata and the journal entry say the block has been moved to a new location and rewritten, but the new location has garbage in it from some previous block.  This one made Linus Torvalds absolutely livid and he tore the EXT4 designers a new one over the design.  You can GOOGLE his rants on the subject yourself.  Last I heard you could not disable the write-back behavior of EXT4 - Linus was pushing to have that fixed, but I don't know if it ever was.  I use EXT3 default mode for filesystems and EXT2 (the original non-journaled Linux FS) for database storage that I care about.

JFS2 and the Open Source JFS filesystem have no serious problems.  EXT3 in default mode and ZFS at least are safe, but the problem with them is just the fact of the block relocations.  There is the performance problem of rewriting a whole block every time the database changes a single page within the block and so negating much of the gains of caching and there is the bigger problem that the file is no longer even as contiguous as a non-journaled filesystem would have it be.  Standard UNIX filesystems (EXT2 and UFS as examples) allocate blocks of contiguous space and try to leave free space that is contiguous with those allocated blocks unused when allocating space for other files so that as a file grows it remains mostly contiguous in multi-block chunks.  This fragments the free space in an FS making it difficult to write very large files (like Informix chunks) that are contiguous, but if you keep the chunks on an FS that's dedicated to Informix chunks that has not been a real problem up until recently since Informix did not extend existing chunks over time prior to the recent release of Informix v 11.70.  Informix 11.70 can, optionally, extend the size of an existing chunk.  JFS's break that rule keeping the level of contiguous bits of a file the same as the block level.  Even if a chunk were allocated as contiguous initially, over time the JFS will cause the file to become internally fragmented.  A two logically contiguous blocks that were originally also physically contiguous can become spread out within the file's allocated space over time when they are rewritten.  If you make the FS block size smaller to alleviate the costs of multiple block rewrites, you make the file fragmentation worse.

These problems don't affect filesystems and normal files as much as databases because the nature of the IO to files is different than IO to databases.  When you write to a flat file, you write mostly sequentially, you rarely rewrite a portion of the file (unless you rewrite the entire file) and you never sync the file to disk before you close the file.  That means that the cache will coalesce all writes until an entire block has been written out before the FS and OS cause a flush and sync of the cache to disk.  That means that the FS has the ability to try to keep the rewritten blocks contiguous by allocating the replacement blocks contiguously.  Essentially the file is relocated whole if it is rewritten. 

Databases don't work that way.  Informix, for example, writes every block to a COOKED device or filesystem chunk either under O_SYNC or O_DIRECT control both of which force the single write operation (and Informix only ever writes a single page or eight contiguous pages at a time) to be physically written and committed before the write() call returns.  That means that the coalescing features of the FS and OS cache management are bypassed in favor of data safety.  So, if the engine performs what it thinks is a sequential scan, it is actually performing a random read of the file swinging the read/write heads back and forth across the disk.  If the physical structure is shared with other applications and even other machines (can you say massive SAN?) that will also be competing with those other storage clients for head positioning.  In normal sequential scanning (ie RAW or COOKED device or non-JFS files) the disk, controller, filesystem, and database read ahead processing reduces the performance impact of this head contention somewhat.  In a JFS that uses block relocation read ahead cannot help at all.

All of this having been said, I guess I have to change my mantra:

NO JFS, NO RAID5!!!   NO JFS, NO RAID5!!!   NO JFS, NO RAID5!!!   NO JFS, NO RAID5!!!   NO JFS, NO RAID5!!!   NO JFS, NO RAID5!!!   NO JFS, NO RAID5!!!   NO JFS, NO RAID5!!!   NO JFS, NO RAID5!!!   NO JFS, NO RAID5!!!   NO JFS, NO RAID5!!!   NO JFS, NO RAID5!!!   NO JFS, NO RAID5!!!   NO JFS, NO RAID5!!!   NO JFS, NO RAID5!!!   NO JFS, NO RAID5!!!   NO JFS, NO RAID5!!!   NO JFS, NO RAID5!!!   NO JFS, NO RAID5!!!   NO JFS, NO RAID5!!!   NO JFS, NO RAID5!!!   NO JFS, NO RAID5!!!   NO JFS, NO RAID5!!!   NO JFS, NO RAID5!!!   NO JFS, NO RAID5!!!   NO JFS, NO RAID5!!! 

Oh!  Also, PLEASE:  NO RAID6!!!!!!!!!!!  Yuck.