IIUG Conference 2018

Looking forward to meeting in Washington DC next October!

Wednesday, July 28, 2010

New Journaled Filesystem Rant

There have been questions from multiple posters on the Informix Forums lately asking about Journaled File Systems (JFSes) like EXT3, EXT4, and ZFS among others.  Bottom line?  JFSes should NEVER be used for storing data for a database system.  ANY database system, whether it is Berkley DB, Oracle, Sybase, DB2, MySQL, PostGreSQL, MS SQL Server, Informix, whatever.  "But", you protest, "the journaling makes the filesystem safer.  It speeds recovery.  It is 'good thing'!"  No.  Not for databases.  Flat out - no!

First, your database is already performing its own logging (read journaling for the DB neophite).  That is sufficient to permit proper and secure recovery.  It is also fast - if it weren't the database product would have gone the way of DBase2 and DBase3 long ago.  The filesystem's journal is redundant at best and at worst will actually slow recovery (versus using RAW or COOKED - non-filesystem - space for storage) by requiring two sets of recovery operations to happen sequentially.  Note that all properly designed database systems use O_SYNC or O_DIRECT mode write operations to ensure that their data is safely on disk.  However, it has come to my attention that many journaling filesystems do not obey these directives when it comes to metadata changes.  On these filesystems metadata is ALWAYS cached.  Therefore there is neither a safety nor recovery speed gain from using JSFes for database storage.

Most JFSes use metadata only journaling.  Here is some insight into that process, and why JFSes should not be used for database storage:
  • This (logical metadata only journaling) is the method used by EXT3, EXT4, JFS2, and ZFS
  • All of these except AIX's JSFS2  use block relocation instead of physical block journaling (AIX's JFS2 - and the Open Source JFS filesystem derived from it - does not journal or relocate data blocks so it is safe).  This means that on write a block is always written to a new location rather than overwriting the existing block on disk.  A properly designed JFS will commit the new version of the disk block before updating the metadata or the logical journal (that's the problem with EXT4 - and EXT3 with write-back enabled - they write the metadata first, then the journal entry before actually committing the physical change to disk).  Once the write and journal are completed the FS metadata is updated and the write is acknowledged.  This means that, in a proper JFS, on a crash there are three possibilities:
    • The new block version was partially or completely written but the journal entry was not written.
    • The new block version and journal entry were written and committed.
    • The new block version, journal, and metadata were written and committed.
In the first case, after recovery, the file remains unchanged, however the changes are lost.  In the second case, after recovery, the FS makes the missed metadata entries and the file is modified during recovery and the original block version is freed for reuse.  In the third case all was well before the crash and the original version of the block was released for reuse. 
The problem with EXT4 (and EXT3 with write-back enabled) is that the application (meaning in this case Informix or other database system) thinks everything is hunky dory since the FS acknowledged the change as committed.  However, immediately after the acknowledgment the physically modified block is still ONLY in cache and only the metadata and journal entry have been saved to disk.  At this point if there is a crash, the file is actually unrecoverable!  The metadata and the journal entry say the block has been moved to a new location and rewritten, but the new location has garbage in it from some previous block.  This one made Linus Torvalds absolutely livid and he tore the EXT4 designers a new one over the design.  You can GOOGLE his rants on the subject yourself.  Last I heard you could not disable the write-back behavior of EXT4 - Linus was pushing to have that fixed, but I don't know if it ever was.  I use EXT3 default mode for filesystems and EXT2 (the original non-journaled Linux FS) for database storage that I care about.

JFS2 and the Open Source JFS filesystem have no serious problems.  EXT3 in default mode and ZFS at least are safe, but the problem with them is just the fact of the block relocations.  There is the performance problem of rewriting a whole block every time the database changes a single page within the block and so negating much of the gains of caching and there is the bigger problem that the file is no longer even as contiguous as a non-journaled filesystem would have it be.  Standard UNIX filesystems (EXT2 and UFS as examples) allocate blocks of contiguous space and try to leave free space that is contiguous with those allocated blocks unused when allocating space for other files so that as a file grows it remains mostly contiguous in multi-block chunks.  This fragments the free space in an FS making it difficult to write very large files (like Informix chunks) that are contiguous, but if you keep the chunks on an FS that's dedicated to Informix chunks that has not been a real problem up until recently since Informix did not extend existing chunks over time prior to the recent release of Informix v 11.70.  Informix 11.70 can, optionally, extend the size of an existing chunk.  JFS's break that rule keeping the level of contiguous bits of a file the same as the block level.  Even if a chunk were allocated as contiguous initially, over time the JFS will cause the file to become internally fragmented.  A two logically contiguous blocks that were originally also physically contiguous can become spread out within the file's allocated space over time when they are rewritten.  If you make the FS block size smaller to alleviate the costs of multiple block rewrites, you make the file fragmentation worse.

These problems don't affect filesystems and normal files as much as databases because the nature of the IO to files is different than IO to databases.  When you write to a flat file, you write mostly sequentially, you rarely rewrite a portion of the file (unless you rewrite the entire file) and you never sync the file to disk before you close the file.  That means that the cache will coalesce all writes until an entire block has been written out before the FS and OS cause a flush and sync of the cache to disk.  That means that the FS has the ability to try to keep the rewritten blocks contiguous by allocating the replacement blocks contiguously.  Essentially the file is relocated whole if it is rewritten. 

Databases don't work that way.  Informix, for example, writes every block to a COOKED device or filesystem chunk either under O_SYNC or O_DIRECT control both of which force the single write operation (and Informix only ever writes a single page or eight contiguous pages at a time) to be physically written and committed before the write() call returns.  That means that the coalescing features of the FS and OS cache management are bypassed in favor of data safety.  So, if the engine performs what it thinks is a sequential scan, it is actually performing a random read of the file swinging the read/write heads back and forth across the disk.  If the physical structure is shared with other applications and even other machines (can you say massive SAN?) that will also be competing with those other storage clients for head positioning.  In normal sequential scanning (ie RAW or COOKED device or non-JFS files) the disk, controller, filesystem, and database read ahead processing reduces the performance impact of this head contention somewhat.  In a JFS that uses block relocation read ahead cannot help at all.

All of this having been said, I guess I have to change my mantra:

NO JFS, NO RAID5!!!   NO JFS, NO RAID5!!!   NO JFS, NO RAID5!!!   NO JFS, NO RAID5!!!   NO JFS, NO RAID5!!!   NO JFS, NO RAID5!!!   NO JFS, NO RAID5!!!   NO JFS, NO RAID5!!!   NO JFS, NO RAID5!!!   NO JFS, NO RAID5!!!   NO JFS, NO RAID5!!!   NO JFS, NO RAID5!!!   NO JFS, NO RAID5!!!   NO JFS, NO RAID5!!!   NO JFS, NO RAID5!!!   NO JFS, NO RAID5!!!   NO JFS, NO RAID5!!!   NO JFS, NO RAID5!!!   NO JFS, NO RAID5!!!   NO JFS, NO RAID5!!!   NO JFS, NO RAID5!!!   NO JFS, NO RAID5!!!   NO JFS, NO RAID5!!!   NO JFS, NO RAID5!!!   NO JFS, NO RAID5!!!   NO JFS, NO RAID5!!! 

Oh!  Also, PLEASE:  NO RAID6!!!!!!!!!!!  Yuck.

24 comments:

  1. Is the same true for the filesystems listed in PSORT_DBTEMP or are those more traditional filesystem type files and would be OK on a journaled filesystem?

    ReplyDelete
  2. I agree that it's stupid to have journaling in both the database and the file system.

    But regarding ext2: don't you risk having very long periods of fsck after an accidental, brutal shutdown? (Not that I have seen that for a long while.)

    And: Would it make sense to have ext2 on the file system where the transaction log resides (potentially a rather small file system, relatively quick to fsck), and then use store non-transaction log data on large ext3 file system(s)?

    ReplyDelete
  3. Andrew, the purpose of use the PSORT_DBTEMP environment variable to redirect sort-work files to a filesystem is to take advantage of the characteristics of the OS's filesystem cache which tries to minimize IO's by flushing data to disk only when it has to. Since sort-work files are by nature very short lived, the result of using PSORT_DBTEMP is that most of these NEVER gett written to disk and live in the cache until the engine merges two together into a third file and deleted the original two older ones. Therefore, I don't think that it really matters what filesystem underlies the paths you include in PSORT_DBTEMP. I would go for faster ones just in case, and recovery is not an issue, since if there is a failure or crash, only the sort is lost.

    ReplyDelete
  4. Troels, One of the reasons that you would use RAW devices as your first choice for database chunks is that there is not FS recovery or fsck required. If a filesystem is used exclusively for Informix chunks, it is very unlikely that it will be damaged during a hard crash. Why? Because what gets damaged in a filesystem during a hard crash is the metadata - it the inodes or equivalent structures. In the case of a filesystem that is exclusively used by Informix, all chunks are allocated once and so (unless the filesystem is journaled - hint, hint) the metadata is never changed (except the modified and access times I guess, and Informix does not depend on these at all)!

    That means that the filesystem will ALWAYS fsck as clean! Any internal inconsistencies within the data stored in the chunks/files will not be corrected by fsck and its redundant to ask a journaled filesystem to repair it, since Informix can repair the internal damage more quickly and efficiently using its own fast recovery procedures which kick in during an engine restart if the engine finds that the physical log was not emptied during a controlled shutdown.

    Actually, the logical log dbspace(s) on an Informix OLTP system are only relatively smaller than the data chunks for the server taken as a whole. On most Informix systems the logical logs (equivalent to the transaction or redo logs on some other RDBMSes) are normally quite large to permit the server to continue functioning for at least a day, often for several days, in case there is a problem with the mechanisms used to back up the logical logs when each fills. If all of the logical logs are full and have not been backed up, the engine will not reuse those log files and will block new transactions from starting and existing transactions from continuing until the logs have been backed up or more logical log space has been added. So, to keep us sane, most Informix DBAs configure lots of logical log space, enough for at least a weekend's work, in case the automated logical log backups fail for some reasons outside the control of the Informix engine (like someone deleted the filesystem to which the backups would be written or the tape drive is taken offline for maintenance).

    So, the direct answer is: I prefer RAW disk devices for Informix chunks over ANY filesystem, but if I have to use a filesystem, I prefer one that is as simple and fast as possible, so EXT2 on Linux.

    ReplyDelete
  5. Troels, I entered a complete and detailed response to you comment, but the post process failed, and I don't have the patience to type it all in again right now. Email me if you want it all. For now, suffice it to say that Informix chunks are pre-allocated so that the metadata of the filesystem does not change once the file supporting the chunk is created. That means that an fsck on a filesystem that ONLY has Informix chunk files will always come up clean. So this is not a problem. Informix, as already said, has its own fast recovery procedures that can repair any internal damage within the file(s) and which fsck won't recognize anyway, so not a problem. EXT2 is perfect.

    ReplyDelete
  6. Art, it looks like your detailed response made it, after all.

    About ext2: So if there is no metadata activity on a file system, e2fsck can check et very quickly? Interesting. I guess I will have to test this quite a bit before feeling comfortable with it.

    By the way: Most of my database activities revolve around DB2. There, I've grown rather tired of having to deal with raw devices; I think they have a tendency to breed quickly, and it becomes a mess. And it seems that DB2 may try to somehow balance I/O between raw devices, so it may be important that they are of the same size, etc. That's why I'm moving towards storing data on file systems. - But I would very much steer away from double-journalling and sub-optimal data placement.

    It would be a funny retro-step to go back to ext2 :-)

    ReplyDelete
  7. Troels, I don't know why I didn't notice your followup comment for 3 months. Sorry about that. I understand your concern about RAW and DB2. I don't know how DB2 uses RAW devices myself, I am an admitted Informix Bigot. Informix LOVES RAW devices and performs best using them. There is no need for the Chunks (the smallest unit of allocating disk to the server in Informix) to be the same size and load balancing across chunks is mostly a function of the physical design of a table or index.

    I don't know what you mean by " have a tendency to breed quickly ..." however. Please elaborate.

    Beyond performance, I have a naturally paranoid DBA reason for preferring RAW devices over filesystem files. I have seen overzealous system administrators delete the chunk files that a database server was accessing too many times simply because the files access times had not been updated in days or weeks or months. That was because we had disabled access time updates on the filesystems used for database chunk files to improve performance and the Informix servers just don't often crash. It is not at all unusual to find an Informix engine that has been online continuously for a year or more. I know of several that have not been offline for even a single second in over 10 years!

    System administrators don't tend to mess with the device driver files under /dev so I have only once seen a RAW chunk destroyed out from under an Informix engine. I have personally rescued five servers over the years and could not rescue two others that were attacked by "helpful" SAs deleting filesystem chunk files. If I never have to do that again, I'll be a happier person.

    ReplyDelete
  8. NO JFS, NO RAID5, NO RAID6
    but how about JFS2 with cio mount-option?

    ReplyDelete
  9. JFS2 is actually OK, based on my research. JFS2 only journals the metadata and since the metadata on filesystems dedicated to Informix chunks will only change when new chunks are created or an existing chunk is dropped or (in 11.70+) extended to make it larger, there should be no performance cost to using JFS2 for OS chunks.

    ReplyDelete
  10. Thanks for that interesting an very detailed description. Whenever I can I'm using raw devices. I saw a 4G chunk on a windows machine that consistet of more than 1000 Fragments. :-(
    Anyway, sometimes I need some chunks I can delete later on and I'm not keen on spending the time for fiddling around with the storage system.
    I am using XFS and I'm perfectly happy with it - as "normal" FS. Do you know any reasons for NOT creating chunks on XFS.

    There are other alternatives too. Maybe I shouldn't mention it ;-), but as far as I am informed, Oracle FS was designed with DB Servers in Mind.

    ReplyDelete
  11. XFS, like JFS2, is meta-data only journaling. There are some recommendations to make sure to mount the XFS filesystem with an 'external' journal on a separate device on a separate channel when using it for databases or other high random write/high concurrency applications to avoid the performance impact of journal operations on FS storage operations. However, given that an Informix chunk is pre-allocated and fully populated (ie not sparse) there is likely to not be much journal operations anyway. I would say that XFS is probably as good as JFS2 for chunks. As far as using an Oracle FS - I have not much idea. I know that OCS is non-journaled and that OCS2 is meta-data and data block journaling (like EXT4) but have no idea of its performance characteristics. However, I am skeptical of any data block journaled filesystem and would not use it myself.

    ReplyDelete
  12. Art, what would you recommend for chunks that reside on a SAN? I think i will run into trouble with the number of my LUNs and SAN-replication, if i have to create one LUN for each chunk.
    At the moment we have one big filesystem with cooked chunks.

    ReplyDelete
  13. Michael,
    What version of IDS are you running? If it is 10.00 or later it supports O_DIRECT/DIRECT_IO so using filesystem chunks is not as bad as it used to be. The performance hit is between 5-8% as compared to 25-35% without O_DIRECT enabled.

    That said, for best performance you should still be using RAW chunks and in order to avoid maintenance and restore time headaches you should have one partition per chunk. You can do that using one LUN per chunk or you can create larger LUNs on the SAN and subdivide each in to multiple logical RAW devices using a volume manager and assign one such partition per chunk. Remember that if you are using 9.40 or later your chunks no longer have to be <= 2GB they can each be up to 4TB in size.

    ReplyDelete
  14. Art, thank your for your thoughts. We have IDS 11.50 on a Suse Linux Enterprise Server 10 SP3, running as a guest on VMWARE ESX4.1 and we are experiencing some performance issues since we virtualized our database server and moved the storage to a SAN.

    I am now trying to get the maximum out of my configuration and check for every single % of performance gain lying around on the floor :)

    ATM we have one coooked file for every chunk with DIRECT_IO enabled. All of them reside on one big single LUN. OS is on a separate LUN.

    I thought of doing what you recommend at first, but i did not, because at the end of the day all the IO has to go through my NIC to the SAN, so there won't be the usual performance gain from putting different chunks on different spindles, right?

    Thanks for your time!

    ReplyDelete
  15. Michael, you have several problems to deal with. Your biggest problem is that you are running Informix in a VM accessing a SAN. In my testing for a major Informix client we found that database IO performance with SANs on a Virtual Machine is at least 50% and as much as 80% below the same SAN's performance on raw metal with the same OS and Informix versions. Part of the problem is in the VM software's IO routines and part is in the SAN's and how they work versus how the VM is accessing them.

    I know that my client had been working with RedHat, VMWare, and their SAN vendor (actually they experienced similar performance issues on two different manufacturer's SAN system - both major player with proven performance otherwise) on the problem, but I don't know what the final resolution was if any. Contact me directly and I can try to hook you up with the client so you can talk about it. I don't know if they resolved the issues they had and went ahead with the virtualization project or not, but either way, knowing what they were and were not able to do will help you out.

    That done, on to your direct question: Will using more dbspaces and isolating your tables, indexes, logical logs, etc make any difference given that it all lives on a single structure on a single SAN accessed over a single network using a single NIC? Honestly, it depends, on a lot of factors. Certainly spreading the load logically cannot hurt performance, it can only help or make no difference at all. Here are some of the issues: At checkpoint time IDS flushes dirty pages using one CLEANER thread per chunk. More chunks, more concurrent IOs. Will that improve server throughput? Maybe, up to the point where you are swamping one of those on-of resources I listed, yes. Next, if you can move indexes to dbspaces with wider pages, that will separate the indexes into a separate cache from the data, reducing buffer cache contention and LRU latch contention - good thing.

    Bottom line? I've said it before, and I'll keep saying it: Big disks and SANs were the two worst things to happen to databases! You would be far better off with 100 pairs of mirrored 300MB SCSI J.B.O.D. drives connected to 25 controllers (four spindles each) to make up your 30GB database than your single, probably RAID5, SAN LUN built from five 200GB drives. You would have 10 or 20 times as many spindles retrieving and storing your data in parallel over more controller bandwidth than you could ever use and you would have to ability to isolate your data one table from another as required to maximize your performance.

    ReplyDelete
  16. Just to be clear, I know that you mentioned using a filesystem, so you are likely to have multiple dbspaces and chunks in there. I want my comments on this BLOG to be as generally useful as possible and not misleading at all. So... In your case, assuming several appropriate chunks on a filesystem all living on a single LUN, changing to multiple LUNs is not going to make any significant difference in performance for you unless you can put those LUNs on multiple independent structures.

    The problem you are seeing is not related to the number of LUNs, but to using a VM.

    ReplyDelete
  17. Art, I thank you a lot for your elaborate explanations. I have just changed my ext3 chunkfilesystem to ext2 and moved some chunks to rawdevices, still residing on the same SAN, but at least a beginning. Furthermore I have tuned some OS-related networking parameters, increased my BUFFERPOOL and LRUs, etc. and it seems that i still haven't reached the end of the pole. I am observing a better NIC-utilization, so maybe

    So i will at least try to get the most out of the non vm-related stuff.

    I have read in some other of your postings (date 07/2010) that for future versions of Suse-Enterprise-Server Linus is planning to stop the raw-device support. Hopefully this is not the final statement :)

    Concerning the contact offer i will write a PM.

    ReplyDelete
  18. Hello Art, i'm looking for an advice from you :) I've always use RAW devices (under Veritas)... Today i must install a new IFX11.70 instance on a Solaris 10 with veritas. I've a SAN (25TB) configured with one LV in RAID50. The question i've when i do a dd (bs=4k size=64GB) on a VxFs file system i've a troughput of 735MB/s when i do the same dd on a raw volume i only have 46MB/s ... what do you think that i must use cooked or raw ? Thank you

    ReplyDelete
  19. I responded to Benji privately, but I guess the response is relevant enough to repost to everyone:

    #1 - You missed that dd is writing to the OS cache which is only flushing using Solaris's lazy cache algoritms. Informix has to open the cooked file with O_SYNC which forces the OS cache to flush after each write which slows it down considerably. If you enable DIRECT_IO the effect is reduced because that bypasses the cache altogether. With O_DIRECT the difference is about 5% slower with COOKED, without O_DIRECT (so with O_SYNC) COOKED is about 25% slower than RAW. I have a test program I can send you if you like. I am traveling this week, so I cannot send it until I get back to my desktop machine. Let me know.

    #2 - NO RAID5! - RAID50 is a particularly pernicious version of RAID5 that wastes more disk than RAID10 while experiencing most of the safety and all of the performance problems of RAID5. Resist it with all of your being! RAID10 all the way! NO RAID5!!!!!!!!!!!!!!!!!!!

    Art

    ReplyDelete
  20. What if your stuck with a filesystem like ZFS, company policy at a client site, I imagine I will get burned running a TB sized IDS instance over time as the data is churned every six (6) months.

    So I am thinking I will get eventual fragmentation over time and a 3X to 7X drop in performance. I think the only way around this potential issue is to schedule down time and to copy chunks into a clean ZFS partition as a 'painful' maintenance process.

    Do you have any insights or work arounds that might alleviate problems in such a ZFS environment environment?

    Jon

    ReplyDelete
  21. I would see if you can get permission to use RAW chunks which are still best. If you cannot, then you just need to document the potential performance problem and the fact that the only way to fix it will require over 24 hours of downtime. Then prepare a quarterly performance review and maintenance plan document that includes the contingent to take the server offline for the reorg each quarter and for estimating the downtime for the next quarter. Finally, get the powers that be to sign off on your comprehensive plan.

    FYI, you can use a restore of your archives to refresh the database chunks to a contiguous state after dropping and rebuilding the filesystems that house them.

    ReplyDelete
  22. Art thanks, pretty much what I thought. I will look into getting a waiver for using RAW chunks but with this client I am not too hopeful.

    Like most COW (copy-on-write) files systems ZFS seems to be ENJOY-FIRST/SUFFER-LATER at least for databases.

    Note, ZFS had a planned feature called 'block pointer rewriting' (BPR) I believe if this is ever implemented then background defragmenting of the ZFS file system would occur. Alas there hasn't been any progress (or much talk) on this ZFS feature for years.

    Jon

    ReplyDelete
  23. Art, lets say I convince my client to go with RAW devices in this case due to the physical environment I then would have only six (6) 1TB SAS drives available to me.

    In my application I have three (3) dbspaces 'rootdbs', 'indxdbs' and 'datadbs' 348 chunks - the names mean imply just what they are used for.

    rootdbs 2k psize, 1024000 pages/chk, 2 chks
    indxdbs 8k psize, 256000 pages/chk 200 chks
    indxdbs 8k psize, 256000 pages/chk 700 chks

    A similar instance is almost full with the following data.

    rootdbs 4GB total space in bytes used now
    indxdbs 400GB total space in bytes used now
    indxdbs 1400GB total space in bytes used now

    I think maybe I should 'chop' each of the six 1TB drive into three (3) slices say 1 slice for rootdbs , 1 slice for indxdbs, 1 slice for datadbs ( if it helped I could do a few more slices for index and data ).

    slice......s4......s5......s6
    drive #1 8GB, 216GB, 712GB
    drive #2 8GB, 216GB, 712GB
    drive #3 8GB, 216GB, 712GB

    mirror set

    slice......s4......s5......s6
    drive #4 8GB, 216GB, 712GB
    drive #5 8GB, 216GB, 712GB
    drive #6 8GB, 216GB, 712GB

    s4 for rootdbs (2 disks) and maybe separate logs (4 disks)
    s5 for indxdbs (placed on all 6 disks)
    s6 for datadbs (placed on all 6 disks)

    Should I just assign one massive chunk to each slice or is there a benefit for making about 465 chunks all 2GB bytes in size?

    I imagine (on the same drive) that if I have multiple chunks - 353 chunks - of size 2GB (or 1/4 say the slice - 4 chunks - of size 176GB) since it is the same drive it will perform the same as a single 706GB chunk - excpet perhaps page cleaners.

    Across all drives: monolithic massive chunk I would still have 6 chunks in indxdbs and 6 chunks in datadbs and 24 in indxdbs, if I went with 1/4 slice policy I would have 24 chunks in the datadbs, if I left the chunks at 2GB I would have 2,118 chunks in datadbs and 648 chunks in indxdbs. Of course these chunks are all part of a mirror.

    The bottom lime is I only have six spindles to use in a raw Informix setup (3-primary, 3-mirror) it won't be much.

    I am curious if Informix will do a round robin read from both sides of the mirror or even better read in parallel double the read speed than a single disk (I know it has to write both sides).

    BTW I feel like this is getting too technical for this blog, feel free to delete this post and email me directly at jon DOT strabala AT quantumsi DOT com

    Thanks in Advance

    Jon

    ReplyDelete
  24. I'm going to leave your post here because this is a fairly common concern but I will answer privately only because of the detail involved.

    Anyone else with similar concerns who wants advice can feel free to email me directly.

    ReplyDelete