Comments on Informix - My view: New Journaled Filesystem Rant

I'm going to leave your post here because this...

2012-10-18T09:56:00.133-04:00

I'm going to leave your post here because this is a fairly common concern but I will answer privately only because of the detail involved.

Anyone else with similar concerns who wants advice can feel free to email me directly.

Art, lets say I convince my client to go with RAW ...

2012-10-17T13:46:42.714-04:00

Art, lets say I convince my client to go with RAW devices in this case due to the physical environment I then would have only six (6) 1TB SAS drives available to me.

In my application I have three (3) dbspaces 'rootdbs', 'indxdbs' and 'datadbs' 348 chunks - the names mean imply just what they are used for.

rootdbs 2k psize, 1024000 pages/chk, 2 chks
indxdbs 8k psize, 256000 pages/chk 200 chks
indxdbs 8k psize, 256000 pages/chk 700 chks

A similar instance is almost full with the following data.

rootdbs 4GB total space in bytes used now
indxdbs 400GB total space in bytes used now
indxdbs 1400GB total space in bytes used now

I think maybe I should 'chop' each of the six 1TB drive into three (3) slices say 1 slice for rootdbs , 1 slice for indxdbs, 1 slice for datadbs ( if it helped I could do a few more slices for index and data ).

slice......s4......s5......s6
drive #1 8GB, 216GB, 712GB
drive #2 8GB, 216GB, 712GB
drive #3 8GB, 216GB, 712GB

mirror set

slice......s4......s5......s6
drive #4 8GB, 216GB, 712GB
drive #5 8GB, 216GB, 712GB
drive #6 8GB, 216GB, 712GB

s4 for rootdbs (2 disks) and maybe separate logs (4 disks)
s5 for indxdbs (placed on all 6 disks)
s6 for datadbs (placed on all 6 disks)

Should I just assign one massive chunk to each slice or is there a benefit for making about 465 chunks all 2GB bytes in size?

I imagine (on the same drive) that if I have multiple chunks - 353 chunks - of size 2GB (or 1/4 say the slice - 4 chunks - of size 176GB) since it is the same drive it will perform the same as a single 706GB chunk - excpet perhaps page cleaners.

Across all drives: monolithic massive chunk I would still have 6 chunks in indxdbs and 6 chunks in datadbs and 24 in indxdbs, if I went with 1/4 slice policy I would have 24 chunks in the datadbs, if I left the chunks at 2GB I would have 2,118 chunks in datadbs and 648 chunks in indxdbs. Of course these chunks are all part of a mirror.

The bottom lime is I only have six spindles to use in a raw Informix setup (3-primary, 3-mirror) it won't be much.

I am curious if Informix will do a round robin read from both sides of the mirror or even better read in parallel double the read speed than a single disk (I know it has to write both sides).

BTW I feel like this is getting too technical for this blog, feel free to delete this post and email me directly at jon DOT strabala AT quantumsi DOT com

Thanks in Advance

Jon

Art thanks, pretty much what I thought. I will loo...

2012-10-17T11:41:12.428-04:00

Art thanks, pretty much what I thought. I will look into getting a waiver for using RAW chunks but with this client I am not too hopeful.

Like most COW (copy-on-write) files systems ZFS seems to be ENJOY-FIRST/SUFFER-LATER at least for databases.

Note, ZFS had a planned feature called 'block pointer rewriting' (BPR) I believe if this is ever implemented then background defragmenting of the ZFS file system would occur. Alas there hasn't been any progress (or much talk) on this ZFS feature for years.

Jon

I would see if you can get permission to use RAW c...

2012-10-16T19:13:51.011-04:00

I would see if you can get permission to use RAW chunks which are still best. If you cannot, then you just need to document the potential performance problem and the fact that the only way to fix it will require over 24 hours of downtime. Then prepare a quarterly performance review and maintenance plan document that includes the contingent to take the server offline for the reorg each quarter and for estimating the downtime for the next quarter. Finally, get the powers that be to sign off on your comprehensive plan.

FYI, you can use a restore of your archives to refresh the database chunks to a contiguous state after dropping and rebuilding the filesystems that house them.

What if your stuck with a filesystem like ZFS, com...

2012-10-15T23:49:27.408-04:00

What if your stuck with a filesystem like ZFS, company policy at a client site, I imagine I will get burned running a TB sized IDS instance over time as the data is churned every six (6) months.

So I am thinking I will get eventual fragmentation over time and a 3X to 7X drop in performance. I think the only way around this potential issue is to schedule down time and to copy chunks into a clean ZFS partition as a 'painful' maintenance process.

Do you have any insights or work arounds that might alleviate problems in such a ZFS environment environment?

Jon

I responded to Benji privately, but I guess the re...

2011-12-01T23:06:28.983-05:00

I responded to Benji privately, but I guess the response is relevant enough to repost to everyone:

#1 - You missed that dd is writing to the OS cache which is only flushing using Solaris's lazy cache algoritms. Informix has to open the cooked file with O_SYNC which forces the OS cache to flush after each write which slows it down considerably. If you enable DIRECT_IO the effect is reduced because that bypasses the cache altogether. With O_DIRECT the difference is about 5% slower with COOKED, without O_DIRECT (so with O_SYNC) COOKED is about 25% slower than RAW. I have a test program I can send you if you like. I am traveling this week, so I cannot send it until I get back to my desktop machine. Let me know.

#2 - NO RAID5! - RAID50 is a particularly pernicious version of RAID5 that wastes more disk than RAID10 while experiencing most of the safety and all of the performance problems of RAID5. Resist it with all of your being! RAID10 all the way! NO RAID5!!!!!!!!!!!!!!!!!!!

Art

Hello Art, i'm looking for an advice from you ...

2011-11-30T04:20:57.258-05:00

Hello Art, i'm looking for an advice from you :) I've always use RAW devices (under Veritas)... Today i must install a new IFX11.70 instance on a Solaris 10 with veritas. I've a SAN (25TB) configured with one LV in RAID50. The question i've when i do a dd (bs=4k size=64GB) on a VxFs file system i've a troughput of 735MB/s when i do the same dd on a raw volume i only have 46MB/s ... what do you think that i must use cooked or raw ? Thank you

Art, I thank you a lot for your elaborate explanat...

2011-02-08T10:04:38.679-05:00

Art, I thank you a lot for your elaborate explanations. I have just changed my ext3 chunkfilesystem to ext2 and moved some chunks to rawdevices, still residing on the same SAN, but at least a beginning. Furthermore I have tuned some OS-related networking parameters, increased my BUFFERPOOL and LRUs, etc. and it seems that i still haven't reached the end of the pole. I am observing a better NIC-utilization, so maybe

So i will at least try to get the most out of the non vm-related stuff.

I have read in some other of your postings (date 07/2010) that for future versions of Suse-Enterprise-Server Linus is planning to stop the raw-device support. Hopefully this is not the final statement :)

Concerning the contact offer i will write a PM.

Just to be clear, I know that you mentioned using ...

2011-02-05T19:46:25.208-05:00

Just to be clear, I know that you mentioned using a filesystem, so you are likely to have multiple dbspaces and chunks in there. I want my comments on this BLOG to be as generally useful as possible and not misleading at all. So... In your case, assuming several appropriate chunks on a filesystem all living on a single LUN, changing to multiple LUNs is not going to make any significant difference in performance for you unless you can put those LUNs on multiple independent structures.

The problem you are seeing is not related to the number of LUNs, but to using a VM.

Michael, you have several problems to deal with. ...

2011-02-05T19:42:11.984-05:00

Michael, you have several problems to deal with. Your biggest problem is that you are running Informix in a VM accessing a SAN. In my testing for a major Informix client we found that database IO performance with SANs on a Virtual Machine is at least 50% and as much as 80% below the same SAN's performance on raw metal with the same OS and Informix versions. Part of the problem is in the VM software's IO routines and part is in the SAN's and how they work versus how the VM is accessing them.

I know that my client had been working with RedHat, VMWare, and their SAN vendor (actually they experienced similar performance issues on two different manufacturer's SAN system - both major player with proven performance otherwise) on the problem, but I don't know what the final resolution was if any. Contact me directly and I can try to hook you up with the client so you can talk about it. I don't know if they resolved the issues they had and went ahead with the virtualization project or not, but either way, knowing what they were and were not able to do will help you out.

That done, on to your direct question: Will using more dbspaces and isolating your tables, indexes, logical logs, etc make any difference given that it all lives on a single structure on a single SAN accessed over a single network using a single NIC? Honestly, it depends, on a lot of factors. Certainly spreading the load logically cannot hurt performance, it can only help or make no difference at all. Here are some of the issues: At checkpoint time IDS flushes dirty pages using one CLEANER thread per chunk. More chunks, more concurrent IOs. Will that improve server throughput? Maybe, up to the point where you are swamping one of those on-of resources I listed, yes. Next, if you can move indexes to dbspaces with wider pages, that will separate the indexes into a separate cache from the data, reducing buffer cache contention and LRU latch contention - good thing.

Bottom line? I've said it before, and I'll keep saying it: Big disks and SANs were the two worst things to happen to databases! You would be far better off with 100 pairs of mirrored 300MB SCSI J.B.O.D. drives connected to 25 controllers (four spindles each) to make up your 30GB database than your single, probably RAID5, SAN LUN built from five 200GB drives. You would have 10 or 20 times as many spindles retrieving and storing your data in parallel over more controller bandwidth than you could ever use and you would have to ability to isolate your data one table from another as required to maximize your performance.

Art, thank your for your thoughts. We have IDS 11....

2011-02-05T05:57:05.052-05:00

Art, thank your for your thoughts. We have IDS 11.50 on a Suse Linux Enterprise Server 10 SP3, running as a guest on VMWARE ESX4.1 and we are experiencing some performance issues since we virtualized our database server and moved the storage to a SAN.

I am now trying to get the maximum out of my configuration and check for every single % of performance gain lying around on the floor :)

ATM we have one coooked file for every chunk with DIRECT_IO enabled. All of them reside on one big single LUN. OS is on a separate LUN.

I thought of doing what you recommend at first, but i did not, because at the end of the day all the IO has to go through my NIC to the SAN, so there won't be the usual performance gain from putting different chunks on different spindles, right?

Thanks for your time!

Michael, What version of IDS are you running? If ...

2011-02-03T12:28:13.154-05:00

Michael,
What version of IDS are you running? If it is 10.00 or later it supports O_DIRECT/DIRECT_IO so using filesystem chunks is not as bad as it used to be. The performance hit is between 5-8% as compared to 25-35% without O_DIRECT enabled.

That said, for best performance you should still be using RAW chunks and in order to avoid maintenance and restore time headaches you should have one partition per chunk. You can do that using one LUN per chunk or you can create larger LUNs on the SAN and subdivide each in to multiple logical RAW devices using a volume manager and assign one such partition per chunk. Remember that if you are using 9.40 or later your chunks no longer have to be <= 2GB they can each be up to 4TB in size.

Art, what would you recommend for chunks that resi...

2011-02-03T06:39:58.391-05:00

Art, what would you recommend for chunks that reside on a SAN? I think i will run into trouble with the number of my LUNs and SAN-replication, if i have to create one LUN for each chunk.
At the moment we have one big filesystem with cooked chunks.

XFS, like JFS2, is meta-data only journaling. The...

2011-01-05T13:46:36.779-05:00

XFS, like JFS2, is meta-data only journaling. There are some recommendations to make sure to mount the XFS filesystem with an 'external' journal on a separate device on a separate channel when using it for databases or other high random write/high concurrency applications to avoid the performance impact of journal operations on FS storage operations. However, given that an Informix chunk is pre-allocated and fully populated (ie not sparse) there is likely to not be much journal operations anyway. I would say that XFS is probably as good as JFS2 for chunks. As far as using an Oracle FS - I have not much idea. I know that OCS is non-journaled and that OCS2 is meta-data and data block journaling (like EXT4) but have no idea of its performance characteristics. However, I am skeptical of any data block journaled filesystem and would not use it myself.

Thanks for that interesting an very detailed descr...

2011-01-04T04:57:33.459-05:00

Thanks for that interesting an very detailed description. Whenever I can I'm using raw devices. I saw a 4G chunk on a windows machine that consistet of more than 1000 Fragments. :-(
Anyway, sometimes I need some chunks I can delete later on and I'm not keen on spending the time for fiddling around with the storage system.
I am using XFS and I'm perfectly happy with it - as "normal" FS. Do you know any reasons for NOT creating chunks on XFS.

There are other alternatives too. Maybe I shouldn't mention it ;-), but as far as I am informed, Oracle FS was designed with DB Servers in Mind.

JFS2 is actually OK, based on my research. JFS2 o...

2010-12-23T10:25:02.849-05:00

JFS2 is actually OK, based on my research. JFS2 only journals the metadata and since the metadata on filesystems dedicated to Informix chunks will only change when new chunks are created or an existing chunk is dropped or (in 11.70+) extended to make it larger, there should be no performance cost to using JFS2 for OS chunks.

NO JFS, NO RAID5, NO RAID6 but how about JFS2 with...

2010-12-21T11:27:04.637-05:00

NO JFS, NO RAID5, NO RAID6
but how about JFS2 with cio mount-option?

Troels, I don't know why I didn't notice y...

2010-11-11T00:29:08.209-05:00

Troels, I don't know why I didn't notice your followup comment for 3 months. Sorry about that. I understand your concern about RAW and DB2. I don't know how DB2 uses RAW devices myself, I am an admitted Informix Bigot. Informix LOVES RAW devices and performs best using them. There is no need for the Chunks (the smallest unit of allocating disk to the server in Informix) to be the same size and load balancing across chunks is mostly a function of the physical design of a table or index.

I don't know what you mean by " have a tendency to breed quickly ..." however. Please elaborate.

Beyond performance, I have a naturally paranoid DBA reason for preferring RAW devices over filesystem files. I have seen overzealous system administrators delete the chunk files that a database server was accessing too many times simply because the files access times had not been updated in days or weeks or months. That was because we had disabled access time updates on the filesystems used for database chunk files to improve performance and the Informix servers just don't often crash. It is not at all unusual to find an Informix engine that has been online continuously for a year or more. I know of several that have not been offline for even a single second in over 10 years!

System administrators don't tend to mess with the device driver files under /dev so I have only once seen a RAW chunk destroyed out from under an Informix engine. I have personally rescued five servers over the years and could not rescue two others that were attacked by "helpful" SAs deleting filesystem chunk files. If I never have to do that again, I'll be a happier person.

Art, it looks like your detailed response made it,...

2010-08-04T16:06:06.705-04:00

Art, it looks like your detailed response made it, after all.

About ext2: So if there is no metadata activity on a file system, e2fsck can check et very quickly? Interesting. I guess I will have to test this quite a bit before feeling comfortable with it.

By the way: Most of my database activities revolve around DB2. There, I've grown rather tired of having to deal with raw devices; I think they have a tendency to breed quickly, and it becomes a mess. And it seems that DB2 may try to somehow balance I/O between raw devices, so it may be important that they are of the same size, etc. That's why I'm moving towards storing data on file systems. - But I would very much steer away from double-journalling and sub-optimal data placement.

It would be a funny retro-step to go back to ext2 :-)

Troels, I entered a complete and detailed response...

2010-08-03T14:18:23.179-04:00

Troels, I entered a complete and detailed response to you comment, but the post process failed, and I don't have the patience to type it all in again right now. Email me if you want it all. For now, suffice it to say that Informix chunks are pre-allocated so that the metadata of the filesystem does not change once the file supporting the chunk is created. That means that an fsck on a filesystem that ONLY has Informix chunk files will always come up clean. So this is not a problem. Informix, as already said, has its own fast recovery procedures that can repair any internal damage within the file(s) and which fsck won't recognize anyway, so not a problem. EXT2 is perfect.

Troels, One of the reasons that you would use RAW ...

2010-08-03T14:14:28.724-04:00

Troels, One of the reasons that you would use RAW devices as your first choice for database chunks is that there is not FS recovery or fsck required. If a filesystem is used exclusively for Informix chunks, it is very unlikely that it will be damaged during a hard crash. Why? Because what gets damaged in a filesystem during a hard crash is the metadata - it the inodes or equivalent structures. In the case of a filesystem that is exclusively used by Informix, all chunks are allocated once and so (unless the filesystem is journaled - hint, hint) the metadata is never changed (except the modified and access times I guess, and Informix does not depend on these at all)!

That means that the filesystem will ALWAYS fsck as clean! Any internal inconsistencies within the data stored in the chunks/files will not be corrected by fsck and its redundant to ask a journaled filesystem to repair it, since Informix can repair the internal damage more quickly and efficiently using its own fast recovery procedures which kick in during an engine restart if the engine finds that the physical log was not emptied during a controlled shutdown.

Actually, the logical log dbspace(s) on an Informix OLTP system are only relatively smaller than the data chunks for the server taken as a whole. On most Informix systems the logical logs (equivalent to the transaction or redo logs on some other RDBMSes) are normally quite large to permit the server to continue functioning for at least a day, often for several days, in case there is a problem with the mechanisms used to back up the logical logs when each fills. If all of the logical logs are full and have not been backed up, the engine will not reuse those log files and will block new transactions from starting and existing transactions from continuing until the logs have been backed up or more logical log space has been added. So, to keep us sane, most Informix DBAs configure lots of logical log space, enough for at least a weekend's work, in case the automated logical log backups fail for some reasons outside the control of the Informix engine (like someone deleted the filesystem to which the backups would be written or the tape drive is taken offline for maintenance).

So, the direct answer is: I prefer RAW disk devices for Informix chunks over ANY filesystem, but if I have to use a filesystem, I prefer one that is as simple and fast as possible, so EXT2 on Linux.

Andrew, the purpose of use the PSORT_DBTEMP enviro...

2010-08-03T14:02:04.117-04:00

Andrew, the purpose of use the PSORT_DBTEMP environment variable to redirect sort-work files to a filesystem is to take advantage of the characteristics of the OS's filesystem cache which tries to minimize IO's by flushing data to disk only when it has to. Since sort-work files are by nature very short lived, the result of using PSORT_DBTEMP is that most of these NEVER gett written to disk and live in the cache until the engine merges two together into a third file and deleted the original two older ones. Therefore, I don't think that it really matters what filesystem underlies the paths you include in PSORT_DBTEMP. I would go for faster ones just in case, and recovery is not an issue, since if there is a failure or crash, only the sort is lost.

I agree that it's stupid to have journaling in...

2010-07-29T15:50:41.589-04:00

I agree that it's stupid to have journaling in both the database and the file system.

But regarding ext2: don't you risk having very long periods of fsck after an accidental, brutal shutdown? (Not that I have seen that for a long while.)

And: Would it make sense to have ext2 on the file system where the transaction log resides (potentially a rather small file system, relatively quick to fsck), and then use store non-transaction log data on large ext3 file system(s)?

Is the same true for the filesystems listed in PSO...

2010-07-29T14:00:19.394-04:00

Is the same true for the filesystems listed in PSORT_DBTEMP or are those more traditional filesystem type files and would be OK on a journaled filesystem?