Database Forum / Oracle / Oracle Server / May 2005
iostat - multiblock read count
|
|
Thread rating:  |
utkanbir - 18 May 2005 09:47 GMT Hi ,
My system is a two node rac on redhat linux 2.1 , 4 ia64 cpus , 8gb.ram , ocfs and emc raid 10
The db_block_size is 16kb. , current multi_block_read_Count is 64 (which makes 1mb. of read)
Below is some samples for different multiblock read counts . It seems when i decrease the multblock read count , performance increases:
Sample query :
select /*+parallel(m,8)*/count(*) from taniadm.MERKEZ_CIKIS_34 m
the table is stored in a tablespace within locally managed tablespace . The tablespace uses uniform extent sizes of 10mb. ( I have just created it for this test, and have chosen this large extent size in order to allow dbserver to use multiblock effectively.)
parallel 8 , 64 multiblock takes : 1:07 min.
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdh1 43890.20 0.00 896.80 1.40 50860.00 1.40 56.63 119.07 131.83 1.09 100.00
parallel 8 , 32 multiblock takes : 59 sec
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdh1 56885.00 0.00 1051.60 1.20 58651.60 1.20 55.71 138.40 130.24 0.93 100.00
parallel 8 , 16 multiblock takes : 45 secs
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdh1 66734.20 0.00 1347.80 1.80 76781.60 1.80 56.89 48.62 36.00 0.72 100.00
parallel 8 , 8 multiblcok takes: 44 sec
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdh1 66598.40 0.00 1371.40 2.00 77467.20 2.00 56.41 24.22 17.66 0.71 100.02
1.It seems when i decrease the multiblock read count parameter , the rsec/s increases , but i expect to see the opposite.What's wrong with this?
2. Are the values in r/s normal? Or do they point that my disks are saturated. I have read an article explaining that 100 or 200 read per second is enough to saturate disks but here i see large values .
3. Related to my second question , since i use raid 10 (striping + mirroring ) is it possible to get higher io rates ? Looking at the iostat values , (rsec/s) / (r/s) more or less equals to 28kb. (This is the kb. read in each read.)It seems a very low value to me. On a sun solaris ufs file system for instance , i can achive 128KB or even 1MB. per read by playing with the parameters , i dont understand why my linux box is different.
Any help will be appreciated. Kind Regards, tolga
DA Morgan - 18 May 2005 19:31 GMT > Hi , > [quoted text clipped - 63 lines] > Kind Regards, > tolga To answer your questions really requires a StatsPack or AWR Report
Upgrading to RedHat 3 will likely improve things substantially
Depending on your hardware you could look at ethernet bonding to increase through-put. Depends on what the limiting factor is.
 Signature Daniel A. Morgan http://www.psoug.org damorgan@x.washington.edu (replace x with u to respond)
chao_ping - 19 May 2005 06:04 GMT One possible reason could because of your global cache management. If you shutdown one node, maybe result will be diffirent.
One question, is your OCFS doing directIO read? Else filesystem cache can mask your result.
utkanbir - 23 May 2005 08:04 GMT Hi Chao ,
I have checked the statstpack output regarding to this query , in raw trace file i see lots of 'global cache cr request ' wait events but the time they take is very little comparing to the disk read events:
select /*+NOPARALLEL(M) */count(*) from taniadm.MERKEZ_CIKIS_34 m
call count cpu elapsed disk query current rows ------- ------ -------- ---------- ---------- ---------- ---------- ---------- Parse 3 0.00 0.01 1 1 0 0 Execute 3 0.00 0.00 0 0 0 0 Fetch 3 61.10 767.28 307181 307209 0 3 ------- ------ -------- ---------- ---------- ---------- ---------- ---------- total 9 61.10 767.30 307182 307210 0 3
Misses in library cache during parse: 1 Optimizer goal: CHOOSE Parsing user id: 46
Rows Row Source Operation ------- --------------------------------------------------- 1 SORT AGGREGATE (cr=102403 r=102399 w=0 time=262103077 us) 10133769 TABLE ACCESS FULL MERKEZ_CIKIS_34 (cr=102403 r=102399 w=0 time=257079273 us)
Elapsed times include waiting on following events: Event waited on Times Max. Wait Total Waited ---------------------------------------- Waited ---------- ------------ SQL*Net message to client 8 0.00 0.00 SQL*Net message from client 8 78.11 173.33 global cache cr request 154082 0.12 6.73 db file scattered read 23984 1.01 707.97 latch free 6 0.03 0.05 SQL*Net break/reset to client 2 0.00 0.00 library cache lock 4 0.00 0.00 db file sequential read 3 0.01 0.02 *******************************************************************************
Here the total times waited value for global cache cr request is large but total waited is very small . The majority of query time spent in disk io.
For the direct/io , i have checked the oracle executables, straced them (especially open system calls) and saw the o_direct flag , and :
filesystemio_options is set to none. (for ocfs i was told it was not necessary to set it, since ocfs uses direct io without this parameter)
Kind Regrads,
> One possible reason could because of your global cache management. If > you shutdown one node, maybe result will be diffirent. > > One question, is your OCFS doing directIO read? Else filesystem cache > can mask your result. Noons - 23 May 2005 08:44 GMT > My system is a two node rac on redhat linux 2.1 , 4 ia64 cpus , > 8gb.ram , ocfs and emc raid 10 time to move to RH3? ;)
> The db_block_size is 16kb. , current multi_block_read_Count is 64 > (which makes 1mb. of read) can mean nothing in Linux, read on...
> 1.It seems when i decrease the multiblock read count parameter , the > rsec/s increases , but i expect to see the opposite.What's wrong with > this? I've got a funny feeling you just hit the 32K default Linux I/O limit. You see, until kernel release 2.6 (or patched 2.4), Linux will "secretly" transform any single I/O request for more than 32K bytes into as many 32K requests as needed. This takes time and physical overhead from the disk controller(s). When you reduce the dbfmr, you reduce this overhead and paradoxically (my my, what a long word for "D'uh!"...) you end up with a little more r/s. Read on.
> 3. Related to my second question , since i use raid 10 (striping + > mirroring ) is it possible to get higher io rates ? Looking at the [quoted text clipped - 3 lines] > per read by playing with the parameters , i dont understand why my > linux box is different. There is a patch for RHAS at Oracle Metalink that gets rid of this 32K limitation. It applies AFAIK only to 2.4.21 onwards, until RHAS4 whereupon the 2.6 kernel takes over and it's not a problem anymore. However I'm not sure if this patch is compatible with anything under Oracle 10g, so CHECK first with support.
If this is your problem you need first to upgrade to the adequate level of RHAS3, *then* apply the Oracle patch. You'll need to request it from Oracle support themselves.
Go here: http://www.oracle.com/technology/deploy/availability/pdf/ora_lcs.pdf for all the nasty details. HTH
chao_ping - 24 May 2005 10:28 GMT Hi, Noons, It is the first time I know about this:
>>I've got a funny feeling you just hit the 32K default Linux I/O limit. >>You see, until kernel release 2.6 (or patched 2.4), Linux will >>"secretly" transform any single I/O request for more than 32K bytes >>into as many 32K requests as needed Can you provide some detail about this? For example, metalink id, or URL about linux kernel about this.
And Utkanbir, since it is for test, can you shutdown one node and perform your test again? THanks
Noons - 24 May 2005 12:53 GMT chao_ping apparently said,on my timestamp of 24/05/2005 7:29 PM:
>>>You see, until kernel release 2.6 (or patched 2.4), Linux will >>>"secretly" transform any single I/O request for more than 32K bytes [quoted text clipped - 5 lines] > And Utkanbir, since it is for test, can you shutdown one node and > perform your test again? That's funny: somehow Utkanbir's reply never made it to my server. I wonder why...
Anyways: see bug 4039598 (which is not really a bug...) and Red Hat's Bugzilla problem 148838. This doc also has some info on it: http://www.oracle.com/technology/deploy/availability/pdf/ora_lcs.pdf And never hearing about it doesn't mean it's incorrect: it just means nobody told you before the simple facts. Now you know. ;)
 Signature Cheers Nuno Souto in sunny Sydney, Australia wizofoz2k@yahoo.com.au.nospam
Fabrizio - 24 May 2005 17:32 GMT > I've got a funny feeling you just hit the 32K default Linux I/O limit. > You see, until kernel release 2.6 (or patched 2.4), Linux will [quoted text clipped - 3 lines] > you reduce this overhead and paradoxically (my my, what a long > word for "D'uh!"...) you end up with a little more r/s. Read on. Is this a Redhat limit?
None of my distribution seems to have a 32K I/O limit even on the now old fashioned 2.4 kernel.
Here is an example from a SLES8:
# dd if=/dev/zero of=/boot/foo bs=4096k count=10 10+0 records in 10+0 records out
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util /dev/sda1 0.00 40636.00 0.00 322.00 0.00 81920.00 0.00 40960.00 254.41 222.40 687.58 24.84 80.00
The iostat shows 322 real calls to the device (while 40636 were merged in this 322 thanks to the asynch i/o).
40960.00 K / 322 = 127.20 K
almost 128 K which can be the limit of my physical device.
(by the way: redhat bug 148838 speaks about a 128K limit on qlogic).
Regards
 Signature Fabrizio Magni
fabrizio.magni@mycontinent.com
replace mycontinent with europe
Noons - 25 May 2005 03:13 GMT > Is this a Redhat limit? AFAIK, yes. That's why it is in the RedHat Bugzilla. It's not really a limit: just a default design decision of the 2.4 kernel people, I guess. And only for 2.4 kernel level. Ie, supposedly fixed in RHAS4 which is kernel 2.6 (or so I'm told...).
And of course in SLES 9 onwards? But then again, SLES always traditionally seemed to have less restrictions than RH. Looking at the compatibility tables in Metaclick for asynchIO and directio for different flavours of Linux, it comes out very clearly SLES paid attention to this disk stuff a lot more than RH. And it also supports a lot more file systems... (but I'm biased towards Suse, so take this with a grain of salt!)
> 40960.00 K / 322 = 127.20 K > > almost 128 K which can be the limit of my physical device. Yeah. I think the patch lets it go to 1M which is what you need for sequential IO in devices like the low-cost RAID stuff Oracle is looking at now.
> (by the way: redhat bug 148838 speaks about a 128K limit on qlogic). Yup, very much so. Which happens to involve all disk IO. See the best practices pdf link posted before for details.
Fabrizio - 25 May 2005 09:52 GMT >>Is this a Redhat limit? > > AFAIK, yes. That's why it is in the RedHat Bugzilla. > It's not really a limit: just a default design decision of > the 2.4 kernel people, I guess. And only for 2.4 kernel level. > Ie, supposedly fixed in RHAS4 which is kernel 2.6 (or so I'm told...). I had a look at the associated bug on metalink. It seems the limit of 32K involves only i/o on raw and OCFS of a specific redhat version.
I cannot see any limit in 2.4 kernel related to i/o (at least I couldn't spot any references in the source code or in the kernel mailing list).
As a test I took an old vanilla kernel (2.4.12) and performed the previous dd (always on ext3 and reiserfs). The limit is not there and the asynch i/o is doing its duty.
The interesting thing about the bug you pointed out is that it shows only on OCFS and raw devices: those use synch OS i/o instead of aio. I would conclude that it is really a redhat issue only.
May I ask you the sources of your statement about a 2.4 "design limit"? (I'd like to invastigate this further: so far it smmes a myth).
> Yup, very much so. Which happens to involve all disk IO. > See the best practices pdf link posted before for details. Thank you. The document (and the related ones) is really useful.
 Signature Fabrizio Magni
fabrizio.magni@mycontinent.com
replace mycontinent with europe
Noons - 25 May 2005 10:23 GMT > I had a look at the associated bug on metalink. It seems the limit of > 32K involves only i/o on raw and OCFS of a specific redhat version. Not sure. Its symptoms appear to show up in at least my RH test box and I'm not running raw or OCFS...
> As a test I took an old vanilla kernel (2.4.12) and performed the > previous dd (always on ext3 and reiserfs). The limit is not there and > the asynch i/o is doing its duty. 'sOK with SLES, AFAIK. (yay!)
> May I ask you the sources of your statement about a 2.4 "design limit"? > (I'd like to invastigate this further: so far it smmes a myth). The pdf document I pointed out plus the blurb in the RH bugzilla. I'd say it's similar to the old Unix limit on max IO size: used to be 64K, then some makers bumped it up to 1M.
> Thank you. The document (and the related ones) is really useful. Pleasure. Please let me know of anything you find: running RH and this could have a big impact for me in the near future. Here or via email in the header. I'm going to find this particular module in the source and see if I can figger who/what runs when. That should show what's going on. Gotta load the RH sources first, across the bigpond: s-l-o-w...
hopehope_123 - 25 May 2005 15:42 GMT Hi Friends ,
Thank you very much for your replies , Fabrizio it is good to hear from you again , i have made your tests here , these are the results:
1. redhat linux advanced server ia64 , ocfs file system on emc raid10. + fibre channel , no asyc , but direct_io active
time dd if=/dev/zero of=/oracle/koccrm01/test bs=4096k count=100 100+0 records in 100+0 records out
real 0m5.964s user 0m0.001s sys 0m3.923s
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdn 7699.00 7624.00 347.00 227.00 8046.00 7851.00 27.70 1.28 2.22 0.65 38.10
7851.00 wsec/s = 7851*(512/1024)=3925,5KBytes/sec. (1 sector=512k)
3925,5 / 227.00 = 17,29 kb.
2. same server: redhat linux advanced server ia64 , this time ext3 file system ,
for aio : /proc/sys/fs/aio_nr = 0 so it is not active.
time dd if=/dev/zero of=/oracle/stagetmp/tmp/test bs=4096k count=100 100+0 records in 100+0 records out
real 0m1.089s user 0m0.002s sys 0m1.071s
cant get iostat since it is too fast . tried by increasing the file size :
time dd if=/dev/zero of=/oracle/stagetmp/tmp/test bs=4096k count=1000
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdl1 0.00 20951.00 1.00 958.00 8.00 175048.00 182.54 4190621.19 379.70 1.02 100.00
87524 KB. / 958 = 91KB.
3. redhat linux x86 , ext3 file system , aio is same with above (not active)
time dd if=/dev/zero of=/oracle/tolga/test bs=4096k count=100 100+0 records in 100+0 records out 3.97s real 0.00s user 2.23s system
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdb5 0.00 22098.00 0.00 31.00 0.00 178408.00 5755.10 644.40 2183.87 116.13 36.00
178408wsec/s = 178408*(512/1024) =89204kb.
89240kb / 31 = 2878 kb. (huh!)
4. sun solaris , ufs file system :
time dd if=/dev/zero of=/data/spss/test bs=4096k count=100
real 0m6.206s user 0m0.000s sys 0m2.830s
extended device statistics device r/s w/s kr/s kw/s wait actv svc_t %w %b sd81 0.0 241.6 0.0 47195.2 29.7 11.9 172.2 24 74
47195,2 / 241.6 = 195 kb.
These results are also interesting .
Chao , i have tried the sql statements by stopping one of the nodes , the duration did not change. I think this is also clear from the statspack output which shows lots of global cr request events but its wait time is very low.
Kind Regards, tolga
Noons - 26 May 2005 02:04 GMT > 1. redhat linux advanced server ia64 , ocfs file system on emc > raid10. + fibre channel , > no asyc , but direct_io active I think I'm a bit in the blank here. How is direct_io active? By file system mount? If it is through Oracle, then a test of IO speed with "dd" in this case means you're writing to buffer cache, very very fast. Might as well use "hdparm -T <any raw device>" and see how fast you can write to buffer cache in one simple go?
Last time I looked, dd uses normal file system io if "of" is not a raw device. Which means we're merrily going through the buffer cache.
So what is the point of testing like this? Or am I missing something?
hopehope_123 - 26 May 2005 07:07 GMT Hi Nuno ,
Thank you very much for your correction. In fact , direct io is enabled by default for the ocfs. But since dd uses normal file io, this test fails. ( There exists a version of dd , cp commands for linux which uses direct io also.)
the point here is just to see whether the symptom tou mentiones exists on my stsem . ( 32kb. io barrier)
Kind Regards, tolga
Noons - 26 May 2005 15:36 GMT hopehope_123 apparently said,on my timestamp of 26/05/2005 4:07 PM:
> by default for the ocfs. But since dd uses normal file io, this test > fails. ( There exists a version of dd , cp commands for linux which > uses direct io also.) No worries, now I get it. There is a howto on the Linux Documentation Project website that goes in detail to all the tools available for accurate testing. Worth a search for these bits of doco: use "IO Performance" and have a quiet read. The problem with raw disk IO (and hence any IO that uses raw disks as base, including f/s) for 2.4 kernels is described in detail in one of them. Fixed on 2.4.17 onwards with the vary-io patch which is also referred in the Oracle doco mentioned before.
I'm toying around with test suites based on dd, 1M-32K-8K-4K-2K IO size for both cooked(file system) and raw IO, similar to yours. Some very surprising results in my systems, with net-based raid as well as native disks, raw and ext3! Once I finish making sense of the results will pop the scripts here or on dizwell for others to try.
> the point here is just to see whether the symptom tou mentiones exists > on my stsem . ( 32kb. io barrier) Really hard to click into unless raw: the file system layer masks it all out for me. In fact, after 8k it makes bugger all difference what the IO size is if using ext3.
 Signature Cheers Nuno Souto in sunny Sydney, Australia wizofoz2k@yahoo.com.au.nospam
Fabrizio - 26 May 2005 20:32 GMT > Hi Friends , > > Thank you very much for your replies , Fabrizio it is good to hear from > you again , i have made your tests here , these are the results: Always arounf (lurking). I'm too busy fighting against clusters, fiber devices and SANs... and, of course, losing... :(
But I seems attracted by your post. ;)
Unfortunately I cannot add anything about your problem with the multiblock read count. Probably Noons is right and you are hitting an "i/o fragmentation" issue.
Instead I would be interested in the methodology you used for the result you posted. They appear... weird...
I'm going to provide the steps I followed to set up my test environment.
I choose a device where I was sure none was writing but me (you can test it by runnning an iostat and querying the proc) in this way my metric won't be tainted.
Then with two shells:
on the fist I'm going to write from another device (pseudo device since it is /dev/zero).
on the other I probed the output device with the command:
iostat -x /dev/sda1 1
it gives me several line at 1 second distance.
I check that all the lines are zeros (*do not go for the first line of an iostat because it is the avarage since system boot*).
When I go for the dd on the first shell (the parameter are chosen to make the dd last less than one second) I see only one line with non-zero values and that is the one I post and where I calculate my i/o rate.
In my opinion iostat is not the right tool for precision measurements. You can always go and query the /proc before and after the dd and calculate your own result.
> 3. redhat linux x86 , ext3 file system , aio is same with above (not > active) [quoted text clipped - 12 lines] > > 89240kb / 31 = 2878 kb. (huh!) This appears too good to be true... :(
May you try another set of measures?
Thank you.
 Signature Fabrizio Magni
fabrizio.magni@mycontinent.com
replace mycontinent with europe
Noons - 27 May 2005 10:32 GMT > In my opinion iostat is not the right tool for precision measurements. > You can always go and query the /proc before and after the dd and > calculate your own result. Same problem here. I've quite taken to: watch --interval 5 cat /proc/partitions and then keep an eye on the relevant columns.
> This appears too good to be true... :( Dunno: just got 66Mb/s sustained off the Xserve! ;)
hopehope_123 - 27 May 2005 12:38 GMT Hi friends,
Fabrizio , thank you very much for your corrections. Here is more accurate results:
2. same server: redhat linux advanced server ia64 , ext3 file system ,
[oracle@tanidw1 tmp]$ time dd if=/dev/zero of=/oracle/stagetmp/tmp/test bs=4096k count=5 5+0 records in 5+0 records out
real 0m0.063s user 0m0.000s sys 0m0.063s
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdl1 0.00 4678.00 0.00 456.00 0.00 41072.00 90.07 91.11 199.79 0.89 41.60
41072.00 / 2=20536 kb. 20536/456 = 45kb. per read
4. sun solaris
device r/s w/s kr/s kw/s wait actv svc_t %w %b sd81 0.0 105.0 0.0 20483.3 4.8 3.2 76.7 11 30
195KB. per write
I have logged a tar for the bug issue.
Kind Regards, tolga
hopehope_123 - 31 May 2005 15:23 GMT Dear Friends ,
I have logged a tar for the bug issue, but oracle says :
Oracle Bug:4039598 is an Internal bug and it's for RH 3.0 x86 (2.4.21) and you running RH 2.1 IA64 (2.4.18) The bug it's not releated to your system
But i believe i have the sypmtoms. Little bit confused now.
tolga
|
|
|