Database Forum / General DB Topics / DB Theory / September 2008
sequential disk read speed
|
|
Thread rating:  |
Darren - 21 Aug 2008 01:36 GMT I am learning about database systems, and I am reading a book called "Physical Database Design".
It gets to a bit about a large sequential access (e.g. for a full table scan), and does the following:
It says "Since most disk systems use prefetch buffers to speed up table scans, we assume a 64 KB prefetch block"
So to calculate the time for a full table scan, it multiples the number of 64KB blocks by the time it takes to seek and read (2.02ms). In other words, it is seeking each 64KB block.
Why can a disk only read 64KB at a time? Is this a valid assumption? Is this a disk limitation or a file system limitation?
Thanks
Bob Badour - 21 Aug 2008 02:48 GMT > I am learning about database systems, and I am reading a book called > "Physical Database Design". [quoted text clipped - 14 lines] > > Thanks It's arbitrary. Some dbmses have a fixed block size; others allow one to configure it as a parameter. Other systems try to read an entire track or cylinder at a time. Whether the latter is feasible can change when available cylinder sizes grow at different rates from available memory.
I suspect the book is giving a hypothetical just to demonstrate the calculations involved.
David BL - 21 Aug 2008 08:08 GMT > I am learning about database systems, and I am reading a book called > "Physical Database Design". [quoted text clipped - 12 lines] > Why can a disk only read 64KB at a time? Is this a valid assumption? > Is this a disk limitation or a file system limitation? A high end modern HD with 4ms average seek will on average take about 7ms to access and an additional 0.5ms to read a randomly located 64k buffer. This mismatch shows that 64k blocks are too small for optimal read performance. 512k or 1Mb blocks would be more suitable.
Darren - 23 Aug 2008 18:26 GMT > > I am learning about database systems, and I am reading a book called > > "Physical Database Design". [quoted text clipped - 19 lines] > > - Show quoted text - But what dictates the block size? Is this defined by the physical disk, the file system, or the database code?
Brian Selzer - 24 Aug 2008 05:39 GMT > > > I am learning about database systems, and I am reading a book called > > > "Physical Database Design". [quoted text clipped - 23 lines] > But what dictates the block size? Is this defined by the physical > disk, the file system, or the database code? There can be a marginal reduction in cpu cycles and in some cases I/O by matching the block size of the disk subsystem to the block size specified in the database engine, but in any system where performance is critical, the disk subsystem will involve a caching disk array or multiple caching disk arrays, and the block size becomes moot as a result of the caching disk array controller reading an entire track at a time. Memory to memory transfers are negligible when compared to disk to memory transfers--let alone seeks. Moreover, caching controllers employ a number of technologies, such as elevator seeking, to minimize the impact of seek times, and by reading an entire track at a time, latency--that is, the time it takes for the data requested to arrive at the head for reading--is effectively eliminated as a factor. (There is still a negligible latency which is rougly half the time it takes for the head to pass over a sector, but as there are hundreds of sectors per track, the time involved is not worth considering.) Not to mention that a savvy dba will spread data access across as many heads as can be budgeted for to maximize throughput and minimize seek time. If you have a 100GB database and you put it on single 100GB disk drive, your best average seek time is the average seek time of the disk drive, but if you put the database on four 100GB disk drives, the the best average seek time will only be a fraction of the seek time of the single disk. Suppose that the full-stroke seek time on the 100GB disk is 7ms and the track-to-track seek time is 1ms. Well, with four disks, instead of an average 4ms seek time, the individual seek time of each disk is reduced to roughly 2.5ms, and since there are four disks, the average seek time for the disk subsystem is reduced to a quarter of that or roughly .625ms. Add a mirror (RAID 0+1 using 8 drives), and you introduce fault tolerance while at the same time halving again seek time for reads. Note that except in rare cases, the ratio of reads to writes in a database is better than 10:1, so any strategy that improves read time without significantly impeding write time, such as is the case with implementing RAID 0+1, should be vigorously pursued. Incidentally, RAID 0+1 is also good for transaction logs and temporary tables, which involve mostly writes or a roughly equal number of reads and writes, but it usually makes sense to segregate the logs from the database--for recovery if nothing else, and it also often makes sense to segregate the location where temporary tables are housed from both the database and the logs.
David BL - 28 Aug 2008 02:46 GMT > If you have a 100GB database and you put it on single > 100GB disk drive, your best average seek time is the average seek time of [quoted text clipped - 4 lines] > of an average 4ms seek time, the individual seek time of each disk is > reduced to roughly 2.5ms Is this because less of the disk is actually being used so on a given platter the head doesn't have such a large range of tracks to move over?
> , and since there are four disks, the average seek > time for the disk subsystem is reduced to a quarter of that or roughly > .625ms. In order for the effective seek time to be reduced to a quarter the seeking must be independent. To achieve that I think the striping would need to be very coarse (eg 512kb or 1Mb).
Brian Selzer - 28 Aug 2008 03:47 GMT >> If you have a 100GB database and you put it on single >> 100GB disk drive, your best average seek time is the average seek time of [quoted text clipped - 11 lines] > platter the head doesn't have such a large range of tracks to move > over? Yes. And the bit density is generally greater at the outside of the platter, so it generally takes fewer tracks to store the same information there as opposed to near the center; consequently, simply dividing the difference of the full-stroke seek and the track-to-track seek by four is a perhaps overly conservative method of estimation. I want to stress that this is not just a hair-brained theory of mine: I've had significant success using this mechanism to boost performance. In one application, by installing a disk that was seven times the size required and creating a partition on the outer edge of the disk, performance improved by over 6000%: batch processes that had been taking over 25 hours to complete were finishing in under 25 minutes.
>> , and since there are four disks, the average seek >> time for the disk subsystem is reduced to a quarter of that or roughly [quoted text clipped - 3 lines] > seeking must be independent. To achieve that I think the striping > would need to be very coarse (eg 512kb or 1Mb). Drives that support disconnection or some other command queueing mechanism are all that is needed for seeking to be independent.
I think using a coarse stripe is counterproductive. There would be a bigger chance that a seek in the middle of the read would be required. Consider: if 3.5 stripes fit on a track in one zone of the disk, then on average every fourth read would require an additional seek to get the remaining half stripe. If on the other hand, 28 stripes fit on a track, then no additional seeks would be necessary. Even if it were 28.5 stripes instead of 28, one additional seek for every 29 reads is a whole lot better than one for every 4.
David BL - 28 Aug 2008 05:32 GMT > >> If you have a 100GB database and you put it on single > >> 100GB disk drive, your best average seek time is the average seek time of [quoted text clipped - 23 lines] > batch processes that had been taking over 25 hours to complete were > finishing in under 25 minutes. How do you explain a 60 fold increase?
> >> , and since there are four disks, the average seek > >> time for the disk subsystem is reduced to a quarter of that or roughly [quoted text clipped - 6 lines] > Drives that support disconnection or some other command queueing mechanism > are all that is needed for seeking to be independent. If stripes are somewhat smaller than the DBMS block size, then every drive (in the RAID 0) will be involved in the reading of each and every DBMS block. No matter how you order those reads, each drive needs to read a large amount of scattered data and the head will seek around a lot. If that is the case then the only advantage arises from your previously mentioned reduction in the overall range of tracks over which the data resides on a given platter.
Alternatively if the stripe size is larger then each drive will read a somewhat independent set of the DBMS blocks, and the effective seek time can be reduced assuming the DBMS is able to issue overlapping read requests for the DBMS blocks.
> I think using a coarse stripe is counterproductive. There would be a bigger > chance that a seek in the middle of the read would be required. Consider: [quoted text clipped - 4 lines] > additional seek for every 29 reads is a whole lot better than one for every > 4. Firstly, hard-disks are quite good at stepping onto the next track in the manner normally used for very large "contiguous" reads or writes.
Secondly your analysis misses the point that coarser granularity stripes lead to fewer overall seeks, not more! Seeks per read is not a very useful stat.
The following
http://www.dba-oracle.com/oracle_tips_raid_usage.htm
discusses the ideal stripe size and the formulae indicate that ~1Mb would be appropriate for a modern disk.
Brian Selzer - 28 Aug 2008 14:34 GMT >> >> If you have a 100GB database and you put it on single >> >> 100GB disk drive, your best average seek time is the average seek time [quoted text clipped - 30 lines] > > How do you explain a 60 fold increase? Fewer and shorter seeks is my guess.
>> >> , and since there are four disks, the average seek >> >> time for the disk subsystem is reduced to a quarter of that or roughly [quoted text clipped - 20 lines] > time can be reduced assuming the DBMS is able to issue overlapping > read requests for the DBMS blocks. Your argument rests on the assumption that data is randomly distributed in the stripes on the disk and doesn't take into account the fact that a high-end caching controller eliminates latency by reading an entire track at once. Isn't it true that there is a physical affinity between related data? Isn't it more likely that an index will occupy contiguous stripes than some random set--regardless of stripe size? Can you show that the number of tracks accessed by say, 128 coarse stripe reads is any less than the number of tracks accessed by 1024 fine stripe reads?
>> I think using a coarse stripe is counterproductive. There would be a >> bigger [quoted text clipped - 13 lines] > Firstly, hard-disks are quite good at stepping onto the next track in > the manner normally used for very large "contiguous" reads or writes. The best track-to-track seek time I've seen is 0.2ms for reads, 0.4ms for writes. That's phenomenal but can still add up.
> Secondly your analysis misses the point that coarser granularity > stripes lead to fewer overall seeks, not more! Seeks per read is > not a very useful stat. Coarser granularity stripes lead to fewer overall reads, not necessarily fewer overall seeks--and not necessarily reduced overall seek time.
A finer granularity means more commands to be processed. More commands to be processed increases the likelyhood that the read of one track will satisfy more than one command. More commands to be processed increases the likelyhood that elevator seeking can be used to reduce overall seek time.
Coarser granularity stripes lead to fewer overall reads--not necessarily fewer overall seeks. In fact, it could lead to more overall seeks. Suppose, for example, that many of the stripes on disk are less than half populated with data--in much the same way that a FAT16 files system with a huge number of tiny files can fill up the disk even though the sum of the actual file sizes can be less than a quarter of the formatted capacity. Any seek that is needed in order to read the rest of a stripe when the rest of the stripe isn't populated with data would be unnecessary if a smaller stripe size were used. In much the same way, with a high-end processor, it is often possible to improve performance by setting the compressed attribute on a file. A compressed file typically occupies half the space of an uncompressed file, and with a high-end cpu, it can actually take less time to read and uncompress data than to read uncompressed data.
> The following > > http://www.dba-oracle.com/oracle_tips_raid_usage.htm > > discusses the ideal stripe size and the formulae indicate that ~1Mb > would be appropriate for a modern disk. I am not convinced, knowing what I know and have had experience with when it comes to storage subsystems. I would have to read the papers Mike Ault vaguely refers to.
David BL - 29 Aug 2008 04:43 GMT > >> "David BL" <davi...@iinet.net.au> wrote in message > [quoted text clipped - 33 lines] > tracks accessed by say, 128 coarse stripe reads is any less than the number > of tracks accessed by 1024 fine stripe reads? Yes, sometimes the DBMS manages to cluster all the necessary data so there is very little seeking required, and in that case it won’t matter what stripe size is used.
However, that is not always possible. For example consider a B+Tree on 1 billion records and in a short period of time the DBMS needs to read 100 records for given index values that are effectively at random with respect to the ordering on that data type. To keep it simple ignore the reading of the internal nodes of the B+Tree. Typically those 100 records will appear in roughly 100 different leaf nodes of the B+Tree. Furthermore due to the sheer size of the overall data those leaf nodes will tend to reside on different tracks. The unfortunate reality is that it isn’t possible to read these records without a lot of head seeking, even if the reads are ordered according to track position (ie elevator seeking). Now if RAID0 is used and the stripes are smaller that the B+Tree leaf nodes, then every drive will need to contribute to the reading of every leaf node. Each drive can read the stripes in any order it likes but it won’t avoid the fact that each drive performs ~100 seeks. If instead, each B+Tree leaf node resides in a single stripe (and therefore on a single drive) then with four drives in the RAID0, each drive will only need to perform ~25 seeks.
> >> I think using a coarse stripe is counterproductive. There would be a > >> bigger [quoted text clipped - 16 lines] > The best track-to-track seek time I've seen is 0.2ms for reads, 0.4ms for > writes. That's phenomenal but can still add up. It’s insignificant when reading or writing 1Mb at a time.
Brian Selzer - 29 Aug 2008 12:47 GMT >> >> "David BL" <davi...@iinet.net.au> wrote in message >> [quoted text clipped - 62 lines] >with four drives in the RAID0, each drive will only need to perform >~25 seeks. You're oversimplifying. With a stripe size of 64K, it is highly unlikely that a leaf node will span more than one stripe; therefore, it is highly unlikely for every drive to contribute to the reading of every leaf node. Also, you appear to be discounting concurrency, and environments where concurrency is important such as typical OLTP environments are where technologies such as elevator seeking are most effective.
By the way, Oracle documentation states that an 8K block size is optimal for most systems and defaults DB_FILE_MULTIBLOCK_READ_COUNT to 8. 8K * 8 = 64K. Interestingly, Sql Server uses 8K pages organized into 64K extents, which happens to be the unit of physical storage allocation. Do you know something they don't?
>> >> I think using a coarse stripe is counterproductive. There would be a >> >> bigger [quoted text clipped - 18 lines] > >It’s insignificant when reading or writing 1Mb at a time. David BL - 29 Aug 2008 15:23 GMT > >> "David BL" <davi...@iinet.net.au> wrote in message > [quoted text clipped - 70 lines] > that a leaf node will span more than one stripe; therefore, it is highly > unlikely for every drive to contribute to the reading of every leaf node. I don't see how I'm oversimplifying.
My point is that stripes need to be at least as coarse as the DBMS block size. Do you agree?
The choice of DBMS block size is another question entirely.
> Also, you appear to be discounting concurrency, and environments where > concurrency is important such as typical OLTP environments are where > technologies such as elevator seeking are most effective. Concurrency has nothing to do with the fact that if the stripe size is too small the seeking of the drives won't be independent.
> By the way, Oracle documentation states that an 8K block size is optimal for > most systems and defaults DB_FILE_MULTIBLOCK_READ_COUNT to 8. 8K * 8 = 64K. > Interestingly, Sql Server uses 8K pages organized into 64K extents, which > happens to be the unit of physical storage allocation. Do you know > something they don't? Sql Server 6.5 used 2k pages and this changed to 8k pages in Sql Server 7.0 released in 1998. Do you expect that 64k extents are still optimal a decade later given that the product of transfer rate and seek time has been steadily increasing?
64k blocks are generally too small on modern disks. A 64k block can be transferred in a tenth of the time it takes to seek to it.
Brian Selzer - 30 Aug 2008 02:14 GMT [snip]
> > You're oversimplifying. With a stripe size of 64K, it is highly > > unlikely [quoted text clipped - 6 lines] > My point is that stripes need to be at least as coarse as the DBMS > block size. Do you agree? Yes, I think the stripe size should be a multiple of the DBMS block size.
> The choice of DBMS block size is another question entirely. > [quoted text clipped - 4 lines] > Concurrency has nothing to do with the fact that if the stripe size is > too small the seeking of the drives won't be independent. That wasn't why I brought up concurrency. You dismissed elevator seeking as an optimization mechanism with respect to the number of seeks required by focusing on what appeared to be a single query. Ultimately the same number of seeks will be required for a particular query, but when combined with ninety-nine other queries, some of those seeks can be shared with other queries, thus reducing the total number of seeks required to satisfy the hundred.
> > By the way, Oracle documentation states that an 8K block size is optimal > > for [quoted text clipped - 9 lines] > optimal a decade later given that the product of transfer rate and > seek time has been steadily increasing? Sql Server 2008 still uses 8K pages and 64K extents. Also, the Oracle documentation that cited an 8K block size as being optimal was for their latest version, 11g.
I think that 64K extents are still optimal because the technology employed for serializing updates is still locking, and in a concurrent environment with an escalating locking heirachy, a page size or extent size that is too large will cause transactions to block more often. Do you know of a sound and practicable alternative to an escalating locking heirarchy for serializing updates?
> 64k blocks are generally too small on modern disks. A 64k block can > be transferred in a tenth of the time it takes to seek to it. Why is that a problem? Isn't it more efficient if I can satisfy the same query by reading 100 64K blocks instead of 100 1M blocks?
David BL - 30 Aug 2008 06:43 GMT > [snip] > [quoted text clipped - 27 lines] > queries, thus reducing the total number of seeks required to satisfy the > hundred. I'm not sure how I implied that elevator seeking isn't worthwhile.
> > > By the way, Oracle documentation states that an 8K block size is optimal > > > for [quoted text clipped - 20 lines] > and practicable alternative to an escalating locking heirarchy for > serializing updates? I think these systems lock pages not extents, and anyway locking granularity can in principle be orthogonal to the unit of I/O or unit of allocation.
> > 64k blocks are generally too small on modern disks. A 64k block can > > be transferred in a tenth of the time it takes to seek to it. > > Why is that a problem? Isn't it more efficient if I can satisfy the same > query by reading 100 64K blocks instead of 100 1M blocks? Yes, but a DBMS will often be able to satisfy a given query by reading fewer blocks. For example full table scans are much more efficient with 1M blocks. Also, if you increase the block size by 10x then the height of a B+Tree will tend to smaller. For example a tree of height 3 may be able to index a trillion rather than only a billion records.
The advantages of a larger block size are more apparent in a database storing data where there is a greater tendency for locality based on affinity to be useful. For example, it would be rather silly to use 64k blocks to store multi-resolution terra pixel images.
Brian Selzer - 30 Aug 2008 20:32 GMT >> [snip] >> [quoted text clipped - 33 lines] > > I'm not sure how I implied that elevator seeking isn't worthwhile. You didn't, and I didn't say that you did. But your implication that it doesn't affect the number of seeks, while true, is an oversimplification in that it doesn't take into account that in a concurrent environment, many of those seeks can be shared by other queries.
>> > > By the way, Oracle documentation states that an 8K block size is >> > > optimal [quoted text clipped - 29 lines] > granularity can in principle be orthogonal to the unit of I/O or unit > of allocation. I don't see how since there is clearly a correlation between the size of each unit of I/O and the contention for what is on each unit of I/O.
>> > 64k blocks are generally too small on modern disks. A 64k block can >> > be transferred in a tenth of the time it takes to seek to it. [quoted text clipped - 7 lines] > height of a B+Tree will tend to smaller. For example a tree of height > 3 may be able to index a trillion rather than only a billion records. I think that if the disk subsystem is sophisticated enough, the performance benefit of an increased block size is lost. For example, if the controller caches entire tracks to eliminate latency, then a larger block size would not improve performance one bit--in fact, it would tend to reduce it because of the vastly increased volume of data that must be transferred from the cache to RAM. In the example above, you would have to move more than 16 times as much data from the cache to RAM to answer the same query.
> The advantages of a larger block size are more apparent in a database > storing data where there is a greater tendency for locality based on > affinity to be useful. For example, it would be rather silly to use > 64k blocks to store multi-resolution terra pixel images. It would be equally silly to physically store multi-resolution terra pixel images alongside scalar data that can be joined or restricted on. If necessary, place the image heap on a separate disk subsystem with a separate stripe size and depth, but store the scalar data on disk subsystem with a stripe size optimal for computing joins and restrictions on it.
David BL - 01 Sep 2008 04:09 GMT > >> "David BL" <davi...@iinet.net.au> wrote in message > [quoted text clipped - 42 lines] > that it doesn't take into account that in a concurrent environment, many of > those seeks can be shared by other queries. If seeks have a reasonable probability of being shared, doesn’t that necessarily mean the size of memory is approaching the size of secondary storage, and a large read cache will achieve the same effect?
I’ve heard of techniques like buffer trees that take buffering of reads and writes to an extreme level, allowing a system to get much closer to the theoretical optimum for I/O. However such techniques don’t seem compatible with OLTP where transactions are using strict 2PL and need to be completed quickly in order release locks. I can see that MVCC will provide far more opportunity for long running read only transactions to share reads (and seeks), but I doubt whether it would allow much sharing of reads when the size of memory is only a tiny fraction of the total size of the database.
> >> > > By the way, Oracle documentation states that an 8K block size is > >> > > optimal [quoted text clipped - 32 lines] > I don't see how since there is clearly a correlation between the size of > each unit of I/O and the contention for what is on each unit of I/O. Locking doesn’t even need to be page based. Are you aware of the distinction between latches and locks?
> >> > 64k blocks are generally too small on modern disks. A 64k block can > >> > be transferred in a tenth of the time it takes to seek to it. [quoted text clipped - 15 lines] > cache to RAM. In the example above, you would have to move more than 16 > times as much data from the cache to RAM to answer the same query. I agree that if the disk subsystem caches a track then a smaller unit of I/O can be desirable.
I think we’re talking past each other because you’re associating block with unit of I/O whereas I’m associating it more with the unit of allocation. As I see it, in the above you’re only comparing relatively small and uninteresting differences in read performance for a given allocation of data to sectors and tracks.
Changing the unit of allocation has an enormous impact on how data is allocated by the DBMS. This in turn can have a huge impact on the number of seeks.
With a very small unit of allocation, on a small contiguous area of the disk there can be allocations for many unrelated tables written by many unrelated transactions. In theory the sweet spot for the unit of allocation occurs when the time to seek is similar in magnitude to the time to transfer.
One can imagine more sophisticated allocators – eg that reserve large areas of the disk in ways that promote better clustering of related data. One can even imagine that multiple independent allocators (ie heaps) should be used within the DBMS. However the same effect can be achieved more simply and with less wastage of space by using a single allocator for the entire DBMS from which reasonably large blocks are always allocated. This can be compared to an in-memory programming environment that employs a single heap allocator for all threads, and wherever that would lead to high latency due to poor localisation of very small allocations, the programmer instead allocates a somewhat larger block and breaks it up into smaller pieces to meet the requirements of the smaller allocations.
I cannot see any good reason why a DBMS would skimp on the unit of allocation (eg 64k), particularly when it can be coarser than the unit of I/O (eg 8k). I wonder whether legacy of code base or backwards compatibility is rearing its ugly head?
> > The advantages of a larger block size are more apparent in a database > > storing data where there is a greater tendency for locality based on [quoted text clipped - 6 lines] > stripe size and depth, but store the scalar data on disk subsystem with a > stripe size optimal for computing joins and restrictions on it. I agree that that the optimal unit of I/O depends on the nature of the data and its usage patterns. I don’t think multiple heaps is generally necessary.
Brian Selzer - 01 Sep 2008 14:00 GMT [snip]
> > > I'm not sure how I implied that elevator seeking isn't worthwhile. > > [quoted text clipped - 19 lines] > would allow much sharing of reads when the size of memory is only a > tiny fraction of the total size of the database. I think you would be surprised. In a typical OLTP database, only a tiny fraction of the total size of the database is being manipulated. Most accesses involve what happened yesterday and what is happening today. Most of the remainder involve what is happening this month and possibly what happened last month. For example, in a manufacturing system, those manufacturing orders that are currently running in the plant are accessed most often. Those orders that ran yesterday are the next most often queried, followed by those that have run so far this month and then last month. Only an occasional query will look at older information--perhaps looking at the last time a particular part was produced, for example.
A bigger block size will cause a lot of information to be transferred to memory that isn't required to answer any queries.
[snip]
> > > I think these systems lock pages not extents, and anyway locking > > > granularity can in principle be orthogonal to the unit of I/O or unit [quoted text clipped - 5 lines] > Locking doesn’t even need to be page based. Are you aware of the > distinction between latches and locks? Yes, I am. A latch guarantees the integrity of a row during the reading of that row.
The overhead required for row-level locking can drastically reduce performance. For an update that touches a few rows, row-level locking is optimal. For an update that touches several thousand rows, locking pages may make more sense. For an update that touches millions of rows, extent locks or even table locks are more efficient. Another factor is the unit of I/O for the transaction logs. If the transaction log contains the pages that are different instead of just the rows that are different, then pages must be locked along with any rows, because if one transaction changes one row on a page, and another transaction changes another row on the same page, and if one of those transactions is rolled back, then the rollback could overwrite the change made by the transaction that will ultimately commit, or if both roll back, the database could end up having one of the changes recorded. It gets a lot more complicated when there are a lot of index pages recorded in the log as well as just rows, because indexes must also be locked, or at least invalidated until the updates propogate to them.
> > >> > 64k blocks are generally too small on modern disks. A 64k block > > >> > can [quoted text clipped - 40 lines] > allocation occurs when the time to seek is similar in magnitude to the > time to transfer. Interestingly, on a high-end 15k drive, revolutions take 4ms, and the average seek time is 2.9ms. Reading an entire track, therefore, should take at most 4ms. Based on comparisons of the specs for different size drives of the same series, I don't think more than one head is being read at the same time, but if they implemented that, it would be possible to read an entire cylinder in the time it takes to read a track, 4ms. But then, it might be cheaper to just add more drives.
> One can imagine more sophisticated allocators – eg that reserve large > areas of the disk in ways that promote better clustering of related [quoted text clipped - 8 lines] > allocates a somewhat larger block and breaks it up into smaller pieces > to meet the requirements of the smaller allocations. It's a trade off: bigger allocation units would increase the throughput for scans but would reduce the throughput for seeks.
> I cannot see any good reason why a DBMS would skimp on the unit of > allocation (eg 64k), particularly when it can be coarser than the unit > of I/O (eg 8k). I wonder whether legacy of code base or backwards > compatibility is rearing its ugly head? Possibly. I would think, however, that given the competitiveness between Oracle and Microsoft, that if more speed could be achieved by increasing the allocation unit, then one of them would have done it. They may also be anticipating the development of lower cost solid-state drives, where seek times and latency disappear, and there a larger block size would be a disadvantage.
> > > The advantages of a larger block size are more apparent in a database > > > storing data where there is a greater tendency for locality based on [quoted text clipped - 13 lines] > data and its usage patterns. I don’t think multiple heaps is > generally necessary. David BL - 25 Aug 2008 03:29 GMT > > > I am learning about database systems, and I am reading a book called > > > "Physical Database Design". [quoted text clipped - 20 lines] > But what dictates the block size? Is this defined by the physical > disk, the file system, or the database code? As far as a physical disk is concerned the term “block” was commonly used to refer to the intersection of a sector and a track, but now days “sector” tends to be used instead. It is the smallest unit of reading/writing and is often 512 bytes. Some disks use 1024 byte sectors.
A file system provides buffering, and that allows an application to seek within a file and read/write a single byte. However behind the scenes an entire sector must be read/written to disk.
Typically a DBMS will read/write the disk without file buffering provided by the OS. For example on Win32, the function CreateFile can take the parameter FILE_FLAG_NO_BUFFERING. This forces the DBMS to work at the granularity of sectors – and it’s fairly low level. Eg there is a requirement that memory buffers be aligned on 512 byte boundaries to comply with the DMA constraints.
To avoid excessive seeking the DBMS will tend to organise the store into much courser units that are typically called “blocks”. The block size is up to the DBMS, but in practise will always be a multiple of a sector. In some cases it may relate back to track or cylinder boundaries, but that constraint is not imposed by the disk controller (which will happily allow for random access to any sector).
Tim X - 21 Aug 2008 10:29 GMT > I am learning about database systems, and I am reading a book called > "Physical Database Design". [quoted text clipped - 12 lines] > Why can a disk only read 64KB at a time? Is this a valid assumption? > Is this a disk limitation or a file system limitation? This is a valid 'assumption'. It does not mean that this is all any disk can do. To make their point, the authors have to pick some value and 64k is as good as any other. In reality, it will depend on the hardware, the way data is written to the disk, the speed of the host, bus, type of data transfer technology, what the system is optimised for etc etc. Things actually get even more complex because most DBMS do a certain amount of their own caching as well. What the authors are attempting to do is provide a clear abstract explination that doesn't get overly complex. this is also the reason why whenever it comes to working out performance and tuning the system (at all levels) it is essential to use available tools and why most large enterprise level databases usually have tools to assist in calculating this type of thing.
I'm not familiar with the book in question, but I suspect they are about to explain how things like record sizes, available indexes etc can impact on performance and possibly show why often held belief that indexes always make things faster can be misleading or just plain wrong. If its a good book, it will emphasise the importance on gathering hard figures and stats in order to optimise performance and how dangerous 'rules' regarding optimisation and performance can be. It is partially due to the complexities and variations involved that you don't get many databases that can successfully tune for performance automatically.
Tim
 Signature tcross (at) rapttech dot com dot au
-CELKO- - 22 Aug 2008 15:13 GMT >> Why can a disk only read 64KB at a time? Is this a valid assumption? Is this a disk limitation or a file system limitation? << The author has to provide some numbers to show how to calculate an estimation for disk access. Frankly, 64KB seems a little small for a modern computer other than a desktop machine.
What you might consider is the rise of solid state storage, which will start replacing moving disk hardware in the next few years. This with multi-core processors will change database design radically. We work in a trade where everything you know is wrong in five years :)
David BL - 23 Aug 2008 07:04 GMT > >> Why can a disk only read 64KB at a time? Is this a valid assumption? Is this a disk limitation or a file system limitation? << > > The author has to provide some numbers to show how to calculate an > estimation for disk access. Frankly, 64KB seems a little small for a > modern computer other than a desktop machine. Also on a desktop, 64k is too small. A desktop HD has higher seek +rotational delay and lower transfer rate giving about the same product as for an enterprise HD.
> What you might consider is the rise of solid state storage, which will > start replacing moving disk hardware in the next few years. This with > multi-core processors will change database design radically. We work > in a trade where everything you know is wrong in five years :) I wonder whether it will be less radical than might at first be imagined. CPU caches lead to significant variation in memory access times.
I few years ago I wrote a transient B+Tree and compared the performance to the STL map (a red black tree) that ships with MS Visual C++. I ran tests involving inserting a million randomly generated keys on a map keyed by 32 bit integers. The B+Tree was twice as fast at insertions and deletions, 35% faster at look up, and 10 times faster at iteration through the elements.
DBMS_Plumber - 05 Sep 2008 21:41 GMT > >> Why can a disk only read 64KB at a time? Is this a valid assumption? Is this a disk limitation or a file system limitation? << ...
> What you might consider is the rise of solid state storage, which will > start replacing moving disk hardware in the next few years. This with > multi-core processors will change database design radically. We work > in a trade where everything you know is wrong in five years :) You know, though? I've been hearing about the immanent arrival of solid state storage for about 10 years now. As fast as they get their price/performance point down, the disk boys/girls drop theirs too.
Mebbe this time things will be different ....
|
|
|