Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
Home
Discussion Groups
Database Servers
DB2InformixIngresMS SQLOraclePervasive.SQLPostgreSQLProgressSybase
Desktop Databases
FileMakerFoxProMS AccessParadox
General
General DB TopicsDatabase Theory
Related Topics
Java Development.NET DevelopmentVB DevelopmentMore Topics ...

Database Forum / General DB Topics / DB Theory / September 2008

Tip: Looking for answers? Try searching our database.

sequential disk read speed

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
Darren - 21 Aug 2008 01:36 GMT
I am learning about database systems, and I am reading a book called
"Physical Database Design".

It gets to a bit about a large sequential access (e.g. for a full
table scan), and does the following:

It says "Since most disk systems use prefetch buffers to speed up
table scans, we
assume a 64 KB prefetch block"

So to calculate the time for a full table scan, it multiples the
number of 64KB blocks by the time it takes to seek and read (2.02ms).
In other words, it is seeking each 64KB block.

Why can a disk only read 64KB at a time? Is this a valid assumption?
Is this a disk limitation or a file system limitation?

Thanks
Bob Badour - 21 Aug 2008 02:48 GMT
> I am learning about database systems, and I am reading a book called
> "Physical Database Design".
[quoted text clipped - 14 lines]
>
> Thanks

It's arbitrary. Some dbmses have a fixed block size; others allow one to
configure it as a parameter. Other systems try to read an entire track
or cylinder at a time. Whether the latter is feasible can change when
available cylinder sizes grow at different rates from available memory.

I suspect the book is giving a hypothetical just to demonstrate the
calculations involved.
David BL - 21 Aug 2008 08:08 GMT
> I am learning about database systems, and I am reading a book called
> "Physical Database Design".
[quoted text clipped - 12 lines]
> Why can a disk only read 64KB at a time? Is this a valid assumption?
> Is this a disk limitation or a file system limitation?

A high end modern HD with 4ms average seek will on average take about
7ms to access and an additional 0.5ms to read a randomly located 64k
buffer.   This mismatch shows that 64k blocks are too small for
optimal read performance.   512k or 1Mb blocks would be more suitable.
Darren - 23 Aug 2008 18:26 GMT
> > I am learning about database systems, and I am reading a book called
> > "Physical Database Design".
[quoted text clipped - 19 lines]
>
> - Show quoted text -

But what dictates the block size? Is this defined by the physical
disk, the file system, or the database code?
Brian Selzer - 24 Aug 2008 05:39 GMT
> > > I am learning about database systems, and I am reading a book called
> > > "Physical Database Design".
[quoted text clipped - 23 lines]
> But what dictates the block size? Is this defined by the physical
> disk, the file system, or the database code?

There can be a marginal reduction in cpu cycles and in some cases I/O by
matching the block size of the disk subsystem to the block size specified in
the database engine, but in any system where performance is critical, the
disk subsystem will involve a caching disk array or multiple caching disk
arrays, and the block size becomes moot as a result of the caching disk
array controller reading an entire track at a time.  Memory to memory
transfers are negligible when compared to disk to memory transfers--let
alone seeks.  Moreover, caching controllers employ a number of technologies,
such as elevator seeking, to minimize the impact of seek times, and by
reading an entire track at a time, latency--that is, the time it takes for
the data requested to arrive at the head for reading--is effectively
eliminated as a factor.  (There is still a negligible latency which is
rougly half the time it takes for the head to pass over a sector, but as
there are hundreds of sectors per track, the time involved is not worth
considering.)  Not to mention that a savvy dba will spread data access
across as many heads as can be budgeted for to maximize throughput and
minimize seek time.  If you have a 100GB database and you put it on single
100GB disk drive, your best average seek time is the average seek time of
the disk drive, but if you put the database on four 100GB disk drives, the
the best average seek time will only be a fraction of the seek time of the
single disk.  Suppose that the full-stroke seek time on the 100GB disk is
7ms and the track-to-track seek time is 1ms.  Well, with four disks, instead
of an average 4ms seek time, the individual seek time of each disk is
reduced to roughly 2.5ms, and since there are four disks, the average seek
time for the disk subsystem is reduced to a quarter of that or roughly
.625ms.  Add a mirror (RAID 0+1 using 8 drives), and you introduce fault
tolerance while at the same time halving again seek time for reads.  Note
that except in rare cases, the ratio of reads to writes in a database is
better than 10:1, so any strategy that improves read time without
significantly impeding write time, such as is the case with implementing
RAID 0+1, should be vigorously pursued.  Incidentally, RAID 0+1 is also good
for transaction logs and temporary tables, which involve mostly writes or a
roughly equal number of reads and writes, but it usually makes sense to
segregate the logs from the database--for recovery if nothing else, and it
also often makes sense to segregate the location where temporary tables are
housed from both the database and the logs.
David BL - 28 Aug 2008 02:46 GMT
> If you have a 100GB database and you put it on single
> 100GB disk drive, your best average seek time is the average seek time of
[quoted text clipped - 4 lines]
> of an average 4ms seek time, the individual seek time of each disk is
> reduced to roughly 2.5ms

Is this because less of the disk is actually being used so on a given
platter the head doesn't have such a large range of tracks to move
over?

> , and since there are four disks, the average seek
> time for the disk subsystem is reduced to a quarter of that or roughly
> .625ms.

In order for the effective seek time to be reduced to a quarter the
seeking must be independent.  To achieve that I think the striping
would need to be very coarse (eg 512kb or 1Mb).
Brian Selzer - 28 Aug 2008 03:47 GMT
>> If you have a 100GB database and you put it on single
>> 100GB disk drive, your best average seek time is the average seek time of
[quoted text clipped - 11 lines]
> platter the head doesn't have such a large range of tracks to move
> over?

Yes.  And the bit density is generally greater at the outside of the
platter, so it generally takes fewer tracks to store the same information
there as opposed to near the center; consequently, simply dividing the
difference of the full-stroke seek and the track-to-track seek by four is a
perhaps overly conservative method of estimation.  I want to stress that
this is not just a hair-brained theory of mine: I've had significant success
using this mechanism to boost performance.  In one application, by
installing a disk that was seven times the size required and creating a
partition on the outer edge of the disk, performance improved by over 6000%:
batch processes that had been taking over 25 hours to complete were
finishing in under 25 minutes.

>> , and since there are four disks, the average seek
>> time for the disk subsystem is reduced to a quarter of that or roughly
[quoted text clipped - 3 lines]
> seeking must be independent.  To achieve that I think the striping
> would need to be very coarse (eg 512kb or 1Mb).

Drives that support disconnection or some other command queueing mechanism
are all that is needed for seeking to be independent.

I think using a coarse stripe is counterproductive.  There would be a bigger
chance that a seek in the middle of the read would be required.  Consider:
if 3.5 stripes fit on a track in one zone of the disk, then on average every
fourth read would require an additional seek to get the remaining half
stripe.  If on the other hand, 28 stripes fit on a track, then no additional
seeks would be necessary.  Even if it were 28.5 stripes instead of 28, one
additional seek for every 29 reads is a whole lot better than one for every
4.
David BL - 28 Aug 2008 05:32 GMT
> >> If you have a 100GB database and you put it on single
> >> 100GB disk drive, your best average seek time is the average seek time of
[quoted text clipped - 23 lines]
> batch processes that had been taking over 25 hours to complete were
> finishing in under 25 minutes.

How do you explain a 60 fold increase?

> >> , and since there are four disks, the average seek
> >> time for the disk subsystem is reduced to a quarter of that or roughly
[quoted text clipped - 6 lines]
> Drives that support disconnection or some other command queueing mechanism
> are all that is needed for seeking to be independent.

If stripes are somewhat smaller than the DBMS block size, then every
drive (in the RAID 0) will be involved in the reading of each and
every DBMS block.  No matter how you order those reads, each drive
needs to read a large amount of scattered data and the head will seek
around a lot.   If that is the case then the only advantage arises
from your previously mentioned reduction in the overall range of
tracks over which the data resides on a given platter.

Alternatively if the stripe size is larger then each drive will read a
somewhat independent set of the DBMS blocks, and the effective seek
time can be reduced assuming the DBMS is able to issue overlapping
read requests for the DBMS blocks.

> I think using a coarse stripe is counterproductive.  There would be a bigger
> chance that a seek in the middle of the read would be required.  Consider:
[quoted text clipped - 4 lines]
> additional seek for every 29 reads is a whole lot better than one for every
> 4.

Firstly, hard-disks are quite good at stepping onto the next track in
the manner normally used for very large "contiguous" reads or writes.

Secondly your analysis misses the point that coarser granularity
stripes lead to fewer overall seeks, not more!    Seeks per read is
not a very useful stat.

The following

   http://www.dba-oracle.com/oracle_tips_raid_usage.htm

discusses the ideal stripe size and the formulae indicate that ~1Mb
would be appropriate for a modern disk.
Brian Selzer - 28 Aug 2008 14:34 GMT
>> >> If you have a 100GB database and you put it on single
>> >> 100GB disk drive, your best average seek time is the average seek time
[quoted text clipped - 30 lines]
>
> How do you explain a 60 fold increase?

Fewer and shorter seeks is my guess.

>> >> , and since there are four disks, the average seek
>> >> time for the disk subsystem is reduced to a quarter of that or roughly
[quoted text clipped - 20 lines]
> time can be reduced assuming the DBMS is able to issue overlapping
> read requests for the DBMS blocks.

Your argument rests on the assumption that data is randomly distributed in
the stripes on the disk and doesn't take into account the fact that a
high-end caching controller eliminates latency by reading an entire track at
once.  Isn't it true that there is a physical affinity between related data?
Isn't it more likely that an index will occupy contiguous stripes than some
random set--regardless of stripe size?  Can you show that the number of
tracks accessed by say, 128 coarse stripe reads is any less than the number
of tracks accessed by 1024 fine stripe reads?

>> I think using a coarse stripe is counterproductive.  There would be a
>> bigger
[quoted text clipped - 13 lines]
> Firstly, hard-disks are quite good at stepping onto the next track in
> the manner normally used for very large "contiguous" reads or writes.

The best track-to-track seek time I've seen is 0.2ms for reads, 0.4ms for
writes.  That's phenomenal but can still add up.

> Secondly your analysis misses the point that coarser granularity
> stripes lead to fewer overall seeks, not more!    Seeks per read is
> not a very useful stat.

Coarser granularity stripes lead to fewer overall reads, not necessarily
fewer overall seeks--and not necessarily reduced overall seek time.

A finer granularity means more commands to be processed.
More commands to be processed increases the likelyhood that the read of one
track will satisfy more than one command.
More commands to be processed increases the likelyhood that elevator seeking
can be used to reduce overall seek time.

Coarser granularity stripes lead to fewer overall reads--not necessarily
fewer overall seeks.  In fact, it could lead to more overall seeks.
Suppose, for example, that many of the stripes on disk are less than half
populated with data--in much the same way that a FAT16 files system with a
huge number of tiny files can fill up the disk even though the sum of the
actual file sizes can be less than a quarter of the formatted capacity.  Any
seek that is needed in order to read the rest of a stripe when the rest of
the stripe isn't populated with data would be unnecessary if a smaller
stripe size were used.  In much the same way, with a high-end processor, it
is often possible to improve performance by setting the compressed attribute
on a file.  A compressed file typically occupies half the space of an
uncompressed file, and with a high-end cpu, it can actually take less time
to read and uncompress data than to read uncompressed data.

> The following
>
>    http://www.dba-oracle.com/oracle_tips_raid_usage.htm
>
> discusses the ideal stripe size and the formulae indicate that ~1Mb
> would be appropriate for a modern disk.

I am not convinced, knowing what I know and have had experience with when it
comes to storage subsystems.  I would have to read the papers Mike Ault
vaguely refers to.
David BL - 29 Aug 2008 04:43 GMT
> >> "David BL" <davi...@iinet.net.au> wrote in message
>
[quoted text clipped - 33 lines]
> tracks accessed by say, 128 coarse stripe reads is any less than the number
> of tracks accessed by 1024 fine stripe reads?

Yes, sometimes the DBMS manages to cluster all the necessary data so
there is very little seeking required, and in that case it won’t
matter what stripe size is used.

However, that is not always possible.  For example consider a B+Tree
on 1 billion records and in a short period of time the DBMS needs to
read 100 records for given index values that are effectively at random
with respect to the ordering on that data type.   To keep it simple
ignore the reading of the internal nodes of the B+Tree.  Typically
those 100 records will appear in roughly 100 different leaf nodes of
the B+Tree.  Furthermore due to the sheer size of the overall data
those leaf nodes will tend to reside on different tracks.   The
unfortunate reality is that it isn’t possible to read these records
without a lot of head seeking, even if the reads are ordered according
to track position (ie elevator seeking). Now if RAID0 is used and the
stripes are smaller that the B+Tree leaf nodes, then every drive will
need to contribute to the reading of every leaf node.  Each drive can
read the stripes in any order it likes but it won’t avoid the fact
that each drive performs ~100 seeks.   If instead, each B+Tree leaf
node resides in a single stripe (and therefore on a single drive) then
with four drives in the RAID0, each drive will only need to perform
~25 seeks.

> >> I think using a coarse stripe is counterproductive.  There would be a
> >> bigger
[quoted text clipped - 16 lines]
> The best track-to-track seek time I've seen is 0.2ms for reads, 0.4ms for
> writes.  That's phenomenal but can still add up.

It’s insignificant when reading or writing 1Mb at a time.
Brian Selzer - 29 Aug 2008 12:47 GMT
>> >> "David BL" <davi...@iinet.net.au> wrote in message
>>
[quoted text clipped - 62 lines]
>with four drives in the RAID0, each drive will only need to perform
>~25 seeks.

You're oversimplifying.  With a stripe size of 64K, it is highly unlikely
that a leaf node will span more than one stripe; therefore, it is highly
unlikely for every drive to contribute to the reading of every leaf node.
Also, you appear to be discounting concurrency, and environments where
concurrency is important such as typical OLTP environments are where
technologies such as elevator seeking are most effective.

By the way, Oracle documentation states that an 8K block size is optimal for
most systems and defaults DB_FILE_MULTIBLOCK_READ_COUNT to 8.  8K * 8 = 64K.
Interestingly, Sql Server uses 8K pages organized into 64K extents, which
happens to be the unit of physical storage allocation.  Do you know
something they don't?

>> >> I think using a coarse stripe is counterproductive.  There would be a
>> >> bigger
[quoted text clipped - 18 lines]
>
>It’s insignificant when reading or writing 1Mb at a time.
David BL - 29 Aug 2008 15:23 GMT
> >> "David BL" <davi...@iinet.net.au> wrote in message
>
[quoted text clipped - 70 lines]
> that a leaf node will span more than one stripe; therefore, it is highly
> unlikely for every drive to contribute to the reading of every leaf node.

I don't see how I'm oversimplifying.

My point is that stripes need to be at least as coarse as the DBMS
block size.  Do you agree?

The choice of DBMS block size is another question entirely.

> Also, you appear to be discounting concurrency, and environments where
> concurrency is important such as typical OLTP environments are where
> technologies such as elevator seeking are most effective.

Concurrency has nothing to do with the fact that if the stripe size is
too small the seeking of the drives won't be independent.

> By the way, Oracle documentation states that an 8K block size is optimal for
> most systems and defaults DB_FILE_MULTIBLOCK_READ_COUNT to 8.  8K * 8 = 64K.
> Interestingly, Sql Server uses 8K pages organized into 64K extents, which
> happens to be the unit of physical storage allocation.  Do you know
> something they don't?

Sql Server 6.5 used 2k pages and this changed to 8k pages in Sql
Server 7.0 released in 1998.  Do you expect that 64k extents are still
optimal a decade later given that the product of transfer rate and
seek time has been steadily increasing?

64k blocks are generally too small on modern disks.  A 64k block can
be transferred in a tenth of the time it takes to seek to it.
Brian Selzer - 30 Aug 2008 02:14 GMT
[snip]

> > You're oversimplifying.  With a stripe size of 64K, it is highly
> > unlikely
[quoted text clipped - 6 lines]
> My point is that stripes need to be at least as coarse as the DBMS
> block size.  Do you agree?

Yes, I think the stripe size should be a multiple of the DBMS block size.

> The choice of DBMS block size is another question entirely.
>
[quoted text clipped - 4 lines]
> Concurrency has nothing to do with the fact that if the stripe size is
> too small the seeking of the drives won't be independent.

That wasn't why I brought up concurrency.  You dismissed elevator seeking as
an optimization mechanism with respect to the number of seeks required by
focusing on what appeared to be a single query.  Ultimately the same number
of seeks will be required for a particular query, but when combined with
ninety-nine other queries, some of those seeks can be shared with other
queries, thus reducing the total number of seeks required to satisfy the
hundred.

> > By the way, Oracle documentation states that an 8K block size is optimal
> > for
[quoted text clipped - 9 lines]
> optimal a decade later given that the product of transfer rate and
> seek time has been steadily increasing?

Sql Server 2008 still uses 8K pages and 64K extents.  Also, the Oracle
documentation that cited an 8K block size as being optimal was for their
latest version, 11g.

I think that 64K extents are still optimal because the technology employed
for serializing updates is still locking, and in a concurrent environment
with an escalating locking heirachy, a page size or extent size that is too
large will cause transactions to block more often.  Do you know of a sound
and practicable alternative to an escalating locking heirarchy for
serializing updates?

> 64k blocks are generally too small on modern disks.  A 64k block can
> be transferred in a tenth of the time it takes to seek to it.

Why is that a problem?  Isn't it more efficient if I can satisfy the same
query by reading 100 64K blocks instead of 100 1M blocks?
David BL - 30 Aug 2008 06:43 GMT
> [snip]
>
[quoted text clipped - 27 lines]
> queries, thus reducing the total number of seeks required to satisfy the
> hundred.

I'm not sure how I implied that elevator seeking isn't worthwhile.

> > > By the way, Oracle documentation states that an 8K block size is optimal
> > > for
[quoted text clipped - 20 lines]
> and practicable alternative to an escalating locking heirarchy for
> serializing updates?

I think these systems lock pages not extents, and anyway locking
granularity can in principle be orthogonal to the unit of I/O or unit
of allocation.

> > 64k blocks are generally too small on modern disks.  A 64k block can
> > be transferred in a tenth of the time it takes to seek to it.
>
> Why is that a problem?  Isn't it more efficient if I can satisfy the same
> query by reading 100 64K blocks instead of 100 1M blocks?

Yes, but a DBMS will often be able to satisfy a given query by reading
fewer blocks.   For example full table scans are much more efficient
with 1M blocks.  Also, if you increase the block size by 10x then the
height of a B+Tree will tend to smaller.  For example a tree of height
3 may be able to index a trillion rather than only a billion records.

The advantages of a larger block size are more apparent in a database
storing data where there is a greater tendency for locality based on
affinity to be useful.  For example, it would be rather silly to use
64k blocks to store multi-resolution terra pixel images.
Brian Selzer - 30 Aug 2008 20:32 GMT
>> [snip]
>>
[quoted text clipped - 33 lines]
>
> I'm not sure how I implied that elevator seeking isn't worthwhile.

You didn't, and I didn't say that you did.  But your implication that it
doesn't affect the number of seeks, while true, is an oversimplification in
that it doesn't take into account that in a concurrent environment, many of
those seeks can be shared by other queries.

>> > > By the way, Oracle documentation states that an 8K block size is
>> > > optimal
[quoted text clipped - 29 lines]
> granularity can in principle be orthogonal to the unit of I/O or unit
> of allocation.

I don't see how since there is clearly a correlation between the size of
each unit of I/O and the contention for what is on each unit of I/O.

>> > 64k blocks are generally too small on modern disks.  A 64k block can
>> > be transferred in a tenth of the time it takes to seek to it.
[quoted text clipped - 7 lines]
> height of a B+Tree will tend to smaller.  For example a tree of height
> 3 may be able to index a trillion rather than only a billion records.

I think that if the disk subsystem is sophisticated enough, the performance
benefit of an increased block size is lost.  For example, if the controller
caches entire tracks to eliminate latency, then a larger block size would
not improve performance one bit--in fact, it would tend to reduce it because
of the vastly increased volume of data that must be transferred from the
cache to RAM.  In the example above, you would have to move more than 16
times as much data from the cache to RAM to answer the same query.

> The advantages of a larger block size are more apparent in a database
> storing data where there is a greater tendency for locality based on
> affinity to be useful.  For example, it would be rather silly to use
> 64k blocks to store multi-resolution terra pixel images.

It would be equally silly to physically store multi-resolution terra pixel
images alongside scalar data that can be joined or restricted on.  If
necessary, place the image heap on a separate disk subsystem with a separate
stripe size and depth, but store the scalar data on disk subsystem with a
stripe size optimal for computing joins and restrictions on it.
David BL - 01 Sep 2008 04:09 GMT
> >> "David BL" <davi...@iinet.net.au> wrote in message
>
[quoted text clipped - 42 lines]
> that it doesn't take into account that in a concurrent environment, many of
> those seeks can be shared by other queries.

If seeks have a reasonable probability of being shared, doesn’t that
necessarily mean the size of memory is approaching the size of
secondary storage, and a large read cache will achieve the same
effect?

I’ve heard of techniques like buffer trees that take buffering of
reads and writes to an extreme level, allowing a system to get much
closer to the theoretical optimum for I/O.   However such techniques
don’t seem compatible with OLTP where transactions are using strict
2PL and need to be completed quickly in order release locks.   I can
see that MVCC will provide far more opportunity for long running read
only transactions to share reads (and seeks), but I doubt whether it
would allow much sharing of reads when the size of memory is only a
tiny fraction of the total size of the database.

> >> > > By the way, Oracle documentation states that an 8K block size is
> >> > > optimal
[quoted text clipped - 32 lines]
> I don't see how since there is clearly a correlation between the size of
> each unit of I/O and the contention for what is on each unit of I/O.

Locking doesn’t even need to be page based.   Are you aware of the
distinction between latches and locks?

> >> > 64k blocks are generally too small on modern disks.  A 64k block can
> >> > be transferred in a tenth of the time it takes to seek to it.
[quoted text clipped - 15 lines]
> cache to RAM.  In the example above, you would have to move more than 16
> times as much data from the cache to RAM to answer the same query.

I agree that if the disk subsystem caches a track then a smaller unit
of I/O can be desirable.

I think we’re talking past each other because you’re associating block
with unit of I/O whereas I’m associating it more with the unit of
allocation.   As I see it, in the above you’re only comparing
relatively small and uninteresting differences in read performance for
a given allocation of data to sectors and tracks.

Changing the unit of allocation has an enormous impact on how data is
allocated by the DBMS.  This in turn can have a huge impact on the
number of seeks.

With a very small unit of allocation, on a small contiguous area of
the disk there can be allocations for many unrelated tables written by
many unrelated transactions. In theory the sweet spot for the unit of
allocation occurs when the time to seek is similar in magnitude to the
time to transfer.

One can imagine more sophisticated allocators – eg that reserve large
areas of the disk in ways that promote better clustering of related
data.   One can even imagine that multiple independent allocators (ie
heaps) should be used within the DBMS.   However the same effect can
be achieved more simply and with less wastage of space by using a
single allocator for the entire DBMS from which reasonably large
blocks are always allocated.   This can be compared to an in-memory
programming environment that employs a single heap allocator for all
threads, and wherever that would lead to high latency due to poor
localisation of very small allocations, the programmer instead
allocates a somewhat larger block and breaks it up into smaller pieces
to meet the requirements of the smaller allocations.

I cannot see any good reason why a DBMS would skimp on the unit of
allocation (eg 64k), particularly when it can be coarser than the unit
of I/O (eg 8k).   I wonder whether legacy of code base or backwards
compatibility is rearing its ugly head?

> > The advantages of a larger block size are more apparent in a database
> > storing data where there is a greater tendency for locality based on
[quoted text clipped - 6 lines]
> stripe size and depth, but store the scalar data on disk subsystem with a
> stripe size optimal for computing joins and restrictions on it.

I agree that that the optimal unit of I/O depends on the nature of the
data and its usage patterns.  I don’t think multiple heaps is
generally necessary.
Brian Selzer - 01 Sep 2008 14:00 GMT
[snip]

> > > I'm not sure how I implied that elevator seeking isn't worthwhile.
> >
[quoted text clipped - 19 lines]
> would allow much sharing of reads when the size of memory is only a
> tiny fraction of the total size of the database.

I think you would be surprised.  In a typical OLTP database, only a tiny
fraction of the total size of the database is being manipulated.  Most
accesses involve what happened yesterday and what is happening today.  Most
of the remainder involve what is happening this month and possibly what
happened last month.  For example, in a manufacturing system, those
manufacturing orders that are currently running in the plant are accessed
most often.  Those orders that ran yesterday are the next most often
queried, followed by those that have run so far this month and then last
month.  Only an occasional query will look at older information--perhaps
looking at the last time a particular part was produced, for example.

A bigger block size will cause a lot of information to be transferred to
memory that isn't required to answer any queries.

[snip]

> > > I think these systems lock pages not extents, and anyway locking
> > > granularity can in principle be orthogonal to the unit of I/O or unit
[quoted text clipped - 5 lines]
> Locking doesn’t even need to be page based.   Are you aware of the
> distinction between latches and locks?

Yes, I am.  A latch guarantees the integrity of a row during the reading of
that row.

The overhead required for row-level locking can drastically reduce
performance.  For an update that touches a few rows, row-level locking is
optimal.  For an update that touches several thousand rows, locking pages
may make more sense.  For an update that touches millions of rows, extent
locks or even table locks are more efficient.  Another factor is the unit of
I/O for the transaction logs.  If the transaction log contains the pages
that are different instead of just the rows that are different, then pages
must be locked along with any rows, because if one transaction changes one
row on a page, and another transaction changes another row on the same page,
and if one of those transactions is rolled back, then the rollback could
overwrite the change made by the transaction that will ultimately commit, or
if both roll back, the database could end up having one of the changes
recorded.  It gets a lot more complicated when there are a lot of index
pages recorded in the log as well as just rows, because indexes must also be
locked, or at least invalidated until the updates propogate to them.

> > >> > 64k blocks are generally too small on modern disks.  A 64k block
> > >> > can
[quoted text clipped - 40 lines]
> allocation occurs when the time to seek is similar in magnitude to the
> time to transfer.

Interestingly, on a high-end 15k drive, revolutions take 4ms, and the
average seek time is 2.9ms.  Reading an entire track, therefore, should take
at most 4ms.  Based on comparisons of the specs for different size drives of
the same series, I don't think more than one head is being read at the same
time, but if they implemented that, it would be possible to read an entire
cylinder in the time it takes to read a track, 4ms.  But then, it might be
cheaper to just add more drives.

> One can imagine more sophisticated allocators – eg that reserve large
> areas of the disk in ways that promote better clustering of related
[quoted text clipped - 8 lines]
> allocates a somewhat larger block and breaks it up into smaller pieces
> to meet the requirements of the smaller allocations.

It's a trade off: bigger allocation units would increase the throughput for
scans but would reduce the throughput for seeks.

> I cannot see any good reason why a DBMS would skimp on the unit of
> allocation (eg 64k), particularly when it can be coarser than the unit
> of I/O (eg 8k).   I wonder whether legacy of code base or backwards
> compatibility is rearing its ugly head?

Possibly.  I would think, however, that given the competitiveness between
Oracle and Microsoft, that if more speed could be achieved by increasing the
allocation unit, then one of them would have done it.  They may also be
anticipating the development of lower cost solid-state drives, where seek
times and latency disappear, and there a larger block size would be a
disadvantage.

> > > The advantages of a larger block size are more apparent in a database
> > > storing data where there is a greater tendency for locality based on
[quoted text clipped - 13 lines]
> data and its usage patterns.  I don’t think multiple heaps is
> generally necessary.
David BL - 25 Aug 2008 03:29 GMT
> > > I am learning about database systems, and I am reading a book called
> > > "Physical Database Design".
[quoted text clipped - 20 lines]
> But what dictates the block size? Is this defined by the physical
> disk, the file system, or the database code?

As far as a physical disk is concerned the term “block” was commonly
used to refer to the intersection of a sector and a track, but now
days “sector” tends to be used instead.   It is the smallest unit of
reading/writing and is often 512 bytes.  Some disks use 1024 byte
sectors.

A file system provides buffering, and that allows an application to
seek within a file and read/write a single byte.  However behind the
scenes an entire sector must be read/written to disk.

Typically a DBMS will read/write the disk without file buffering
provided by the OS.  For example on Win32, the function CreateFile can
take the parameter FILE_FLAG_NO_BUFFERING. This forces the DBMS to
work at the granularity of sectors – and it’s fairly low level.  Eg
there is a requirement that memory buffers be aligned on 512 byte
boundaries to comply with the DMA constraints.

To avoid excessive seeking the DBMS will tend to organise the store
into much courser units that are typically called “blocks”.  The block
size is up to the DBMS, but in practise will always be a multiple of a
sector.  In some cases it may relate back to track or cylinder
boundaries, but that constraint is not imposed by the disk controller
(which will happily allow for random access to any sector).
Tim X - 21 Aug 2008 10:29 GMT
> I am learning about database systems, and I am reading a book called
> "Physical Database Design".
[quoted text clipped - 12 lines]
> Why can a disk only read 64KB at a time? Is this a valid assumption?
> Is this a disk limitation or a file system limitation?

This is a valid 'assumption'. It does not mean that this is all any disk
can do. To make their point, the authors have to pick some value and 64k
is as good as any other. In reality, it will depend on the hardware, the
way data is written to the disk, the speed of the host, bus, type of
data transfer technology, what the system is optimised for etc etc.
Things actually get even more complex because most DBMS do a certain
amount of their own caching as well. What the authors are attempting to
do is provide a clear abstract explination that doesn't get overly
complex. this is also the reason why whenever it comes to working out
performance and tuning the system (at all levels) it is essential to use
available tools and why most large enterprise level databases usually
have tools to assist in calculating this type of thing.

I'm not familiar with the book in question, but I suspect they are about
to explain how things like record sizes, available indexes etc can
impact on performance and possibly show why often held belief that
indexes always make things faster can be misleading or just plain
wrong. If its a good book, it will emphasise the importance on gathering
hard figures and stats in order to optimise performance and how
dangerous 'rules' regarding optimisation and performance can be. It is
partially due to the complexities and variations involved that you don't
get many databases that can successfully tune for performance
automatically.

Tim

Signature

tcross (at) rapttech dot com dot au

-CELKO- - 22 Aug 2008 15:13 GMT
>> Why can a disk only read 64KB at a time? Is this a valid assumption?  Is this a disk limitation or a file system limitation? <<

The author has to provide some numbers to show how to calculate an
estimation for disk access.  Frankly, 64KB  seems a little small for a
modern computer other than a desktop machine.

What you might consider is the rise of solid state storage, which will
start replacing moving disk hardware in the next few years.  This with
multi-core processors will change database design radically.   We work
in a trade where everything you know is wrong in five years :)
David BL - 23 Aug 2008 07:04 GMT
> >> Why can a disk only read 64KB at a time? Is this a valid assumption?  Is this a disk limitation or a file system limitation? <<
>
> The author has to provide some numbers to show how to calculate an
> estimation for disk access.  Frankly, 64KB  seems a little small for a
> modern computer other than a desktop machine.

Also on a desktop, 64k is too small.  A desktop HD has higher seek
+rotational delay and lower transfer rate giving about the same
product as for an enterprise HD.

> What you might consider is the rise of solid state storage, which will
> start replacing moving disk hardware in the next few years.  This with
> multi-core processors will change database design radically.   We work
> in a trade where everything you know is wrong in five years :)

I wonder whether it will be less radical than might at first be
imagined.   CPU caches lead to significant variation in memory access
times.

I few years ago I wrote a transient B+Tree and compared the
performance to the STL map (a red black tree) that ships with MS
Visual C++.  I ran tests involving inserting a million randomly
generated keys on a map keyed by 32 bit integers.  The B+Tree was
twice as fast at insertions and deletions, 35% faster at look up, and
10 times faster at iteration through the elements.
DBMS_Plumber - 05 Sep 2008 21:41 GMT
> >> Why can a disk only read 64KB at a time? Is this a valid assumption?  Is this a disk limitation or a file system limitation? <<
...
> What you might consider is the rise of solid state storage, which will
> start replacing moving disk hardware in the next few years.  This with
> multi-core processors will change database design radically.   We work
> in a trade where everything you know is wrong in five years :)

You know, though? I've been hearing about the immanent arrival of
solid state storage for about 10 years now. As fast as they get their
price/performance point down, the disk boys/girls drop theirs too.

Mebbe this time things will be different ....
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.