Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
Home
Discussion Groups
Database Servers
DB2InformixIngresMS SQLOraclePervasive.SQLPostgreSQLProgressSybase
Desktop Databases
FileMakerFoxProMS AccessParadox
General
General DB TopicsDatabase Theory
Related Topics
Java Development.NET DevelopmentVB DevelopmentMore Topics ...

Database Forum / DB2 Topics / October 2008

Tip: Looking for answers? Try searching our database.

MDC vs. cluster index: which to use?

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
Henry J. - 19 Oct 2008 20:34 GMT
I have an existing table with 2 million rows which I'm tasked to
improve performance.  I am thinking of creating a two-column MDC, on
column A and B.  The total cells will be card(A) * card(B) =~ 12,000.
Although on average each cell will have about 90 rows, 25% of the
cells will only have 1 row, 60% will have between 2 and 19 rows, and
the rest 15% will have 20 or more rows.

My question is, are the columns A and B a good candidate for a MDC?
It seems to me I'm better off creating a clustering index on A and B
since a good number of the MDC cells will be sparsely populated.

This idea of a new MDC/cluster index just came to me and I like to see
if my understanding of MDC is correct before I ask our DBAs to make
the changes in order to test it out.  So any hint will be appreciated.

Thanks!
Serge Rielau - 20 Oct 2008 00:24 GMT
> I have an existing table with 2 million rows which I'm tasked to
> improve performance.  I am thinking of creating a two-column MDC, on
[quoted text clipped - 10 lines]
> if my understanding of MDC is correct before I ask our DBAs to make
> the changes in order to test it out.  So any hint will be appreciated.
Can you make it more coarse? It sounds like a lot of near empty extends.
I.e. you storage requirements will go up tremendously.

Cheers
Serge

Signature

Serge Rielau
DB2 Solutions Development
IBM Toronto Lab

Henry J. - 20 Oct 2008 18:39 GMT
> > I have an existing table with 2 million rows which I'm tasked to
> > improve performance.  I am thinking of creating a two-column MDC, on
[quoted text clipped - 21 lines]
> DB2 Solutions Development
> IBM Toronto Lab

The cardinality of column A (type_id) is small, about 6-7; that of
column B (product_id) is much larger.   I guess we have to use a
generated column on product_id to make it coarser, which I'm
relunctant to do since it requires change to the table.  Perhaps I'd
just go for cluster index.
Serge Rielau - 20 Oct 2008 20:07 GMT
>>> I have an existing table with 2 million rows which I'm tasked to
>>> improve performance.  I am thinking of creating a two-column MDC, on
[quoted text clipped - 24 lines]
> relunctant to do since it requires change to the table.  Perhaps I'd
> just go for cluster index.
You could MDC on only the one column. This of such an MDC as a self
organizing range partitioning.

Signature

Serge Rielau
DB2 Solutions Development
IBM Toronto Lab

Henry J. - 21 Oct 2008 05:26 GMT
> >>> I have an existing table with 2 million rows which I'm tasked to
> >>> improve performance.  I am thinking of creating a two-column MDC, on
[quoted text clipped - 32 lines]
> DB2 Solutions Development
> IBM Toronto Lab

MDC on type_id?  So cluster index is not a good idea?
Ian - 21 Oct 2008 06:24 GMT
> MDC on type_id?  So cluster index is not a good idea?

MDC *guarantees* clustering, whereas a table with a clustering index
will eventually require maintenance (a.k.a. reorg) to maintain the
cluster ratio.

That's not to say that a clustering index isn't still valuable
(especially for high cardinality columns that aren't a reasonable
candidate as an MDC dimension).
Henry J. - 21 Oct 2008 15:18 GMT
> > MDC on type_id?  So cluster index is not a good idea?
>
[quoted text clipped - 5 lines]
> (especially for high cardinality columns that aren't a reasonable
> candidate as an MDC dimension).

For my case, the table is purged and re-populated every day.  And the
other column, product_id, has high cardinality.
Perhaps a cluster index makes more sense?  I guess I'd choose type_id
if I have to pick one column for MDC.  Then it won't be as helpful as
a cluster index on (type_id, product_id)?
ChrisC - 21 Oct 2008 16:19 GMT
> For my case, the table is purged and re-populated every day.  And the
> other column, product_id, has high cardinality.
> Perhaps a cluster index makes more sense?  I guess I'd choose type_id
> if I have to pick one column for MDC.  Then it won't be as helpful as
> a cluster index on (type_id, product_id)?

MDC still might be faster - probably faster at loading and equivalent
for searches based on type_id.  For searches on type_id and
product_id, it could outperform the cluster index, if you also create
an index on product_id - and the combination (MDC + index on
product_id) will definitely outperform a cluster index on type_id,
product_id for product_id searches.

You'll want to verify these claims - but since you rebuild the table
every day, you are in a perfect position to try these various options
out.  Pick a new one each day until you've found the right
combination.

-Chris
Henry J. - 21 Oct 2008 18:05 GMT
> > For my case, the table is purged and re-populated every day.  And the
> > other column, product_id, has high cardinality.
[quoted text clipped - 15 lines]
>
> -Chris

Thanks Chris.

The primary key of the table is (type_id, product_id).  Now if we add
a MDC on type_id, will the MDC be actually used if we query like the
following, or the primary key will be used only?  I guess even if only
the primary key is used, the I/O will still be benefited.

           select client.client_id, sum( prod.quantity * prod.price *
client.ordersize )
            from prod, client
            where prod.type_id = 4 and prod.product_id =
client.product_id
            group by client.client_id

Also, would the following query benefit from both the MDC and the
primary key?

    select *
          from prod
            where prod.type_id = 4 and prod.product_id =  101

Or, instead of the primary key, just use an index on (product_id)
would cover both of the above queries?

I'll certainly test to find out.  But I'd like to gain better
understanding about indexing.  Also, the turnaround of making such
changes are not as fast as I want and I hope to get to the best
solution a little faster.  Thanks.
Henry J. - 28 Oct 2008 18:54 GMT
> > On Oct 21, 7:18 am, "HenryJ." <tank209...@yahoo.com> wrote:
>
[quoted text clipped - 48 lines]
>
> - Show quoted text -

Now I've changed the primary key of the table to (type_id, product_id)
while making (type_id) as MDC.

When examing the plan, I found that the following query:

             select client.client_id, sum( prod.quantity * prod.price
* client.ordersize )
             from prod, client
             where prod.type_id = 4 and prod.product_id =
client.product_id
             group by client.client_id

does use the new primary key on (type_id, product_id).  However, if I
use a bind variable for type_id, i.e.:

             select client.client_id, sum( prod.quantity * prod.price
* client.ordersize )
             from prod, client
             where prod.type_id = ? and prod.product_id =
client.product_id
             group by client.client_id

It only uses the MDC on (type_id).

Now, my questions are:

1) should I drop the bind variable on type_id so it can utilize the
primary key?  The cardinality of type_id is about 5 to 8.

2) Would an addition on (product_id) help?

Thanks.
Henry J. - 28 Oct 2008 20:53 GMT
> On Oct 21, 7:18 am, "HenryJ." <tank209...@yahoo.com> wrote:
>
[quoted text clipped - 17 lines]
>
> -Chris

I have the primary key changed to (type_id, product_id), and added a
MDC on type_id.  Now the following query:

      select client.client_id, sum( prod.quantity * prod.price *
client.ordersize )
            from prod, client
            where prod.type_id = 4 and prod.product_id =
client.product_id
            group by client.client_id

it will utilize the primary key.  However, if I use a host variable
for type_id, it will only pick up the MDC on type_id, which I think is
not optimal.

My questions are:

1) should I add an index on (product_id) which is what's really needed
rather than the primary key on (type_id, product_id).  Note that the
cardinality of type_id is pretty small, around 6.

2) should I drop the host variable on type_id?  Note that each type_id
has the same number of product_id's.  Perhaps this is not a good case
for host variable.  BTW, this SQL is embedded in a Java (as static SQL
I think?).

Thanks!

Thanks!
Henry J. - 28 Oct 2008 20:56 GMT
> On Oct 21, 7:18 am, "HenryJ." <tank209...@yahoo.com> wrote:
>
[quoted text clipped - 17 lines]
>
> -Chris

I have the primary key changed to (type_id, product_id), and added a
MDC on type_id.  Now the following query:

      select client.client_id, sum( prod.quantity * prod.price *
client.ordersize )
            from prod, client
            where prod.type_id = 4 and prod.product_id =
client.product_id
            group by client.client_id

will utilize the primary key.  However, if I use a host variable for
type_id (i.e., type_id = ?), it will only pick up the MDC on type_id,
which I think is not optimal.

My questions are:

1) should I add an index on (product_id) which is what's really needed
rather than the primary key on (type_id, product_id).  Note that the
cardinality of type_id is pretty small, around 6.

2) should I drop the host variable on type_id?  Note that each type_id
has the same number of product_id's.  Perhaps this is not a good case
for host variable.  BTW, this SQL is embedded in a Java (as static SQL
I think?).

Thanks!
Saurabh.Nr@gmail.com - 29 Oct 2008 14:55 GMT
> > On Oct 21, 7:18 am, "HenryJ." <tank209...@yahoo.com> wrote:
>
[quoted text clipped - 46 lines]
>
> - Show quoted text -

Hi,
there is a shell script know extent utilisation.sh which can be
downloaded from IBM website.it will answers all your queries regarding
which will be a better column and how much space will you save and how
will it improve the performance.
Henry J. - 29 Oct 2008 21:13 GMT
On Oct 29, 9:55 am, Saurabh...@gmail.com wrote:
> On Oct 29, 12:56 am, "HenryJ." <tank209...@yahoo.com> wrote:
>
[quoted text clipped - 56 lines]
>
> - Show quoted text -

Could you please give me the link?  I couldn't find it.  Thanks.
Ian - 21 Oct 2008 19:32 GMT
>>> MDC on type_id?  So cluster index is not a good idea?
>> MDC *guarantees* clustering, whereas a table with a clustering index
[quoted text clipped - 10 lines]
> if I have to pick one column for MDC.  Then it won't be as helpful as
> a cluster index on (type_id, product_id)?

By purging it every day, you're going to make it nearly impossible
for a clustering index to be useful.  Unless your data comes in
already sorted by type_id,product_id the cluster ratio is probably
going to be very low, regardless of what you set pctfree to.

Furthermore, will a clustering index really help with queries that
aggregating on the clustering key?  Maybe, but maybe not -- that
depends on whether the query is doing an index scan / fetch.  If
it's just doing a tablescan, you'll still need to do a sort.
Henry J. - 21 Oct 2008 23:54 GMT
> >>> MDC on type_id?  So cluster index is not a good idea?
> >> MDC *guarantees* clustering, whereas a table with a clustering index
[quoted text clipped - 22 lines]
>
> - Show quoted text -

I'm totally confused -- doesn't a clustering index make sure the rows
are inserted as sorted and hence with high cluster ratio?  And only
updates/deletes/inserts over time would reduce the cluster ratio?  And
what I understood is that a truly clustered index would avoid a sort.
Ian - 22 Oct 2008 06:21 GMT
> I'm totally confused -- doesn't a clustering index make sure the rows
> are inserted as sorted and hence with high cluster ratio?  And only
> updates/deletes/inserts over time would reduce the cluster ratio?  And
> what I understood is that a truly clustered index would avoid a sort.

When a table has a clustering index, DB2 will *try* to insert rows into
the table in order to keep the data in the same order as the clustering
index.  However, there is no guarantee that this will happen, because it
depends on there being enough free space on the particular page DB2
wants to write the row.

If there isn't enough free space on the "correct" page, DB2 will insert
the row elsewhere, thus reducing the cluster ratio.  [Incidentally, this
is why MDC is so nice:  It guarantees that the table remains clustered.
Not only that, it allows you to effectively have multiple clustering
indexes].

In your case, when you start with an empty table (i.e. after you
truncate it at the beginning of each day), the cluster ratio is going
to be a mess very quickly.  Remember, INSERT ignores the table's
PCTFREE setting.  Say you have inserted 5 rows, which fill up 1 entire
page.  If the 6th row you insert logically belongs on that first page,
but that page is already full, your cluster ratio is dropping quickly.

As I said before, unless you insert your data in sorted order, the
cluster ratio will not be good until you perform a reorg.  Until you
have enough pages in the table (each with sufficient free space to
accommodate newly inserted rows), your cluster ratio will drop quickly.

Finally, regarding sorts:  It should be obvious that because the rows in
the table are not guaranteed to be in order, a sort operation must take
place if your query uses only a table scan.  A query that scans the
clustering index will fetch the rows in sorted order, and depending on
the cluster ratio this may or may not be very efficient.
Henry J. - 22 Oct 2008 19:09 GMT
> > I'm totally confused -- doesn't a clustering index make sure the rows
> > are inserted as sorted and hence with high cluster ratio?  And only
[quoted text clipped - 30 lines]
> clustering index will fetch the rows in sorted order, and depending on
> the cluster ratio this may or may not be very efficient.

Ian, thanks a lot.  I'm clear now.
Mark A - 20 Oct 2008 04:27 GMT
> I have an existing table with 2 million rows which I'm tasked to
> improve performance.  I am thinking of creating a two-column MDC, on
[quoted text clipped - 12 lines]
>
> Thanks!

That is not enough information.

How often is the table inserted into?
What is the logical relationship between A and B?

I would probably help to post the DDL of the table (you can change the
column names, but a business description of the tables is needed). Then
please explain how the data in the table is typically accessed.

Clustering is often a trade-off. You improve performance for some SQL
statements, but make others a bit slower. The best scenario is when the
statements you make slower don't occur very often.
Henry J. - 20 Oct 2008 18:46 GMT
> > I have an existing table with 2 million rows which I'm tasked to
> > improve performance.  I am thinking of creating a two-column MDC, on
[quoted text clipped - 27 lines]
>
> - Show quoted text -

The table is inserted once, never updated, and queried a lot.  All in
an overnight batch.  Column A is type_id and B is product_id.  There
are 2 other fields on the table, i.e., the quantity and price for the
products.  We typically do aggregation on the table to get total
quantities per product for each given type_id and selected product_id
(selected by joining with some other products tables).  Of course all
this is simplified but should convey the main features.

Thanks.
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2010 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.