Database Forum / DB2 Topics / October 2008
MDC vs. cluster index: which to use?
|
|
Thread rating:  |
Henry J. - 19 Oct 2008 20:34 GMT I have an existing table with 2 million rows which I'm tasked to improve performance. I am thinking of creating a two-column MDC, on column A and B. The total cells will be card(A) * card(B) =~ 12,000. Although on average each cell will have about 90 rows, 25% of the cells will only have 1 row, 60% will have between 2 and 19 rows, and the rest 15% will have 20 or more rows.
My question is, are the columns A and B a good candidate for a MDC? It seems to me I'm better off creating a clustering index on A and B since a good number of the MDC cells will be sparsely populated.
This idea of a new MDC/cluster index just came to me and I like to see if my understanding of MDC is correct before I ask our DBAs to make the changes in order to test it out. So any hint will be appreciated.
Thanks!
Serge Rielau - 20 Oct 2008 00:24 GMT > I have an existing table with 2 million rows which I'm tasked to > improve performance. I am thinking of creating a two-column MDC, on [quoted text clipped - 10 lines] > if my understanding of MDC is correct before I ask our DBAs to make > the changes in order to test it out. So any hint will be appreciated. Can you make it more coarse? It sounds like a lot of near empty extends. I.e. you storage requirements will go up tremendously.
Cheers Serge
 Signature Serge Rielau DB2 Solutions Development IBM Toronto Lab
Henry J. - 20 Oct 2008 18:39 GMT > > I have an existing table with 2 million rows which I'm tasked to > > improve performance. I am thinking of creating a two-column MDC, on [quoted text clipped - 21 lines] > DB2 Solutions Development > IBM Toronto Lab The cardinality of column A (type_id) is small, about 6-7; that of column B (product_id) is much larger. I guess we have to use a generated column on product_id to make it coarser, which I'm relunctant to do since it requires change to the table. Perhaps I'd just go for cluster index.
Serge Rielau - 20 Oct 2008 20:07 GMT >>> I have an existing table with 2 million rows which I'm tasked to >>> improve performance. I am thinking of creating a two-column MDC, on [quoted text clipped - 24 lines] > relunctant to do since it requires change to the table. Perhaps I'd > just go for cluster index. You could MDC on only the one column. This of such an MDC as a self organizing range partitioning.
 Signature Serge Rielau DB2 Solutions Development IBM Toronto Lab
Henry J. - 21 Oct 2008 05:26 GMT > >>> I have an existing table with 2 million rows which I'm tasked to > >>> improve performance. I am thinking of creating a two-column MDC, on [quoted text clipped - 32 lines] > DB2 Solutions Development > IBM Toronto Lab MDC on type_id? So cluster index is not a good idea?
Ian - 21 Oct 2008 06:24 GMT > MDC on type_id? So cluster index is not a good idea? MDC *guarantees* clustering, whereas a table with a clustering index will eventually require maintenance (a.k.a. reorg) to maintain the cluster ratio.
That's not to say that a clustering index isn't still valuable (especially for high cardinality columns that aren't a reasonable candidate as an MDC dimension).
Henry J. - 21 Oct 2008 15:18 GMT > > MDC on type_id? So cluster index is not a good idea? > [quoted text clipped - 5 lines] > (especially for high cardinality columns that aren't a reasonable > candidate as an MDC dimension). For my case, the table is purged and re-populated every day. And the other column, product_id, has high cardinality. Perhaps a cluster index makes more sense? I guess I'd choose type_id if I have to pick one column for MDC. Then it won't be as helpful as a cluster index on (type_id, product_id)?
ChrisC - 21 Oct 2008 16:19 GMT > For my case, the table is purged and re-populated every day. And the > other column, product_id, has high cardinality. > Perhaps a cluster index makes more sense? I guess I'd choose type_id > if I have to pick one column for MDC. Then it won't be as helpful as > a cluster index on (type_id, product_id)? MDC still might be faster - probably faster at loading and equivalent for searches based on type_id. For searches on type_id and product_id, it could outperform the cluster index, if you also create an index on product_id - and the combination (MDC + index on product_id) will definitely outperform a cluster index on type_id, product_id for product_id searches.
You'll want to verify these claims - but since you rebuild the table every day, you are in a perfect position to try these various options out. Pick a new one each day until you've found the right combination.
-Chris
Henry J. - 21 Oct 2008 18:05 GMT > > For my case, the table is purged and re-populated every day. And the > > other column, product_id, has high cardinality. [quoted text clipped - 15 lines] > > -Chris Thanks Chris.
The primary key of the table is (type_id, product_id). Now if we add a MDC on type_id, will the MDC be actually used if we query like the following, or the primary key will be used only? I guess even if only the primary key is used, the I/O will still be benefited.
select client.client_id, sum( prod.quantity * prod.price * client.ordersize ) from prod, client where prod.type_id = 4 and prod.product_id = client.product_id group by client.client_id
Also, would the following query benefit from both the MDC and the primary key?
select * from prod where prod.type_id = 4 and prod.product_id = 101
Or, instead of the primary key, just use an index on (product_id) would cover both of the above queries?
I'll certainly test to find out. But I'd like to gain better understanding about indexing. Also, the turnaround of making such changes are not as fast as I want and I hope to get to the best solution a little faster. Thanks.
Henry J. - 28 Oct 2008 18:54 GMT > > On Oct 21, 7:18 am, "HenryJ." <tank209...@yahoo.com> wrote: > [quoted text clipped - 48 lines] > > - Show quoted text - Now I've changed the primary key of the table to (type_id, product_id) while making (type_id) as MDC.
When examing the plan, I found that the following query:
select client.client_id, sum( prod.quantity * prod.price * client.ordersize ) from prod, client where prod.type_id = 4 and prod.product_id = client.product_id group by client.client_id
does use the new primary key on (type_id, product_id). However, if I use a bind variable for type_id, i.e.:
select client.client_id, sum( prod.quantity * prod.price * client.ordersize ) from prod, client where prod.type_id = ? and prod.product_id = client.product_id group by client.client_id
It only uses the MDC on (type_id).
Now, my questions are:
1) should I drop the bind variable on type_id so it can utilize the primary key? The cardinality of type_id is about 5 to 8.
2) Would an addition on (product_id) help?
Thanks.
Henry J. - 28 Oct 2008 20:53 GMT > On Oct 21, 7:18 am, "HenryJ." <tank209...@yahoo.com> wrote: > [quoted text clipped - 17 lines] > > -Chris I have the primary key changed to (type_id, product_id), and added a MDC on type_id. Now the following query:
select client.client_id, sum( prod.quantity * prod.price * client.ordersize ) from prod, client where prod.type_id = 4 and prod.product_id = client.product_id group by client.client_id
it will utilize the primary key. However, if I use a host variable for type_id, it will only pick up the MDC on type_id, which I think is not optimal.
My questions are:
1) should I add an index on (product_id) which is what's really needed rather than the primary key on (type_id, product_id). Note that the cardinality of type_id is pretty small, around 6.
2) should I drop the host variable on type_id? Note that each type_id has the same number of product_id's. Perhaps this is not a good case for host variable. BTW, this SQL is embedded in a Java (as static SQL I think?).
Thanks!
Thanks!
Henry J. - 28 Oct 2008 20:56 GMT > On Oct 21, 7:18 am, "HenryJ." <tank209...@yahoo.com> wrote: > [quoted text clipped - 17 lines] > > -Chris I have the primary key changed to (type_id, product_id), and added a MDC on type_id. Now the following query:
select client.client_id, sum( prod.quantity * prod.price * client.ordersize ) from prod, client where prod.type_id = 4 and prod.product_id = client.product_id group by client.client_id
will utilize the primary key. However, if I use a host variable for type_id (i.e., type_id = ?), it will only pick up the MDC on type_id, which I think is not optimal.
My questions are:
1) should I add an index on (product_id) which is what's really needed rather than the primary key on (type_id, product_id). Note that the cardinality of type_id is pretty small, around 6.
2) should I drop the host variable on type_id? Note that each type_id has the same number of product_id's. Perhaps this is not a good case for host variable. BTW, this SQL is embedded in a Java (as static SQL I think?).
Thanks!
Saurabh.Nr@gmail.com - 29 Oct 2008 14:55 GMT > > On Oct 21, 7:18 am, "HenryJ." <tank209...@yahoo.com> wrote: > [quoted text clipped - 46 lines] > > - Show quoted text - Hi, there is a shell script know extent utilisation.sh which can be downloaded from IBM website.it will answers all your queries regarding which will be a better column and how much space will you save and how will it improve the performance.
Henry J. - 29 Oct 2008 21:13 GMT On Oct 29, 9:55 am, Saurabh...@gmail.com wrote:
> On Oct 29, 12:56 am, "HenryJ." <tank209...@yahoo.com> wrote: > [quoted text clipped - 56 lines] > > - Show quoted text - Could you please give me the link? I couldn't find it. Thanks.
Ian - 21 Oct 2008 19:32 GMT >>> MDC on type_id? So cluster index is not a good idea? >> MDC *guarantees* clustering, whereas a table with a clustering index [quoted text clipped - 10 lines] > if I have to pick one column for MDC. Then it won't be as helpful as > a cluster index on (type_id, product_id)? By purging it every day, you're going to make it nearly impossible for a clustering index to be useful. Unless your data comes in already sorted by type_id,product_id the cluster ratio is probably going to be very low, regardless of what you set pctfree to.
Furthermore, will a clustering index really help with queries that aggregating on the clustering key? Maybe, but maybe not -- that depends on whether the query is doing an index scan / fetch. If it's just doing a tablescan, you'll still need to do a sort.
Henry J. - 21 Oct 2008 23:54 GMT > >>> MDC on type_id? So cluster index is not a good idea? > >> MDC *guarantees* clustering, whereas a table with a clustering index [quoted text clipped - 22 lines] > > - Show quoted text - I'm totally confused -- doesn't a clustering index make sure the rows are inserted as sorted and hence with high cluster ratio? And only updates/deletes/inserts over time would reduce the cluster ratio? And what I understood is that a truly clustered index would avoid a sort.
Ian - 22 Oct 2008 06:21 GMT > I'm totally confused -- doesn't a clustering index make sure the rows > are inserted as sorted and hence with high cluster ratio? And only > updates/deletes/inserts over time would reduce the cluster ratio? And > what I understood is that a truly clustered index would avoid a sort. When a table has a clustering index, DB2 will *try* to insert rows into the table in order to keep the data in the same order as the clustering index. However, there is no guarantee that this will happen, because it depends on there being enough free space on the particular page DB2 wants to write the row.
If there isn't enough free space on the "correct" page, DB2 will insert the row elsewhere, thus reducing the cluster ratio. [Incidentally, this is why MDC is so nice: It guarantees that the table remains clustered. Not only that, it allows you to effectively have multiple clustering indexes].
In your case, when you start with an empty table (i.e. after you truncate it at the beginning of each day), the cluster ratio is going to be a mess very quickly. Remember, INSERT ignores the table's PCTFREE setting. Say you have inserted 5 rows, which fill up 1 entire page. If the 6th row you insert logically belongs on that first page, but that page is already full, your cluster ratio is dropping quickly.
As I said before, unless you insert your data in sorted order, the cluster ratio will not be good until you perform a reorg. Until you have enough pages in the table (each with sufficient free space to accommodate newly inserted rows), your cluster ratio will drop quickly.
Finally, regarding sorts: It should be obvious that because the rows in the table are not guaranteed to be in order, a sort operation must take place if your query uses only a table scan. A query that scans the clustering index will fetch the rows in sorted order, and depending on the cluster ratio this may or may not be very efficient.
Henry J. - 22 Oct 2008 19:09 GMT > > I'm totally confused -- doesn't a clustering index make sure the rows > > are inserted as sorted and hence with high cluster ratio? And only [quoted text clipped - 30 lines] > clustering index will fetch the rows in sorted order, and depending on > the cluster ratio this may or may not be very efficient. Ian, thanks a lot. I'm clear now.
Mark A - 20 Oct 2008 04:27 GMT > I have an existing table with 2 million rows which I'm tasked to > improve performance. I am thinking of creating a two-column MDC, on [quoted text clipped - 12 lines] > > Thanks! That is not enough information.
How often is the table inserted into? What is the logical relationship between A and B?
I would probably help to post the DDL of the table (you can change the column names, but a business description of the tables is needed). Then please explain how the data in the table is typically accessed.
Clustering is often a trade-off. You improve performance for some SQL statements, but make others a bit slower. The best scenario is when the statements you make slower don't occur very often.
Henry J. - 20 Oct 2008 18:46 GMT > > I have an existing table with 2 million rows which I'm tasked to > > improve performance. I am thinking of creating a two-column MDC, on [quoted text clipped - 27 lines] > > - Show quoted text - The table is inserted once, never updated, and queried a lot. All in an overnight batch. Column A is type_id and B is product_id. There are 2 other fields on the table, i.e., the quantity and price for the products. We typically do aggregation on the table to get total quantities per product for each given type_id and selected product_id (selected by joining with some other products tables). Of course all this is simplified but should convey the main features.
Thanks.
|
|
|