Database Forum / DB2 Topics / July 2006
Best practice
|
|
Thread rating:  |
Bob Stearns - 22 Jul 2006 06:11 GMT I'm building a web based auction system for multiple clients. I have auctions, items, and invoices I need to assign generated keys for. Since each of these entities is represented by a table, the keys are assigned as GENERATED BY DEFAULT. While this works, and is easy to reference programmatically, I wonder if it is the best technique. All of the keys are drawn from one sequence, not depending on the customer or the specific auction. Would it be better to have an auction_id sequence for each client, an item_id sequence for each auction (the client never sees either one of these columns, but there is no natural key for these tables that is ready and known when they are instantiated) and invoice_no sequence for each client or auction? If so, how should I implement these to both minimize SQL calls and ensure no duplication? With sequences, I can do something like:
INSERT INTO auctions (client_id, auction_id, ...) VALUES ($client, (SELECT NEXTVAL FOR auction_is_seq_$client), ...)
If instead I use the MAX(auction_id), what is the syntax to select the MAX and INSERT the new row in one atomic operation?
If I use a column associated with each user for next_auction_id, the same question occurs.
The same question, only increased in difficulty occurs when I have many rows to add at once with one INSERT statement.
Mark A - 22 Jul 2006 08:34 GMT > I'm building a web based auction system for multiple clients. I have > auctions, items, and invoices I need to assign generated keys for. Since [quoted text clipped - 22 lines] > The same question, only increased in difficulty occurs when I have many > rows to add at once with one INSERT statement. This will generate a unique value for auction_id within a client:
INSERT INTO auctions (client_id, auction_id, ...) VALUES ($client, (SELECT MAX(auction_id) + 1), ...);
If you want to insert multiple rows with one insert statement, you will have to increment the +1 in your code::
INSERT INTO auctions (client_id, auction_id, ...) VALUES ($client, (SELECT MAX(auction_id) + 1), ...) ($client, (SELECT MAX(auction_id) + 2), ...) ($client, (SELECT MAX(auction_id) + 3), ...) ($client, (SELECT MAX(auction_id) + 4), ...) ;
If you need to know what the value of the auction_id you just inserted, you can use this statement to do the insert and find the value all in one statement:
SELECT auction_id FROM FINAL TABLE (INSERT INTO auctions (client_id, auction_id, ...) VALUES ($client, (SELECT MAX(auction_id) + 1), ...));
If the PK is (client_id, auction_id) then you probably want to make the PK index the clustering index so that all auctions for particular client will be physically close together. That means you will have to create the unique index on these columns before you define the PK.
IBM: When are we going to get ALTER INDEX to specify the clustering index? DB2 for z/OS has had this for at least 15 years!!!
Serge Rielau - 22 Jul 2006 13:08 GMT > IBM: When are we going to get ALTER INDEX to specify the clustering index? > DB2 for z/OS has had this for at least 15 years!!! My personal opinion: MDC is much more powerful than clustering indexes. Makes much more sense to invest into that area, IMHO.
Cheers Serge
 Signature Serge Rielau DB2 Solutions Development IBM Toronto Lab
IOD Conference http://www.ibm.com/software/data/ondemandbusiness/conf2006/
Mark A - 22 Jul 2006 16:03 GMT > My personal opinion: > MDC is much more powerful than clustering indexes. > Makes much more sense to invest into that area, IMHO. > > Cheers > Serge MDC is overkill in some situations, especially if the desired clustering column has very high cardinality.
For example, I have a hotel_property table with card of 75,000, row length 300, and I want a clustering index on (LONGITUDE, LATITUDE) with the following very high frequency search (both are defined a DECIMAL (9,6)):
SELECT hotel_nbr, hotel_name, ... FROM hotel_property WHERE longitude between ? and ? AND latitude between ? and ?;
MDC would not be good on (LONGITUDE), or (LONGITUDE, LATITUDE).
Serge Rielau - 22 Jul 2006 19:29 GMT >> My personal opinion: >> MDC is much more powerful than clustering indexes. [quoted text clipped - 16 lines] > > MDC would not be good on (LONGITUDE), or (LONGITUDE, LATITUDE). You roll it up on 2 dimensions, e.g.: LONGITUDE / 10 and LATITUDE / 10 Now the beats can serve up BOTH diemensions. clustering index required reorg to keep it clustered and is only single dimension.
Cheers Serge
 Signature Serge Rielau DB2 Solutions Development IBM Toronto Lab
IOD Conference http://www.ibm.com/software/data/ondemandbusiness/conf2006/
Mark A - 22 Jul 2006 19:42 GMT >> MDC would not be good on (LONGITUDE), or (LONGITUDE, LATITUDE). > You roll it up on 2 dimensions, e.g.: LONGITUDE / 10 and LATITUDE / 10 [quoted text clipped - 3 lines] > Cheers > Serge I don't quite understand the "/ 10". Every longitude and latitude (taken to 6 decimal places) is unique (except in rare cases).
My Hotel table has very few updates and does not need to be reorged often.
In the example of the Original Poster, I would think that a lot of client_id's have only one or a few auctions, so it seems that there would be a lot of wasted space, and a big penalty if a table scan was ever needed.
Serge Rielau - 22 Jul 2006 20:18 GMT >>> MDC would not be good on (LONGITUDE), or (LONGITUDE, LATITUDE). >> You roll it up on 2 dimensions, e.g.: LONGITUDE / 10 and LATITUDE / 10 [quoted text clipped - 12 lines] > client_id's have only one or a few auctions, so it seems that there would be > a lot of wasted space, and a big penalty if a table scan was ever needed. CREATE TABLE places(id INT NOT NULL PRIMARY KEY, name VARCHAR(100), latitude DECIMAL(8, 6), -- +/-90 Decrees longitude DECIMAL(9, 6), -- +/-180 Degrees latstripe INTEGER GENERATED ALWAYS AS (INTEGER(latitude)) longstripe INTEGER GENERATED ALWAYS AS (INTEGER(longitude) / 2)) ORGANIZE BY DIMENSIONS (latstripe, longstripe);
Now each row gets bucketized into a 180 * 180 grid. The block indexes will be a single page each. To find any place by longitude and/or latitude DB2 will index and teh block indices and then scan within the grid sector. Of course you can add regular indices as well. You never need to reorg the beast except for page overflow due to updates. Want to find all the places in Europe? No problem, box Europe in and go from there. If your table gets really big you may want to allow for finer granularity.
I love MDC :-)
Cheers Serge
 Signature Serge Rielau DB2 Solutions Development IBM Toronto Lab
IOD Conference http://www.ibm.com/software/data/ondemandbusiness/conf2006/
Mark A - 23 Jul 2006 00:03 GMT > CREATE TABLE places(id INT NOT NULL PRIMARY KEY, > name VARCHAR(100), [quoted text clipped - 20 lines] > Cheers > Serge I will try this and run some explains and maybe test execution time, but I am still skeptical. All searches for hotels are within a relatively small radius, usually about 25 miles (of course it is actually a rectangle and not a radius).
Serge Rielau - 23 Jul 2006 03:09 GMT >> CREATE TABLE places(id INT NOT NULL PRIMARY KEY, >> name VARCHAR(100), [quoted text clipped - 25 lines] > radius, usually about 25 miles (of course it is actually a rectangle and not > a radius). The biggest mistake done with MDC is to have too fine a granularity. Think of range-partitioning...squared and automatic.
Cheers Serge
 Signature Serge Rielau DB2 Solutions Development IBM Toronto Lab
IOD Conference http://www.ibm.com/software/data/ondemandbusiness/conf2006/
Raj - 24 Jul 2006 19:24 GMT MDC is bad with columns having high card... I have tried MDC on a table on item_key and cust_key, each has a card of 10,000. and about 80 million records. mdc { slice} [ of card = 1000] was made up of gen_item_key ( = item_key/1000), gen_cust_key( = cust_key/1000 )
The mdc table size was comparable to the original table, but the index was not being used because of non-uniform density of the generated columns? Also table scans take 1-2 min longer , and sorts after a table scan take almost twice the time compared to the original table.
> > I will try this and run some explains and maybe test execution time, but I > > am still skeptical. All searches for hotels are within a relatively small [quoted text clipped - 12 lines] > IOD Conference > http://www.ibm.com/software/data/ondemandbusiness/conf2006/ Serge Rielau - 24 Jul 2006 21:27 GMT > MDC is bad with columns having high card... > I have tried MDC on a table on item_key and cust_key, each has a card [quoted text clipped - 7 lines] > Also table scans take 1-2 min longer , and sorts after a table scan > take almost twice the time compared to the original table. It took more than a minute to scan 80M rows? What kind of system was that? I have worked on a Linux BCU (a dozen data nodes) a couple of weeks ago where we burned through 1.4B rows in the fact plus several dimensions, rolled up the intermediate result set from the join of 140M rows with some aggregation in some 20 seconds (no MQTs of course). MDC was used on the fact table.
I don't see a reason why sort would care for the tables structure at all. Could it be MDC was the scapegoat here?
 Signature Serge Rielau DB2 Solutions Development IBM Toronto Lab
IOD Conference http://www.ibm.com/software/data/ondemandbusiness/conf2006/
Raj - 25 Jul 2006 13:49 GMT No... thats not what i said, table scan on the original table takes around 11 mins for the non-MDC table & it takes about 13 mins on the MDC table. There is atleast 100,000 timerons increase in the access plan. I now have the table built on a non unique column, (card = 80). It works great with that
> It took more than a minute to scan 80M rows? What kind of system was > that? I have worked on a Linux BCU (a dozen data nodes) a couple of [quoted text clipped - 12 lines] > IOD Conference > http://www.ibm.com/software/data/ondemandbusiness/conf2006/ Knut Stolze - 24 Jul 2006 10:03 GMT > For example, I have a hotel_property table with card of 75,000, row length > 300, and I want a clustering index on (LONGITUDE, LATITUDE) with the [quoted text clipped - 6 lines] > > MDC would not be good on (LONGITUDE), or (LONGITUDE, LATITUDE). Agreed. But in such cases, a spatial index would be better in any case.
 Signature Knut Stolze DB2 Information Integration Development IBM Germany
Mark A - 24 Jul 2006 18:58 GMT > Agreed. But in such cases, a spatial index would be better in any case. In theory yes, but in practice no. Most of the time someone is searching for a hotel within 5 miles or less of reference point and that means a (for 5 mile radius) a square of 10 miles in each dimension (figuring out the true radius is not worth the effort in most cases).
On average, there are maybe 25 hotels or less that would returned in such a query. Taking in to account other reference points with the same longitude coordinates, but that are in a totally different latitude range (also 10 miles for a 5 mile radius) the number is usually only doubled, or tripled at worst. It is very fast to just filter those out by including latitude as the second column in the index (along with longitude).
Certainly, some geo applications require more sophisticated search algorithms for maximum efficiency and performance, but this one doesn't.
Serge Rielau - 24 Jul 2006 21:14 GMT >> Agreed. But in such cases, a spatial index would be better in any case. > [quoted text clipped - 12 lines] > Certainly, some geo applications require more sophisticated search > algorithms for maximum efficiency and performance, but this one doesn't. OK, back to the original debate on clustering indexes. The difference between a clustered index and a non clustered index (to me as an optimizer amateur) is whether DB2 bothers to throw in a RIDSORT/FETCH compared to a straight FETCH. Sorting 25 RIDS (4 Byte in V8, 6 bytes each in V9) is truly not anything that sounds worth troubling myself for with a clustering index :-)
Cheers Serge
 Signature Serge Rielau DB2 Solutions Development IBM Toronto Lab
IOD Conference http://www.ibm.com/software/data/ondemandbusiness/conf2006/
Mark A - 24 Jul 2006 21:51 GMT > OK, back to the original debate on clustering indexes. > The difference between a clustered index and a non clustered index (to [quoted text clipped - 9 lines] > DB2 Solutions Development > IBM Toronto Lab In the example of the Hotel table, after the hotels within a particular latitude and longitude range are determined via the index scan, I want to be able to retrieve the data rows on the table with as few data page fetches, and using as few bufferpool pages for those data pages, as possible.
Serge Rielau - 24 Jul 2006 22:27 GMT > In the example of the Hotel table, after the hotels within a particular > latitude and longitude range are determined via the index scan, I want > to be able to retrieve the data rows on the table with as few data page > fetches, and using as few bufferpool pages for those data pages, as > possible. OK, that sounds good, but how much slower is your system going to be if your buffer pool hit ratio drops by 1%? Won't you have the most popular locations in the buffer pool anyway? I confess I'm pushing you here a bit. The point I'm trying to make is that there is a point where the effort doesn't justify the return. When I look at troubled apps clustered index is one of the less popular arrows in my quiver. Whenever I see a post from a novice or casual user I try to KISS and assume that 85% of theoretical best throughput will make that customer quite happy as long as it's easy to get there.
Cheers serge
 Signature Serge Rielau DB2 Solutions Development IBM Toronto Lab
IOD Conference http://www.ibm.com/software/data/ondemandbusiness/conf2006/
Mark A - 25 Jul 2006 01:22 GMT > OK, that sounds good, but how much slower is your system going to be if > your buffer pool hit ratio drops by 1%? Won't you have the most popular [quoted text clipped - 12 lines] > DB2 Solutions Development > IBM Toronto Lab To me, it is easier to design things in the most optimal manner, than it is to design in a less optimal manner and then try to measure or speculate about whether there is any negative impact, or how much difference there would be.. I don't see any real cost of clustering tables that have relatively low insert or update activity. Once a person learns how to do it properly, it becomes second nature. I sleep better at night knowing that DB2 is accessing the least number of pages necessary to process the SQL statements.
If like the OP, one has web application to keep track of a relatively small number of auctions, then maybe the performance difference does not mater much or at all.
If I have a worldwide air, hotel, and rental car reservation system with hundreds of transactions per second, and hundreds of tables that I can make clustering decisions on, then it could make a difference. Especially if I am memory constrained with a 32-bit instance and can't allocate bufferpools over 2GB (I inherited this system with XML Extender, which is not supported on 64-bit).
Clustering is especially important on DB2 for z/OS because if you don't choose a clustering index, DB2 will choose one for you (the first index created). So it is not always a question of clustering or not clustering, but of bad clustering versus good clustering. I can guarantee you that when I started working with DB2 systems on MVS in the late 1980's, proper clustering of tables made a huge difference in performance in most cases. I still believe that it makes a difference on many systems.
When talking about natural keys versus dumb keys (the subject of the OP), clustering is a relevant consideration because natural keys provide a viable means of clustering dependent tables with composite PK's. It makes no sense to cluster a single column PK with a generated or random number. Using natural keys provides a candidate clustering index that theoretically improves performance (the degree to which varies) and therefore it is one advantage of natural keys that should be mentioned.
Serge Rielau - 22 Jul 2006 13:13 GMT > I'm building a web based auction system for multiple clients. I have > auctions, items, and invoices I need to assign generated keys for. Since [quoted text clipped - 22 lines] > The same question, only increased in difficulty occurs when I have many > rows to add at once with one INSERT statement. Many questions answered by Mark. I'll just add delta: Don't use MAX. Sequences and IDENTITY were invented to get around the limitations of MAX. My recommendation is to use a sequence such as: SELECT pk FROM NEW TABLE(INSERT INTO T INCLUDE (i INTEGER) SELECT NEXT VALUES FOR S1 AS PK, V.* FROM VALUES(CAST(? AS INT), CAST(? AS DOUBLE), 1), (?, ?, 2), ....) AS V(c1, c2, i) ORDER BY i;
Now you also know which input goes with which sequence.
Cheers Serge
 Signature Serge Rielau DB2 Solutions Development IBM Toronto Lab
IOD Conference http://www.ibm.com/software/data/ondemandbusiness/conf2006/
|
|
|