Database Forum / DB2 Topics / January 2007
deleting large numbers of records
|
|
Thread rating:  |
Frank Swarbrick - 24 Jan 2007 21:59 GMT We have a batch process that inserts large numbers (100,000 - 1,000,000) of records into a database each day. (DL/I database.) We're considering converting it to a DB2 table. Currently we have logic in place that, prior to inserting any data, reads the first input record and checks to see if it already exists in the table. If the record already exists there are two options: 1) Don't continue, because you already ran this job today! 2) This is a rerun - continue.
If number 2 is selected the first thing that happens is that it deletes every record that was inserted today prior to doing the regular insert process. (You may ask, why not just skip over the ones that are already there. It's because we may be rerunning with an updated input file, where the input records may be different than during the first run.)
Anyway, I figured with DB2 this would be a snap. I'll I'd need to do is: EXEC SQL DELETE FROM FILM.FILM_TRANSACTIONS WHERE UPDATE_DATE = FB_FUNC.TO_DATE(:CURR-DATE-JUL-PACKED) END-EXEC
The only problem is that my log file would end up running out of room. So now I've come up with the following:
DELETE-TODAY SECTION. DISPLAY 'DELETE PROCESS BEGINS' UPON CONSOLE PERFORM WITH TEST AFTER UNTIL SQLCODE = 100 DISPLAY 'COMMITTING...' UPON CONSOLE PERFORM COMMIT-UOW DISPLAY 'DELETING 10000' UPON CONSOLE PERFORM DB2-DELETE-TODAY END-PERFORM PERFORM COMMIT-UOW DISPLAY 'DELETE PROCESS ENDS' UPON CONSOLE . DB2-DELETE-TODAY SECTION. EXEC SQL DELETE FROM ( SELECT UPDATE_DATE FROM FILM.FILM_TRANSACTIONS WHERE UPDATE_DATE = FB_FUNC.TO_DATE(:CURR-DATE-JUL-PACKED) FETCH FIRST 10000 ROWS ONLY ) WHERE 1 = 1 END-EXEC CALL CHECKERR USING SQLCA ERRLOC .
My question is, is this the way to go or is there some better way?
I tried making the "10000" a host variable, but that didn't work. Any way around this?
You may wondering why I put the "WHERE 1 = 1" clause on the DELETE statement. This is because DB2 gives a warning if you pre-compile a DELETE or UPDATE statement without a WHERE clause. Still works, but I like to avoid warnings.
Thanks! Frank
--- Frank Swarbrick Senior Developer/Analyst - Mainframe Applications FirstBank Data Corporation - Lakewood, CO USA
Mark A - 24 Jan 2007 23:36 GMT > We have a batch process that inserts large numbers (100,000 - 1,000,000) > of [quoted text clipped - 67 lines] > Senior Developer/Analyst - Mainframe Applications > FirstBank Data Corporation - Lakewood, CO USA Yes there is a better way that will avoid filling up your DB2 z/OS logs.
Declare a cursor with something like: SELECT UPDATE_DATE FROM FILM.FILM_TRANSACTIONS WHERE UPDATE_DATE = FB_FUNC.TO_DATE(:CURR-DATE-JUL-PACKED) FOR UPDATE WITH HOLD
OPEN cursor-name.
Then FETCH one row at a time (until RC = 100).
DELETE WHERE CURRENT OF C1 (cursor name).
COMMIT. You could commit every 100 - 10000 if you don't want to commit each row.
CLOSE cursor-name.
Note that the WITH HOLD will retain the cursor position even though you have committed (which usually closes a cursor).
Please consult the SQL Reference for details, because I am posting all the above syntax from (my) memory.
Also, you find that there is very little increase in performance in committing more often than every 100 updates. Committing more often will provide increased concurrency with other applications.
Mark A - 24 Jan 2007 23:44 GMT > Also, you find that there is very little increase in performance in > committing more often than every 100 updates. Committing more often will > provide increased concurrency with other applications. Correction, you find that there is very little increase in performance in committing LESS often than every 100 updates.
jefftyzzer - 25 Jan 2007 00:54 GMT I think both solutions have merit. Frank, your version is very much like the approach outlined in Serge's "SQL on Fire" seminars, albeit in COBOL as opposed to SQL/PL. Mark's solution will use less log space than yours, but I respectfully wonder about the time required to delete all of the target rows when they're being fetched and deleted one-at-a-time.
FWIW, here's how I've done it (in a stored procedure):
CREATE PROCEDURE CSE.PURGE_LRD() BEGIN
DECLARE V_NO_DATA SMALLINT DEFAULT 0;-- DECLARE V_DEADLOCK_OR_LTO SMALLINT DEFAULT 0;--
DECLARE C_DEADLOCK_OR_LTO CONDITION FOR SQLSTATE '40001';--
DECLARE CONTINUE HANDLER FOR NOT FOUND SET V_NO_DATA = 1;--
-- The V_DEADLOCK_OR_LTO attribute is throw-away, -- but a continue handler needs to do something, -- i.e., it's not enough to just declare a handler, -- it has to have an action in its body. DECLARE CONTINUE HANDLER FOR C_DEADLOCK_OR_LTO SET V_DEADLOCK_OR_LTO = 1;--
WHILE (V_NO_DATA = 0) DO DELETE FROM ( SELECT 1 FROM LRD FETCH FIRST 200000 ROWS ONLY ) AS LRD_D;--
COMMIT;-- END WHILE;-- END;
--Jeff
> > Also, you find that there is very little increase in performance in > > committing more often than every 100 updates. Committing more often will > > provide increased concurrency with other applications.Correction, you find that there is very little increase in performance in > committing LESS often than every 100 updates. Frank Swarbrick - 25 Jan 2007 01:46 GMT jefftyzzer<jefftyzzer@sbcglobal.net> 01/24/07 5:54 PM >>>
>I think both solutions have merit. Frank, your version is very much >like the approach outlined in Serge's "SQL on Fire" seminars, albeit in >COBOL as opposed to SQL/PL. Mark's solution will use less log space >than yours, but I respectfully wonder about the time required to delete >all of the target rows when they're being fetched and deleted >one-at-a-time. That was my concern as well. In any case, I will give both methods a shot and see which I prefer.
>FWIW, here's how I've done it (in a stored procedure): > [quoted text clipped - 29 lines] > END WHILE;-- >END; My probably very naive thought is that it would be nice to have some sort of DELETE statement that didn't even do logging. While often (usually?) you would want to be able to ROLLBACK a DELETE, in the case of what I'm doing there's no reason I would ever want to rollback. So why log? Just wondering...
Thanks!
Frank
--- Frank Swarbrick Senior Developer/Analyst - Mainframe Applications FirstBank Data Corporation - Lakewood, CO USA
Knut Stolze - 25 Jan 2007 07:41 GMT > jefftyzzer<jefftyzzer@sbcglobal.net> 01/24/07 5:54 PM >>> >>I think both solutions have merit. Frank, your version is very much [quoted text clipped - 47 lines] > there's no reason I would ever want to rollback. So why log? Just > wondering... The logs are also used for crash recovery. Let's assume you run the unlogged DELETE. Now your application or the DB2 server crashes before you issued a COMMIT. Upon restart, DB2 has to make sure the transaction is properly rolled back and the database is in a consistent state. If you don't log the DELETE, you are out of luck there.
What would be nice to have in this respect is an option for the DELETE statement to explicitly turn off logging - which would have a certain amount of problems as I just mentioned. Truncating a whole table is supported that way already: you can use ALTER TABLE ... ACTIVATE NOT LOGGED INITIALLY WITH EMPTY TABLE for that. Maybe this, combined with range partitioning is an option for you?
 Signature Knut Stolze DB2 z/OS Utilities Development IBM Germany
Frank Swarbrick - 25 Jan 2007 16:39 GMT Knut Stolze<stolze@de.ibm.com> 01/25/07 12:41 AM >>>
>> My probably very naive thought is that it would be nice to have some sort >> of [quoted text clipped - 8 lines] >properly rolled back and the database is in a consistent state. If you >don't log the DELETE, you are out of luck there. Are you saying that DB2 occasionally crashes?
:-) (Just kidding.) As you can tell, I'm hopelessly naive about these things. I'm just a simple application programmer.
>What would be nice to have in this respect is an option for the DELETE >statement to explicitly turn off logging - which would have a certain >amount of problems as I just mentioned. Truncating a whole table is >supported that way already: you can use ALTER TABLE ... ACTIVATE NOT LOGGED >INITIALLY WITH EMPTY TABLE for that. Maybe this, combined with range >partitioning is an option for you? I don't know about range partitioning. Can you give me a pointer to some information on this?
Thanks!
Frank
--- Frank Swarbrick Senior Developer/Analyst - Mainframe Applications FirstBank Data Corporation - Lakewood, CO USA
Knut Stolze - 25 Jan 2007 21:23 GMT >>The logs are also used for crash recovery. Let's assume you run the >>unlogged DELETE. Now your application or the DB2 server crashes before [quoted text clipped - 5 lines] > Are you saying that DB2 occasionally crashes? > :-) (Just kidding.) I can't really comment on that. DB2 crashes quite often in my environment - sometimes on purpose, sometimes not. If not, then it is usually due to my (wrong) code changes, of course. ;-)
Anyway, just think of someone tripping over the power cable or using Windows as OS...
>>What would be nice to have in this respect is an option for the DELETE >>statement to explicitly turn off logging - which would have a certain [quoted text clipped - 6 lines] > I don't know about range partitioning. Can you give me a pointer to some > information on this? I guess Serge is the most knowledgeable about this. In a nutshell: you have one logical table that is internally stored as multiple physical tables. DB2 will distribute your data across those physical tables. For that, it needs some criteria/algorithm for the distribution. With range partitioning, you define ranges and a value in a row that fits into one range goes into the physical table for that range. During query time, the DB2 optimizer will analyze the query and if it finds that the query searches on ranges, it can eliminate scanning some/most of the physical tables, for instance.
Another side effect is that you have now (V9) ALTER TABLE ... ATTACH PARTITION and ALTER TABLE ... DETACH PARTITION SQL statements. (http://publib.boulder.ibm.com/infocenter/db2luw/v9/topic/com.ibm.db2.udb.admin.d oc/doc/r0000888.htm) Essentially, those statements switch a regular base table to such a mentioned physical table and group it to the logical table - or vice versa. Thus, you can roll-in and roll-out ranges of a table with a single SQL statement.
If you can partition your table according to your deletion criteria, you can detach the internal, physical table holding the data you want to remove. It becomes a regular table, which you can drop.
p.s: I hope I didn't screw up too much on the terminology.
 Signature Knut Stolze DB2 z/OS Utilities Development IBM Germany
andyhe - 26 Jan 2007 16:09 GMT > What would be nice to have in this respect is an option for the DELETE > statement to explicitly turn off logging - which would have a certain > amount of problems as I just mentioned. Truncating a whole table is > supported that way already: you can use ALTER TABLE ... ACTIVATE NOT LOGGED > INITIALLY WITH EMPTY TABLE for that. Maybe this, combined with range > partitioning is an option for you? What 'activate not logged initially' does (without the 'with empty table' option) is that it suspends logging until the next commit. So you CAN use it without partitioning. Just activate not logged initially, do your delete and commit. Beware of any errors (already mentioned)! If for some reason your delete fails, db2 cannot roll back and the table becomes invalid. Also, if you have to rollforward after a restore, there is no way to rollforward over a 'not logged' transaction, your table will be invalid after rollforward. We have used this feature for a one-time extension of a very big table, but took precautions before doing it: backup AND snapshot. :-) By the way: emptying a complete table is very easy with an import from /dev/null (on unix).
ChrisC - 25 Jan 2007 16:27 GMT Another option might be to create the table as an MDC (MultiDimensional Cluster) table, clustered on UPDATE_DATE, and then turn on the option MDC ROLLOUT (not sure of exact syntax). This should then allow you to just delete all the rows for that date. According to the literature, it should just mark each block (set of pages) for that cluster as deleted, log each block as being deleted, and commit. If you think about this, it should delete the 1,000,000 or so rows very quickly and NOT fill up your log files.
Disclaimer - I haven't (yet) been able to use this myself, so no actual experience here. But if this is a new DB2 table/DB, it might be a great time to check this feature out.
-Chris
Frank Swarbrick - 25 Jan 2007 01:39 GMT Oh! So every 10,000 actually does make sense then, right?
>>> Mark A<nobody@nowhere.com> 01/24/07 4:44 PM >>> "Mark A" <nobody@nowhere.com> wrote in message news:pOSdnQU31-zgcCrYnZ2dnUVZ_sWdnZ2d@comcast.com...
> Also, you find that there is very little increase in performance in > committing more often than every 100 updates. Committing more often will > provide increased concurrency with other applications. Correction, you find that there is very little increase in performance in committing LESS often than every 100 updates.
Mark A - 25 Jan 2007 02:35 GMT > Oh! So every 10,000 actually does make sense then, right? No. I would commit every 100 - 1000 rows. A commit is not that expensive, unless you do it for every row. Doing it every 100 rows or 10,000 will provide about a 2% difference in performance at best.
The major expense of a commit is a synchronous write of the log buffer to disk. But the log buffer will be flushed to disk anyway whenever the log buffer is full, or about every 1 second. In any case, a commit is usually only as expensive as a few deletes, so do the math.
Knut Stolze - 25 Jan 2007 08:37 GMT > The major expense of a commit is a synchronous write of the log buffer to > disk. Not necessarily if group commit is used.
 Signature Knut Stolze DB2 z/OS Utilities Development IBM Germany
Frank Swarbrick - 25 Jan 2007 01:38 GMT Mark A<nobody@nowhere.com> 01/24/07 4:36 PM >>>
>Yes there is a better way that will avoid filling up your DB2 z/OS logs. Actually, DB2/LUW, but I'm guessing your advice still applies.
>Declare a cursor with something like: >SELECT UPDATE_DATE FROM FILM.FILM_TRANSACTIONS [quoted text clipped - 8 lines] > >COMMIT. You could commit every 100 - 10000 if you don't want to commit each
>row. > >CLOSE cursor-name. Interesting. I just figured that this would be much less efficient than doing just the delete with the fullselect, because in the latter case no data need be returned to the AR. Anyway, I will give it a shot.
>Note that the WITH HOLD will retain the cursor position even though you have >committed (which usually closes a cursor). > >Please consult the SQL Reference for details, because I am posting all the
>above syntax from (my) memory. > >Also, you find that there is very little increase in performance in >committing more often than every 100 updates. Committing more often will >provide increased concurrency with other applications. Will keep that in mind. Thanks!
Frank
--- Frank Swarbrick Senior Developer/Analyst - Mainframe Applications FirstBank Data Corporation - Lakewood, CO USA
Mark A - 25 Jan 2007 02:40 GMT > Interesting. I just figured that this would be much less efficient than > doing just the delete with the fullselect, because in the latter case no > data need be returned to the AR. Anyway, I will give it a shot. It depends on where the program runs. If the program runs on server and the static SQL is bound into a package that runs on the server, then there is not that much difference in performance (unless performance is ultra-critical). If the program is running remotely, then there would be a big difference in performance.
I have written SQL stored procedures to do mass deletes with a cursor and it performs well.
I haven't seen to many COBOL programs running on DB2 LUW. What compiler are you using? I used MicroFocus COBOL against OS/2 Database Manager, but that was in 1991.
Frank Swarbrick - 25 Jan 2007 17:07 GMT Mark A<nobody@nowhere.com> 01/24/07 7:40 PM >>>
>> Interesting. I just figured that this would be much less efficient than >> doing just the delete with the fullselect, because in the latter case no >> data need be returned to the AR. Anyway, I will give it a shot. > >It depends on where the program runs. If the program runs on server and the
>static SQL is bound into a package that runs on the server, then there is >not that much difference in performance (unless performance is >ultra-critical). If the program is running remotely, then there would be a
>big difference in performance. > >I have written SQL stored procedures to do mass deletes with a cursor and it >performs well. > >I haven't seen to many COBOL programs running on DB2 LUW. What compiler are
>you using? I used MicroFocus COBOL against OS/2 Database Manager, but that
>was in 1991. We're doing it in kind of an odd way. And for now, we're only testing. We are using "DB2 Server for VSE" as the client, with the IBM COBOL for VSE/ESA compiler. But all of our databases are remote databases on DB2/LUW.
So we definitely fall in to the category of a remote client, not a client running on the server.
But to answer your question anyway, I have been successful using both Micro Focus Net Express (COBOL) 5.0 as well as OpenCobol 0.33 to access DB2/LUW databases.
Frank
--- Frank Swarbrick Senior Developer/Analyst - Mainframe Applications FirstBank Data Corporation - Lakewood, CO USA
Mark A - 26 Jan 2007 00:45 GMT > We're doing it in kind of an odd way. And for now, we're only testing. > We [quoted text clipped - 11 lines] > > Frank Given the above, I would create an SQL stored procedure to do the deletes. It will run on the LUW server (you can call it from a remote client with any parms you want) and it should perform quite well.
Frank Swarbrick - 26 Jan 2007 01:29 GMT Mark A<nobody@nowhere.com> 01/25/07 5:45 PM >>>
>> We're doing it in kind of an odd way. And for now, we're only testing. >> We [quoted text clipped - 13 lines] > >Given the above, I would create an SQL stored procedure to do the deletes.
>It will run on the LUW server (you can call it from a remote client with any >parms you want) and it should perform quite well. Sounds right. To be honest, we probably would not do that for this particular case, simply because it's a situation that occurs just this side of never. But I will keep it in mind. Thanks, Frank
--- Frank Swarbrick Senior Developer/Analyst - Mainframe Applications FirstBank Data Corporation - Lakewood, CO USA
Knut Stolze - 25 Jan 2007 08:38 GMT > Mark A<nobody@nowhere.com> 01/24/07 4:36 PM >>> >>Yes there is a better way that will avoid filling up your DB2 z/OS logs. > > Actually, DB2/LUW, but I'm guessing your advice still applies. In that case, have you considered the MERGE statement? Then you may not have to DELETE the rows at all - just UPDATE them.
 Signature Knut Stolze DB2 z/OS Utilities Development IBM Germany
Frank Swarbrick - 25 Jan 2007 17:09 GMT Knut Stolze<stolze@de.ibm.com> 01/25/07 1:38 AM >>>
>> Mark A<nobody@nowhere.com> 01/24/07 4:36 PM >>> >>>Yes there is a better way that will avoid filling up your DB2 z/OS logs. [quoted text clipped - 3 lines] >In that case, have you considered the MERGE statement? Then you may not >have to DELETE the rows at all - just UPDATE them. Yet another thing I am not familiar with. I will look into it. Thanks.
Frank
--- Frank Swarbrick Senior Developer/Analyst - Mainframe Applications FirstBank Data Corporation - Lakewood, CO USA
Art S. Kagel - 29 Jan 2007 19:54 GMT > Knut Stolze<stolze@de.ibm.com> 01/25/07 1:38 AM >>> > [quoted text clipped - 8 lines] > > Yet another thing I am not familiar with. I will look into it. Thanks. Just updating the non-key columns in each existing row (and inserting the missing row if the record was not in the original change set) will be FAR more efficient. The delete-then-add operations must update each index key twice - once to remove the original record's keys and again to readd the row's keys. This is very expensive.
If you can determine whether all (or at least most) operations will be replacing an existing row, and apparently you can, then just do the updates in place. My testing has found that this is ALWAYS faster to update than to delete-then-add and is faster than trying to insert the row and updating instead if the update fails for a unique key violation (ie already exists) when fewer than about 30% of the rows will have already pre-existed. When more than about 70% of the rows do not exist already it is usually faster to do the insert and update if the insert fails (the exact cutoff depends on the number of indexes, whether the update modifies indexed columns, the order the database engine chooses to check the unique keys for violations, etc.).
Obviously YMMV so you need to test it yourself.
Art S. Kagel
aj - 25 Jan 2007 14:10 GMT Frank: Here's an OLAPy trick that I sometimes use:
Let's say you want to delete rows from a very large table based on a sysdate column. You *don't* want to overfill the transaction logs
The answer: Figure out how many rows you can safely delete w/ your logs, use row_number() to slap a number on each one, and delete based not only on your sysdate, but also that number. Let's say you can safely delete up to 200000 rows, and you only want to delete rows where sysdate = 5/1/2005:
lock table mytable in exclusive mode ; delete FROM (SELECT sysdate , row_number() OVER (ORDER BY sysdate) AS rn FROM mytable) AS tr WHERE rn BETWEEN 1 and 200000 and sysdate = '5/1/2005' ; COMMIT ;
Stick this in a loop and stop when no rows get deleted anymore.
HTH
aj
> We have a batch process that inserts large numbers (100,000 - 1,000,000) of > records into a database each day. (DL/I database.) We're considering [quoted text clipped - 63 lines] > Senior Developer/Analyst - Mainframe Applications > FirstBank Data Corporation - Lakewood, CO USA Frank Swarbrick - 25 Jan 2007 17:12 GMT aj<ronald@mcdonalds.com> 01/25/07 7:10 AM >>>
>Frank: >Here's an OLAPy trick that I sometimes use: [quoted text clipped - 17 lines] > >Stick this in a loop and stop when no rows get deleted anymore. Sounds interesting. And brings up another question. Is there any way to dynamically determine how many rows I can delete w/o filling up the logs?
Lots of good responses to this. Thanks all!
Frank
--- Frank Swarbrick Senior Developer/Analyst - Mainframe Applications FirstBank Data Corporation - Lakewood, CO USA
Gregor Kovač - 26 Jan 2007 07:21 GMT Hi!
I'd propose another method that involves EXPORT and a LOAD operation: - first you EXPORT the rows you want to preserve - then do a LOAD with REPLACE Example: 1.) EXPORT FROM SCHEMA.TABLE1 INTO TABLE1.IXF OF IXF SELECT * FROM SCHEMA.TABLE1 WHERE DATE BETWEEN CURRENT DATE AND CURRENT DATE - 3 DAYS 2.) LOAD FROM TABLE1.IXF OF IXF REPLACE INTO SCHEMA.TABLE1 After running this you'll probably have to do some SET INTEGRITY statements if you have RI.
I only know DB2 for LUW, but if your ADMIN_CMD procedure supports EXPORT and LOAD oparations then you could even do this via SQL procedure.
Best regards, Kovi
> We have a batch process that inserts large numbers (100,000 - 1,000,000) > of [quoted text clipped - 66 lines] > Senior Developer/Analyst - Mainframe Applications > FirstBank Data Corporation - Lakewood, CO USA
 Signature -~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
| In A World Without Fences Who Needs Gates? | | Experience Linux. | -~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
Frank Swarbrick - 26 Jan 2007 16:19 GMT Good thought, but with two years of data, of 100,000 to 1,000,000 records a day I don't think this would be very efficient! :-) (This is a transaction history file with two years of transactions.)
Thanks,
Frank
--- Frank Swarbrick Senior Developer/Analyst - Mainframe Applications FirstBank Data Corporation - Lakewood, CO USA
>>> Gregor Kova(Bè<gregor.kovac@mikropis.si> 01/26/07 12:21 AM >>> Hi!
I'd propose another method that involves EXPORT and a LOAD operation: - first you EXPORT the rows you want to preserve - then do a LOAD with REPLACE Example: 1.) EXPORT FROM SCHEMA.TABLE1 INTO TABLE1.IXF OF IXF SELECT * FROM SCHEMA.TABLE1 WHERE DATE BETWEEN CURRENT DATE AND CURRENT DATE - 3 DAYS 2.) LOAD FROM TABLE1.IXF OF IXF REPLACE INTO SCHEMA.TABLE1 After running this you'll probably have to do some SET INTEGRITY statements if you have RI.
I only know DB2 for LUW, but if your ADMIN_CMD procedure supports EXPORT and LOAD oparations then you could even do this via SQL procedure.
Best regards, Kovi
|
|
|