Database Forum / General DB Topics / DB Theory / April 2008
implementing a database log
|
|
Thread rating:  |
Christoph Rupp - 21 Apr 2008 20:44 GMT Hi,
i'm the author of a simple open-source database engine. i'm currently adding recovery/logging, and i have some practical questions about the implementation.
I have 3 options.
1. a physical log based on modified pages - whenever a page is modified, a before/after image is stored in the log. pro: easy to implement and to test con: logfile will become huge!
2. a physical log based on modified bytes - whenever a database key or record is modified, only the modified bytes in the file are logged. pro: logfile is slim, very fine-grained control con: tedious to implement, error prone, difficult to test
3. a logical logging - information about the high-level operation is stored in the logfile. honestly, i don't know how i could implement this, since one high-level operation is often splitted into several physical operations, and if one of them fails i am not sure about the consequences...
currently, i'm going with 2), but as i wrote above, it's hard to implement and i'm not happy with the way i'm going.
what would you recommend?
thanks for your opinions, chris
Brian Selzer - 21 Apr 2008 22:00 GMT > Hi, > [quoted text clipped - 24 lines] > > what would you recommend? Why not go with #4:
4. a physical log based on modified rows. Whenever a row is modified, added or removed, it is logged. Then you could also implement row versioning--just add a row version field to the physical rows. I believe that this what snapshot isolation is built on.
> thanks for your opinions, > chris Christoph Rupp - 21 Apr 2008 22:58 GMT Brian,
> Why not go with #4: > > 4. a physical log based on modified rows. Whenever a row is modified, added > or removed, it is logged. Then you could also implement row > versioning--just add a row version field to the physical rows. I believe > that this what snapshot isolation is built on. It's not an SQL database, i don't even have the notion of "rows", but basically i think your #4 is the same as my #1 or #2.
My problem is that one single insert can be split in up to 3 or even more file operations, depending on the size of the B+Tree index (it has to be split, if a page overflows, etc). I don't have an atomic "insert(key, value)" operation.
And in that case, i have to log 3 (or more) different file events. And that brings me back to my question - do i log the whole modified page (usually a page is between 16kb and 64kb), or just the modified region in the page...
Brian Selzer - 21 Apr 2008 23:58 GMT > Brian, > [quoted text clipped - 8 lines] > It's not an SQL database, i don't even have the notion of "rows", but > basically i think your #4 is the same as my #1 or #2. No, it isn't. #1 requires the logging of additional records that may not have been affected by an update. #2 doesn't log the entire changed record, but only bits and pieces. I would think that limiting the units of change to individual records--entire records--would simplify the process of marking and isolating units of work while at the same time guaranteeing consistency.
> My problem is that one single insert can be split in up to 3 or even > more file operations, depending on the size of the B+Tree index (it > has to be split, if a page overflows, etc). I don't have an atomic > "insert(key, value)" operation. Perhaps you should focus on that, then.
> And in that case, i have to log 3 (or more) different file events. And > that brings me back to my question - do i log the whole modified page > (usually a page is between 16kb and 64kb), or just the modified region > in the page... Again, I say neither. You should log only those records that have changed: discard the notion of logging pages or segments of pages--instead, log records or sets of records. It is not enough in a concurrent environment to reconstruct a snapshot of the physical media at a particular point in time. An update may encompass many file operations; therefore, a reconstruction of just the snapshot will almost certainly be inconsistent and will require rolling back or rolling forward the result of any transactions that were outstanding at the point in time of the snapshot.
David BL - 22 Apr 2008 07:53 GMT > > Brian, > [quoted text clipped - 14 lines] > to individual records--entire records--would simplify the process of marking > and isolating units of work while at the same time guaranteeing consistency. I don't think an atomic unit of work is always associated with a change to an individual record. Are you suggesting transactions to define arbitrarily large units of work aren't needed?
Typically a log will contain special log records to mark transaction boundaries. Many systems allow interleaving of transaction log records by using a transaction id in each log record. Each transaction is normally expected to end with a commit or abort log record.
This is irrespective of whether physical or logical changes are logged.
Brian Selzer - 22 Apr 2008 16:39 GMT >> > Brian, >> [quoted text clipped - 23 lines] > change to an individual record. Are you suggesting transactions to > define arbitrarily large units of work aren't needed? No, that's not what I'm suggesting. What I'm suggesting is that the atomic unit of work should be a /set/ of /records/--either the old records in the case of a before image or the new records in the case of an after image.
> Typically a log will contain special log records to mark transaction > boundaries. Many systems allow interleaving of transaction log [quoted text clipped - 4 lines] > This is irrespective of whether physical or logical changes are > logged. David BL - 23 Apr 2008 06:54 GMT > >> "Christoph Rupp" <cruppst...@gmail.com> wrote in message > [quoted text clipped - 31 lines] > unit of work should be a /set/ of /records/--either the old records in the > case of a before image or the new records in the case of an after image. Ok but that sounds like the system snapshots an entire table in the before/after images.
For efficiency one would expect to only store the set of records that have been added and the set that have been removed by a given transaction. It is easy to see how we get the inverse which is required for roll back of an uncommitted transaction during recovery. However these logical operations aren't idempotent (at least for bags of records). How does recovery deal with non-idempotent redo()/ undo() changes in the log?
Brian Selzer - 24 Apr 2008 15:11 GMT >> >> "Christoph Rupp" <cruppst...@gmail.com> wrote in message >> [quoted text clipped - 49 lines] > of records). How does recovery deal with non-idempotent redo()/ > undo() changes in the log? Why would there be bags of records? At the physical level, each record has a specific offset in some file, and that offset would uniquely identify it. Why would you strip off that identification? Consequently, there wouldn't be bags of records, only sets of records.
If you start out with a set of records and you know what is to be inserted, updated, and deleted, you can compute the resulting set of records. If you start out with the resulting set of records, and you know what was inserted, updated and deleted in order to arrive at that result, then you can compute the original set of records. For simplicity, even if it isn't necessarily the most efficient, what is updated could be implemented as a set of ordered pairs of records, tying each original record to its replacement. So the log would consist of a sequence of triples (D, U, I) separated by transaction markers where D is a set of records that were deleted, U is a set of pairs of records that were updated, and I is a set of records that were inserted.
Now, provided that the log is written before the database--that is, (1) write the triple to the log, (2) write the database, (3) write the transaction marker in the log--, it should be possible to determine whether or not what was written to the log actually made it into the database, and thus it should be possible to roll back any uncommitted transaction.
Christoph Rupp - 25 Apr 2008 10:13 GMT > If you start out with a set of records and you know what is to be inserted, > updated, and deleted, you can compute the resulting set of records. If you [quoted text clipped - 12 lines] > or not what was written to the log actually made it into the database, and > thus it should be possible to roll back any uncommitted transaction. what about page splits in the index tree? and page merges? You would need special log entries for this, and therefore special rollback mechanisms to undo a page split/merge.
Brian Selzer - 25 Apr 2008 21:50 GMT >> If you start out with a set of records and you know what is to be >> inserted, [quoted text clipped - 27 lines] > need special log entries for this, and therefore special rollback > mechanisms to undo a page split/merge. Not necessarily. Once you've identified which records might have been affected by uncommitted transactions, you could determine which index pages could have been affected and check them for consistency, repairing or rebuilding them if necessary. Once you know which index pages might have been affected and thus which key values, a scan through the table could yield the current offsets of each record associated with an affected key value, making it possible to verify and if necessary reconstruct only the affected portion of the indexes. It would certainly take more processing power during recovery than a brute-force method such as recording all index pages affected by a transaction, but with much less log storage requirements and therefore better throughput for normal operations.
David BL - 28 Apr 2008 03:50 GMT > >> "David BL" <davi...@iinet.net.au> wrote in message > [quoted text clipped - 58 lines] > Why would you strip off that identification? Consequently, there wouldn't > be bags of records, only sets of records. One could say the log records are representing physical changes if they record the physical locations.
> If you start out with a set of records and you know what is to be inserted, > updated, and deleted, you can compute the resulting set of records. If you [quoted text clipped - 12 lines] > or not what was written to the log actually made it into the database, and > thus it should be possible to roll back any uncommitted transaction. A modern DBMS using WAL will tend to use a "lazy writer" that writes dirty pages to disk in the background. For performance this will tend to write dirty pages in a physical order so the disk head moves uniformly over the platter. This assumes the only constraint between the writing to the log and the dirty pages is the WAL constraint - ie dirty pages must be written to disk strictly after the associated log records have been written to disk. To meet this constraint the lazy writer simply needs to ignore dirty pages that haven't (yet) been logged.
Your suggestion brings far more onerous constraints on the disk writing policies. In fact to make a transaction durable it seems necessary to write all the dirty pages to disk as part of the transaction. This is certainly not normally the case in a modern DBMS. A database that avoids a "force" policy requires REDO during recovery.
Often a DBMS that supports long running transactions will allow dirty pages to be written to disk even though the associated transaction hasn't yet committed. This is the so called "steal" policy and leads to the need for UNDO during recovery.
AFAIK most databases have very flexible disk writing policies (steal/ no-force) and require both UNDO and REDO during recovery. See the ARIES algorithm.
Brian Selzer - 28 Apr 2008 05:25 GMT >> >> "David BL" <davi...@iinet.net.au> wrote in message >> [quoted text clipped - 111 lines] > Your suggestion brings far more onerous constraints on the disk > writing policies. No it doesn't. It only requires that the data be written to disk before the transaction marker is written to the log. Whether the disk writes are queued up to be written lazily is a separate issue.
> In fact to make a transaction durable it seems > necessary to write all the dirty pages to disk as part of the [quoted text clipped - 6 lines] > hasn't yet committed. This is the so called "steal" policy and leads > to the need for UNDO during recovery. The only way this can happen is if what was replaced as a result of the writes is cached while the transactions are outstanding. In order to improve performance in the presence of long-running transactions, it may be necessary to maintain that cache on disk. I should emphasize here that such caching is strictly a mechanism for improving performance.
> AFAIK most databases have very flexible disk writing policies (steal/ > no-force) and require both UNDO and REDO during recovery. See the > ARIES algorithm. David BL - 28 Apr 2008 07:07 GMT > >> "David BL" <davi...@iinet.net.au> wrote in message > [quoted text clipped - 119 lines] > transaction marker is written to the log. Whether the disk writes are > queued up to be written lazily is a separate issue. When is the transaction deemed to have committed? I was assuming your transaction marker was used to commit the transaction. If not what is it for? Check pointing? I think there are better ways to check point than that. AFAIK there is no need to check point *individual* transactions with special markers.
> > In fact to make a transaction durable it seems > > necessary to write all the dirty pages to disk as part of the [quoted text clipped - 12 lines] > necessary to maintain that cache on disk. I should emphasize here that such > caching is strictly a mechanism for improving performance. With a general purpose DBMS using strict 2PL there is inevitably a chance of dead-lock and the need to abort transactions. This in turn requires UNDO of all changes made by a particular transaction. You can either record before images of pages of uncommitted transactions, or else use the log directly to roll back a transaction. The latter is economical with caching of the tail of the log and backward chaining of log records for a given transaction.
> > AFAIK most databases have very flexible disk writing policies (steal/ > > no-force) and require both UNDO and REDO during recovery. See the > > ARIES algorithm. Brian Selzer - 28 Apr 2008 14:25 GMT >> >> "David BL" <davi...@iinet.net.au> wrote in message >> [quoted text clipped - 147 lines] > point than that. AFAIK there is no need to check point *individual* > transactions with special markers. The transaction marker is used to indicate that the transaction is complete. What that means is that the changes that are called for in the triple have been written to the database. It's not enough to simply write the log and then write the database, because without the marker following the triple in the log, there is no way to be sure that the changes called for in the triple actually made it into the database.
>> > In fact to make a transaction durable it seems >> > necessary to write all the dirty pages to disk as part of the [quoted text clipped - 22 lines] > is economical with caching of the tail of the log and backward > chaining of log records for a given transaction. Ideally, the dbms engine would serialize physical transactions. The before images of uncommitted transactions could be read by other transactions instead of the database during writes to the database. Since what is written to the log is just the changes required by a particular transaction, it should be possible to cache the changes required by other transactions while each is committed in turn. Note that by 'serialize physical transactions' I'm not referring to transaction isolation levels, which involve how information is read, but instead how information is written.
>> > AFAIK most databases have very flexible disk writing policies (steal/ >> > no-force) and require both UNDO and REDO during recovery. See the >> > ARIES algorithm. David BL - 29 Apr 2008 03:49 GMT > >> "David BL" <davi...@iinet.net.au> wrote in message > [quoted text clipped - 158 lines] > the log, there is no way to be sure that the changes called for in the > triple actually made it into the database. Ok so it relates to check pointing. I don't see how these markers are particularly useful - ie to determine where to start replaying the log during recovery. For example, when ARIES performs a check point it records 1) the active transaction list 2) last LSN (Log Sequence Number) for each active transaction 3) the set of dirty pages. 4) for each dirty page, the LSN of the earliest log record whose effect is not reflected in the page on disk.
This allows the recovery scan to determine where to start replaying the log (in the REDO pass).
> >> > In fact to make a transaction durable it seems > >> > necessary to write all the dirty pages to disk as part of the [quoted text clipped - 31 lines] > transactions' I'm not referring to transaction isolation levels, which > involve how information is read, but instead how information is written. Is this for MVCC?
Eric - 29 Apr 2008 10:08 GMT > .... > Ok so it relates to check pointing. I don't see how these markers are [quoted text clipped - 9 lines] > This allows the recovery scan to determine where to start replaying > the log (in the REDO pass). No, I think it's about being able to recover what happened between the last completed checkpoint and the end of the log.
E.
David BL - 29 Apr 2008 11:15 GMT > > .... > > Ok so it relates to check pointing. I don't see how these markers are [quoted text clipped - 12 lines] > No, I think it's about being able to recover what happened between the > last completed checkpoint and the end of the log. Actually with ARIES it's often necessary to recover from a position before the last completed check point!
ARIES recovery contains three passes: 1) analysis (forwards from last check point) 2) redo (forwards from a position calculated by the analysis) 3) undo (backwards from the end of the log).
For efficient check pointing ARIES doesn't force all dirty pages to disk as part of the check point. Therefore on recovery ARIES often needs to begin the redo pass from a log position *before* the last valid check point.
ARIES calculates the start point of the redo pass by taking the minimum of the "recovery LSNs" of the dirty pages recorded in the checkpoint.
Brian Selzer - 29 Apr 2008 17:31 GMT >> >> "David BL" <davi...@iinet.net.au> wrote in message >> [quoted text clipped - 195 lines] > This allows the recovery scan to determine where to start replaying > the log (in the REDO pass). That's not what the markers are for. In fact, during recovery only the last marker really matters.
>> >> > In fact to make a transaction durable it seems >> >> > necessary to write all the dirty pages to disk as part of the [quoted text clipped - 40 lines] > > Is this for MVCC? Not exactly. While it should be possible to have several outstanding transactions, it makes sense to me to allow only one to commit (at least physically) at a time. If you have two transactions, they can either (1) be completely independent of each other so that it shouldn't matter which transaction commits first, or (2) interact with each other so that the order in which they commit can result in different database states.
It should be possible to extend this to involve any number of outstanding transactions so that a sequence of writes to the log and the database can be determined.
Christoph Rupp - 22 Apr 2008 08:59 GMT Brian,
thanks for the clarification. I see your point now.
> Again, I say neither. You should log only those records that have changed: > discard the notion of logging pages or segments of pages--instead, log > records or sets of records. So your suggestion is that the log file would contain information like:
RECORD at page 13, offset 2: old key = "age", value = "30" -> modified, new key value = "40" etc
and during recovery, i repeat these operations (or undo them)?
> It is not enough in a concurrent environment to > reconstruct a snapshot of the physical media at a particular point in time. > An update may encompass many file operations; therefore, a reconstruction of > just the snapshot will almost certainly be inconsistent and will require > rolling back or rolling forward the result of any transactions that were > outstanding at the point in time of the snapshot. I actually always thought that this was how recovery works - i think i have a neat idea for a checkpoint algorithm; my logfiles will never become very large, i hope. Rolling back/forward was something i was sure i had to do...
Thanks Christoph
paul c - 22 Apr 2008 00:56 GMT > Hi, > [quoted text clipped - 27 lines] > thanks for your opinions, > chris Several advantages of logical logging, not the only ones, I'm sure:
1) Likely more insensitive to future changes in physical organization/operations.
2) Potential to reconstruct old db values with additional constraints applied (I don't know how many times this would have avoided much more awkward surprise requirements).
3) Ability to shadow via another physical db, eg., for read-only audit or various performance requirements (my favourite advantage, maybe most people would favour #1.)
David BL - 22 Apr 2008 08:04 GMT > Several advantages of logical logging, not the only ones, I'm sure: > [quoted text clipped - 8 lines] > or various performance requirements (my favourite advantage, maybe most > people would favour #1.) There are also disadvantages to logical logging
a) logical changes aren't always idempotent
b) there are many forms of logical changes, whereas physical changes to a page are always assumed to simply be assignments to a range of bytes within the page.
Both of these create a lot of complexity. I think physical logging is much easier to implement.
Christoph Rupp - 22 Apr 2008 09:12 GMT Paul,
thanks for your comments
> > Hi, > [quoted text clipped - 40 lines] > or various performance requirements (my favourite advantage, maybe most > people would favour #1.) I totally agree that logical logging is easy to implement (just dump some function calls to the log), and all your points are valid.
i.e. if there's an insert(a, b), just write the information to the logfile, and then during recovery repeat this operation. but that's also my problem: my code base is not able to repeat an operation if the database is corrupt. (and i don't think it has to work with a corrupted database).
so i don't know how to implement this. i'd be curious if any database uses logical logging - i think it's nice in theory, but just too difficult to implement...
paul c - 22 Apr 2008 16:45 GMT ...
> I totally agree that logical logging is easy to implement (just dump > some function calls to the log), and all your points are valid. > ... I'd say whether it is easy depends on the purpose - if that is the traditional recovery from physical errors it can be tricky.
Christoph Rupp - 22 Apr 2008 16:57 GMT > ... > [quoted text clipped - 4 lines] > I'd say whether it is easy depends on the purpose - if that is the > traditional recovery from physical errors it can be tricky. yes, my post was misleading. I actually wanted to write that it _sounds_ easy. But as i wrote afterwards, i don't know how i could implement it in my existing code base, for a traditional recovery.
|
|
|