Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
Home
Discussion Groups
Database Servers
DB2InformixIngresMS SQLOraclePervasive.SQLPostgreSQLProgressSybase
Desktop Databases
FileMakerFoxProMS AccessParadox
General
General DB TopicsDatabase Theory
Related Topics
Java Development.NET DevelopmentVB DevelopmentMore Topics ...

Database Forum / General DB Topics / DB Theory / April 2008

Tip: Looking for answers? Try searching our database.

implementing a database log

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
Christoph Rupp - 21 Apr 2008 20:44 GMT
Hi,

i'm the author of a simple open-source database engine. i'm currently
adding recovery/logging, and i have some practical questions about the
implementation.

I have 3 options.

1. a physical log based on modified pages - whenever a page is
modified, a before/after image is stored in the log.
pro: easy to implement and to test
con: logfile will become huge!

2. a physical log based on modified bytes - whenever a database key or
record is modified, only the modified bytes in the file are logged.
pro: logfile is slim, very fine-grained control
con: tedious to implement, error prone, difficult to test

3. a logical logging - information about the high-level operation is
stored in the logfile. honestly, i don't know how i could implement
this, since one high-level operation is often splitted into several
physical operations, and if one of them fails i am not sure about the
consequences...

currently, i'm going with 2), but as i wrote above, it's hard to
implement and i'm not happy with the way i'm going.

what would you recommend?

thanks for your opinions,
chris
Brian Selzer - 21 Apr 2008 22:00 GMT
> Hi,
>
[quoted text clipped - 24 lines]
>
> what would you recommend?

Why not go with #4:

4. a physical log based on modified rows.  Whenever a row is modified, added
or removed, it is logged.  Then you could also implement row
versioning--just add a row version field to the physical rows.  I believe
that this what snapshot isolation is built on.

> thanks for your opinions,
> chris
Christoph Rupp - 21 Apr 2008 22:58 GMT
Brian,

> Why not go with #4:
>
> 4. a physical log based on modified rows.  Whenever a row is modified, added
> or removed, it is logged.  Then you could also implement row
> versioning--just add a row version field to the physical rows.  I believe
> that this what snapshot isolation is built on.

It's not an SQL database, i don't even have the notion of "rows", but
basically i think your #4 is the same as my #1 or #2.

My problem is that one single insert can be split in up to 3 or even
more file operations, depending on the size of the B+Tree index (it
has to be split, if a page overflows, etc). I don't have an atomic
"insert(key, value)" operation.

And in that case, i have to log 3 (or more) different file events. And
that brings me back to my question - do i log the whole modified page
(usually a page is between 16kb and 64kb), or just the modified region
in the page...
Brian Selzer - 21 Apr 2008 23:58 GMT
> Brian,
>
[quoted text clipped - 8 lines]
> It's not an SQL database, i don't even have the notion of "rows", but
> basically i think your #4 is the same as my #1 or #2.

No, it isn't.  #1 requires the logging of additional records that may not
have been affected by an update.  #2 doesn't log the entire changed record,
but only bits and pieces.  I would think that limiting the units of change
to individual records--entire records--would simplify the process of marking
and isolating units of work while at the same time guaranteeing consistency.

> My problem is that one single insert can be split in up to 3 or even
> more file operations, depending on the size of the B+Tree index (it
> has to be split, if a page overflows, etc). I don't have an atomic
> "insert(key, value)" operation.

Perhaps you should focus on that, then.

> And in that case, i have to log 3 (or more) different file events. And
> that brings me back to my question - do i log the whole modified page
> (usually a page is between 16kb and 64kb), or just the modified region
> in the page...

Again, I say neither.  You should log only those records that have changed:
discard the notion of logging pages or segments of pages--instead, log
records or sets of records.  It is not enough in a concurrent environment to
reconstruct a snapshot of the physical media at a particular point in time.
An update may encompass many file operations; therefore, a reconstruction of
just the snapshot will almost certainly be inconsistent and will require
rolling back or rolling forward the result of any transactions that were
outstanding at the point in time of the snapshot.
David BL - 22 Apr 2008 07:53 GMT
> > Brian,
>
[quoted text clipped - 14 lines]
> to individual records--entire records--would simplify the process of marking
> and isolating units of work while at the same time guaranteeing consistency.

I don't think an atomic unit of work is always associated with a
change to an individual record.   Are you suggesting transactions to
define arbitrarily large units of work aren't needed?

Typically a log will contain special log records to mark transaction
boundaries.  Many systems allow interleaving of transaction log
records by using a transaction id in each log record.  Each
transaction is normally expected to end with a commit or abort log
record.

This is irrespective of whether physical or logical changes are
logged.
Brian Selzer - 22 Apr 2008 16:39 GMT
>> > Brian,
>>
[quoted text clipped - 23 lines]
> change to an individual record.   Are you suggesting transactions to
> define arbitrarily large units of work aren't needed?

No, that's not what I'm suggesting.  What I'm suggesting is that the atomic
unit of work should be a /set/ of /records/--either the old records in the
case of a before image or the new records in the case of an after image.

> Typically a log will contain special log records to mark transaction
> boundaries.  Many systems allow interleaving of transaction log
[quoted text clipped - 4 lines]
> This is irrespective of whether physical or logical changes are
> logged.
David BL - 23 Apr 2008 06:54 GMT
> >> "Christoph Rupp" <cruppst...@gmail.com> wrote in message
>
[quoted text clipped - 31 lines]
> unit of work should be a /set/ of /records/--either the old records in the
> case of a before image or the new records in the case of an after image.

Ok but that sounds like the system snapshots an entire table in the
before/after images.

For efficiency one would expect to only store the set of records that
have been added and the set that have been removed by a given
transaction.  It is easy to see how we get the inverse which is
required for roll back of an uncommitted transaction during recovery.
However these logical operations aren't idempotent (at least for bags
of records).   How does recovery deal with non-idempotent redo()/
undo() changes in the log?
Brian Selzer - 24 Apr 2008 15:11 GMT
>> >> "Christoph Rupp" <cruppst...@gmail.com> wrote in message
>>
[quoted text clipped - 49 lines]
> of records).   How does recovery deal with non-idempotent redo()/
> undo() changes in the log?

Why would there be bags of records?  At the physical level, each record has
a specific offset in some file, and that offset would uniquely identify it.
Why would you strip off that identification?  Consequently, there wouldn't
be bags of records, only sets of records.

If you start out with a set of records and you know what is to be inserted,
updated, and deleted, you can compute the resulting set of records.  If you
start out with the resulting set of records, and you know what was inserted,
updated and deleted in order to arrive at that result, then you can compute
the original set of records.  For simplicity, even if it isn't necessarily
the most efficient, what is updated could be implemented as a set of ordered
pairs of records, tying each original record to its replacement.  So the log
would consist of a sequence of triples (D, U, I) separated by transaction
markers where D is a set of records that were deleted, U is a set of pairs
of records that were updated, and I is a set of records that were inserted.

Now, provided that the log is written before the database--that is, (1)
write the triple to the log, (2) write the database, (3) write the
transaction marker in the log--, it should be possible to determine whether
or not what was written to the log actually made it into the database, and
thus it should be possible to roll back any uncommitted transaction.
Christoph Rupp - 25 Apr 2008 10:13 GMT
> If you start out with a set of records and you know what is to be inserted,
> updated, and deleted, you can compute the resulting set of records.  If you
[quoted text clipped - 12 lines]
> or not what was written to the log actually made it into the database, and
> thus it should be possible to roll back any uncommitted transaction.

what about page splits in the index tree? and page merges? You would
need special log entries for this, and therefore special rollback
mechanisms to undo a page split/merge.
Brian Selzer - 25 Apr 2008 21:50 GMT
>> If you start out with a set of records and you know what is to be
>> inserted,
[quoted text clipped - 27 lines]
> need special log entries for this, and therefore special rollback
> mechanisms to undo a page split/merge.

Not necessarily.  Once you've identified which records might have been
affected by uncommitted transactions, you could determine which index pages
could have been affected and check them for consistency, repairing or
rebuilding them if necessary.  Once you know which index pages might have
been affected and thus which key values, a scan through the table could
yield the current offsets of each record associated with an affected key
value, making it possible to verify and if necessary reconstruct only the
affected portion of the indexes.  It would certainly take more processing
power during recovery than a brute-force method such as recording all index
pages affected by a transaction, but with much less log storage requirements
and therefore better throughput for normal operations.
David BL - 28 Apr 2008 03:50 GMT
> >> "David BL" <davi...@iinet.net.au> wrote in message
>
[quoted text clipped - 58 lines]
> Why would you strip off that identification?  Consequently, there wouldn't
> be bags of records, only sets of records.

One could say the log records are representing physical changes if
they record the physical locations.

> If you start out with a set of records and you know what is to be inserted,
> updated, and deleted, you can compute the resulting set of records.  If you
[quoted text clipped - 12 lines]
> or not what was written to the log actually made it into the database, and
> thus it should be possible to roll back any uncommitted transaction.

A modern DBMS using WAL will tend to use a "lazy writer" that writes
dirty pages to disk in the background.   For performance this will
tend to write dirty pages in a physical order so the disk head moves
uniformly over the platter.  This assumes the only constraint between
the writing to the log and the dirty pages is the WAL constraint - ie
dirty pages must be written to disk strictly after the associated log
records have been written to disk.  To meet this constraint the lazy
writer simply needs to ignore dirty pages that haven't (yet) been
logged.

Your suggestion brings far more onerous constraints on the disk
writing policies.  In fact to make a transaction durable it seems
necessary to write all the dirty pages to disk as part of the
transaction.   This is certainly not normally the case in a modern
DBMS.  A database that avoids a "force" policy requires REDO during
recovery.

Often a DBMS that supports long running transactions will allow dirty
pages to be written to disk even though the associated transaction
hasn't yet committed.  This is the so called "steal" policy and leads
to the need for UNDO during recovery.

AFAIK most databases have very flexible disk writing policies (steal/
no-force) and require both UNDO and REDO during recovery.  See the
ARIES algorithm.
Brian Selzer - 28 Apr 2008 05:25 GMT
>> >> "David BL" <davi...@iinet.net.au> wrote in message
>>
[quoted text clipped - 111 lines]
> Your suggestion brings far more onerous constraints on the disk
> writing policies.

No it doesn't.  It only requires that the data be written to disk before the
transaction marker is written to the log.  Whether the disk writes are
queued up to be written lazily is a separate issue.

> In fact to make a transaction durable it seems
> necessary to write all the dirty pages to disk as part of the
[quoted text clipped - 6 lines]
> hasn't yet committed.  This is the so called "steal" policy and leads
> to the need for UNDO during recovery.

The only way this can happen is if what was replaced as a result of the
writes is cached while the transactions are outstanding.  In order to
improve performance in the presence of long-running transactions, it may be
necessary to maintain that cache on disk.  I should emphasize here that such
caching is strictly a mechanism for improving performance.

> AFAIK most databases have very flexible disk writing policies (steal/
> no-force) and require both UNDO and REDO during recovery.  See the
> ARIES algorithm.
David BL - 28 Apr 2008 07:07 GMT
> >> "David BL" <davi...@iinet.net.au> wrote in message
>
[quoted text clipped - 119 lines]
> transaction marker is written to the log.  Whether the disk writes are
> queued up to be written lazily is a separate issue.

When is the transaction deemed to have committed?  I was assuming your
transaction marker was used to commit the transaction.  If not what is
it for?  Check pointing?   I think there are better ways to check
point than that.   AFAIK there is no need to check point *individual*
transactions with special markers.

> > In fact to make a transaction durable it seems
> > necessary to write all the dirty pages to disk as part of the
[quoted text clipped - 12 lines]
> necessary to maintain that cache on disk.  I should emphasize here that such
> caching is strictly a mechanism for improving performance.

With a general purpose DBMS using strict 2PL there is inevitably a
chance of dead-lock and the need to abort transactions.  This in turn
requires UNDO of all changes made by a particular transaction.  You
can either record before images of pages of uncommitted transactions,
or else use the log directly to roll back a transaction.  The latter
is economical with caching of the tail of the log and backward
chaining of log records for a given transaction.

> > AFAIK most databases have very flexible disk writing policies (steal/
> > no-force) and require both UNDO and REDO during recovery.  See the
> > ARIES algorithm.
Brian Selzer - 28 Apr 2008 14:25 GMT
>> >> "David BL" <davi...@iinet.net.au> wrote in message
>>
[quoted text clipped - 147 lines]
> point than that.   AFAIK there is no need to check point *individual*
> transactions with special markers.

The transaction marker is used to indicate that the transaction is complete.
What that means is that the changes that are called for in the triple have
been written to the database.  It's not enough to simply write the log and
then write the database, because without the marker following the triple in
the log, there is no way to be sure that the changes called for in the
triple actually made it into the database.

>> > In fact to make a transaction durable it seems
>> > necessary to write all the dirty pages to disk as part of the
[quoted text clipped - 22 lines]
> is economical with caching of the tail of the log and backward
> chaining of log records for a given transaction.

Ideally, the dbms engine would serialize physical transactions.  The before
images of uncommitted transactions could be read by other transactions
instead of the database during writes to the database.  Since what is
written to the log is just the changes required by a particular transaction,
it should be possible to cache the changes required by other transactions
while each is committed in turn.  Note that by 'serialize physical
transactions' I'm not referring to transaction isolation levels, which
involve how information is read, but instead how information is written.

>> > AFAIK most databases have very flexible disk writing policies (steal/
>> > no-force) and require both UNDO and REDO during recovery.  See the
>> > ARIES algorithm.
David BL - 29 Apr 2008 03:49 GMT
> >> "David BL" <davi...@iinet.net.au> wrote in message
>
[quoted text clipped - 158 lines]
> the log, there is no way to be sure that the changes called for in the
> triple actually made it into the database.

Ok so it relates to check pointing.  I don't see how these markers are
particularly useful - ie to determine where to start replaying the log
during recovery.  For example, when ARIES performs a check point it
records
   1) the active transaction list
   2) last LSN (Log Sequence Number) for each active transaction
   3) the set of dirty pages.
   4) for each dirty page, the LSN of the earliest log
      record whose effect is not reflected in the page on disk.

This allows the recovery scan to determine where to start replaying
the log (in the REDO pass).

> >> > In fact to make a transaction durable it seems
> >> > necessary to write all the dirty pages to disk as part of the
[quoted text clipped - 31 lines]
> transactions' I'm not referring to transaction isolation levels, which
> involve how information is read, but instead how information is written.

Is this for MVCC?
Eric - 29 Apr 2008 10:08 GMT
> ....
> Ok so it relates to check pointing.  I don't see how these markers are
[quoted text clipped - 9 lines]
> This allows the recovery scan to determine where to start replaying
> the log (in the REDO pass).

No, I think it's about being able to recover what happened between the
last completed checkpoint and the end of the log.

E.
David BL - 29 Apr 2008 11:15 GMT
> > ....
> > Ok so it relates to check pointing.  I don't see how these markers are
[quoted text clipped - 12 lines]
> No, I think it's about being able to recover what happened between the
> last completed checkpoint and the end of the log.

Actually with ARIES it's often necessary to recover from a position
before the last completed check point!

ARIES recovery contains three passes:
 1) analysis (forwards from last check point)
 2) redo (forwards from a position calculated by the analysis)
 3) undo (backwards from the end of the log).

For efficient check pointing ARIES doesn't force all dirty pages to
disk as part of the check point.  Therefore on recovery ARIES often
needs to begin the redo pass from a log position *before* the last
valid check point.

ARIES calculates the start point of the redo pass by taking the
minimum of the "recovery LSNs" of the dirty pages recorded in the
checkpoint.
Brian Selzer - 29 Apr 2008 17:31 GMT
>> >> "David BL" <davi...@iinet.net.au> wrote in message
>>
[quoted text clipped - 195 lines]
> This allows the recovery scan to determine where to start replaying
> the log (in the REDO pass).

That's not what the markers are for.  In fact, during recovery only the last
marker really matters.

>> >> > In fact to make a transaction durable it seems
>> >> > necessary to write all the dirty pages to disk as part of the
[quoted text clipped - 40 lines]
>
> Is this for MVCC?

Not exactly. While it should be possible to have several outstanding
transactions, it makes sense to me to allow only one to commit (at least
physically) at a time.  If you have two transactions, they can either
(1) be completely independent of each other so that it shouldn't matter
which transaction commits first, or
(2) interact with each other so that the order in which they commit can
result in different database states.

It should be possible to extend this to involve any number of outstanding
transactions so that a sequence of writes to the log and the database can be
determined.
Christoph Rupp - 22 Apr 2008 08:59 GMT
Brian,

thanks for the clarification. I see your point now.

> Again, I say neither.  You should log only those records that have changed:
> discard the notion of logging pages or segments of pages--instead, log
> records or sets of records.

So your suggestion is that the log file would contain information
like:

RECORD at page 13, offset 2: old key = "age", value = "30" ->
modified, new key value = "40"
etc

and during recovery, i repeat these operations (or undo them)?

> It is not enough in a concurrent environment to
> reconstruct a snapshot of the physical media at a particular point in time.
> An update may encompass many file operations; therefore, a reconstruction of
> just the snapshot will almost certainly be inconsistent and will require
> rolling back or rolling forward the result of any transactions that were
> outstanding at the point in time of the snapshot.

I actually always thought that this was how recovery works - i think i
have a neat idea for a checkpoint algorithm; my logfiles will never
become very large, i hope. Rolling back/forward was something i was
sure i had to do...

Thanks
Christoph
paul c - 22 Apr 2008 00:56 GMT
> Hi,
>
[quoted text clipped - 27 lines]
> thanks for your opinions,
> chris

Several advantages of logical logging, not the only ones, I'm sure:

1) Likely more insensitive to future changes in physical
organization/operations.

2) Potential to reconstruct old db values with additional constraints
applied (I don't know how many times this would have avoided much more
awkward surprise requirements).

3) Ability to shadow via another physical db, eg., for read-only audit
or various performance requirements (my favourite advantage, maybe most
people would favour #1.)
David BL - 22 Apr 2008 08:04 GMT
> Several advantages of logical logging, not the only ones, I'm sure:
>
[quoted text clipped - 8 lines]
> or various performance requirements (my favourite advantage, maybe most
> people would favour #1.)

There are also disadvantages to logical logging

a)  logical changes aren't always idempotent

b)  there are many forms of logical changes, whereas physical changes
to a page are always assumed to simply be assignments to a range of
bytes within the page.

Both of these create a lot of complexity.   I think physical logging
is much easier to implement.
Christoph Rupp - 22 Apr 2008 09:12 GMT
Paul,

thanks for your comments

> > Hi,
>
[quoted text clipped - 40 lines]
> or various performance requirements (my favourite advantage, maybe most
> people would favour #1.)

I totally agree that logical logging is easy to implement (just dump
some function calls to the log), and all your points are valid.

i.e. if there's an insert(a, b), just write the information to the
logfile, and then during recovery repeat this operation. but that's
also my problem: my code base is not able to repeat an operation if
the database is corrupt. (and i don't think it has to work with a
corrupted database).

so i don't know how to implement this. i'd be curious if any database
uses logical logging - i think it's nice in theory, but just too
difficult to implement...
paul c - 22 Apr 2008 16:45 GMT
...

> I totally agree that logical logging is easy to implement (just dump
> some function calls to the log), and all your points are valid.
> ...

I'd say whether it is easy depends on the purpose - if that is the
traditional recovery from physical errors it can be tricky.
Christoph Rupp - 22 Apr 2008 16:57 GMT
> ...
>
[quoted text clipped - 4 lines]
> I'd say whether it is easy depends on the purpose - if that is the
> traditional recovery from physical errors it can be tricky.

yes, my post was misleading. I actually wanted to write that it
_sounds_ easy. But as i wrote afterwards, i don't know how i could
implement it in my existing code base, for a traditional recovery.
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2009 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.