Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
Home
Discussion Groups
Database Servers
DB2InformixIngresMS SQLOraclePervasive.SQLPostgreSQLProgressSybase
Desktop Databases
FileMakerFoxProMS AccessParadox
General
General DB TopicsDatabase Theory
Related Topics
Java Development.NET DevelopmentVB DevelopmentMore Topics ...

Database Forum / General DB Topics / DB Theory / February 2008

Tip: Looking for answers? Try searching our database.

Newbie question about db normalization theory: redundant keys OK?

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
raylopez99 - 12 Dec 2007 22:26 GMT
With a few hours of theory under my belt, I'd like to ask if there's
ever a time that you don't want a completely normalized dB, that is, a
normalized database being a dB that has no redundant information (my
understanding of what a normalized database is).

Or, is there ever a time that you want redundant keys (that is, the
same keys in many different tables, that obviously are not linked (in
a relationship) between two tables?).  Having redundant attributes and/
or keys seems to me a very lazy way of designing a database that
doesn't require lots of initial thought, but of course you have to pay
for it by meticulously "synching" all redundant keys to one another
everytime there is a change in one of the redundant keys, so the keys
don't drift and have different values.

But is there ever a time you want to do this?

THanks in advance

RL
David Cressey - 13 Dec 2007 12:14 GMT
> With a few hours of theory under my belt, I'd like to ask if there's
> ever a time that you don't want a completely normalized dB, that is, a
[quoted text clipped - 15 lines]
>
> RL

The answer is yes,  there are times when a  design is a good one, even if
less than fully normalized.  For each normalization form, there is a known
set of anomalies that come up when you insert, update, or delete data in
that form.  If you are willing and able to program around those anomalies,
and if the design yields benefits that justify that effort,  it can be the
right thing to do.  Learning when to normalize is more subtle than learning
how to normalize.

There is a particular form of database design, called "star schema" that
yields good results when used in a data mart or data warehouse situation.  A
star schema mimics a multidimensional database in relational (or SQL) form.
A star schema follows design rules of its own,  and those rules sometimes
contradict the rules of normalization.  The up side of star schema is that
it's very easy to use with report generators,  or with OLAP tools like
Cognos or Business Objects.  The down side of star schema is that the
process of keeping the data current involves some fairly intricate
programming,  and heavy use of computer resources.

Star schema,  and other unnormalized or denormalized designs almost always
cost more than they are worth when used in a high transaction operational
setting,  like OLTP.

Unfortunately, most deviations from normalization occur due to blunders, and
not due to well considered design decisions.  Many deviations from
normalization occur because the designer is unfamiliar with some of the
normal forms.  Back when I was building databases,  I only really knew 1NF,
2NF, and 3NF.  Update anomalies due to deviations from BCNF and beyond were
rare,  but my design process would not have obviated them.

Another major cause of deviations from normalization is failure to
understand the data.  In particular, the functional dependendencies inherent
in the data are not discovered during data analysis,  and the design
unknowingly violates normalization rules.  By the time this is discovered,
there is usually a large body of application code that is dependent of the
bad design.

Sometimes, denormalized design is the reult of sheer pigheadedness.
Bob Badour - 13 Dec 2007 17:30 GMT
>>With a few hours of theory under my belt, I'd like to ask if there's
>>ever a time that you don't want a completely normalized dB, that is, a
[quoted text clipped - 53 lines]
>
> Sometimes, denormalized design is the reult of sheer pigheadedness.

Don't listen to a work David says. Star schema was sold by Cognos and
Business Objects so their customers would have to do the work they
should have done in the first place.

I seldom see anyone 'denormalize' who is aware of the actual costs of
doing so. On the other hand, I have seen plenty of ignoramuses
'denormalize' when physical clustering for the same performance
characteristics was an available option.
Dr. Dweeb - 16 Dec 2007 20:54 GMT
>>> With a few hours of theory under my belt, I'd like to ask if there's
>>> ever a time that you don't want a completely normalized dB, that
[quoted text clipped - 61 lines]
> 'denormalize' when physical clustering for the same performance
> characteristics was an available option.

Me too. The inability of a large number of "dba"s to distinguish between
logical and physical and to know which is which is trult astonishing - and
quite sad, given the fundamental nature of this distinction and its
relevance to relational databases.

This forum is full of examples of people who just have not understood it,
and propose bizarre ideas as "performance" enhancing, essentially because
they haven't got a clue what an optimizer is, why it is there, and just what
forms of optimization are in fact possible within any specific product.
Throw in physical proximity and other physical placement solutions and one
quickly realises that most problems are solvable without dicking around with
the logical model.

Dr. Dweeb
(an occasional lurker and OracleRdb guru)
-CELKO- - 13 Dec 2007 14:12 GMT
CREATE TABLE Schedule
(teacher VARCHAR(15) NOT NULL,
class CHAR(15) NOT NULL,
room INTEGER NOT NULL,
period INTEGER NOT NULL,
PRIMARY KEY (teacher, class, room, period));

That choice of a primary key is the most obvious one -- use all the
columns.  Typical rows would look like this:

('Mr. Celko', 'Database 101', 222, 6)

The rules we want to enforce are:

1) A teacher is in only one room each period.
2) A teacher teaches only one class each period.
3) A room has only one class each period.
4) A room has only one teacher in it each period.

Stop reading and see what you come up with for an answer.  Okay, now
consider using one constraint for each rule in the list, thus.

CREATE TABLE Schedule_1 -- version one, wrong!
(teacher VARCHAR(15) NOT NULL,
class CHAR(15) NOT NULL,
room INTEGER NOT NULL,
period INTEGER NOT NULL,
UNIQUE (teacher, room, period), -- rule #1
UNIQUE (teacher, class, period), -- rule #2
UNIQUE (class, room, period),  -- rule #3
UNIQUE (teacher, room, period), -- rule #4
PRIMARY KEY (teacher, class, room, period));

We know that there are 4 ways to pick three things from a set of four
things permutation.  While column order is important in creating an
index, we can ignore it for now and then worry about index tuning
later in the book.

I could drop the PRIMARY KEY as redundant if I have all four of these
constraints in place.  But what happens if I drop the PRIMARY KEY and
then one of the constraints?

CREATE TABLE Schedule_2  -- still wrong
(teacher VARCHAR(15) NOT NULL,
class CHAR(15) NOT NULL,
room INTEGER NOT NULL,
period INTEGER NOT NULL,
UNIQUE (teacher, room, period), -- rule #1
UNIQUE (teacher, class, period), -- rule #2
UNIQUE (class, room, period));    -- rule #3

I can now insert these rows in the second version of the table:

('Mr. Celko', 'Database 101', 222, 6)
('Mr. Celko', 'Database 102', 223, 6)

This gives me a very tough sixth period class load since I have to be
in two different rooms at the same time.  Things can get even worse
when another teacher is added to the schedule:

('Mr. Celko', 'Database 101', 222, 6)
('Mr. Celko', 'Database 102', 223, 6)
('Ms. Shields', 'Database 101', 223, 6)

Ms. Shields and I are both in room 223, trying to teach different
classes at the same time.  Matthew Burr looked at the constraints and
the rules, came up with this analysis.

CREATE TABLE Schedule_3  -- corrected version
(teacher VARCHAR(15) NOT NULL,
class CHAR(15) NOT NULL,
room INTEGER NOT NULL,
period INTEGER NOT NULL,
UNIQUE (teacher, period), -- rules #1 and #2
UNIQUE (room, period));   -- rules #3 and #4

If a teacher is in only one room each period, then given a period and
a teacher I should be able to determine only one room, i.e.  room is
functionally dependent upon the combination of teacher and period.
Likewise, if a teacher teaches only one class each period, then class
is functionally dependent upon the combination of teacher and period.
The same thinking holds for the last two rules: class is functionally
dependent upon the combination of room and period, and teacher is
functionally dependent upon the combination of room and period.

With the constraints that were provided in the first version, you will
find that the rules are not enforced.  For example, I could enter the
following rows:

('Mr. Celko', 'Database 101', 222, 6)
('Mr. Celko', 'Database 102', 223, 6)

These rows violate two of your rules, rule #1 and rule #2.  However,
the unique constraints first provided in Schedule_2 do not capture
this violation and will allow the rows to be entered.

The constraint

 UNIQUE (teacher, room, period)

is checking the complete combination of teacher, room, and period, and
since ('Mr. Celko', 222, 6) is different from ('Mr. Celko', 223, 6),
the DDL does not find any problem with both rows being entered, even
though that means that Mr. Celko is in more than one room during the
same period.

 UNIQUE (teacher, class, period)

doesn't catch its associated rule either since ('Mr. Celko', 'Database
101', 6) is different from ('Mr. Celko', 'Database 102', 6), and so,
Mr. Celko is able to teach more than one class during the same period,
thus violating rule two.  It seems that we'd also be able to add the
following row:

('Ms.  Shields', 'Database 103', 222, 6)

which violates rules #3 and #4.
Tony Rogerson - 14 Dec 2007 08:01 GMT
> CREATE TABLE Schedule
> (teacher VARCHAR(15) NOT NULL,
[quoted text clipped - 7 lines]
>
> ('Mr. Celko', 'Database 101', 222, 6)

And half way through term the teacher changed their name to 'Mrs Bunting'
and now everybody is confused!

Great example of why you should use an artificial key, I'll remember your
example for a blog entry I'm writing.

Signature

Tony Rogerson, SQL Server MVP
http://sqlblogcasts.com/blogs/tonyrogerson
[Ramblings from the field from a SQL consultant]
http://sqlserverfaq.com
[UK SQL User Community]

Roy Hann - 14 Dec 2007 08:35 GMT
>> CREATE TABLE Schedule
>> (teacher VARCHAR(15) NOT NULL,
[quoted text clipped - 13 lines]
> Great example of why you should use an artificial key, I'll remember your
> example for a blog entry I'm writing.

I'll remember this.  It is a great example of why no one should bother
reading your blog.

Roy
Tony Rogerson - 14 Dec 2007 12:16 GMT
> I'll remember this.  It is a great example of why no one should bother
> reading your blog.

Oh, I see Roy; you agree with Celko's design.

Interesting....

Signature

Tony Rogerson, SQL Server MVP
http://sqlblogcasts.com/blogs/tonyrogerson
[Ramblings from the field from a SQL consultant]
http://sqlserverfaq.com
[UK SQL User Community]

David Portas - 14 Dec 2007 14:59 GMT
> > I'll remember this.  It is a great example of why no one should bother
> > reading your blog.
[quoted text clipped - 7 lines]
> [Ramblings from the field from a SQL consultant]http://sqlserverfaq.com
> [UK SQL User Community]

Tony,

You didn't specify any reason why you thought Joe's design was wrong.
There is no obvious reason for confusion just because a teacher's name
changes - at least not as far as the data model is concerned. Users of
the data need a way to identify their teacher but how would an
"artificial key" help? I never knew any of my teachers as 1234.

--
David Portas
Tony Rogerson - 14 Dec 2007 15:49 GMT
> You didn't specify any reason why you thought Joe's design was wrong.
> There is no obvious reason for confusion just because a teacher's name
> changes - at least not as far as the data model is concerned. Users of
> the data need a way to identify their teacher but how would an
> "artificial key" help? I never knew any of my teachers as 1234.

I thought I had.

On my printed off time table it states 'Ms Fred', now, half way through the
school year while on a term break the teacher gets married and is now called
'Mrs Sid' - how does my timetable now link back to the original entity - it
can't, now, I return back from my holiday and wander around campus looking
for a teached called 'Ms Fred' but to no avail.

Now, a new teacher starts towards the end of the school year, called 'Ms
Fred', this is a completely different person from the one at the start of
term, in fact they teach cooking rather than engineering; now, my timetable
correctly links up and I can find my teacher - 'Ms Fred', only, the problem
is I'm at the wrong class - I no longer recognise the subject content - what
breaking eggs and cooking cakes has to do with engineering?

Do you see my point yet? and no, don't start talking about it being
identified by classroom - I remember when I was at FE the room we where
taught in, in fact the building sometimes often changed from week to week

Another one is somebody I do business with in Microsoft for the community,
she recently got married - she cannot change her email address because if
she did when we emailed her it would bounce because we would be using the
old one.

These are all problems with using the natural key - the natural key changes!
You need to create an imutable key; never changes so these real problems can
not and do not occur.

I'm interested - how would you solve this problem in the real world? Would
you prevent the teacher from changing their name?

Signature

Tony Rogerson, SQL Server MVP
http://sqlblogcasts.com/blogs/tonyrogerson
[Ramblings from the field from a SQL consultant]
http://sqlserverfaq.com
[UK SQL User Community]

David Portas - 14 Dec 2007 16:13 GMT
> > You didn't specify any reason why you thought Joe's design was wrong.
> > There is no obvious reason for confusion just because a teacher's name
[quoted text clipped - 18 lines]
>
> Do you see my point yet?

If the database is correctly updated to reflect any name changes then
no problem arises. If there is a requirement to reflect the history of
previous names then that information can also be recorded. In either
case I don't see any call for anything like an "artificial key".

> Another one is somebody I do business with in Microsoft for the community,
> she recently got married - she cannot change her email address because if
[quoted text clipped - 4 lines]
> You need to create an imutable key; never changes so these real problems can
> not and do not occur.

I don't understand what would prevent you from updating the email
address in the database. The choice of key should have nothing to do
with it.

--
David Portas
Tony Rogerson - 14 Dec 2007 17:07 GMT
> If the database is correctly updated to reflect any name changes then
> no problem arises. If there is a requirement to reflect the history of
> previous names then that information can also be recorded. In either
> case I don't see any call for anything like an "artificial key".

CREATE TABLE Schedule
(teacher VARCHAR(15) NOT NULL,
class CHAR(15) NOT NULL,
room INTEGER NOT NULL,
period INTEGER NOT NULL,
PRIMARY KEY (teacher, class, room, period));

Does the above do that? No.

Part 2; in the real world how would the history work?

Perhaps this....

CREATE TABLE Schedule
(teacher VARCHAR(15) NOT NULL,
class CHAR(15) NOT NULL,
room INTEGER NOT NULL,
period INTEGER NOT NULL,
start_date datetime not null,
end_date datetime null check( end_date > start_date ),
PRIMARY KEY (teacher, class, room, period, start_end, end_date ));

Nope, I still cannot find my teacher (SELECT * FROM Schedule WHERE teacher =
'Ms Fred'); all I know now is that my teacher no longer is a teacher which
is wrong.

How would you store the history given that the natural key changes WITHOUT
using an artificial key?

> I don't understand what would prevent you from updating the email
> address in the database. The choice of key should have nothing to do
> with it.

The real world gets in the way.

Upto and including Monday the person has the email address
'julie.smith@fred.com'

On Tuesday she gets married and changes her email address to
'julie.jones@fred.com'

Do you not see a problem with that if the designer had followed the same
approach above.

I keep trying to email julie.smith@fred.com only that email address no
longer exists so I can no longer communicate with her.

Signature

Tony Rogerson, SQL Server MVP
http://sqlblogcasts.com/blogs/tonyrogerson
[Ramblings from the field from a SQL consultant]
http://sqlserverfaq.com
[UK SQL User Community]

David Portas - 14 Dec 2007 17:36 GMT
> > If the database is correctly updated to reflect any name changes then
> > no problem arises. If there is a requirement to reflect the history of
[quoted text clipped - 52 lines]
> [Ramblings from the field from a SQL consultant]http://sqlserverfaq.com
> [UK SQL User Community]

I genuinely am having trouble understanding what any of this has to do
with saying that Joe's design is "wrong". None of the problems you
have mentioned are the usual ones given for using "artificial" keys -
not that I've seen anyway. I don't doubt that you have some real
issues in mind but I don't think you are explaining them very
precisely: "The real world gets in the way" tells us nothing about why
it would be a problem to update email address X to become email
address Y. I just don't see what you are getting at.

For methods of recording the history of change Date, Darwen and
Lorentzos have a book "Temporal Databases and the Relational Model",
which discusses the issues and solutions at length.

--
David Portas
Tony Rogerson - 14 Dec 2007 17:51 GMT
First of all, let's go back to your statement about just store the history -
can you please show how you would do that with celko schema and still be
able to reach the rows as per my examples?

> precisely: "The real world gets in the way" tells us nothing about why
> it would be a problem to update email address X to become email
> address Y. I just don't see what you are getting at.

Here's another...

create table blah (
   emiladdress varchar(?) not null unique,
   who varchar(100) not null
)

In the year 2006 I have an email address called
tonyrogerson@sqlserver.eu.com everybody sends email to that address.

insert blah values( 'tonyrogerson@sqlserver.eu.com' )

On 1st Jan 2007 I change the email address to tonyrogerson@torver.net I
suddenly stop receiving emails because everybody is still emailing
tonyrogerson@sqlserver.eu.com.

update blah set emailaddress = 'tonyrogerson@torver.net' where emailaddress
= 'tonyrogerson@sqlserver.eu.com'

How do all the applications disconnected from the database now reach the
record? They can't - the natural key has changed.

Signature

Tony Rogerson, SQL Server MVP
http://sqlblogcasts.com/blogs/tonyrogerson
[Ramblings from the field from a SQL consultant]
http://sqlserverfaq.com
[UK SQL User Community]

Ross Presser - 14 Dec 2007 18:01 GMT
> create table blah (
>     emiladdress varchar(?) not null unique,
[quoted text clipped - 15 lines]
> How do all the applications disconnected from the database now reach the
> record? They can't - the natural key has changed.

There are two ways I would answer this. The first says that email
address isn't a natural key -- it doesn't uniquely define a person any
more than hair color does.  The second is that the natural key hasn't
changed, the entity -- a mailbox on a mail server -- has changed.
Tony Rogerson - 14 Dec 2007 18:14 GMT
> There are two ways I would answer this. The first says that email
> address isn't a natural key -- it doesn't uniquely define a person any
> more than hair color does.  The second is that the natural key hasn't
> changed, the entity -- a mailbox on a mail server -- has changed.

Absolutely - so we need to use something that won't change - an artificial
key.

Signature

Tony Rogerson, SQL Server MVP
http://sqlblogcasts.com/blogs/tonyrogerson
[Ramblings from the field from a SQL consultant]
http://sqlserverfaq.com
[UK SQL User Community]

David Portas - 14 Dec 2007 18:39 GMT
> First of all, let's go back to your statement about just store the history -
> can you please show how you would do that with celko schema and still be
[quoted text clipped - 25 lines]
> How do all the applications disconnected from the database now reach the
> record? They can't - the natural key has changed.

You are saying that this is a distributed database but that it lacks a
mechanism for accurately propagating changes out to all its nodes?
Well in my view such a DBMS would be broken. It surely violates Codd's
principle of "Distribution Independence". Let's follow your example to
its conclusion though. The solution is to replace whatever copy of the
Blah relation exists in the application with the new Blah relation
that superceded it. Now all emails reach the correct address and there
is no problem that requires a different key.

--
David Portas
Tony Rogerson - 14 Dec 2007 18:55 GMT
> You are saying that this is a distributed database but that it lacks a
> mechanism for accurately propagating changes out to all its nodes?

Nope - it's a normal database you have simply done a query against the
database - the result of SELECT and then return to the database to validate
your data against it - think about it; your insurance company posts you a
claim form and you then send it back in - you need to validate what's on the
form (our equiv of time table) against the database.

Also, please can you show me how you would simply put history in the table
with celko's example.

Signature

Tony Rogerson, SQL Server MVP
http://sqlblogcasts.com/blogs/tonyrogerson
[Ramblings from the field from a SQL consultant]
http://sqlserverfaq.com
[UK SQL User Community]

Brian Selzer - 15 Dec 2007 05:17 GMT
>> First of all, let's go back to your statement about just store the
>> history -
[quoted text clipped - 36 lines]
> that superceded it. Now all emails reach the correct address and there
> is no problem that requires a different key.

Forgive me for butting in, David, but where did you come up with the idea
that it is a distributed database?  What have disconnected
applications--that is, applications that use something akin to disconnected
ADO recordsets or ADO.NET datasets--to do with distributed databases?

The question is: for how long is the data that was just read out of the
database considered to be valid?  Until the next update?  Or is it stale as
soon as its read?  Does it have something to do with transaction control or
locking?  If several updates occur between the reading of one piece of
information and the reading of another, how can you be sure that any answer
that involves both pieces of information is correct?  How can you be sure
that you haven't read the same information twice?  If you use an artificial
key and a timestamp (or rowversion), then there can be no doubt as to
whether or not the information in question changed between the first reading
and the second.

> --
> David Portas
David Portas - 15 Dec 2007 09:08 GMT
>> You are saying that this is a distributed database but that it lacks a
>> mechanism for accurately propagating changes out to all its nodes?
[quoted text clipped - 18 lines]
> any answer that involves both pieces of information is correct?  How can
> you be sure that you haven't read the same information twice?

I agree that these are important issues of application design. I don't think
they need to affect the database logical design in this case.

> If you use an artificial key and a timestamp (or rowversion), then there
> can be no doubt as to whether or not the information in question changed
> between the first reading and the second.

A ROWVERSION (as defined by Microsoft SQL Server) does not tell you whether
any data has changed. It tells you whether some rows were possibly affected
by update operations. In other words it can give false positives -
indicating a change where there is none.

BTW I seriously doubt whether it would be possible or desirable to implement
anything like a ROWVERSION in a true RDBMS. The consequences of SQL Server's
implementation are serious because it attempts to identify row data based on
something other than keys. I have never been a fan of the ROWVERSION
feature.

The only way to tell whether the current state of the database equals some
previous state is to query again and compare that to a result previously
retrieved. That comparison is usually based on a key value. It makes no
difference what type of value is used for the key. The comparison is exactly
the same whether it is an "artificial" key or otherwise. (I'm not too
concerned about defining what an "artificial" key is because I don't think
it matters).

Consider a set of transformations:

T1:a -> T2:b -> T3:c -> T4:a

Where "T1:a" means "The value of the database at time T1 is a". When we
requery the database at time T4 all we will ever know is that the value is
the same as at time T1. If it is a requirement to know about the previous
updates at T2 and T3 then obviously we ought to preserve that information in
the database - but in many cases it is quite reasonable not to do that.

Signature

David Portas

David Cressey - 15 Dec 2007 13:13 GMT
> >> You are saying that this is a distributed database but that it lacks a
> >> mechanism for accurately propagating changes out to all its nodes?
[quoted text clipped - 21 lines]
> I agree that these are important issues of application design. I don't think
> they need to affect the database logical design in this case.

They need not afect database design for a single database built in
isolation.  However, if there is a data architecture that pervades the
entire enterprise,  and both databases and applications inherit the
semantics of the data and to some extent the form,  from that pervasive data
architecture,  then the issues raised become very relevant.
Bob Badour - 15 Dec 2007 13:49 GMT
>>>You are saying that this is a distributed database but that it lacks a
>>>mechanism for accurately propagating changes out to all its nodes?
[quoted text clipped - 10 lines]
>>disconnected ADO recordsets or ADO.NET datasets--to do with distributed
>>databases?

Managing data on a server... and managing data on a client... um, and
the idiot is too stupid to see what that has to do with distributed data
management?!? Yikes!

Just because the idiot chose a piss-poor way to manage data at one node
in the system doesn't change the fundamental nature of what's going on.

David, I don't know why you bother with self-aggrandizing ignorants like
Selzer.

[snip]
Brian Selzer - 15 Dec 2007 19:37 GMT
>>>>You are saying that this is a distributed database but that it lacks a
>>>>mechanism for accurately propagating changes out to all its nodes?
[quoted text clipped - 20 lines]
> David, I don't know why you bother with self-aggrandizing ignorants like
> Selzer.

I believe that with this post you have proven yourself to be ignorant.  You
appear to not have a clue about how relational data is used in applications,
nor does it appear that you can distinguish between a distributed database
and an application.  Perhaps you should consider keeping your mouth closed,
unless, of course, you enjoy the taste of your feet.

> [snip]
Brian Selzer - 15 Dec 2007 19:38 GMT
>>> You are saying that this is a distributed database but that it lacks a
>>> mechanism for accurately propagating changes out to all its nodes?
[quoted text clipped - 21 lines]
> I agree that these are important issues of application design. I don't
> think they need to affect the database logical design in this case.

I think they do.  For a single state, first order logic suffices; once more
than one state is involved, at a minimum some form of modal logic is
necessary.

>> If you use an artificial key and a timestamp (or rowversion), then there
>> can be no doubt as to whether or not the information in question changed
[quoted text clipped - 4 lines]
> affected by update operations. In other words it can give false
> positives - indicating a change where there is none.

You're right, of course: it only tells whether some rows were the target of
an update, not whether the information is actually different.

> BTW I seriously doubt whether it would be possible or desirable to
> implement anything like a ROWVERSION in a true RDBMS. The consequences of
> SQL Server's implementation are serious because it attempts to identify
> row data based on something other than keys. I have never been a fan of
> the ROWVERSION feature.

In what way does it attempt to identify row data based on something other
than keys?

> The only way to tell whether the current state of the database equals some
> previous state is to query again and compare that to a result previously
[quoted text clipped - 3 lines]
> too concerned about defining what an "artificial" key is because I don't
> think it matters).

I disagree.  Unless the identifier is a rigid designator or a rigid definite
description, then any comparison based upon that identifier is suspect.  It
may be the case that the key values are identical, but the individuals in
each state that are identified by that key value are different individuals.
For example, "the first person in line" can be different people at different
times.

> Consider a set of transformations:
>
[quoted text clipped - 5 lines]
> updates at T2 and T3 then obviously we ought to preserve that information
> in the database - but in many cases it is quite reasonable not to do that.
David Portas - 16 Dec 2007 14:14 GMT
>> The only way to tell whether the current state of the database equals
>> some previous state is to query again and compare that to a result
[quoted text clipped - 10 lines]
> different individuals. For example, "the first person in line" can be
> different people at different times.

Isn't that purely a question of what propositions we want to represent?

Assume that the tuple:

A: ('Celko', 'Database 101', 222, 6)

represents the proposition:

P: "Teacher named Celko teaches class 'Database 101' in room 222 during
period 6"

which we assume to be true at a certain point in time. At some later time we
replace tuple A with two different tuples:

B: ('Selzer', 'Database 101', 222, 6)
C: ('Celko', 'Zen Buddhism', 223, 6)

representing propositions of the same kind as P.

Those propositions assert absolutely nothing about whether Celko in A is the
same individual as Celko in C. They assert nothing about whether room 222 in
the first case is the same as in the second. Crucially, they DO NOT assert
whether any individual previously known as "Celko" changed his name to
"Selzer" or whether he was replaced by an entirely different person called
"Selzer".

The reason that you (and Tony I guess) think it is "suspect" to compare
tuple A with tuple B is that you are applying your own intended
interpretation, which the database was never designed to support. There are
other interpretations however, which are potentially valid (the
interpretations of the proposition P kind for example). If those valid
interpretations are the ones required and understood by the users of the
data then the database is quite sufficient as proposed.

>> BTW I seriously doubt whether it would be possible or desirable to
>> implement anything like a ROWVERSION in a true RDBMS. The consequences of
[quoted text clipped - 4 lines]
> In what way does it attempt to identify row data based on something other
> than keys?

A problem arises with relational assignment - something that SQL doesn't
support. In an assignment it is meaningless to talk of individual tuples
being updated or left unchanged. The only valid basis for anything like
ROWVERSION might be a comparison of the relation value before the assigment
and the relation value after. But the ROWVERSION feature has no means of
specifying a key on which to perform such a comparison (presumably the
comparison would be a join on one of the candidate keys). ROWVERSION just
assumes that such a comparison can be made with or without a key (this is
SQL after all!).

Every kind of update in an RDBMS must surely be equivalent to some
relational assignment. That's why I seriously doubt whether anything quite
like ROWVERSION is possible or desirable.

Signature

David Portas

Tony Rogerson - 16 Dec 2007 22:06 GMT
David,

Stop trying to divert; you got it wrong; just leave it there.

Here is the original ddl...

CREATE TABLE Schedule
(teacher VARCHAR(15) NOT NULL,
class CHAR(15) NOT NULL,
room INTEGER NOT NULL,
period INTEGER NOT NULL,
PRIMARY KEY (teacher, class, room, period));

ROWVERSION has nothing at all to do with this I really don't understand why
you are trying to deflect the thread to that.

The simple matter of the fact is that on revisting the database with the
natural key you obtained (which should not have changed) has changed and
points to completely different data.

That is why we need an immutable key.

And... I'm still waiting for you to show me how to put history into celko's
schema that would fix the problem where the natural key changes - why oh why
are you STILL ignoring that request?

Signature

Tony Rogerson, SQL Server MVP
http://sqlblogcasts.com/blogs/tonyrogerson
[Ramblings from the field from a SQL consultant]
http://sqlserverfaq.com
[UK SQL User Community]

David Portas - 16 Dec 2007 22:38 GMT
> David,
>
> Stop trying to divert; you got it wrong; just leave it there.

I'm surprised if you assume there is a single right or wrong answer. I
generally try to keep an open mind which is why I posed my responses as
questions rather than claim that anything you said was wrong.

> ROWVERSION has nothing at all to do with this I really don't understand
> why you are trying to deflect the thread to that.

I agree ROWVERSION has nothing to do with it. I didn't introduce the topic.
Brian did.

> The simple matter of the fact is that on revisting the database with the
> natural key you obtained (which should not have changed) has changed and
> points to completely different data.
>
> That is why we need an immutable key.

That's a straw man argument. If you design a solution on the assumption that
data will not change then you will certainly have problems. I agree with
you. I just don't advocate designing systems that way. Now if the rows
referenced by your "immutable" key have been deleted, what result do you
intend in that case? You don't assume that they will always exist do you? Of
course you don't - you determine the right response and design the
appropriate solution on the basis that the data may have changed.

> And... I'm still waiting for you to show me how to put history into
> celko's schema that would fix the problem where the natural key changes -
> why oh why are you STILL ignoring that request?

I already explained.

Signature

David Portas

Brian Selzer - 17 Dec 2007 03:55 GMT
>>> The only way to tell whether the current state of the database equals
>>> some previous state is to query again and compare that to a result
[quoted text clipped - 44 lines]
> interpretations are the ones required and understood by the users of the
> data then the database is quite sufficient as proposed.

Isn't it true that with FOL, there must be separate interpretations for
tuple A and tuple B, since they belong to different database values, whereas
with modal or temporal logic, the same interpretation can apply to both?
Since it is under an interpretation that values are assigned meaning, the
separate interpretations for tuple A and tuple B permit identical values to
mean totally different things or different values to mean the same thing.
Without a common interpretation, it cannot be determined whether or not the
individuals referenced by the key values in tuple A and tuple B are the same
individual or different individuals.   This is why I think it is suspect to
compare tuple A with tuple B unless the key is a rigid designator or a rigid
definite description.  Note that the presence of a rigid designator or rigid
definite description implies a common interpretation under which comparisons
are no longer suspect.

>>> BTW I seriously doubt whether it would be possible or desirable to
>>> implement anything like a ROWVERSION in a true RDBMS. The consequences
[quoted text clipped - 18 lines]
> relational assignment. That's why I seriously doubt whether anything quite
> like ROWVERSION is possible or desirable.

That assumes that relational assignment is primitive.  I would argue that
insert, update and delete are the primitive operations, and that assignment
is a shorthand for a combination of the primitives delete and insert.
Information is lost when an update is translated into an assignment, but not
so the reverse: an assignment can always be translated into a delete and an
insert.
David Portas - 18 Dec 2007 22:03 GMT
> Isn't it true that with FOL, there must be separate interpretations for
> tuple A and tuple B, since they belong to different database values,
> whereas with modal or temporal logic, the same interpretation can apply to
> both?

I don't know enough about modal logic to comment on that part. I don't see
how you conclude that for FOL there MUST be separate interpretations just
because the values have different valid times. It seems like there are
temporal database models and even logic programming that don't require
anything like modal logic to produce useful results.

> That assumes that relational assignment is primitive.  I would argue that
> insert, update and delete are the primitive operations, and that
> assignment is a shorthand for a combination of the primitives delete and
> insert. Information is lost when an update is translated into an
> assignment, but not so the reverse: an assignment can always be translated
> into a delete and an insert.

I think it is self-evident that assignment can preserve exactly as much
information as the user requires it to. If the information in question can
be represented as values within relations then, ipso facto, a relational
assignment is sufficient. Any form of DELETE or INSERT operator therefore
doesn't add any expressive power. At best it would be a syntax shortcut of
some kind.

Signature

David Portas

Brian Selzer - 19 Dec 2007 05:34 GMT
>> Isn't it true that with FOL, there must be separate interpretations for
>> tuple A and tuple B, since they belong to different database values,
[quoted text clipped - 6 lines]
> temporal database models and even logic programming that don't require
> anything like modal logic to produce useful results.

I believe one of the logicians out there can answer better than I, but the
way I understand it, an interpretation is an assignment.  For modal logic,
an interpretation assigns meaning to every symbol that appears in any
possible world, and then one of those possible worlds is determined to be
the actual world.  As a consequence, no matter how often a symbol appears in
however many possible worlds, the interpretation involves a single
assignment.  First order logic doesn't deal with possibility and necessity,
so as far as I know, the possible worlds interpretation doesn't apply.
That's why you need separate FOL interpretations for database values that
occur at different times: each distinct database value necessarily involves
a separate assignment.  How, for example, can you assign meaning for
something that doesn't exist?  If Baby Mike will be born on Halloween, 2008,
then how can you use the same interpretation for a database today as a
database on Christmas, 2008?  With modal logic, the possibility that Baby
Mike will be born can be represented, and therefore meaning can be assigned
to the symbol for Baby Mike.

>> That assumes that relational assignment is primitive.  I would argue that
>> insert, update and delete are the primitive operations, and that
[quoted text clipped - 9 lines]
> doesn't add any expressive power. At best it would be a syntax shortcut of
> some kind.

You're right about DELETE and INSERT, but not about UPDATE.  If you look at
UPDATE in TTM (pps. 112-113), you can see what happens:

UPDATE r ( Ai := X, Aj := Y)

where i <> j is supposedly equivalent to

( ( EXTEND r ADD ( X AS Bi, Y AS Bj ) ) { ALL BUT Ai, Aj } )
   RENAME ( Bi AS Bk, Bj AS Aj, Bk AS Ai )

where Bi, Bj, and Bk are arbitrary distinct attribute names that do not
appear in r.

Now if you look at the result of

EXTEND r ADD ( X AS Bi, Y AS Bj ),

you'll notice that each tuple has both the old values and the new values for
each affected attribute.  Clearly if Ai or Aj is prime, then that
information that ties each tuple in the result to its corresponding tuple in
r is projected away by "{ ALL BUT Ai, Aj }".  Information is lost.  The
result may be the same, but how that result was arrived at is lost in
translation, and therefore cannot be verified.
David Portas - 19 Dec 2007 11:41 GMT
> You're right about DELETE and INSERT, but not about UPDATE.  If you look
> at UPDATE in TTM (pps. 112-113), you can see what happens:
[quoted text clipped - 19 lines]
> result may be the same, but how that result was arrived at is lost in
> translation, and therefore cannot be verified.

"Information is lost" but no-one ever said that relational operators have to
be reversible. I don't know what you mean by "cannot be verified" - the
result is what it is. Assignment is still primitive because you haven't
shown how an UPDATE can give a result that cannot be obtained by assignment.

Signature

David Portas

Brian Selzer - 19 Dec 2007 19:32 GMT
>> You're right about DELETE and INSERT, but not about UPDATE.  If you look
>> at UPDATE in TTM (pps. 112-113), you can see what happens:
[quoted text clipped - 25 lines]
> shown how an UPDATE can give a result that cannot be obtained by
> assignment.

If assignment is primitive, then it cannot necessarily be determined whether
the properties of an individual that existed before an assignment are now
different or that an individual that did not exist before the assignment now
does.  If, on the other hand, update is primitive, then there is no doubt.
Since an update specifically targets information about what already exists,
and since the update includes information about those individuals both
before and after, there is no doubt as to whether the individuals in
question existed.  This is what I mean by "cannot be verified."

I think I indicated that the result may be the same.  If I want a new
Ferrari, I can buy one or I can steal one.  In both cases the result is the
same--I would possess a new Ferrari--but how I ultimately arrived at that
result is significant, because in the one case I would be in debt up to my
eyeballs, but in the other case I would be on the run from the law.
David Portas - 19 Dec 2007 20:03 GMT
> If assignment is primitive, then it cannot necessarily be determined
> whether the properties of an individual that existed before an assignment
> are now different or that an individual that did not exist before the
> assignment now does.

Both those questions could be answered by comparing the relation value
before the assignment to the one after it. For reasons that aren't clear to
me you think that is not allowed or invalid in some way. I think I'm not the
only one who will take some convincing.

> If, on the other hand, update is primitive, then there is no doubt. Since
> an update specifically targets information about what already exists, and
> since the update includes information about those individuals both before
> and after, there is no doubt as to whether the individuals in question
> existed.  This is what I mean by "cannot be verified."

To attribute any real world meaning to a humble operator is a bizarre leap
beyond RM. RM says absolutely nothing about semantics. One tuple could mean
anything in successive states of the database or even stand for multiple
things in the same state. I can't imagine saying to database users "Use
updates if you mean that this particular individual already existed. Use
other operators if you mean something else." I for one wouldn't want to have
anything to do with such a DBMS!

> I think I indicated that the result may be the same.  If I want a new
> Ferrari, I can buy one or I can steal one.  In both cases the result is
> the same--I would possess a new Ferrari--but how I ultimately arrived at
> that result is significant, because in the one case I would be in debt up
> to my eyeballs, but in the other case I would be on the run from the law.

Yes but you are totally failing to explain HOW or WHERE that information
could be exposed. If it does not exist as values within relations then it
does not exist in an RDBMS. If it does exist as values within relations then
it can be supported equivalently using an assignment (or at least using
multiple assignment).

Signature

David Portas

Anith Sen - 19 Dec 2007 21:04 GMT
>> I can't imagine saying to database users "Use updates if you mean that
>> this particular individual already existed. Use other operators if you
>> mean something else." I for one wouldn't want to have anything to do with
>> such a DBMS!

I am quite sure Brian had been told about the distinctions between external
and internal predicates and how the distinction matters significantly while
mapping a business representation to a logical model. For whatever reasons,
he chooses not to heed. Fabian has an article somewhere in his site namely,
"The desirable and the possible" that clarifies the distinction pretty well.

Signature

Anith

Brian Selzer - 20 Dec 2007 15:03 GMT
>> If assignment is primitive, then it cannot necessarily be determined
>> whether the properties of an individual that existed before an assignment
[quoted text clipped - 5 lines]
> to me you think that is not allowed or invalid in some way. I think I'm
> not the only one who will take some convincing.

The problem is that there are times when they can be answered and times when
they can't.  If that which identifies an individual can be different at
different times, then simply comparing the relation value before the
assingment with the one after will not work.  For example, the number you
pick at the pharmacy today may be different than the one you picked last
week.  It all boils down to key stability, and consequently whether or not
the comparison is valid depends upon an arbitrary choice made by the
database designer.  In my opinion, that is unacceptable.  Now if update is
primitive, then key stability is no longer a factor because both the old
values and the new values for each affected tuple are available, and
therfore a tuple-by-tuple comparison can be carried out.

>> If, on the other hand, update is primitive, then there is no doubt. Since
>> an update specifically targets information about what already exists, and
[quoted text clipped - 9 lines]
> existed. Use other operators if you mean something else." I for one
> wouldn't want to have anything to do with such a DBMS!

I don't see how it is attributing real world meaning to expect users to
issue a DELETE to remove information about something that no longer exists,
an UPDATE to adjust information about something that is different in
appearance, or an INSERT to supply information about something that just
came into existence.  It is the user that can see what exists, what doesn't
and what looks different; the system must be told.  Is existence part of the
external predicate?  The domain closure assumption states that the only
individuals that exist are those that are represented in the database.

Key values are supposed to map to individuals in the micro-world that the
database represents.  While it is true that a tuple can mean anything in
successive states or even stand for multiple things, it is clear that since
each tuple contains a key value, there is a mapping between that tuple and
at least one individual in the micro-world.  It is not the individual that
is important here, but the fact that every tuple exhibits the same behavior,
that is, under any interpretation, each key value exemplifies an individual.
It is that behavior that supports the use of DELETE, UPDATE and INSERT in
the manner described above.

>> I think I indicated that the result may be the same.  If I want a new
>> Ferrari, I can buy one or I can steal one.  In both cases the result is
[quoted text clipped - 7 lines]
> then it can be supported equivalently using an assignment (or at least
> using multiple assignment).

The information is exposed in a FOR EACH ROW trigger.  I think it could also
be exposed in a manner that does not involve iteration.
David Cressey - 19 Dec 2007 11:59 GMT
> You're right about DELETE and INSERT, but not about UPDATE.  If you look at
> UPDATE in TTM (pps. 112-113), you can see what happens:
[quoted text clipped - 19 lines]
> result may be the same, but how that result was arrived at is lost in
> translation, and therefore cannot be verified.

It seems to me that, in many discussions over the past few months, you have
been emphasizing the difference between DELETE followed by INSERT on the one
hand,  and UPDATE on the other,  even when the final result of both is a
database in the same state.

The question arises whether the database communicates any data other than
the data contained in its state.  I'm going to pull a quote from Codd's 1970
paper.  Even though this quote is about consistency,  rather than identity,
I want to draw your attention to the wording.

"It is important to note that consistency as defined above
is a property of the instantaneous state of a data bank, and
is independent of how that state came about. Thus, in
particular, there is no distinction made on the basis of
whether a user generated an inconsistency due to an act of
omission or an act of commission. "

It seems to me that all of us who have been building or working with
relational databases, regardless of whether we use SQL as an interface,  or
some tool that might be superior to SQL, have been treating database state
as independent of history in exactly the above sense.  Not only when we deal
with consistency,  but also with regard to entity identity.  I don't know of
any revision made by D&D or others to that perspective.

The languages described in this newsgroup differ from SQL in at least one
important respect.  SQL distinguishes between actions and transactions.  For
reasons that they themselves have outlined,  D&D have come up with a
language in which every transaction can be expressed as a single action.  I
can see the advantages of such a pattern, even without any prectical
experience in such a language.  But even with this difference,  the
difference between history and state is just about identical between SQL and
the relational language.

It seems to me that,  if you are going to assert that the history of a
database is part of the interpretation of its content,  it's going to be up
to you to explain how a database can serve up any relevant portions of its
own history as data.

If it isn't data,  we don't know it.

I think that until this matter gets resolved somehow,  there is going to be
a continuing disconnect between you and a lot of the regulars in c.d.t.
Frank Hamersley - 22 Dec 2007 11:07 GMT
> "Brian Selzer" <brian@selzer-software.com> wrote in message
[..]

> The question arises whether the database communicates any data other than
> the data contained in its state.  I'm going to pull a quote from Codd's 1970
[quoted text clipped - 7 lines]
> whether a user generated an inconsistency due to an act of
> omission or an act of commission. "

Interesting - I had always assumed that Codds use of "time varying" had
implied temporality but the above quote must weaken that view.

[..]

> It seems to me that,  if you are going to assert that the history of a
> database is part of the interpretation of its content,  it's going to be up
> to you to explain how a database can serve up any relevant portions of its
> own history as data.

Many have tried!  AFAIK all have failed.

Cheers Frank.
David Cressey - 22 Dec 2007 13:17 GMT
> > "Brian Selzer" <brian@selzer-software.com> wrote in message
> [..]
[quoted text clipped - 13 lines]
> Interesting - I had always assumed that Codds use of "time varying" had
> implied temporality but the above quote must weaken that view.

It's my understanding,  from reading in here,  that "time varying relation"
was basically the same thing as "relational variable",  which in turn gets
abbreviated to "relvar".

It's my understanding that a consistent view of a database involves
everything written by prior transactions and nothing written by concurrent
transactions, let alone future transactions.  Concurrency is, of course,
more complicated than this,  but  this is sufficient for this discussion.

I have preferred to think of the database data structures as "containers"
rather than "variables".   The contents of containers are presumably time
varying, subject to the above comment on transactions.  I suppose you could
have a WORM container that, once written, remains constant,  but I never
thought of that until just now.

The "container" terminology  makes for a nice distinction between container
based addressing and content based addressing.  This in turn leads to the
distinction between the graph model of data and the relational model of
data.  But, AFAIK,  it's just terminology.  If there's any deep thought
here, it isn't my thought.
paul c - 22 Dec 2007 16:12 GMT
...
> It's my understanding that a consistent view of a database involves
> everything written by prior transactions and nothing written by concurrent
> transactions, let alone future transactions.  Concurrency is, of course,
> more complicated than this,  but  this is sufficient for this discussion.
> ...

It certainly does get complicated, IMHO that's because of the
conventional locking implementations which for the average app
programmer, obscure the essential problem.  Often those implementations
force the changing of application requirements when I think concurrency
should always be part of the requirement.   Logically, I don't see why a
transaction need ever be concerned with other concurrent transactions
although I admit such an implementation might not be considered
"optimum" as far as performance is concerned.  For example, if "time" is
taken out of the picture by a suitable language, one could view, say, a
bank account balance update as involving only the data the transaction
has read plus the changes and these could all be expressed in the form
of relations.  A dbms could view the previously read data supplied by an
updating transaction merely as constraints or assertions if you like.
In this way, such a dbms would not have a distinguishable concurrency
component, eg., a conventional lock manager.  I'm not suggesting this
should be implemented but to me it seems a kind of canonical view of the
concurrency problem and possibly a good one for allowing people to
introduce concurrency notions into an application design as opposed to
leaving it to the vagaries of a particular implementation.

To some extent, I think the "optimistic" implementations mimic this, but
my complaint is that their mechanisms are usually out of the hands of
the app programmer.

I don't know enough about functional languages to say whether they could
support this "out-of-the-box", but the D&D "comma" operator is a step in
the direction of having a notation that allows "before" and "after"
versions of a relvar.  One might prefer to think of a time sequence when
dealing with such relvars, but logically a "timeless" engine would only
need to support the notion of replacement, not "before" and "after".

Just my twenty-five cents.
David Cressey - 23 Dec 2007 00:50 GMT
> ...
> > It's my understanding that a consistent view of a database involves
[quoted text clipped - 22 lines]
> introduce concurrency notions into an application design as opposed to
> leaving it to the vagaries of a particular implementation.

I was spoiled, back in the late 80s and early 90s,  by working with DEC
Rdb/VMS.
The locking mechanism was fairly conservative, but the app programmer was
very much freed from the responsibilities for figuring out the things you
describe,  at least in most circumstances.

What programmers did have to program for was deadlock errors,  and to be
able to rollback out of the tranaction and start it again.  At the time,
that seemed a small price to pay.

Rdb/VMS had two concurrency mechanisms,  one using lock manager, and the
other using multiple versions called "snapshots".  The snapshot mechanism
was mainly for providing a read only transaction with a consistent view of
the database while not blocking read write transactions.

I don't have that much experience with the other implementations,  but it
seems as though some of them place more of a burden on the programmer.

> To some extent, I think the "optimistic" implementations mimic this, but
> my complaint is that their mechanisms are usually out of the hands of
> the app programmer.

And that's a good thing,  except in exceptional circumstances.
paul c - 23 Dec 2007 01:41 GMT
>> ...
>>> It's my understanding that a consistent view of a database involves
[quoted text clipped - 47 lines]
>
> And that's a good thing,  except in exceptional circumstances.

In those days, most Cics and some IMS programmmers knew the four
necessary conditions for deadlock and avoided at least one of them in
their apps.  Nowadays that's considered high science.   Jim Gray was a
smart guy and I hope he's still alive but the fact remains that it seems
that he and other concurrency experts chose to ignore the IP.  I think
Bob B said something quite profound recently, something like "you can't
manage data without data".  A corollary might be that only mystics can
manage without data.  Normalization theory started with the IP and
caught a following, no coincidence if you ask me.  The concurrency
theorists carried on with the IMS and other physical ways of thinking
and never seemed to get the idea that there might be a logical
concurrency model based on the IP (at least as far as I know).
Bob Badour - 23 Dec 2007 03:25 GMT
>>> ...
>>>
[quoted text clipped - 68 lines]
> and never seemed to get the idea that there might be a logical
> concurrency model based on the IP (at least as far as I know).

IP? Intellectual property? Internet protocol? IMS programmer? ??
paul c - 23 Dec 2007 15:16 GMT
...
> IP? Intellectual property? Internet protocol? IMS programmer? ??

Information principle, although it wouldn't surprise me if somebody
usurped the term intellectual property to stand for a mystical attribute
or two!
Bob Badour - 23 Dec 2007 15:43 GMT
> ...
>
[quoted text clipped - 3 lines]
> usurped the term intellectual property to stand for a mystical attribute
> or two!

I am not sure Gray's stuff is incompatible with the information
principle. It seems orthogonal to me.
paul c - 23 Dec 2007 17:05 GMT
...
> I am not sure Gray's stuff is incompatible with the information
> principle. It seems orthogonal to me.

I wouldn't argue with that but the distinction I'd prefer to think of as
more useful is that concurrency theory viewed from the perspective of an
rdbms operates at a physical level, often, maybe always, using artifacts
such as "transaction id's, session id's" and so forth, that are hidden
from apps.

My complaint about conventional concurrency theory is that (as far as I
know) it has never tried to express how a purely logical view that is
visible to an application can achieve the same effects as a typical
concurrency or lock manager component. I think one likely reason for
this is that the concurrency theorists take physical persistence as
their starting point.  Given that, it's not surprising they would ignore
the possibility of a suitable app language using a relational algebra to
prototype the same effects.

While implementing this might seem to deny one of Codd's goals, that of
putting as much common logic in the rdbms rather than in app code, I
also note that for the larger apps I've seen, there never failed to be
some requirements compromise or other due to the concurrency strategy
imposed by various dbms's.

best wishes,
p
David BL - 24 Dec 2007 01:02 GMT
> ...
>
[quoted text clipped - 21 lines]
> some requirements compromise or other due to the concurrency strategy
> imposed by various dbms's.

Operational Transform (OT) tries to do exactly what you're alluding
to.  I have been investigating it for years now and can say that the
mathematics is complex and very interesting.  The basic idea is to
think of changes to data structures as operations that can be
generated and executed by different processes independently (and
concurrently) on the same state without any locking, and the
operations can be transformed as required to achieve the same logical
effect when they are post-applied in different orders by the
processes.  This holds the promise of allowing for replicated
databases which support local edits without being exposed to network
latency or outages, and for asynchronous merging of changes in a
similar fashion to CVS.  Unfortunately, and not surprisingly there is
a limited extent to which OT can preserve the original logical
intention of the operations due to conflicts.

I would be hesitant to talk about any alternatives to Gray's approach
with Badour.  He considers it sacrilegious.  I was plonked after only
a brief discussion when I first posted to cdt.
paul c - 24 Dec 2007 03:17 GMT
...
> Operational Transform (OT) tries to do exactly what you're alluding
> to.  I have been investigating it for years now and can say that the
[quoted text clipped - 11 lines]
> intention of the operations due to conflicts.
> ...

I didn't have replication in mind, but I will try to peruse whatever I
can find about "OT".  However, until simpler approaches have been
studied more thoroughly, I'll stick to my suspicion that the recognized
theorists all started out on the wrong foot.

My attitude about concurrency is likely much simpler in principle, if
not in practice.  Basically, an "ideal" rdbms would ignore concurrency!
 All a "transaction" can "know" is its data, aka the hand it's been
dealt and this is sufficient to eliminate what some other transactions
might be occupied with.  If it regurgitates the queries the designer
decides are pertinent to the transaction, at the "time" of update, along
with the expected query results, and those match, then the rdbms could
apply the transaction's expected updates, in perfect safety.

If a concurrent mechanism in a particular dbms allows a transaction to
behave according to data that is not "known" to its own app code,
obviously no other "transaction" should be allowed to see such data and
no update should be permitted in the first place. I vaguely recall that
Oracle was one of the first to have such a mechanism but the
implementation cost seemed excessive to me, ie., at least for
high-volume oltp-type transactions, the apps could have coded the same
things much cheaper, given a suitable "server".

(BTW, I think your deleted comment is unnecessary and irrelevant as far
as the topic is concerned.  Bob B has shown himself to be quite capable
of understanding such physical issues as replication which are in fact
much more elementary than his usual themes, which he is also quite
capable of dealing of.  Also, when it comes to synchronization, it is
amazing to me how hard it seems to be for experts to see how simple and
massive a problem it can be, eg., involving 2-phase commit on a large
scale, but I suspect not for an layman outsider once the jargon has been
explained.  Another thing concurrency theorists don't seem to have
considered much is the possibility of re-combining different relations
from different db's to accomplish a limited number of desired effects,
tactics as simple as storing details in one location and summaries in
another - if the two don't balance then the collective relations are
rejected, perhaps days later.  But that would require them to work at
the logical, not the physical level, which most of them would find
impossible, counter to their training and other biases.)
David BL - 24 Dec 2007 05:59 GMT
> ...
>
[quoted text clipped - 18 lines]
> studied more thoroughly, I'll stick to my suspicion that the recognized
> theorists all started out on the wrong foot.

The literature on OT focuses on the real time collaborative editing of
text.  IMO the published solutions tend to be incorrect or
impractical.  I'm not aware of anyone else researching the use of OT
for databases.

Wikipedia provides a reasonable overview

   http://en.wikipedia.org/wiki/Operational_transformation

> My attitude about concurrency is likely much simpler in principle, if
> not in practice.  Basically, an "ideal" rdbms would ignore concurrency!
[quoted text clipped - 4 lines]
> with the expected query results, and those match, then the rdbms could
> apply the transaction's expected updates, in perfect safety.

Isn't that just optimistic concurrency control?  But what happens when
the results don't match?

> If a concurrent mechanism in a particular dbms allows a transaction to
> behave according to data that is not "known" to its own app code,
[quoted text clipped - 4 lines]
> high-volume oltp-type transactions, the apps could have coded the same
> things much cheaper, given a suitable "server".

I don't think I follow you.  Are you saying that an application must
not perform an update following a dirty read?  The problem is not
detecting it, but what to do about it.  Depending on the application
rejection is not always practical, even if you can live with
application code that explicitly (or even implicitly) sits in a re-try
loop.

I can only see two ways to avoid rejection:

   1)  Locking - to serialise transactions.

   2)  Compensation (eg using OT).

Both have their pros and cons.

Is the comment about Oracle referring to Multi-Version Concurrency
Control (MVCC)?   Although very useful, MVCC only addresses the
important case of long running read only queries.

> (BTW, I think your deleted comment is unnecessary and irrelevant as far
> as the topic is concerned.  Bob B has shown himself to be quite capable
> of understanding such physical issues as replication which are in fact
> much more elementary than his usual themes, which he is also quite
> capable of dealing of.

My comment didn't concern Bob's intelligence or knowledge (which are
both impressive).

> Also, when it comes to synchronization, it is
> amazing to me how hard it seems to be for experts to see how simple and
> massive a problem it can be, eg., involving 2-phase commit on a large
> scale, but I suspect not for an layman outsider once the jargon has been
> explained.

I've never thought much of multiphase commit protocols.  Instead I'm a
fan of persistent message queues.  I was actually intending to post to
cdt to see if anyone could provide an example mandating a multiphase
commit protocol (ie where no solution exists using persistent message
queues).

> Another thing concurrency theorists don't seem to have
> considered much is the possibility of re-combining different relations
[quoted text clipped - 4 lines]
> the logical, not the physical level, which most of them would find
> impossible, counter to their training and other biases.)

I think storing asynchronous summaries in a different location is
rather straightforward because it wouldn't be an independent source of
updates.  A message queue can be used to bring the summaries DB up to
date.  A persistent message sequence number (stored in each DB) tells
you whether the summary DB is synchronised with the details DB, or
could allow a thread to block until the summaries have been
synchronised, preferably at the start of a transaction before it has
locked any resources.
Sampo Syreeni - 24 Dec 2007 15:30 GMT
> The literature on OT focuses on the real time collaborative editing of
> text. IMO the published solutions tend to be incorrect or impractical.
> I'm not aware of anyone else researching the use of OT for databases.

There is a long research tradition on semantic and multilevel
concurrency control. As I've said before, OT seems to be a special case
of those, in that all the operations on the underlying abstract datatype
are engineered to be either commutative or to have easily expressed
compensations. Transaction processing monitors routinely apply the more
general theory, which accounts for much of the performance increase such
middleware can offer.

> Is the comment about Oracle referring to Multi-Version Concurrency
> Control (MVCC)? Although very useful, MVCC only addresses the
> important case of long running read only queries.

Oracle's implementation has trouble even with those, because it's quite
difficult to guarantee that the version data necessary in the later
stages of the transaction do not accidentally get overwritten over the
course of execution.

> I've never thought much of multiphase commit protocols. Instead I'm a
> fan of persistent message queues.

I don't quite see them as solving the same problem. If your queue is
such that all of the updates posted in it are always applied, then in a
distributed environment it would seem that you need something like a
multiphase commit to get agreement on what to insert into the queue in
the first place. Or if some of the queued transactions can be rolled
back, then you need distributed agreement when the queued transaction is
finally executed. Finally, it obviously solves the problem if the
transactions are serialized before insertion into the queue, but then we
lose in concurrency and the queue easily becomes a central hot spot.

So, could you perhaps elaborate a bit?

> A message queue can be used to bring the summaries DB up to date. A
> persistent message sequence number (stored in each DB) tells you
> whether the summary DB is synchronised with the details DB, or could
> allow a thread to block until the summaries have been synchronised,
> preferably at the start of a transaction before it has locked any
> resources.

That's how most reasonable report databases/warehouses connect to their
respective OLTP sources, at least.
Signature

Sampo Syreeni, aka decoy - mailto:decoy@iki.fi, tel:+358-50-5756111
student/math+cs/helsinki university, http://www.iki.fi/~decoy/front
openpgp: 050985C2/025E D175 ABE5 027C 9494 EEB0 E090 8BA9 0509 85C2

David BL - 26 Dec 2007 00:28 GMT
> > I've never thought much of multiphase commit protocols. Instead I'm a
> > fan of persistent message queues.
[quoted text clipped - 10 lines]
>
> So, could you perhaps elaborate a bit?

Consider the following hypothetical example:  there are two
geographically separated databases storing account balances.  There is
no replication - each DB is recording independent accounts.

Suppose we want a transfer between accounts needing a distributed
transaction.   The two phase commit protocol will involve a "prepare
to commit" phase, and each DB will need to lock the relevant account
balances.  The locks aren't released until the participants
acknowledge the commit message (assuming the presumed rollback
protocol).  Network failures could make this take a very long time.
ie network failure can upset the autonomy of the databases.  How bad
is that!

Instead consider that we avoid TPC entirely, and use persistent
message queues...

Suppose we want to transfer from account balance Ax on machine X into
account balance Ay on machine Y.

Let X have a persistent message queue Qx that records one way messages
to be sent from X to Y.   Each message represents a deposit to be
performed on an account on Y.  Y records a persistent sequence number
Nx for the number of messages it has processed from Qx.  A middleware
layer allows Y to pump messages from Qx and it uses Nx to ensure it
processes each message exactly once.

Similarly Y has Qy for messages to be sent from Y to X and X records
sequence number Ny.

A successful transfer involves the following steps

1.   On X, a local transaction does the withdrawal on Ax and pushes
message onto Qx that represents a deposit that needs to be performed
on Ay.   Ax and Qx are updated atomically on X.

2.   Y reads message from Qx then opens a transaction that atomically
updates Ay and Nx.

Now consider that Y needs to cancel the transfer.  Eg this could occur
because by the time Y processes the deposit message the relevant
account has been closed.   Then we simply introduce a message to
cancel the transfer, which in a fashion can be compared to a presumed
commit TPC.  In fact a cancel of a transfer is nothing other than a
deposit message to undo the original withdrawal.

Note that when we account for the outstanding (ie not yet processed by
receiver) deposit messages, the total money in the system is conserved
by the local transactions.  Note as well that local transactions are
performed without any network I/O whatsoever.

The durability requirement on these transactions can be relaxed.  For
example when they commit they can release locks before flushing the
log, making the locks less sensitive to file I/O.  Accounts are locked
for extremely short periods.  It is a vastly superior approach to
TPC.

IMO the whole idea of a "distributed transaction" is disgusting.  It
seems tragic for a system to use them if they can be avoided.  I
wonder if they always can be (when one considers the "bigger picture"
and one isn't constrained by an existing legacy system).
Sampo Syreeni - 26 Dec 2007 17:20 GMT
> A middleware layer allows Y to pump messages from Qx and it uses Nx to
> ensure it processes each message exactly once.

Essentially what you're doing here is using a timestamping mechanism and
the acknowledgements from Y to implement a rudimentary, distributed
agreement protocol. It is used to assure that the message gets delivered
once, and only once. If the transaction touched more than two databases,
you would have to get agreement from all of them because failure to
apply the update against Y would imply that the effects against Z have
to be cancelled as well. That is, you haven't really gotten rid of the
agreement protocol, but only gone to an optimistic version of it, and
perhaps in the process even restricted it to a two database context
only.

> On X, a local transaction does the withdrawal on Ax and pushes message
> onto Qx that represents a deposit that needs to be performed on Ay.
> [...] Now consider that Y needs to cancel the transfer. Eg this could
> occur because by the time Y processes the deposit message the relevant
> account has been closed. [...] In fact a cancel of a transfer is
> nothing other than a deposit message to undo the original withdrawal.

In the literature this sort of thing is called an open nested
transaction, and the "cancel" is called a compensating
transaction/compensation. The construct is used precisely for the
reasons you cite, i.e. for less locking interference, higher concurrency
and general decoupling, but it also exacts a clear price. Under it
consistency cannot be guaranteed unless a) the only way to access the
database is through b) operations with known semantics, c) possessing
clearly defined undo/"compensation" operations, d) which commute
suitably amongst each other, and e) which have real world semantics
which allow us to conclude that this new, lower level of distributed
consistency is appropriate. If general (usually called "native")
transactions are allowed to proceed in parallel with the semantic
primitives, f) usual locking semantics still need to be available and to
be applied to control interactions between the native (lower level) and
the semantic (higher level) actions.

In fact, that you included account cancellations in the set of
permissible operations already break c) and d). Consider a situation
where some money is transferred from account X to account Y, Y gets
cancelled before the message is processed, and then X gets cancelled
before the reverse deposit is. Now the money is lost and no compensating
action is available. You could perhaps augment the system with zombie
accounts, cascading aborts and real world compensating actions (i.e.
trace the money back to where it came from, even across multiple dead
accounts, and mail it to the original owner), but that is complicated
and once you move outside of the banking field, in general you can't
guarantee that all of the necessary options are available.

As an example, consider a claims processing database. We want each claim
to always be allocated to a handler, and no more than say ten claims to
be allocated to a single handler at a time. If all handlers are fully
occupied, there might even be a legal obligation not to accept new
claims. If X now assigns all of his claims to Y and accepts ten new
claims for processing, no matter what we do from there on the real world
legal mistake of accepting too many claims for processing in toto cannot
be avoided, or rectified once it's occurred. Under global consistency in
the normal sense, that couldn't have happened because assignment would
only have been possible if Y were actually able to accept the extra load
then and there, atomically.

What allows you to get even as far as you did with message queues with
the account transaction example is the way money behaves in a real
economy, that is, semantics. For example, you can rely on money being
fungible and divisible, people not wanting throw to it away (unlike
burdens such as a claim to be processed), other people always wanting to
accept more of it, and for everybody to go to great pains (like keeping
useless accounts around for a time) in order not to lose any of it.
Obviously such constraints do not hold for all of the applications
databases can and do have.

> Note that when we account for the outstanding (ie not yet processed by
> receiver) deposit messages, the total money in the system is conserved
> by the local transactions.

The problem is that this sort of accounting involves hidden state, which
allows us to break global integrity constraints. The user visible state
of the system is no longer consistent; my money temporarily disappeared.
Of course this is what happens with current interbank transfers, and
there's nothing wrong with it as long as the real world semantics make
sense. But the usual notion of distributed consistency employed by
database researchers is much more ambitious than that. It also tries to
support all the rest of the applications which *don't* have anything to
do with accounting or its particular characteristics.

> The durability requirement on these transactions can be relaxed. For
> example when they commit they can release locks before flushing the
> log, making the locks less sensitive to file I/O.

Even this is not really true: the new state of the queue still needs to
be forced into stable storage before the transaction can safely return.
Otherwise a crash before the message insertion is made persistent would
cause an inconsistency between the states the client and the DBMS see.

> I wonder if they always can be (when one considers the "bigger
> picture" and one isn't constrained by an existing legacy system).

Hmm. I would not easily criticise the modern work on consistency and
concurrency based on a single example -- accounting -- that in itself is
several millennia of age.
Signature

Sampo Syreeni, aka decoy - mailto:decoy@iki.fi, tel:+358-50-5756111
student/math+cs/helsinki university, http://www.iki.fi/~decoy/front
openpgp: 050985C2/025E D175 ABE5 027C 9494 EEB0 E090 8BA9 0509 85C2

David BL - 27 Dec 2007 01:57 GMT
> > A middleware layer allows Y to pump messages from Qx and it uses Nx to
> > ensure it processes each message exactly once.
[quoted text clipped - 9 lines]
> perhaps in the process even restricted it to a two database context
> only.

But can you provide a realistic example, and offer a "proof" that
message queues are inadequate?   Does such a demonstration exist in
the literature?

> > On X, a local transaction does the withdrawal on Ax and pushes message
> > onto Qx that represents a deposit that needs to be performed on Ay.
[quoted text clipped - 25 lines]
> before the reverse deposit is. Now the money is lost and no compensating
> action is available.

When you say "Y is cancelled" do you mean an account on machine Y is
being closed?

Closing an account doesn't happen synchronously.  Consider that an
account on Y is being closed.

1.  A local transaction on Y marks the account as closed and "close
account" messages are posted to its peers (by pushing messages onto
the local queues)

2.  Given that peers only find out asynchronously that the account is
being closed there will be some period of time for which deposits
continue to be made into the account.  Y must continue pumping
messages from its peers.  Using a local transaction on Y, each deposit
into the account marked as closed must be cancelled by pushing a
compensating deposit onto the queue for the relevant peer.

3.  It is assumed that each peer (in a local transaction) eventually
reads the message that the account is being closed.  This stops it
from performing any further deposits into the account.  In the same
local transaction the peer posts an acknowledge message onto its local
queue that is pumped by Y.

4.  Eventually Y reads all the acknowledge messages from all its peers
and determines that the account is now properly closed, because by
ordered delivery of messages in the queues, it is safe to assume there
will be no outstanding deposit messages for that account in the
system.

> You could perhaps augment the system with zombie
> accounts, cascading aborts and real world compensating actions (i.e.
> trace the money back to where it came from, even across multiple dead
> accounts, and mail it to the original owner), but that is complicated
> and once you move outside of the banking field, in general you can't
> guarantee that all of the necessary options are available.

I think the above steps allow any number of accounts to be closed, and
no money will be lost.

> As an example, consider a claims processing database. We want each claim
> to always be allocated to a handler, and no more than say ten claims to
[quoted text clipped - 4 lines]
> legal mistake of accepting too many claims for processing in toto cannot
> be avoided, or rectified once it's occurred.

I don't agree.  There should be a concept of a claim that is not
currently allocated to a handler, and the claim is passed around the
network until it is assigned to a handler.  Allocation of a claim to a
handler is a local decision using a local transaction allowing the
legal obligation to be enforced.

> Under global consistency in
> the normal sense, that couldn't have happened because assignment would
> only have been possible if Y