Database Forum / General DB Topics / DB Theory / February 2008
Newbie question about db normalization theory: redundant keys OK?
|
|
Thread rating:  |
raylopez99 - 12 Dec 2007 22:26 GMT With a few hours of theory under my belt, I'd like to ask if there's ever a time that you don't want a completely normalized dB, that is, a normalized database being a dB that has no redundant information (my understanding of what a normalized database is).
Or, is there ever a time that you want redundant keys (that is, the same keys in many different tables, that obviously are not linked (in a relationship) between two tables?). Having redundant attributes and/ or keys seems to me a very lazy way of designing a database that doesn't require lots of initial thought, but of course you have to pay for it by meticulously "synching" all redundant keys to one another everytime there is a change in one of the redundant keys, so the keys don't drift and have different values.
But is there ever a time you want to do this?
THanks in advance
RL
David Cressey - 13 Dec 2007 12:14 GMT > With a few hours of theory under my belt, I'd like to ask if there's > ever a time that you don't want a completely normalized dB, that is, a [quoted text clipped - 15 lines] > > RL The answer is yes, there are times when a design is a good one, even if less than fully normalized. For each normalization form, there is a known set of anomalies that come up when you insert, update, or delete data in that form. If you are willing and able to program around those anomalies, and if the design yields benefits that justify that effort, it can be the right thing to do. Learning when to normalize is more subtle than learning how to normalize.
There is a particular form of database design, called "star schema" that yields good results when used in a data mart or data warehouse situation. A star schema mimics a multidimensional database in relational (or SQL) form. A star schema follows design rules of its own, and those rules sometimes contradict the rules of normalization. The up side of star schema is that it's very easy to use with report generators, or with OLAP tools like Cognos or Business Objects. The down side of star schema is that the process of keeping the data current involves some fairly intricate programming, and heavy use of computer resources.
Star schema, and other unnormalized or denormalized designs almost always cost more than they are worth when used in a high transaction operational setting, like OLTP.
Unfortunately, most deviations from normalization occur due to blunders, and not due to well considered design decisions. Many deviations from normalization occur because the designer is unfamiliar with some of the normal forms. Back when I was building databases, I only really knew 1NF, 2NF, and 3NF. Update anomalies due to deviations from BCNF and beyond were rare, but my design process would not have obviated them.
Another major cause of deviations from normalization is failure to understand the data. In particular, the functional dependendencies inherent in the data are not discovered during data analysis, and the design unknowingly violates normalization rules. By the time this is discovered, there is usually a large body of application code that is dependent of the bad design.
Sometimes, denormalized design is the reult of sheer pigheadedness.
Bob Badour - 13 Dec 2007 17:30 GMT >>With a few hours of theory under my belt, I'd like to ask if there's >>ever a time that you don't want a completely normalized dB, that is, a [quoted text clipped - 53 lines] > > Sometimes, denormalized design is the reult of sheer pigheadedness. Don't listen to a work David says. Star schema was sold by Cognos and Business Objects so their customers would have to do the work they should have done in the first place.
I seldom see anyone 'denormalize' who is aware of the actual costs of doing so. On the other hand, I have seen plenty of ignoramuses 'denormalize' when physical clustering for the same performance characteristics was an available option.
Dr. Dweeb - 16 Dec 2007 20:54 GMT >>> With a few hours of theory under my belt, I'd like to ask if there's >>> ever a time that you don't want a completely normalized dB, that [quoted text clipped - 61 lines] > 'denormalize' when physical clustering for the same performance > characteristics was an available option. Me too. The inability of a large number of "dba"s to distinguish between logical and physical and to know which is which is trult astonishing - and quite sad, given the fundamental nature of this distinction and its relevance to relational databases.
This forum is full of examples of people who just have not understood it, and propose bizarre ideas as "performance" enhancing, essentially because they haven't got a clue what an optimizer is, why it is there, and just what forms of optimization are in fact possible within any specific product. Throw in physical proximity and other physical placement solutions and one quickly realises that most problems are solvable without dicking around with the logical model.
Dr. Dweeb (an occasional lurker and OracleRdb guru)
-CELKO- - 13 Dec 2007 14:12 GMT CREATE TABLE Schedule (teacher VARCHAR(15) NOT NULL, class CHAR(15) NOT NULL, room INTEGER NOT NULL, period INTEGER NOT NULL, PRIMARY KEY (teacher, class, room, period));
That choice of a primary key is the most obvious one -- use all the columns. Typical rows would look like this:
('Mr. Celko', 'Database 101', 222, 6)
The rules we want to enforce are:
1) A teacher is in only one room each period. 2) A teacher teaches only one class each period. 3) A room has only one class each period. 4) A room has only one teacher in it each period.
Stop reading and see what you come up with for an answer. Okay, now consider using one constraint for each rule in the list, thus.
CREATE TABLE Schedule_1 -- version one, wrong! (teacher VARCHAR(15) NOT NULL, class CHAR(15) NOT NULL, room INTEGER NOT NULL, period INTEGER NOT NULL, UNIQUE (teacher, room, period), -- rule #1 UNIQUE (teacher, class, period), -- rule #2 UNIQUE (class, room, period), -- rule #3 UNIQUE (teacher, room, period), -- rule #4 PRIMARY KEY (teacher, class, room, period));
We know that there are 4 ways to pick three things from a set of four things permutation. While column order is important in creating an index, we can ignore it for now and then worry about index tuning later in the book.
I could drop the PRIMARY KEY as redundant if I have all four of these constraints in place. But what happens if I drop the PRIMARY KEY and then one of the constraints?
CREATE TABLE Schedule_2 -- still wrong (teacher VARCHAR(15) NOT NULL, class CHAR(15) NOT NULL, room INTEGER NOT NULL, period INTEGER NOT NULL, UNIQUE (teacher, room, period), -- rule #1 UNIQUE (teacher, class, period), -- rule #2 UNIQUE (class, room, period)); -- rule #3
I can now insert these rows in the second version of the table:
('Mr. Celko', 'Database 101', 222, 6) ('Mr. Celko', 'Database 102', 223, 6)
This gives me a very tough sixth period class load since I have to be in two different rooms at the same time. Things can get even worse when another teacher is added to the schedule:
('Mr. Celko', 'Database 101', 222, 6) ('Mr. Celko', 'Database 102', 223, 6) ('Ms. Shields', 'Database 101', 223, 6)
Ms. Shields and I are both in room 223, trying to teach different classes at the same time. Matthew Burr looked at the constraints and the rules, came up with this analysis.
CREATE TABLE Schedule_3 -- corrected version (teacher VARCHAR(15) NOT NULL, class CHAR(15) NOT NULL, room INTEGER NOT NULL, period INTEGER NOT NULL, UNIQUE (teacher, period), -- rules #1 and #2 UNIQUE (room, period)); -- rules #3 and #4
If a teacher is in only one room each period, then given a period and a teacher I should be able to determine only one room, i.e. room is functionally dependent upon the combination of teacher and period. Likewise, if a teacher teaches only one class each period, then class is functionally dependent upon the combination of teacher and period. The same thinking holds for the last two rules: class is functionally dependent upon the combination of room and period, and teacher is functionally dependent upon the combination of room and period.
With the constraints that were provided in the first version, you will find that the rules are not enforced. For example, I could enter the following rows:
('Mr. Celko', 'Database 101', 222, 6) ('Mr. Celko', 'Database 102', 223, 6)
These rows violate two of your rules, rule #1 and rule #2. However, the unique constraints first provided in Schedule_2 do not capture this violation and will allow the rows to be entered.
The constraint
UNIQUE (teacher, room, period)
is checking the complete combination of teacher, room, and period, and since ('Mr. Celko', 222, 6) is different from ('Mr. Celko', 223, 6), the DDL does not find any problem with both rows being entered, even though that means that Mr. Celko is in more than one room during the same period.
UNIQUE (teacher, class, period)
doesn't catch its associated rule either since ('Mr. Celko', 'Database 101', 6) is different from ('Mr. Celko', 'Database 102', 6), and so, Mr. Celko is able to teach more than one class during the same period, thus violating rule two. It seems that we'd also be able to add the following row:
('Ms. Shields', 'Database 103', 222, 6)
which violates rules #3 and #4.
Tony Rogerson - 14 Dec 2007 08:01 GMT > CREATE TABLE Schedule > (teacher VARCHAR(15) NOT NULL, [quoted text clipped - 7 lines] > > ('Mr. Celko', 'Database 101', 222, 6) And half way through term the teacher changed their name to 'Mrs Bunting' and now everybody is confused!
Great example of why you should use an artificial key, I'll remember your example for a blog entry I'm writing.
 Signature Tony Rogerson, SQL Server MVP http://sqlblogcasts.com/blogs/tonyrogerson [Ramblings from the field from a SQL consultant] http://sqlserverfaq.com [UK SQL User Community]
Roy Hann - 14 Dec 2007 08:35 GMT >> CREATE TABLE Schedule >> (teacher VARCHAR(15) NOT NULL, [quoted text clipped - 13 lines] > Great example of why you should use an artificial key, I'll remember your > example for a blog entry I'm writing. I'll remember this. It is a great example of why no one should bother reading your blog.
Roy
Tony Rogerson - 14 Dec 2007 12:16 GMT > I'll remember this. It is a great example of why no one should bother > reading your blog. Oh, I see Roy; you agree with Celko's design.
Interesting....
 Signature Tony Rogerson, SQL Server MVP http://sqlblogcasts.com/blogs/tonyrogerson [Ramblings from the field from a SQL consultant] http://sqlserverfaq.com [UK SQL User Community]
David Portas - 14 Dec 2007 14:59 GMT > > I'll remember this. It is a great example of why no one should bother > > reading your blog. [quoted text clipped - 7 lines] > [Ramblings from the field from a SQL consultant]http://sqlserverfaq.com > [UK SQL User Community] Tony,
You didn't specify any reason why you thought Joe's design was wrong. There is no obvious reason for confusion just because a teacher's name changes - at least not as far as the data model is concerned. Users of the data need a way to identify their teacher but how would an "artificial key" help? I never knew any of my teachers as 1234.
-- David Portas
Tony Rogerson - 14 Dec 2007 15:49 GMT > You didn't specify any reason why you thought Joe's design was wrong. > There is no obvious reason for confusion just because a teacher's name > changes - at least not as far as the data model is concerned. Users of > the data need a way to identify their teacher but how would an > "artificial key" help? I never knew any of my teachers as 1234. I thought I had.
On my printed off time table it states 'Ms Fred', now, half way through the school year while on a term break the teacher gets married and is now called 'Mrs Sid' - how does my timetable now link back to the original entity - it can't, now, I return back from my holiday and wander around campus looking for a teached called 'Ms Fred' but to no avail.
Now, a new teacher starts towards the end of the school year, called 'Ms Fred', this is a completely different person from the one at the start of term, in fact they teach cooking rather than engineering; now, my timetable correctly links up and I can find my teacher - 'Ms Fred', only, the problem is I'm at the wrong class - I no longer recognise the subject content - what breaking eggs and cooking cakes has to do with engineering?
Do you see my point yet? and no, don't start talking about it being identified by classroom - I remember when I was at FE the room we where taught in, in fact the building sometimes often changed from week to week
Another one is somebody I do business with in Microsoft for the community, she recently got married - she cannot change her email address because if she did when we emailed her it would bounce because we would be using the old one.
These are all problems with using the natural key - the natural key changes! You need to create an imutable key; never changes so these real problems can not and do not occur.
I'm interested - how would you solve this problem in the real world? Would you prevent the teacher from changing their name?
 Signature Tony Rogerson, SQL Server MVP http://sqlblogcasts.com/blogs/tonyrogerson [Ramblings from the field from a SQL consultant] http://sqlserverfaq.com [UK SQL User Community]
David Portas - 14 Dec 2007 16:13 GMT > > You didn't specify any reason why you thought Joe's design was wrong. > > There is no obvious reason for confusion just because a teacher's name [quoted text clipped - 18 lines] > > Do you see my point yet? If the database is correctly updated to reflect any name changes then no problem arises. If there is a requirement to reflect the history of previous names then that information can also be recorded. In either case I don't see any call for anything like an "artificial key".
> Another one is somebody I do business with in Microsoft for the community, > she recently got married - she cannot change her email address because if [quoted text clipped - 4 lines] > You need to create an imutable key; never changes so these real problems can > not and do not occur. I don't understand what would prevent you from updating the email address in the database. The choice of key should have nothing to do with it.
-- David Portas
Tony Rogerson - 14 Dec 2007 17:07 GMT > If the database is correctly updated to reflect any name changes then > no problem arises. If there is a requirement to reflect the history of > previous names then that information can also be recorded. In either > case I don't see any call for anything like an "artificial key". CREATE TABLE Schedule (teacher VARCHAR(15) NOT NULL, class CHAR(15) NOT NULL, room INTEGER NOT NULL, period INTEGER NOT NULL, PRIMARY KEY (teacher, class, room, period));
Does the above do that? No.
Part 2; in the real world how would the history work?
Perhaps this....
CREATE TABLE Schedule (teacher VARCHAR(15) NOT NULL, class CHAR(15) NOT NULL, room INTEGER NOT NULL, period INTEGER NOT NULL, start_date datetime not null, end_date datetime null check( end_date > start_date ), PRIMARY KEY (teacher, class, room, period, start_end, end_date ));
Nope, I still cannot find my teacher (SELECT * FROM Schedule WHERE teacher = 'Ms Fred'); all I know now is that my teacher no longer is a teacher which is wrong.
How would you store the history given that the natural key changes WITHOUT using an artificial key?
> I don't understand what would prevent you from updating the email > address in the database. The choice of key should have nothing to do > with it. The real world gets in the way.
Upto and including Monday the person has the email address 'julie.smith@fred.com'
On Tuesday she gets married and changes her email address to 'julie.jones@fred.com'
Do you not see a problem with that if the designer had followed the same approach above.
I keep trying to email julie.smith@fred.com only that email address no longer exists so I can no longer communicate with her.
 Signature Tony Rogerson, SQL Server MVP http://sqlblogcasts.com/blogs/tonyrogerson [Ramblings from the field from a SQL consultant] http://sqlserverfaq.com [UK SQL User Community]
David Portas - 14 Dec 2007 17:36 GMT > > If the database is correctly updated to reflect any name changes then > > no problem arises. If there is a requirement to reflect the history of [quoted text clipped - 52 lines] > [Ramblings from the field from a SQL consultant]http://sqlserverfaq.com > [UK SQL User Community] I genuinely am having trouble understanding what any of this has to do with saying that Joe's design is "wrong". None of the problems you have mentioned are the usual ones given for using "artificial" keys - not that I've seen anyway. I don't doubt that you have some real issues in mind but I don't think you are explaining them very precisely: "The real world gets in the way" tells us nothing about why it would be a problem to update email address X to become email address Y. I just don't see what you are getting at.
For methods of recording the history of change Date, Darwen and Lorentzos have a book "Temporal Databases and the Relational Model", which discusses the issues and solutions at length.
-- David Portas
Tony Rogerson - 14 Dec 2007 17:51 GMT First of all, let's go back to your statement about just store the history - can you please show how you would do that with celko schema and still be able to reach the rows as per my examples?
> precisely: "The real world gets in the way" tells us nothing about why > it would be a problem to update email address X to become email > address Y. I just don't see what you are getting at. Here's another...
create table blah ( emiladdress varchar(?) not null unique, who varchar(100) not null )
In the year 2006 I have an email address called tonyrogerson@sqlserver.eu.com everybody sends email to that address.
insert blah values( 'tonyrogerson@sqlserver.eu.com' )
On 1st Jan 2007 I change the email address to tonyrogerson@torver.net I suddenly stop receiving emails because everybody is still emailing tonyrogerson@sqlserver.eu.com.
update blah set emailaddress = 'tonyrogerson@torver.net' where emailaddress = 'tonyrogerson@sqlserver.eu.com'
How do all the applications disconnected from the database now reach the record? They can't - the natural key has changed.
 Signature Tony Rogerson, SQL Server MVP http://sqlblogcasts.com/blogs/tonyrogerson [Ramblings from the field from a SQL consultant] http://sqlserverfaq.com [UK SQL User Community]
Ross Presser - 14 Dec 2007 18:01 GMT > create table blah ( > emiladdress varchar(?) not null unique, [quoted text clipped - 15 lines] > How do all the applications disconnected from the database now reach the > record? They can't - the natural key has changed. There are two ways I would answer this. The first says that email address isn't a natural key -- it doesn't uniquely define a person any more than hair color does. The second is that the natural key hasn't changed, the entity -- a mailbox on a mail server -- has changed.
Tony Rogerson - 14 Dec 2007 18:14 GMT > There are two ways I would answer this. The first says that email > address isn't a natural key -- it doesn't uniquely define a person any > more than hair color does. The second is that the natural key hasn't > changed, the entity -- a mailbox on a mail server -- has changed. Absolutely - so we need to use something that won't change - an artificial key.
 Signature Tony Rogerson, SQL Server MVP http://sqlblogcasts.com/blogs/tonyrogerson [Ramblings from the field from a SQL consultant] http://sqlserverfaq.com [UK SQL User Community]
David Portas - 14 Dec 2007 18:39 GMT > First of all, let's go back to your statement about just store the history - > can you please show how you would do that with celko schema and still be [quoted text clipped - 25 lines] > How do all the applications disconnected from the database now reach the > record? They can't - the natural key has changed. You are saying that this is a distributed database but that it lacks a mechanism for accurately propagating changes out to all its nodes? Well in my view such a DBMS would be broken. It surely violates Codd's principle of "Distribution Independence". Let's follow your example to its conclusion though. The solution is to replace whatever copy of the Blah relation exists in the application with the new Blah relation that superceded it. Now all emails reach the correct address and there is no problem that requires a different key.
-- David Portas
Tony Rogerson - 14 Dec 2007 18:55 GMT > You are saying that this is a distributed database but that it lacks a > mechanism for accurately propagating changes out to all its nodes? Nope - it's a normal database you have simply done a query against the database - the result of SELECT and then return to the database to validate your data against it - think about it; your insurance company posts you a claim form and you then send it back in - you need to validate what's on the form (our equiv of time table) against the database.
Also, please can you show me how you would simply put history in the table with celko's example.
 Signature Tony Rogerson, SQL Server MVP http://sqlblogcasts.com/blogs/tonyrogerson [Ramblings from the field from a SQL consultant] http://sqlserverfaq.com [UK SQL User Community]
Brian Selzer - 15 Dec 2007 05:17 GMT >> First of all, let's go back to your statement about just store the >> history - [quoted text clipped - 36 lines] > that superceded it. Now all emails reach the correct address and there > is no problem that requires a different key. Forgive me for butting in, David, but where did you come up with the idea that it is a distributed database? What have disconnected applications--that is, applications that use something akin to disconnected ADO recordsets or ADO.NET datasets--to do with distributed databases?
The question is: for how long is the data that was just read out of the database considered to be valid? Until the next update? Or is it stale as soon as its read? Does it have something to do with transaction control or locking? If several updates occur between the reading of one piece of information and the reading of another, how can you be sure that any answer that involves both pieces of information is correct? How can you be sure that you haven't read the same information twice? If you use an artificial key and a timestamp (or rowversion), then there can be no doubt as to whether or not the information in question changed between the first reading and the second.
> -- > David Portas David Portas - 15 Dec 2007 09:08 GMT >> You are saying that this is a distributed database but that it lacks a >> mechanism for accurately propagating changes out to all its nodes? [quoted text clipped - 18 lines] > any answer that involves both pieces of information is correct? How can > you be sure that you haven't read the same information twice? I agree that these are important issues of application design. I don't think they need to affect the database logical design in this case.
> If you use an artificial key and a timestamp (or rowversion), then there > can be no doubt as to whether or not the information in question changed > between the first reading and the second. A ROWVERSION (as defined by Microsoft SQL Server) does not tell you whether any data has changed. It tells you whether some rows were possibly affected by update operations. In other words it can give false positives - indicating a change where there is none.
BTW I seriously doubt whether it would be possible or desirable to implement anything like a ROWVERSION in a true RDBMS. The consequences of SQL Server's implementation are serious because it attempts to identify row data based on something other than keys. I have never been a fan of the ROWVERSION feature.
The only way to tell whether the current state of the database equals some previous state is to query again and compare that to a result previously retrieved. That comparison is usually based on a key value. It makes no difference what type of value is used for the key. The comparison is exactly the same whether it is an "artificial" key or otherwise. (I'm not too concerned about defining what an "artificial" key is because I don't think it matters).
Consider a set of transformations:
T1:a -> T2:b -> T3:c -> T4:a
Where "T1:a" means "The value of the database at time T1 is a". When we requery the database at time T4 all we will ever know is that the value is the same as at time T1. If it is a requirement to know about the previous updates at T2 and T3 then obviously we ought to preserve that information in the database - but in many cases it is quite reasonable not to do that.
 Signature David Portas
David Cressey - 15 Dec 2007 13:13 GMT > >> You are saying that this is a distributed database but that it lacks a > >> mechanism for accurately propagating changes out to all its nodes? [quoted text clipped - 21 lines] > I agree that these are important issues of application design. I don't think > they need to affect the database logical design in this case. They need not afect database design for a single database built in isolation. However, if there is a data architecture that pervades the entire enterprise, and both databases and applications inherit the semantics of the data and to some extent the form, from that pervasive data architecture, then the issues raised become very relevant.
Bob Badour - 15 Dec 2007 13:49 GMT >>>You are saying that this is a distributed database but that it lacks a >>>mechanism for accurately propagating changes out to all its nodes? [quoted text clipped - 10 lines] >>disconnected ADO recordsets or ADO.NET datasets--to do with distributed >>databases? Managing data on a server... and managing data on a client... um, and the idiot is too stupid to see what that has to do with distributed data management?!? Yikes!
Just because the idiot chose a piss-poor way to manage data at one node in the system doesn't change the fundamental nature of what's going on.
David, I don't know why you bother with self-aggrandizing ignorants like Selzer.
[snip]
Brian Selzer - 15 Dec 2007 19:37 GMT >>>>You are saying that this is a distributed database but that it lacks a >>>>mechanism for accurately propagating changes out to all its nodes? [quoted text clipped - 20 lines] > David, I don't know why you bother with self-aggrandizing ignorants like > Selzer. I believe that with this post you have proven yourself to be ignorant. You appear to not have a clue about how relational data is used in applications, nor does it appear that you can distinguish between a distributed database and an application. Perhaps you should consider keeping your mouth closed, unless, of course, you enjoy the taste of your feet.
> [snip] Brian Selzer - 15 Dec 2007 19:38 GMT >>> You are saying that this is a distributed database but that it lacks a >>> mechanism for accurately propagating changes out to all its nodes? [quoted text clipped - 21 lines] > I agree that these are important issues of application design. I don't > think they need to affect the database logical design in this case. I think they do. For a single state, first order logic suffices; once more than one state is involved, at a minimum some form of modal logic is necessary.
>> If you use an artificial key and a timestamp (or rowversion), then there >> can be no doubt as to whether or not the information in question changed [quoted text clipped - 4 lines] > affected by update operations. In other words it can give false > positives - indicating a change where there is none. You're right, of course: it only tells whether some rows were the target of an update, not whether the information is actually different.
> BTW I seriously doubt whether it would be possible or desirable to > implement anything like a ROWVERSION in a true RDBMS. The consequences of > SQL Server's implementation are serious because it attempts to identify > row data based on something other than keys. I have never been a fan of > the ROWVERSION feature. In what way does it attempt to identify row data based on something other than keys?
> The only way to tell whether the current state of the database equals some > previous state is to query again and compare that to a result previously [quoted text clipped - 3 lines] > too concerned about defining what an "artificial" key is because I don't > think it matters). I disagree. Unless the identifier is a rigid designator or a rigid definite description, then any comparison based upon that identifier is suspect. It may be the case that the key values are identical, but the individuals in each state that are identified by that key value are different individuals. For example, "the first person in line" can be different people at different times.
> Consider a set of transformations: > [quoted text clipped - 5 lines] > updates at T2 and T3 then obviously we ought to preserve that information > in the database - but in many cases it is quite reasonable not to do that. David Portas - 16 Dec 2007 14:14 GMT >> The only way to tell whether the current state of the database equals >> some previous state is to query again and compare that to a result [quoted text clipped - 10 lines] > different individuals. For example, "the first person in line" can be > different people at different times. Isn't that purely a question of what propositions we want to represent?
Assume that the tuple:
A: ('Celko', 'Database 101', 222, 6)
represents the proposition:
P: "Teacher named Celko teaches class 'Database 101' in room 222 during period 6"
which we assume to be true at a certain point in time. At some later time we replace tuple A with two different tuples:
B: ('Selzer', 'Database 101', 222, 6) C: ('Celko', 'Zen Buddhism', 223, 6)
representing propositions of the same kind as P.
Those propositions assert absolutely nothing about whether Celko in A is the same individual as Celko in C. They assert nothing about whether room 222 in the first case is the same as in the second. Crucially, they DO NOT assert whether any individual previously known as "Celko" changed his name to "Selzer" or whether he was replaced by an entirely different person called "Selzer".
The reason that you (and Tony I guess) think it is "suspect" to compare tuple A with tuple B is that you are applying your own intended interpretation, which the database was never designed to support. There are other interpretations however, which are potentially valid (the interpretations of the proposition P kind for example). If those valid interpretations are the ones required and understood by the users of the data then the database is quite sufficient as proposed.
>> BTW I seriously doubt whether it would be possible or desirable to >> implement anything like a ROWVERSION in a true RDBMS. The consequences of [quoted text clipped - 4 lines] > In what way does it attempt to identify row data based on something other > than keys? A problem arises with relational assignment - something that SQL doesn't support. In an assignment it is meaningless to talk of individual tuples being updated or left unchanged. The only valid basis for anything like ROWVERSION might be a comparison of the relation value before the assigment and the relation value after. But the ROWVERSION feature has no means of specifying a key on which to perform such a comparison (presumably the comparison would be a join on one of the candidate keys). ROWVERSION just assumes that such a comparison can be made with or without a key (this is SQL after all!).
Every kind of update in an RDBMS must surely be equivalent to some relational assignment. That's why I seriously doubt whether anything quite like ROWVERSION is possible or desirable.
 Signature David Portas
Tony Rogerson - 16 Dec 2007 22:06 GMT David,
Stop trying to divert; you got it wrong; just leave it there.
Here is the original ddl...
CREATE TABLE Schedule (teacher VARCHAR(15) NOT NULL, class CHAR(15) NOT NULL, room INTEGER NOT NULL, period INTEGER NOT NULL, PRIMARY KEY (teacher, class, room, period));
ROWVERSION has nothing at all to do with this I really don't understand why you are trying to deflect the thread to that.
The simple matter of the fact is that on revisting the database with the natural key you obtained (which should not have changed) has changed and points to completely different data.
That is why we need an immutable key.
And... I'm still waiting for you to show me how to put history into celko's schema that would fix the problem where the natural key changes - why oh why are you STILL ignoring that request?
 Signature Tony Rogerson, SQL Server MVP http://sqlblogcasts.com/blogs/tonyrogerson [Ramblings from the field from a SQL consultant] http://sqlserverfaq.com [UK SQL User Community]
David Portas - 16 Dec 2007 22:38 GMT > David, > > Stop trying to divert; you got it wrong; just leave it there. I'm surprised if you assume there is a single right or wrong answer. I generally try to keep an open mind which is why I posed my responses as questions rather than claim that anything you said was wrong.
> ROWVERSION has nothing at all to do with this I really don't understand > why you are trying to deflect the thread to that. I agree ROWVERSION has nothing to do with it. I didn't introduce the topic. Brian did.
> The simple matter of the fact is that on revisting the database with the > natural key you obtained (which should not have changed) has changed and > points to completely different data. > > That is why we need an immutable key. That's a straw man argument. If you design a solution on the assumption that data will not change then you will certainly have problems. I agree with you. I just don't advocate designing systems that way. Now if the rows referenced by your "immutable" key have been deleted, what result do you intend in that case? You don't assume that they will always exist do you? Of course you don't - you determine the right response and design the appropriate solution on the basis that the data may have changed.
> And... I'm still waiting for you to show me how to put history into > celko's schema that would fix the problem where the natural key changes - > why oh why are you STILL ignoring that request? I already explained.
 Signature David Portas
Brian Selzer - 17 Dec 2007 03:55 GMT >>> The only way to tell whether the current state of the database equals >>> some previous state is to query again and compare that to a result [quoted text clipped - 44 lines] > interpretations are the ones required and understood by the users of the > data then the database is quite sufficient as proposed. Isn't it true that with FOL, there must be separate interpretations for tuple A and tuple B, since they belong to different database values, whereas with modal or temporal logic, the same interpretation can apply to both? Since it is under an interpretation that values are assigned meaning, the separate interpretations for tuple A and tuple B permit identical values to mean totally different things or different values to mean the same thing. Without a common interpretation, it cannot be determined whether or not the individuals referenced by the key values in tuple A and tuple B are the same individual or different individuals. This is why I think it is suspect to compare tuple A with tuple B unless the key is a rigid designator or a rigid definite description. Note that the presence of a rigid designator or rigid definite description implies a common interpretation under which comparisons are no longer suspect.
>>> BTW I seriously doubt whether it would be possible or desirable to >>> implement anything like a ROWVERSION in a true RDBMS. The consequences [quoted text clipped - 18 lines] > relational assignment. That's why I seriously doubt whether anything quite > like ROWVERSION is possible or desirable. That assumes that relational assignment is primitive. I would argue that insert, update and delete are the primitive operations, and that assignment is a shorthand for a combination of the primitives delete and insert. Information is lost when an update is translated into an assignment, but not so the reverse: an assignment can always be translated into a delete and an insert.
David Portas - 18 Dec 2007 22:03 GMT > Isn't it true that with FOL, there must be separate interpretations for > tuple A and tuple B, since they belong to different database values, > whereas with modal or temporal logic, the same interpretation can apply to > both? I don't know enough about modal logic to comment on that part. I don't see how you conclude that for FOL there MUST be separate interpretations just because the values have different valid times. It seems like there are temporal database models and even logic programming that don't require anything like modal logic to produce useful results.
> That assumes that relational assignment is primitive. I would argue that > insert, update and delete are the primitive operations, and that > assignment is a shorthand for a combination of the primitives delete and > insert. Information is lost when an update is translated into an > assignment, but not so the reverse: an assignment can always be translated > into a delete and an insert. I think it is self-evident that assignment can preserve exactly as much information as the user requires it to. If the information in question can be represented as values within relations then, ipso facto, a relational assignment is sufficient. Any form of DELETE or INSERT operator therefore doesn't add any expressive power. At best it would be a syntax shortcut of some kind.
 Signature David Portas
Brian Selzer - 19 Dec 2007 05:34 GMT >> Isn't it true that with FOL, there must be separate interpretations for >> tuple A and tuple B, since they belong to different database values, [quoted text clipped - 6 lines] > temporal database models and even logic programming that don't require > anything like modal logic to produce useful results. I believe one of the logicians out there can answer better than I, but the way I understand it, an interpretation is an assignment. For modal logic, an interpretation assigns meaning to every symbol that appears in any possible world, and then one of those possible worlds is determined to be the actual world. As a consequence, no matter how often a symbol appears in however many possible worlds, the interpretation involves a single assignment. First order logic doesn't deal with possibility and necessity, so as far as I know, the possible worlds interpretation doesn't apply. That's why you need separate FOL interpretations for database values that occur at different times: each distinct database value necessarily involves a separate assignment. How, for example, can you assign meaning for something that doesn't exist? If Baby Mike will be born on Halloween, 2008, then how can you use the same interpretation for a database today as a database on Christmas, 2008? With modal logic, the possibility that Baby Mike will be born can be represented, and therefore meaning can be assigned to the symbol for Baby Mike.
>> That assumes that relational assignment is primitive. I would argue that >> insert, update and delete are the primitive operations, and that [quoted text clipped - 9 lines] > doesn't add any expressive power. At best it would be a syntax shortcut of > some kind. You're right about DELETE and INSERT, but not about UPDATE. If you look at UPDATE in TTM (pps. 112-113), you can see what happens:
UPDATE r ( Ai := X, Aj := Y)
where i <> j is supposedly equivalent to
( ( EXTEND r ADD ( X AS Bi, Y AS Bj ) ) { ALL BUT Ai, Aj } ) RENAME ( Bi AS Bk, Bj AS Aj, Bk AS Ai )
where Bi, Bj, and Bk are arbitrary distinct attribute names that do not appear in r.
Now if you look at the result of
EXTEND r ADD ( X AS Bi, Y AS Bj ),
you'll notice that each tuple has both the old values and the new values for each affected attribute. Clearly if Ai or Aj is prime, then that information that ties each tuple in the result to its corresponding tuple in r is projected away by "{ ALL BUT Ai, Aj }". Information is lost. The result may be the same, but how that result was arrived at is lost in translation, and therefore cannot be verified.
David Portas - 19 Dec 2007 11:41 GMT > You're right about DELETE and INSERT, but not about UPDATE. If you look > at UPDATE in TTM (pps. 112-113), you can see what happens: [quoted text clipped - 19 lines] > result may be the same, but how that result was arrived at is lost in > translation, and therefore cannot be verified. "Information is lost" but no-one ever said that relational operators have to be reversible. I don't know what you mean by "cannot be verified" - the result is what it is. Assignment is still primitive because you haven't shown how an UPDATE can give a result that cannot be obtained by assignment.
 Signature David Portas
Brian Selzer - 19 Dec 2007 19:32 GMT >> You're right about DELETE and INSERT, but not about UPDATE. If you look >> at UPDATE in TTM (pps. 112-113), you can see what happens: [quoted text clipped - 25 lines] > shown how an UPDATE can give a result that cannot be obtained by > assignment. If assignment is primitive, then it cannot necessarily be determined whether the properties of an individual that existed before an assignment are now different or that an individual that did not exist before the assignment now does. If, on the other hand, update is primitive, then there is no doubt. Since an update specifically targets information about what already exists, and since the update includes information about those individuals both before and after, there is no doubt as to whether the individuals in question existed. This is what I mean by "cannot be verified."
I think I indicated that the result may be the same. If I want a new Ferrari, I can buy one or I can steal one. In both cases the result is the same--I would possess a new Ferrari--but how I ultimately arrived at that result is significant, because in the one case I would be in debt up to my eyeballs, but in the other case I would be on the run from the law.
David Portas - 19 Dec 2007 20:03 GMT > If assignment is primitive, then it cannot necessarily be determined > whether the properties of an individual that existed before an assignment > are now different or that an individual that did not exist before the > assignment now does. Both those questions could be answered by comparing the relation value before the assignment to the one after it. For reasons that aren't clear to me you think that is not allowed or invalid in some way. I think I'm not the only one who will take some convincing.
> If, on the other hand, update is primitive, then there is no doubt. Since > an update specifically targets information about what already exists, and > since the update includes information about those individuals both before > and after, there is no doubt as to whether the individuals in question > existed. This is what I mean by "cannot be verified." To attribute any real world meaning to a humble operator is a bizarre leap beyond RM. RM says absolutely nothing about semantics. One tuple could mean anything in successive states of the database or even stand for multiple things in the same state. I can't imagine saying to database users "Use updates if you mean that this particular individual already existed. Use other operators if you mean something else." I for one wouldn't want to have anything to do with such a DBMS!
> I think I indicated that the result may be the same. If I want a new > Ferrari, I can buy one or I can steal one. In both cases the result is > the same--I would possess a new Ferrari--but how I ultimately arrived at > that result is significant, because in the one case I would be in debt up > to my eyeballs, but in the other case I would be on the run from the law. Yes but you are totally failing to explain HOW or WHERE that information could be exposed. If it does not exist as values within relations then it does not exist in an RDBMS. If it does exist as values within relations then it can be supported equivalently using an assignment (or at least using multiple assignment).
 Signature David Portas
Anith Sen - 19 Dec 2007 21:04 GMT >> I can't imagine saying to database users "Use updates if you mean that >> this particular individual already existed. Use other operators if you >> mean something else." I for one wouldn't want to have anything to do with >> such a DBMS! I am quite sure Brian had been told about the distinctions between external and internal predicates and how the distinction matters significantly while mapping a business representation to a logical model. For whatever reasons, he chooses not to heed. Fabian has an article somewhere in his site namely, "The desirable and the possible" that clarifies the distinction pretty well.
 Signature Anith
Brian Selzer - 20 Dec 2007 15:03 GMT >> If assignment is primitive, then it cannot necessarily be determined >> whether the properties of an individual that existed before an assignment [quoted text clipped - 5 lines] > to me you think that is not allowed or invalid in some way. I think I'm > not the only one who will take some convincing. The problem is that there are times when they can be answered and times when they can't. If that which identifies an individual can be different at different times, then simply comparing the relation value before the assingment with the one after will not work. For example, the number you pick at the pharmacy today may be different than the one you picked last week. It all boils down to key stability, and consequently whether or not the comparison is valid depends upon an arbitrary choice made by the database designer. In my opinion, that is unacceptable. Now if update is primitive, then key stability is no longer a factor because both the old values and the new values for each affected tuple are available, and therfore a tuple-by-tuple comparison can be carried out.
>> If, on the other hand, update is primitive, then there is no doubt. Since >> an update specifically targets information about what already exists, and [quoted text clipped - 9 lines] > existed. Use other operators if you mean something else." I for one > wouldn't want to have anything to do with such a DBMS! I don't see how it is attributing real world meaning to expect users to issue a DELETE to remove information about something that no longer exists, an UPDATE to adjust information about something that is different in appearance, or an INSERT to supply information about something that just came into existence. It is the user that can see what exists, what doesn't and what looks different; the system must be told. Is existence part of the external predicate? The domain closure assumption states that the only individuals that exist are those that are represented in the database.
Key values are supposed to map to individuals in the micro-world that the database represents. While it is true that a tuple can mean anything in successive states or even stand for multiple things, it is clear that since each tuple contains a key value, there is a mapping between that tuple and at least one individual in the micro-world. It is not the individual that is important here, but the fact that every tuple exhibits the same behavior, that is, under any interpretation, each key value exemplifies an individual. It is that behavior that supports the use of DELETE, UPDATE and INSERT in the manner described above.
>> I think I indicated that the result may be the same. If I want a new >> Ferrari, I can buy one or I can steal one. In both cases the result is [quoted text clipped - 7 lines] > then it can be supported equivalently using an assignment (or at least > using multiple assignment). The information is exposed in a FOR EACH ROW trigger. I think it could also be exposed in a manner that does not involve iteration.
David Cressey - 19 Dec 2007 11:59 GMT > You're right about DELETE and INSERT, but not about UPDATE. If you look at > UPDATE in TTM (pps. 112-113), you can see what happens: [quoted text clipped - 19 lines] > result may be the same, but how that result was arrived at is lost in > translation, and therefore cannot be verified. It seems to me that, in many discussions over the past few months, you have been emphasizing the difference between DELETE followed by INSERT on the one hand, and UPDATE on the other, even when the final result of both is a database in the same state.
The question arises whether the database communicates any data other than the data contained in its state. I'm going to pull a quote from Codd's 1970 paper. Even though this quote is about consistency, rather than identity, I want to draw your attention to the wording.
"It is important to note that consistency as defined above is a property of the instantaneous state of a data bank, and is independent of how that state came about. Thus, in particular, there is no distinction made on the basis of whether a user generated an inconsistency due to an act of omission or an act of commission. "
It seems to me that all of us who have been building or working with relational databases, regardless of whether we use SQL as an interface, or some tool that might be superior to SQL, have been treating database state as independent of history in exactly the above sense. Not only when we deal with consistency, but also with regard to entity identity. I don't know of any revision made by D&D or others to that perspective.
The languages described in this newsgroup differ from SQL in at least one important respect. SQL distinguishes between actions and transactions. For reasons that they themselves have outlined, D&D have come up with a language in which every transaction can be expressed as a single action. I can see the advantages of such a pattern, even without any prectical experience in such a language. But even with this difference, the difference between history and state is just about identical between SQL and the relational language.
It seems to me that, if you are going to assert that the history of a database is part of the interpretation of its content, it's going to be up to you to explain how a database can serve up any relevant portions of its own history as data.
If it isn't data, we don't know it.
I think that until this matter gets resolved somehow, there is going to be a continuing disconnect between you and a lot of the regulars in c.d.t.
Frank Hamersley - 22 Dec 2007 11:07 GMT > "Brian Selzer" <brian@selzer-software.com> wrote in message [..]
> The question arises whether the database communicates any data other than > the data contained in its state. I'm going to pull a quote from Codd's 1970 [quoted text clipped - 7 lines] > whether a user generated an inconsistency due to an act of > omission or an act of commission. " Interesting - I had always assumed that Codds use of "time varying" had implied temporality but the above quote must weaken that view.
[..]
> It seems to me that, if you are going to assert that the history of a > database is part of the interpretation of its content, it's going to be up > to you to explain how a database can serve up any relevant portions of its > own history as data. Many have tried! AFAIK all have failed.
Cheers Frank.
David Cressey - 22 Dec 2007 13:17 GMT > > "Brian Selzer" <brian@selzer-software.com> wrote in message > [..] [quoted text clipped - 13 lines] > Interesting - I had always assumed that Codds use of "time varying" had > implied temporality but the above quote must weaken that view. It's my understanding, from reading in here, that "time varying relation" was basically the same thing as "relational variable", which in turn gets abbreviated to "relvar".
It's my understanding that a consistent view of a database involves everything written by prior transactions and nothing written by concurrent transactions, let alone future transactions. Concurrency is, of course, more complicated than this, but this is sufficient for this discussion.
I have preferred to think of the database data structures as "containers" rather than "variables". The contents of containers are presumably time varying, subject to the above comment on transactions. I suppose you could have a WORM container that, once written, remains constant, but I never thought of that until just now.
The "container" terminology makes for a nice distinction between container based addressing and content based addressing. This in turn leads to the distinction between the graph model of data and the relational model of data. But, AFAIK, it's just terminology. If there's any deep thought here, it isn't my thought.
paul c - 22 Dec 2007 16:12 GMT ...
> It's my understanding that a consistent view of a database involves > everything written by prior transactions and nothing written by concurrent > transactions, let alone future transactions. Concurrency is, of course, > more complicated than this, but this is sufficient for this discussion. > ... It certainly does get complicated, IMHO that's because of the conventional locking implementations which for the average app programmer, obscure the essential problem. Often those implementations force the changing of application requirements when I think concurrency should always be part of the requirement. Logically, I don't see why a transaction need ever be concerned with other concurrent transactions although I admit such an implementation might not be considered "optimum" as far as performance is concerned. For example, if "time" is taken out of the picture by a suitable language, one could view, say, a bank account balance update as involving only the data the transaction has read plus the changes and these could all be expressed in the form of relations. A dbms could view the previously read data supplied by an updating transaction merely as constraints or assertions if you like. In this way, such a dbms would not have a distinguishable concurrency component, eg., a conventional lock manager. I'm not suggesting this should be implemented but to me it seems a kind of canonical view of the concurrency problem and possibly a good one for allowing people to introduce concurrency notions into an application design as opposed to leaving it to the vagaries of a particular implementation.
To some extent, I think the "optimistic" implementations mimic this, but my complaint is that their mechanisms are usually out of the hands of the app programmer.
I don't know enough about functional languages to say whether they could support this "out-of-the-box", but the D&D "comma" operator is a step in the direction of having a notation that allows "before" and "after" versions of a relvar. One might prefer to think of a time sequence when dealing with such relvars, but logically a "timeless" engine would only need to support the notion of replacement, not "before" and "after".
Just my twenty-five cents.
David Cressey - 23 Dec 2007 00:50 GMT > ... > > It's my understanding that a consistent view of a database involves [quoted text clipped - 22 lines] > introduce concurrency notions into an application design as opposed to > leaving it to the vagaries of a particular implementation. I was spoiled, back in the late 80s and early 90s, by working with DEC Rdb/VMS. The locking mechanism was fairly conservative, but the app programmer was very much freed from the responsibilities for figuring out the things you describe, at least in most circumstances.
What programmers did have to program for was deadlock errors, and to be able to rollback out of the tranaction and start it again. At the time, that seemed a small price to pay.
Rdb/VMS had two concurrency mechanisms, one using lock manager, and the other using multiple versions called "snapshots". The snapshot mechanism was mainly for providing a read only transaction with a consistent view of the database while not blocking read write transactions.
I don't have that much experience with the other implementations, but it seems as though some of them place more of a burden on the programmer.
> To some extent, I think the "optimistic" implementations mimic this, but > my complaint is that their mechanisms are usually out of the hands of > the app programmer. And that's a good thing, except in exceptional circumstances.
paul c - 23 Dec 2007 01:41 GMT >> ... >>> It's my understanding that a consistent view of a database involves [quoted text clipped - 47 lines] > > And that's a good thing, except in exceptional circumstances. In those days, most Cics and some IMS programmmers knew the four necessary conditions for deadlock and avoided at least one of them in their apps. Nowadays that's considered high science. Jim Gray was a smart guy and I hope he's still alive but the fact remains that it seems that he and other concurrency experts chose to ignore the IP. I think Bob B said something quite profound recently, something like "you can't manage data without data". A corollary might be that only mystics can manage without data. Normalization theory started with the IP and caught a following, no coincidence if you ask me. The concurrency theorists carried on with the IMS and other physical ways of thinking and never seemed to get the idea that there might be a logical concurrency model based on the IP (at least as far as I know).
Bob Badour - 23 Dec 2007 03:25 GMT >>> ... >>> [quoted text clipped - 68 lines] > and never seemed to get the idea that there might be a logical > concurrency model based on the IP (at least as far as I know). IP? Intellectual property? Internet protocol? IMS programmer? ??
paul c - 23 Dec 2007 15:16 GMT ...
> IP? Intellectual property? Internet protocol? IMS programmer? ?? Information principle, although it wouldn't surprise me if somebody usurped the term intellectual property to stand for a mystical attribute or two!
Bob Badour - 23 Dec 2007 15:43 GMT > ... > [quoted text clipped - 3 lines] > usurped the term intellectual property to stand for a mystical attribute > or two! I am not sure Gray's stuff is incompatible with the information principle. It seems orthogonal to me.
paul c - 23 Dec 2007 17:05 GMT ...
> I am not sure Gray's stuff is incompatible with the information > principle. It seems orthogonal to me. I wouldn't argue with that but the distinction I'd prefer to think of as more useful is that concurrency theory viewed from the perspective of an rdbms operates at a physical level, often, maybe always, using artifacts such as "transaction id's, session id's" and so forth, that are hidden from apps.
My complaint about conventional concurrency theory is that (as far as I know) it has never tried to express how a purely logical view that is visible to an application can achieve the same effects as a typical concurrency or lock manager component. I think one likely reason for this is that the concurrency theorists take physical persistence as their starting point. Given that, it's not surprising they would ignore the possibility of a suitable app language using a relational algebra to prototype the same effects.
While implementing this might seem to deny one of Codd's goals, that of putting as much common logic in the rdbms rather than in app code, I also note that for the larger apps I've seen, there never failed to be some requirements compromise or other due to the concurrency strategy imposed by various dbms's.
best wishes, p
David BL - 24 Dec 2007 01:02 GMT > ... > [quoted text clipped - 21 lines] > some requirements compromise or other due to the concurrency strategy > imposed by various dbms's. Operational Transform (OT) tries to do exactly what you're alluding to. I have been investigating it for years now and can say that the mathematics is complex and very interesting. The basic idea is to think of changes to data structures as operations that can be generated and executed by different processes independently (and concurrently) on the same state without any locking, and the operations can be transformed as required to achieve the same logical effect when they are post-applied in different orders by the processes. This holds the promise of allowing for replicated databases which support local edits without being exposed to network latency or outages, and for asynchronous merging of changes in a similar fashion to CVS. Unfortunately, and not surprisingly there is a limited extent to which OT can preserve the original logical intention of the operations due to conflicts.
I would be hesitant to talk about any alternatives to Gray's approach with Badour. He considers it sacrilegious. I was plonked after only a brief discussion when I first posted to cdt.
paul c - 24 Dec 2007 03:17 GMT ...
> Operational Transform (OT) tries to do exactly what you're alluding > to. I have been investigating it for years now and can say that the [quoted text clipped - 11 lines] > intention of the operations due to conflicts. > ... I didn't have replication in mind, but I will try to peruse whatever I can find about "OT". However, until simpler approaches have been studied more thoroughly, I'll stick to my suspicion that the recognized theorists all started out on the wrong foot.
My attitude about concurrency is likely much simpler in principle, if not in practice. Basically, an "ideal" rdbms would ignore concurrency! All a "transaction" can "know" is its data, aka the hand it's been dealt and this is sufficient to eliminate what some other transactions might be occupied with. If it regurgitates the queries the designer decides are pertinent to the transaction, at the "time" of update, along with the expected query results, and those match, then the rdbms could apply the transaction's expected updates, in perfect safety.
If a concurrent mechanism in a particular dbms allows a transaction to behave according to data that is not "known" to its own app code, obviously no other "transaction" should be allowed to see such data and no update should be permitted in the first place. I vaguely recall that Oracle was one of the first to have such a mechanism but the implementation cost seemed excessive to me, ie., at least for high-volume oltp-type transactions, the apps could have coded the same things much cheaper, given a suitable "server".
(BTW, I think your deleted comment is unnecessary and irrelevant as far as the topic is concerned. Bob B has shown himself to be quite capable of understanding such physical issues as replication which are in fact much more elementary than his usual themes, which he is also quite capable of dealing of. Also, when it comes to synchronization, it is amazing to me how hard it seems to be for experts to see how simple and massive a problem it can be, eg., involving 2-phase commit on a large scale, but I suspect not for an layman outsider once the jargon has been explained. Another thing concurrency theorists don't seem to have considered much is the possibility of re-combining different relations from different db's to accomplish a limited number of desired effects, tactics as simple as storing details in one location and summaries in another - if the two don't balance then the collective relations are rejected, perhaps days later. But that would require them to work at the logical, not the physical level, which most of them would find impossible, counter to their training and other biases.)
David BL - 24 Dec 2007 05:59 GMT > ... > [quoted text clipped - 18 lines] > studied more thoroughly, I'll stick to my suspicion that the recognized > theorists all started out on the wrong foot. The literature on OT focuses on the real time collaborative editing of text. IMO the published solutions tend to be incorrect or impractical. I'm not aware of anyone else researching the use of OT for databases.
Wikipedia provides a reasonable overview
http://en.wikipedia.org/wiki/Operational_transformation
> My attitude about concurrency is likely much simpler in principle, if > not in practice. Basically, an "ideal" rdbms would ignore concurrency! [quoted text clipped - 4 lines] > with the expected query results, and those match, then the rdbms could > apply the transaction's expected updates, in perfect safety. Isn't that just optimistic concurrency control? But what happens when the results don't match?
> If a concurrent mechanism in a particular dbms allows a transaction to > behave according to data that is not "known" to its own app code, [quoted text clipped - 4 lines] > high-volume oltp-type transactions, the apps could have coded the same > things much cheaper, given a suitable "server". I don't think I follow you. Are you saying that an application must not perform an update following a dirty read? The problem is not detecting it, but what to do about it. Depending on the application rejection is not always practical, even if you can live with application code that explicitly (or even implicitly) sits in a re-try loop.
I can only see two ways to avoid rejection:
1) Locking - to serialise transactions.
2) Compensation (eg using OT).
Both have their pros and cons.
Is the comment about Oracle referring to Multi-Version Concurrency Control (MVCC)? Although very useful, MVCC only addresses the important case of long running read only queries.
> (BTW, I think your deleted comment is unnecessary and irrelevant as far > as the topic is concerned. Bob B has shown himself to be quite capable > of understanding such physical issues as replication which are in fact > much more elementary than his usual themes, which he is also quite > capable of dealing of. My comment didn't concern Bob's intelligence or knowledge (which are both impressive).
> Also, when it comes to synchronization, it is > amazing to me how hard it seems to be for experts to see how simple and > massive a problem it can be, eg., involving 2-phase commit on a large > scale, but I suspect not for an layman outsider once the jargon has been > explained. I've never thought much of multiphase commit protocols. Instead I'm a fan of persistent message queues. I was actually intending to post to cdt to see if anyone could provide an example mandating a multiphase commit protocol (ie where no solution exists using persistent message queues).
> Another thing concurrency theorists don't seem to have > considered much is the possibility of re-combining different relations [quoted text clipped - 4 lines] > the logical, not the physical level, which most of them would find > impossible, counter to their training and other biases.) I think storing asynchronous summaries in a different location is rather straightforward because it wouldn't be an independent source of updates. A message queue can be used to bring the summaries DB up to date. A persistent message sequence number (stored in each DB) tells you whether the summary DB is synchronised with the details DB, or could allow a thread to block until the summaries have been synchronised, preferably at the start of a transaction before it has locked any resources.
Sampo Syreeni - 24 Dec 2007 15:30 GMT > The literature on OT focuses on the real time collaborative editing of > text. IMO the published solutions tend to be incorrect or impractical. > I'm not aware of anyone else researching the use of OT for databases. There is a long research tradition on semantic and multilevel concurrency control. As I've said before, OT seems to be a special case of those, in that all the operations on the underlying abstract datatype are engineered to be either commutative or to have easily expressed compensations. Transaction processing monitors routinely apply the more general theory, which accounts for much of the performance increase such middleware can offer.
> Is the comment about Oracle referring to Multi-Version Concurrency > Control (MVCC)? Although very useful, MVCC only addresses the > important case of long running read only queries. Oracle's implementation has trouble even with those, because it's quite difficult to guarantee that the version data necessary in the later stages of the transaction do not accidentally get overwritten over the course of execution.
> I've never thought much of multiphase commit protocols. Instead I'm a > fan of persistent message queues. I don't quite see them as solving the same problem. If your queue is such that all of the updates posted in it are always applied, then in a distributed environment it would seem that you need something like a multiphase commit to get agreement on what to insert into the queue in the first place. Or if some of the queued transactions can be rolled back, then you need distributed agreement when the queued transaction is finally executed. Finally, it obviously solves the problem if the transactions are serialized before insertion into the queue, but then we lose in concurrency and the queue easily becomes a central hot spot.
So, could you perhaps elaborate a bit?
> A message queue can be used to bring the summaries DB up to date. A > persistent message sequence number (stored in each DB) tells you > whether the summary DB is synchronised with the details DB, or could > allow a thread to block until the summaries have been synchronised, > preferably at the start of a transaction before it has locked any > resources. That's how most reasonable report databases/warehouses connect to their respective OLTP sources, at least.
 Signature Sampo Syreeni, aka decoy - mailto:decoy@iki.fi, tel:+358-50-5756111 student/math+cs/helsinki university, http://www.iki.fi/~decoy/front openpgp: 050985C2/025E D175 ABE5 027C 9494 EEB0 E090 8BA9 0509 85C2
David BL - 26 Dec 2007 00:28 GMT > > I've never thought much of multiphase commit protocols. Instead I'm a > > fan of persistent message queues. [quoted text clipped - 10 lines] > > So, could you perhaps elaborate a bit? Consider the following hypothetical example: there are two geographically separated databases storing account balances. There is no replication - each DB is recording independent accounts.
Suppose we want a transfer between accounts needing a distributed transaction. The two phase commit protocol will involve a "prepare to commit" phase, and each DB will need to lock the relevant account balances. The locks aren't released until the participants acknowledge the commit message (assuming the presumed rollback protocol). Network failures could make this take a very long time. ie network failure can upset the autonomy of the databases. How bad is that!
Instead consider that we avoid TPC entirely, and use persistent message queues...
Suppose we want to transfer from account balance Ax on machine X into account balance Ay on machine Y.
Let X have a persistent message queue Qx that records one way messages to be sent from X to Y. Each message represents a deposit to be performed on an account on Y. Y records a persistent sequence number Nx for the number of messages it has processed from Qx. A middleware layer allows Y to pump messages from Qx and it uses Nx to ensure it processes each message exactly once.
Similarly Y has Qy for messages to be sent from Y to X and X records sequence number Ny.
A successful transfer involves the following steps
1. On X, a local transaction does the withdrawal on Ax and pushes message onto Qx that represents a deposit that needs to be performed on Ay. Ax and Qx are updated atomically on X.
2. Y reads message from Qx then opens a transaction that atomically updates Ay and Nx.
Now consider that Y needs to cancel the transfer. Eg this could occur because by the time Y processes the deposit message the relevant account has been closed. Then we simply introduce a message to cancel the transfer, which in a fashion can be compared to a presumed commit TPC. In fact a cancel of a transfer is nothing other than a deposit message to undo the original withdrawal.
Note that when we account for the outstanding (ie not yet processed by receiver) deposit messages, the total money in the system is conserved by the local transactions. Note as well that local transactions are performed without any network I/O whatsoever.
The durability requirement on these transactions can be relaxed. For example when they commit they can release locks before flushing the log, making the locks less sensitive to file I/O. Accounts are locked for extremely short periods. It is a vastly superior approach to TPC.
IMO the whole idea of a "distributed transaction" is disgusting. It seems tragic for a system to use them if they can be avoided. I wonder if they always can be (when one considers the "bigger picture" and one isn't constrained by an existing legacy system).
Sampo Syreeni - 26 Dec 2007 17:20 GMT > A middleware layer allows Y to pump messages from Qx and it uses Nx to > ensure it processes each message exactly once. Essentially what you're doing here is using a timestamping mechanism and the acknowledgements from Y to implement a rudimentary, distributed agreement protocol. It is used to assure that the message gets delivered once, and only once. If the transaction touched more than two databases, you would have to get agreement from all of them because failure to apply the update against Y would imply that the effects against Z have to be cancelled as well. That is, you haven't really gotten rid of the agreement protocol, but only gone to an optimistic version of it, and perhaps in the process even restricted it to a two database context only.
> On X, a local transaction does the withdrawal on Ax and pushes message > onto Qx that represents a deposit that needs to be performed on Ay. > [...] Now consider that Y needs to cancel the transfer. Eg this could > occur because by the time Y processes the deposit message the relevant > account has been closed. [...] In fact a cancel of a transfer is > nothing other than a deposit message to undo the original withdrawal. In the literature this sort of thing is called an open nested transaction, and the "cancel" is called a compensating transaction/compensation. The construct is used precisely for the reasons you cite, i.e. for less locking interference, higher concurrency and general decoupling, but it also exacts a clear price. Under it consistency cannot be guaranteed unless a) the only way to access the database is through b) operations with known semantics, c) possessing clearly defined undo/"compensation" operations, d) which commute suitably amongst each other, and e) which have real world semantics which allow us to conclude that this new, lower level of distributed consistency is appropriate. If general (usually called "native") transactions are allowed to proceed in parallel with the semantic primitives, f) usual locking semantics still need to be available and to be applied to control interactions between the native (lower level) and the semantic (higher level) actions.
In fact, that you included account cancellations in the set of permissible operations already break c) and d). Consider a situation where some money is transferred from account X to account Y, Y gets cancelled before the message is processed, and then X gets cancelled before the reverse deposit is. Now the money is lost and no compensating action is available. You could perhaps augment the system with zombie accounts, cascading aborts and real world compensating actions (i.e. trace the money back to where it came from, even across multiple dead accounts, and mail it to the original owner), but that is complicated and once you move outside of the banking field, in general you can't guarantee that all of the necessary options are available.
As an example, consider a claims processing database. We want each claim to always be allocated to a handler, and no more than say ten claims to be allocated to a single handler at a time. If all handlers are fully occupied, there might even be a legal obligation not to accept new claims. If X now assigns all of his claims to Y and accepts ten new claims for processing, no matter what we do from there on the real world legal mistake of accepting too many claims for processing in toto cannot be avoided, or rectified once it's occurred. Under global consistency in the normal sense, that couldn't have happened because assignment would only have been possible if Y were actually able to accept the extra load then and there, atomically.
What allows you to get even as far as you did with message queues with the account transaction example is the way money behaves in a real economy, that is, semantics. For example, you can rely on money being fungible and divisible, people not wanting throw to it away (unlike burdens such as a claim to be processed), other people always wanting to accept more of it, and for everybody to go to great pains (like keeping useless accounts around for a time) in order not to lose any of it. Obviously such constraints do not hold for all of the applications databases can and do have.
> Note that when we account for the outstanding (ie not yet processed by > receiver) deposit messages, the total money in the system is conserved > by the local transactions. The problem is that this sort of accounting involves hidden state, which allows us to break global integrity constraints. The user visible state of the system is no longer consistent; my money temporarily disappeared. Of course this is what happens with current interbank transfers, and there's nothing wrong with it as long as the real world semantics make sense. But the usual notion of distributed consistency employed by database researchers is much more ambitious than that. It also tries to support all the rest of the applications which *don't* have anything to do with accounting or its particular characteristics.
> The durability requirement on these transactions can be relaxed. For > example when they commit they can release locks before flushing the > log, making the locks less sensitive to file I/O. Even this is not really true: the new state of the queue still needs to be forced into stable storage before the transaction can safely return. Otherwise a crash before the message insertion is made persistent would cause an inconsistency between the states the client and the DBMS see.
> I wonder if they always can be (when one considers the "bigger > picture" and one isn't constrained by an existing legacy system). Hmm. I would not easily criticise the modern work on consistency and concurrency based on a single example -- accounting -- that in itself is several millennia of age.
 Signature Sampo Syreeni, aka decoy - mailto:decoy@iki.fi, tel:+358-50-5756111 student/math+cs/helsinki university, http://www.iki.fi/~decoy/front openpgp: 050985C2/025E D175 ABE5 027C 9494 EEB0 E090 8BA9 0509 85C2
David BL - 27 Dec 2007 01:57 GMT > > A middleware layer allows Y to pump messages from Qx and it uses Nx to > > ensure it processes each message exactly once. [quoted text clipped - 9 lines] > perhaps in the process even restricted it to a two database context > only. But can you provide a realistic example, and offer a "proof" that message queues are inadequate? Does such a demonstration exist in the literature?
> > On X, a local transaction does the withdrawal on Ax and pushes message > > onto Qx that represents a deposit that needs to be performed on Ay. [quoted text clipped - 25 lines] > before the reverse deposit is. Now the money is lost and no compensating > action is available. When you say "Y is cancelled" do you mean an account on machine Y is being closed?
Closing an account doesn't happen synchronously. Consider that an account on Y is being closed.
1. A local transaction on Y marks the account as closed and "close account" messages are posted to its peers (by pushing messages onto the local queues)
2. Given that peers only find out asynchronously that the account is being closed there will be some period of time for which deposits continue to be made into the account. Y must continue pumping messages from its peers. Using a local transaction on Y, each deposit into the account marked as closed must be cancelled by pushing a compensating deposit onto the queue for the relevant peer.
3. It is assumed that each peer (in a local transaction) eventually reads the message that the account is being closed. This stops it from performing any further deposits into the account. In the same local transaction the peer posts an acknowledge message onto its local queue that is pumped by Y.
4. Eventually Y reads all the acknowledge messages from all its peers and determines that the account is now properly closed, because by ordered delivery of messages in the queues, it is safe to assume there will be no outstanding deposit messages for that account in the system.
> You could perhaps augment the system with zombie > accounts, cascading aborts and real world compensating actions (i.e. > trace the money back to where it came from, even across multiple dead > accounts, and mail it to the original owner), but that is complicated > and once you move outside of the banking field, in general you can't > guarantee that all of the necessary options are available. I think the above steps allow any number of accounts to be closed, and no money will be lost.
> As an example, consider a claims processing database. We want each claim > to always be allocated to a handler, and no more than say ten claims to [quoted text clipped - 4 lines] > legal mistake of accepting too many claims for processing in toto cannot > be avoided, or rectified once it's occurred. I don't agree. There should be a concept of a claim that is not currently allocated to a handler, and the claim is passed around the network until it is assigned to a handler. Allocation of a claim to a handler is a local decision using a local transaction allowing the legal obligation to be enforced.
> Under global consistency in > the normal sense, that couldn't have happened because assignment would > only have been possible if Y |
|