Database Forum / General DB Topics / General DB Topics / July 2004
database schema (PRI, UNI keys)
|
|
Thread rating:  |
Titus Barik - 05 Jul 2004 22:22 GMT Hi all,
Looking at the Wordpress database schema I notice that their wp_users table is defined as follows (snippet):
Field Type Key Default Extra ID int(10) unsigned PRI NULL auto_increment user_login varchar(20) UNI
I don't really understand why would you want to do something like this. If the user_login is unique, why not just make the user_login the primary key? Is this for efficiency reasons or am I missing something more fundamental?
Thanks in advance,
 Signature titus barik (titus@barik.net) | www.barik.net | aim: tbarik godfather Armand concierges Presbyterian fevered physical thrower's accomplishment Manuela Uriel's Meighen's versification
--CELKO-- - 06 Jul 2004 04:33 GMT >> I don't really understand why would you want to do something like this. If the user_login is unique, why not just make the user_login the primary key? <<
This is a common design mistake. The use of any non-relational,proprietary auto-increment as a key is an attempt to imitate a sequential file system and it destroys data integrity.
A long key is not always a bad thing for performance. For example, if I use (city, state) as my key, I get a free index on just (city). I can also add extra columns to the key to make it a super-key when such a super-key gives me a covering index (i.e. an index which contains all of the columns required for a query, so that the base table does not have to be accessed at all).
Faking a sequential file's positional record number, so I can reference the physical storage location? Sure, if I want to lose all the advantages of an abstract data model, SQL set oriented programming, carry extra data and destroy the portability of code!
More and more programmers who have absolutely no database training are being told to design a database. They are using GUIDs, IDENTITY, ROWID and other proprietary auto-numbering "features" in SQL products to imitate either a record number (sequential file system mindset) or OID (OO mindset) since they don't know anything else.
Experienced database designers tend toward intelligent keys they find in industry standard codes, such as UPC, VIN, GTIN, ISBN, etc. They know that they need to verify the data against the reality they are modeling. A trusted external source is a good thing to have.
The auto-increment column is a holdover from the early programming languages which were very close to the hardware. For example, the fields (not columns; big difference) in a COBOL or FORTRAN program were assumed to be physically located in main storage in the order they were declared in the program. The languages have constructs using that model -- logical and physical implementations are practically one! The data has meaning BECAUSE of the program reading it (i.e. the same bits could be a character in one program and be an integer in another)
The early SQLs were based on existing file systems. The data was kept in physically contiguous disk pages, in physically contiguous rows, made up of physically contiguous columns. In short, just like a deck of punch cards or a magnetic tape. Most programmer still carry that mental model, which is why I keep doing that rant about file vs. table, row vs. record and column vs. field.
But physically contiguous storage is only one way of building a relational database and it is not the best one. The basic idea of a relational database is that user is not supposed to know *how* or *where* things are stored at all, much less write code that depends on the particular physical representation in a particular release of a particular product on particular hardware at a particular time.
One of the biggest errors is the auto-increment column (actually property, not a column at all) in the Sybase/SQL Server family. People actually program with this "feature" and even use it as the primary key for the table! Now, let's go into painful details as to why this thing is bad.
The first practical consideration is that auto-increment is proprietary and non-portable, so you know that you will have maintenance problems when you change releases or port your system to other products. Newbies actually think they will never port code! Perhaps they only work for companies that are failing and will be gone. Perhaps their code is such crap nobody else want their application.
But let's look at the logical problems. First try to create a table with two columns and try to make them both auto-increment. If you cannot declare more than one column to be of a certain data type, then that thing is not a datatype at all, by definition. It is a property which belongs to the PHYSICAL table, not the LOGICAL data in the table.
Next, create a table with one column and make it an auto-increment Now try to insert, update and delete different numbers from it. If you cannot insert, update and delete rows from a table, then it is not a table by definition.
Finally create a simple table with one auto-increment column and a few other columns. Use a few statements like
INSERT INTO Foobar (a, b, c) VALUES ('a1', 'b1', 'c1'); INSERT INTO Foobar (a, b, c) VALUES ('a2', 'b2', 'c2'); INSERT INTO Foobar (a, b, c) VALUES ('a3', 'b3', 'c3');
To put a few rows into the table and notice that the auto-increment sequentially numbered them in the order they were presented. If you delete a row, the gap in the sequence is not filled in and the sequence continues from the highest number that has ever been used in that column in that particular table. This is how we did record numbers in pre-allocated sequential files in the 1950's, by the way. A utility program would then "pack" or "compress" the records that were flagged as deleted or unused to move the empty space to the physical end of the physical file.
But now use a statement with a query expression in it, like this:
INSERT INTO Foobar (a, b, c) SELECT x, y, z FROM Floob;
Since a query result is a table, and a table is a set which has no ordering, what should the auto-increment numbers be? The entire, whole, completed set is presented to Foobar all at once, not a row at a time. There are (n!) ways to number (n) rows, so which one do you pick? The answer has been to use whatever the *physical* order of the result set happened to be. That non-relational phrase "physical order" again!
But it is actually worse than that. If the same query is executed again, but with new statistics or after an index has been dropped or added, the new execution plan could bring the result set back in a different physical order.
Can you explain from a logical model why the same rows in the second query get different auto-increment numbers? In the relational model, they should be treated the same if all the values of all the attributes are identical.
Using auto-increment as a primary key is a sign that there is no data model, only an imitation of a sequential file system. Since this "magic, all-purpose, one-size-fits-all" pseudo-identifier exists only as a result of the physical state of a particular piece of hardware at a particular time as read by the current release of a particular database product, how do you verify that an entity has such a number in the reality you are modeling?
You will see newbies who design tables like this:
CREATE Drivers (driver_id auto-increment NOT NULL PRIMARY KEY, ssn CHAR(9) NOT NULL REFERENCES Personnel(ssn), vin CHAR(17) NOT NULL REFERENCES Motorpool(vin));
Now input data and submit the same row a thousand times, a million times. Your data integrity is trashed. The natural key was this:
CREATE Drivers (ssn CHAR(9) NOT NULL REFERENCES Personnel(ssn), vin CHAR(17) NOT NULL REFERENCES Motorpool(vin), PRIMARY KEY (ssn, vin));
To demonstrate, here is a typical idiot newbie schema -- you will them all over the news groups. I call them "idiots" because they always name the auto-increment property column "id" in EVERY table. They don't understand basic data modeling -- one and only name for an attribute. About half the time they don't use any DRI, but let's show it.
CREATE TABLE MotorPool (id auto-increment NOT NULL PRIMARY KEY, personnel_id INTEGER NOT NULL REFERENCES Personnel(id), vehicle_id INTEGER NOT NULL REFERENCES Vehicle(id));
CREATE TABLE Personnel (id auto-increment NOT NULL PRIMARY KEY, ssn CHAR(9) NOT NULL UNIQUE, ..);
CREATE TABLE Vehicles (id auto-increment NOT NULL PRIMARY KEY, vin CHAR(17) NOT NULL UNIQUE, ..);
Now change the natural key in Personnel:
UPDATE Personnel SET ssn = '666666666' WHERE ssn = '000000000';
Nothing happened in Motorpool, did it? You can do the same thing with a VIN.
Now you are REALLY thinking about relations and keys instead of 1950's sequential record numbering. Adding an IDENTITY column to either of these tables as a candidate key would be dangerously redundant; one query uses the auto-increment and another uses the real key, and like a man with two watches, you are never sure what time it is.
Finally, an appeal to authority, with a quote from Dr. Codd: "..Database users may cause the system to generate or delete a surrogate, but they have no control over its value, nor is its value ever displayed to them ..."(Dr. Codd in ACM TODS, pp 409-410) and Codd, E. (1979), Extending the database relational model to capture more meaning. ACM Transactions on Database Systems, 4(4). pp. 397-434.
This means that a surrogate ought to act like an index; created by the user, managed by the system and NEVER seen by a user. That means never used in queries, DRI or anything else that a user does.
Codd also wrote the following:
"There are three difficulties in employing user-controlled keys as permanent surrogates for entities.
(1) The actual values of user-controlled keys are determined by users and must therefore be subject to change by them (e.g. if two companies merge, the two employee databases might be combined with the result that some or all of the serial numbers might be changed.).
(2) Two relations may have user-controlled keys defined on distinct domains (e.g. one uses social security, while the other uses employee serial numbers) and yet the entities denoted are the same.
(3) It may be necessary to carry information about an entity either before it has been assigned a user-controlled key value or after it has ceased to have one (e.g. and applicant for a job and a retiree).
These difficulties have the important consequence that an equi-join on common key values may not yield the same result as a join on common entities. A solution - proposed in part [4] and more fully in [14] - is to introduce entity domains which contain system-assigned surrogates. Database users may cause the system to generate or delete a surrogate, but they have no control over its value, nor is its value ever displayed to them....." (Codd in ACM TODS, pp 409-410).
References
Codd, E. (1979), Extending the database relational model to capture more meaning. ACM Transactions on Database Systems, 4(4). pp. 397-434
Nick Landsberg - 06 Jul 2004 22:55 GMT A very impassioned and eloquent post, Mr. Celko!
I have a few things to add from my experiences when I was in the database world. Most of which reinforce your statements.
>>>I don't really understand why would you want to do something like > [quoted text clipped - 7 lines] > A long key is not always a bad thing for performance. For example, if > I use (city, state) as my key, I get a free index on just (city). Absolutely, if you assume B-tree (or other tree-like indices). It does not work with hash indices, but they tend to be vendor-specific implementations. A study of the most common access patterns can help determine whether it should be a hash index or a tree. The answer is "sometimes both (and let the DBMS optimizer decide which to use for any given query.)" It wastes space, sure. Disk space no longer seems to be a problem nowadays, but I will probably get rebuttals about that. (Some of them from myself. I like to argue with myself. Sometimes I even win the arguments :)
> ... I can also add extra columns to the key to make it a super-key when such > a super-key gives me a covering index (i.e. an index which contains > all of the columns required for a query, so that the base table does > not have to be accessed at all). That particular technique also may waste space, but it should be etched on the inner side of the eyelids of any physical (as opposed to logical) DBMS designer if performance is the primary consideration! Again, it works quite nicely with tree-like indices.
[ Minor Snip ]
> More and more programmers who have absolutely no database training are > being told to design a database. They are using GUIDs, IDENTITY, > ROWID and other proprietary auto-numbering "features" in SQL products > to imitate either a record number (sequential file system mindset) or > OID (OO mindset) since they don't know anything else. That first sentence of yours in the above probably sums it up in a nutshell. I have seen way too many database applications designed by people who were not trained properly. The result usually was that something suffered. (Most often performance, sometimes data integrity or consistenty.)
The typical reaction was to "blame the database." Switching from Oracle to Sybase or vice versa (often presented as a proposed solution), did absolutely nothing to solve the problem.
A discussion about logical schema vs. physical schema I will leave to another time, if it ever comes up :)
[REST SNIPPED]
NPL
 Signature "It is impossible to make anything foolproof because fools are so ingenious" - A. Bloch
Laconic2 - 07 Jul 2004 16:01 GMT I agree mostly with you, Nick and Joe.
It is certainly true that a large number of databases have been designed by persons who just barely understand how to fetch data from a database, and lack the minimal background necessary to distinguish between a good design, and one that is going to fail in practice.
Having said that, I'm going to take issue with the two of you on one aspect of this discussion.
The decision to use internal keys (surrogate keys) as identifiers rather than trusted external keys (natural keys) is often based on sloppy analysis, rather than sloppy design. That is, the decision that a given external key cannot be trusted is sometimes made in relatively complete ignorance of the environment into which the database and the application(s) are going to be placed. (There are cases where surrogate keys are appropriate, but that's another story).
The ignorance that result from sloppy analysis seems to be a badge of honor among the people who pride themselves on "rapid development" these days. Back in the stone age (my era), many projects bogged down in excessive analysis, and the rapid development reaction was perhaps inevitable. But superficial analysis can cost as much time as excessive analysis. The wasted time shows up as a consequence of bad design, but the bad design is really a consequence of sloppy analysis.
Again, this only one aspect. Even with adequate analysis, if the design is done by someone that doesn't know how to design, bed choices are almost certain.
AK - 07 Jul 2004 21:12 GMT > placed. (There are cases where surrogate keys are appropriate, but that's > another story). such as? could you please elaborate?
-P- - 22 Jul 2004 14:29 GMT > > placed. (There are cases where surrogate keys are appropriate, but that's > > another story). > > > such as? could you please elaborate? Here's one: Our software manages TV commercials for TV stations and cable networks. For example, Miller Brewing calls up WGN in Chicago and wants to run 6 Lite Beer commercials during the next Cubs game. We have a requirement (actually an FCC requirement) to uniquely identify each of the six "instances" of this single order - the spots must be serialized for accounting purposes. We use an autoincrement system-generated surrogate key to serialize them. While there are industry standard codes for the actual copy that will air (it's called an ISCI number) - each of the six instances carries the same ISCI. We can't use a date/timestamp either, because the scheduled airtime of each spot is not assigned at the time the order is placed. There IS NO industry standard identifier for this. ISBN's, SSNs, and other industry-standard identifiers are great, and should be used whenever available, but to eschew surrogate keys for ALL entities in EVERY datamodel shows a limited scope of knowledge.
 Signature Paul Horan Sr. Architect VCI Springfield, MA www.vcisolutions.com
Laconic2 - 22 Jul 2004 18:29 GMT While I agree with your general philosophy regarding surrogate keys, I don't think the example given below really illustrates the point.
A "surrogate" key, by definition is an artificial key that is used in place of a natural key. In the example you have given us, there is no natural key for the entity you need to identify (the "instance" of the ad). Therefore the creation of an artificial key, while correct, is not really a case of generating a surrogate for the natural key.
Rather it's generating a key, natural or artificial, where one is needed and none exists.
I hope I'm not stirring the pot too much here.
> > > placed. (There are cases where surrogate keys are appropriate, but that's > > > another story). [quoted text clipped - 11 lines] > other industry-standard identifiers are great, and should be used whenever available, but to eschew surrogate keys for > ALL entities in EVERY datamodel shows a limited scope of knowledge. -P- - 23 Jul 2004 00:03 GMT > While I agree with your general philosophy regarding surrogate keys, I > don't think the example given below really illustrates the point. [quoted text clipped - 9 lines] > > I hope I'm not stirring the pot too much here. No - that's a point well taken. With that definition in mind, we do actually use one surrogate key. It's the entity that describes the "structure" of a television program.
The natural key is: network smallint , // the channel the program is airing on schedule_date date , // self-explanatory schedule_type smallint , // the regular schedule, or an alternate log in case of a rainout? start_time integer , // the number of seconds past the station's "start of day" time
That combination of columns uniquely identifies every program segment and commercial break on a station log. There are child tables that have foreign keys back to this table. Rather than continue to propagate the concatenated key, we introduce a surrogate key "record_id" (yeah, yeah, I know... "Rows are not records, tables are not files...") that becomes the foreign key to the child tables. To illustrate, a 2-minute break in Seinfeld is one row in this table. However, we may place four :30 second commercials into that break (or two :60s, or one :60, one :30, and two :15s...) The associative entity between the "spots" (from the previous thread) and the "break" shown above is simply (spot_id, record_id) rather than (spot_id, {four more columns}).
 Signature Paul Horan Sr. Architect VCI Springfield, MA www.vcisolutions.com
Ed prochak - 23 Jul 2004 13:12 GMT > > While I agree with your general philosophy regarding surrogate keys, I > > don't think the example given below really illustrates the point. [quoted text clipped - 26 lines] > The associative entity between the "spots" (from the previous thread) and the "break" shown above is simply (spot_id, > record_id) rather than (spot_id, {four more columns}). I'm curiout about this question: What benefit do you see in using the Surrogate versus the concatenated primary key?
Back when disc space and CPU cycles were expensive, I could see a justification for having a smaller key. What you are doing is basically moving the index out of a hidden place in the DBMS and into full view as part of your tables. Queries aren't any faster. Only the time it takes to type the query is faster (one column versus four columns).
Then the risk of corrupt data now increases. And your example of the "spots" seems to bear that out. What stops you from "overbooking" 3minutes of spots in the 2minute time slot? (Yes you may argue that that's why you want the surrogate, you plan on occasionally breaking the rules! And if that meets the business requirement, then you may be going down the right path.)
Like I said, I'm just curious. Ed
-P- - 23 Jul 2004 19:18 GMT > I'm curiout about this question: What benefit do you see in using the > Surrogate versus the concatenated primary key? I only touched on it briefly, but there are several child tables "under" this table - including some at two more levels down. Propagating ALL that data (and the associated indices) felt "unwieldy". Plus, a typical operation in our system is the "sliding" of a program to a later timeframe. Seinfeld isn't running at 10pm anymore, it's running at 10:30. If the start_time column is part of the primary key, then the update must be cascaded down to all those dependent tables. With the current structure, the update only happens at this "root node".
> Back when disc space and CPU cycles were expensive, I could see a > justification for having a smaller key. What you are doing is > basically moving the index out of a hidden place in the DBMS and into > full view as part of your tables. Queries aren't any faster. Only the > time it takes to type the query is faster (one column versus four > columns). Queries on a 5-column concatenated index (made up of integers and date columns) aren't any faster than two integers? We have hundreds of thousands of these in a typical station database. Larger properties will easily exceed a million rows.
> Then the risk of corrupt data now increases. How? What types of corruption are we talking about?
> And your example of the > "spots" seems to bear that out. What stops you from "overbooking" > 3minutes of spots in the 2minute time slot? (Yes you may argue that > that's why you want the surrogate, you plan on occasionally breaking > the rules! And if that meets the business requirement, then you may be > going down the right path.) We do allow breaks to be oversold - it's on the users to ensure that the log is "airable" before its shipped off to Master Control.
> Like I said, I'm just curious. > Ed
 Signature Paul Horan Sr. Architect VCI Springfield, MA www.vcisolutions.com
Ed prochak - 26 Jul 2004 17:53 GMT > > I'm curiout about this question: What benefit do you see in using the > > Surrogate versus the concatenated primary key? [quoted text clipped - 4 lines] > the start_time column is part of the primary key, then the update must be cascaded down to all those dependent tables. > With the current structure, the update only happens at this "root node". The "unweildy" justification is weak, but I do see your advantage in the sliding of a program.
This propogation problem might also indicate that the normalization of the tables was somewhat incomplete. The program and it's timeslots are really different entities so that was an option at design time that may have been skipped. (ie instead of a program table with time as part of the PK, there might have been a program/show table and a scehdule table with a show/timeslot relation joining them. note this is just off the top of my head, it's not a detailed/deeply analyzed plan) Given your current design there are some advantages.
> > Back when disc space and CPU cycles were expensive, I could see a > > justification for having a smaller key. What you are doing is [quoted text clipped - 6 lines] > We have hundreds of thousands of these in a typical station database. Larger properties will easily exceed a million > rows. I'm speaking primarily from an ORACLE background, but I'd expect any modern DBMS to have fast lookups on the PK index, whether the index consists of 5 columns of mixed datatypes or 2 columns of both integers, even on millions of rows. If your DBMS has significant differences in those two cases, it might be time to consider a different DBMS.
> > Then the risk of corrupt data now increases. > > How? What types of corruption are we talking about? I mentioned in another post (in may have been on another thread) about a system that had addresses consisting of the house number and a street ID, storing the street names in another table. sometime along the way the ID's in the street table were corrupted. Needless to say, fixing that case was a real nightmare.
> > And your example of the > > "spots" seems to bear that out. What stops you from "overbooking" [quoted text clipped - 5 lines] > We do allow breaks to be oversold - it's on the users to ensure that the log is "airable" before its shipped off to > Master Control. So I guessed right about the business need. Hopefully, you provide the users some tools to deal with the process of resolving overbooking (maybe at least a report that identifies overlaps). Sounds like the system works, which is the bottom line.
Thanks for the feedback!
Ed
-P- - 26 Jul 2004 19:35 GMT "Ed prochak" <ed.prochak@magicinterface.com> wrote in message
> So I guessed right about the business need. Hopefully, you provide the > users some tools to deal with the process of resolving overbooking > (maybe at least a report that identifies overlaps). Sounds like the > system works, which is the bottom line. You should see this report... In addition to oversold/undersold breaks, it has to check for things like: - Spots with the same inventory category in the same break (two car dealership ads); - Spots from the same advertiser in the same break (Wrigley's doublemint vs. juicy fruit); - Spots placed in "inappropriate" programming (Bud lite ads in kid's cartoons); - Was there enough "Canadian" content per hour for our stations north of the border;
...and a host of other potential rules violations. But I digress...
 Signature Paul Horan Sr. Architect VCI Springfield, MA www.vcisolutions.com
Laconic2 - 26 Jul 2004 21:47 GMT > You should see this report... > In addition to oversold/undersold breaks, it has to check for things like: [quoted text clipped - 4 lines] > > ...and a host of other potential rules violations. But I digress... Dare I say that it might be time for someone who is good at report analysis to look at this report and see if its is being made to do more work than a single report ought to do?
It sounds like a "kitchen sink" report to me.
-P- - 26 Jul 2004 21:54 GMT <snip>
>someone who is good at report analysis You're talking to one. Thanks for your concern, though...
It's actually pretty well-designed. Each "section" of the report is a different query, fed into a PowerBuilder composite datawindow. The report runs top to bottom in about 20 seconds for a typical station.
 Signature Paul Horan Sr. Architect VCI Springfield, Ma www.vcisolutions.com
Laconic2 - 27 Jul 2004 14:23 GMT > It's actually pretty well-designed. Each "section" of the report is a different query, fed into a PowerBuilder > composite datawindow. The report runs top to bottom in about 20 seconds for a typical station. Fine. My concern was not how long it takes to run, but whether the information the report provides "belongs" in a single report. It sounds like you've made a conscious decision that it does. I won't second guess you.
From a database persepctive, the fact that each section is driven by a separate query means that, for all practical purposes, we could treat them as if they were a separate report.
Given that, could you restate your original question concerning database design? I don't think that databases designed for multiple queries are particularly unusual.
-P- - 27 Jul 2004 18:09 GMT > > It's actually pretty well-designed. Each "section" of the report is a > different query, fed into a PowerBuilder [quoted text clipped - 13 lines] > design? I don't think that databases designed for multiple queries are > particularly unusual. I really didn't have an original question... The debate was on the usefullness/appropriateness of surrogate keys over natural, concatenated keys. In defense of surrogate keys, I brought up an example from our software where used one, and found it effective. I'm certainly against their use on ALL tables, but this one example (replacing a 4 column natural key with a surrogate IDENTITY column, that the users never see or manipulate) demonstrated one possible use for them.
 Signature Paul Horan Sr. Architect VCI Springfield, MA www.vcisolutions.com
AK - 27 Jul 2004 15:13 GMT > I'm speaking primarily from an ORACLE background, but I'd expect any > modern DBMS to have fast lookups on the PK index, whether the index > consists of 5 columns of mixed datatypes or 2 columns of both > integers, even on millions of rows. If your DBMS has significant > differences in those two cases, it might be time to consider a > different DBMS. Over the past 10 years I've worked with DB2, Oracle, Sybase, and MS SQL Server. For all these RDBMS the difference between joining on 5 columns of mixed datatypes and joining on 2 integers will definitely be significant.
Why don't you set up some benchmarks and see for yourself.
David Best - 22 Jul 2004 22:44 GMT > > > placed. (There are cases where surrogate keys are appropriate, but that's > > > another story). [quoted text clipped - 11 lines] > other industry-standard identifiers are great, and should be used whenever available, but to eschew surrogate keys for > ALL entities in EVERY datamodel shows a limited scope of knowledge. So then it's not a surrogate; you are actually defining a natural key and acting in the role of the industry standards group for your company.
AK - 07 Jul 2004 03:18 GMT from where I sit, natural keys aren't perfect either. Primary keys aren't supposed to change, never, ever, are they? Regarding your own example, (city, state), once upon a time I lived in a city that got renamed. Having to deal with changing PK is a very common problem in database programming, correct? Regarding using SS# as a good natural primary key, it used to be true several years ago, but not anymore. There is a growing concern over exposing people's SSN's unless it is absolutely necessary. My current health insurance card does not have my SSN on it, because my employer has demanded that SSN be replaced with a meaningless generated indentifier, definitely a surrogate key. Speaking SQL, on customers' requests, some people had to run UPDATE SOME_TABLE SET SSN=0 against some of their tables and were forced to use surrogate keys. Those who assumed that SSN is guaranteed to be a good natural primary key forever got in really big trouble. Those "idiots" who weren't assuming that they know absolutely everything and can predict the future got away much much easier.
I'm sure lots of people could add to my list.
Finally, there are good decisions and quick decisions. Sometimes it is profitable to invest time and effort and make a good decision (choose a natural key), because good decisions usually last longer (though still not forever, I've given some examples above). however, sometimes it is profitable to make quick decisions (surrogate keys) and deliver working solutions ASAP. Identities and sequences provide a very convenient shortcut and help us in making quick decisions. In database programming, it always depends and there are no hard and fast rules.
Troels Arvin - 07 Jul 2004 12:08 GMT > Primary keys aren't > supposed to change, never, ever, are they? Who says that a primary key can never change?
> Regarding your own example, > (city, state), once upon a time I lived in a city that got renamed. Probably even more common, at least in Denmark, at the moment: cities fuse. And then, well, you have to deal with it one way or the other.
I think that Tom has a good point in that it may pay off to try to use natural or - at least - intelligent keys: In some DBMSes, it may require a bit of work when keys change, but it's of great importance that you are able to match real-World entities with data in your DBMS.
 Signature Greetings from Troels Arvin, Copenhagen, Denmark
AK - 08 Jul 2004 02:45 GMT Troels, let me be the Devil's advocate today. Obviously surrogate keys have their disadvantages, nobody argues with that. However, natural keys also have very pronounced disadvantages either. Let me give just one example.
Elaborating on (City, state) natural primary key, let's suppose we have dealt with cities getting renamed and mergeing with other cities. Let's discuss storage considerations. Let's create PAYMENTS(CITY, STATE, COUNTRY, ADDRESS, PMT_DATE, PMT_AMOUNT, PRIMARY KEY(CITY, STATE, COUNTRY, ADDRESS, PMT_DATE)). Let's just populate it with some data and see how much space the table and the index implementing PK use up. Then let's create another table PAYMENTS_SUROGATE(ADDRESS_ID, PMT_DATE, PMT_AMOUNT, PRIMARY KEY(ADDRESS_ID, PMT_DATE)), populate it and compare storage requirements. Most probably we'll need at least 50% less storage, with many queries running 50% faster. All right, that's the price we are ready to pay for the logical purity of our design. Let's suppose the top management are unaware of potential cost savings and the database goes into production as is.
Some time later the sales team wins a contract in Russia. Unfortunately, the system won't work in Russia, and the reason is simple: (CITY, STATE) is not supposed to be unique in Russia. (CITY, COUNTY, STATE) is unique. We don't have COUNTY in our database. now we need to choose between: 1. Creating a separate version for Russia 2. Adding COUNTY NOT NULL column to almost every table in the database. As a PK column, it has to be NOT NULL. MOst queries joining tables need to be changed. 3. Introducing a surrogate key, which would take about as much time as option 2
Going for option 2 does not guarantee that the situation won't repeat. We might later learn that some other combination of columns should be used as natural primary key in India, or Brazilia, or elsewhere.
I hope my point is clear enough: in many cases we have to develop applications without enough knowledge on the subject area. Errors in choosing a natural primary key may be far more expensive than the disadvantages of using surrogate keys. Performancewise, natural primary keys may also be not the best alternative.
What do you think?
Ed prochak - 08 Jul 2004 16:23 GMT > Troels, > let me be the Devil's advocate today. [quoted text clipped - 16 lines] > design. Let's suppose the top management are unaware of potential cost > savings and the database goes into production as is. One: I think your space savings estimates are exagerated. Two: I think your speed savings estimates are clearly inflated. Using the surrogate key means a join on every payment query. A good DBMS will reduce the overhead of accessing the second table, but it cannot reduce it to zero.
I am not one that advocates for bloated software and data, but in the absence of hard numbers and direct business requirements (e.g., the DB must run on a portable device like a GPS) this is a poor argument for surrogates. Disc prices continue to drop year by year so the storage costs are seldom the real issue. There is a constant struggle between higher levels of normalization (saves storage space) and higher performance (flatter, denormalized tables). Surrogate keys can be a false economy (making you think you saved a little on both).
> Some time later the sales team wins a contract in Russia. > Unfortunately, the system won't work in Russia, and the reason is [quoted text clipped - 7 lines] > 3. Introducing a surrogate key, which would take about as much time as > option 2 Hmm, doing it wrong (3) takes about as much time as doing it right (2), so what do we save again???
> Going for option 2 does not guarantee that the situation won't repeat. But the surrogate doesn't fair any better in this example so why do it wrong?
> We might later learn that some other combination of columns should be > used as natural primary key in India, or Brazilia, or elsewhere. > > I hope my point is clear enough: in many cases we have to develop > applications without enough knowledge on the subject area. That's the point and the ROOT CAUSE of the problem. Get someone on your team that knows the subject area and do it right the first time. (Well at least as close to right as possible, then include the possibility of change in the design.)
> ..... Errors in > choosing a natural primary key may be far more expensive than the > disadvantages of using surrogate keys. Performancewise, natural > primary keys may also be not the best alternative. > > What do you think? It is this conclusion that I find wrong. Having done database conversions, I found surrogate keys can lead to conversion disaters. One company tried to save space by replacing street names with a surrogate key, but somewhere back in time, the link table was corrupted. The net result was at conversion time, we got a feed with street addresses like 456 ZZXY. Sure saved a lot of space in the dump file, but made the client's life a living hell! We programmatically cleaned up as much as we could, but ultimately the client had to start calling customers to verify their address.
You might argue that you won't let that happen on YOUR system, but I'm sure that's what the programmers of that old system thought.
Are surrogates ALWAYS a bad idea? IMHO, they should be used, like dynamite, with EXTERME CAUTION! There are times when you have to blow up the mountain, but when you do, it is always messy.
Ed
AK - 22 Jul 2004 04:09 GMT > One: I think your space savings estimates are exagerated. > Two: I think your speed savings estimates are clearly inflated. Using "inflated" or not they are real. i created 2 schemas on the same tablespaces and loaded same production data into both
> the surrogate key means a join on every payment query. just plain not true
> surrogates. Disc prices continue to drop year by year so the storage > costs are seldom the real issue. There is a constant struggle between if you are speaking about adding 1 more 100Gb drive into a cheap intel box, I would completely agree. Think big - going from 30 to 50 Tb is major undertaking, and an expensive one.
> That's the point and the ROOT CAUSE of the problem. Get someone on > your team that knows the subject area and do it right the first time. > (Well at least as close to right as possible, then include the > possibility of change in the design.) OK, ""include the possibility of change in the design"! I've heard this kind of argument before. Be specific: let's suppose the very best expert in the industry has told you to use (city, state) PK. How are you gonna "include the possibility of change in the design" in this particular case?
Be specific!
Ed prochak - 27 Jul 2004 19:03 GMT > > One: I think your space savings estimates are exagerated. > > Two: I think your speed savings estimates are clearly inflated. Using > > "inflated" or not they are real. i created 2 schemas on the same > tablespaces and loaded same production data into both ... and what results? Your earlier post seemed to have figures that seemed too vague (50% IIRC).
> > the surrogate key means a join on every payment query. > just plain not true Before changing table you said:
>> Let's create PAYMENTS(CITY, >> STATE, COUNTRY, ADDRESS, PMT_DATE, PMT_AMOUNT, PRIMARY KEY(CITY, >> STATE, COUNTRY, ADDRESS, PMT_DATE)). After changing it you said
> use up. Then let's create another table PAYMENTS_SUROGATE(ADDRESS_ID, > PMT_DATE, PMT_AMOUNT, PRIMARY KEY(ADDRESS_ID, PMT_DATE)), but you leave out the needed address information table SUROGATE_ADDRESS(ADDRESS_ID, CITY, STATE, COUNTRY, ADDRESS)
so to get all the info in the original, natural key table, you MUST do a join in the surogate case.
> > surrogates. Disc prices continue to drop year by year so the storage > > costs are seldom the real issue. There is a constant struggle between > > if you are speaking about adding 1 more 100Gb drive into a cheap intel > box, I would completely agree. Think big - going from 30 to 50 Tb is > major undertaking, and an expensive one. I must admit I haven't worked with tables or databases in the terabyte range. Those systems can have a different kind of budget than most applications (like the difference between running a 5k and a marathon, very different energy budgets in running those).
> > That's the point and the ROOT CAUSE of the problem. Get someone on > > your team that knows the subject area and do it right the first time. [quoted text clipped - 8 lines] > > Be specific! Doesn't sound like much of an expert. It's not like the issues of international addresses are unknown.
Your address example is somewhat contrived since it almost purposely left out known possibilities. There are different kinds of addresses and each can be structured in different ways. Most of my experience is in American addressing issues.
There's a postal address (likely doesn't include county in the USA, and maybe not even anything equivalent to state/province in some small countries). Then there's a legal address which is bound to include county along with city and state (at least in the USA), but might use map/page/plot number instead of the street address. Then there's you physical address for things like the 911 system. In rural areas that address might be nearly anything including just directions (turn right of Rt 9, 5miles to farmhouse on left). And the geographic address given by latitude and longtitude. Some systems may have to include multiples of these forms and therefore be required to have multiple address tables. A surogate (or three) becomes very useful in that case (surprised to hear me say that?) but can be very tricky to maintain.
as to flexibility, make it so that adding additional columns should not break existing applications, even if you are expanding primary keys. Your city,state to city,county,state example can be done. In the USA version of the application, which in the example is the existing application, can ignore county. Inserting new rows would just require a trigger to populate a default value for the county. Then the new Russian version can use the county attribute as needed. So is that specific enough? It allows the old application to continue working while the new version is developed. Note the trigger might be smart in that it could try to supply the correct county for each entry instead of some fixed value (It could do this correctly for a large percentage of USA city/states, but not all).
One final note: the combination city,state is not unique in the USA either. At least if you are using unincorporated cities. I thought there were two Springfield's in at least one state. Using full names might be needed. For example I live in the City of Twinsburg, right next to Twinsburg Township. Postal addresses look the same, but legal addresses are VERY different. Oh, and watch out with including county. The city of New York is in more than one county (5 of them?).
Just pointing out that it all goes back to your application requirements. Surogates should not be banned, but neither should they be the first choice (which happens far too often). The database serves up data to the application. The possibility of change comment refers to the combination, not just the datamodel.
Later, ed
Gene Wirchenko - 28 Jul 2004 00:30 GMT [snip]
>There's a postal address (likely doesn't include county in the USA, There might not be a postal address. I was going through properties in an MLS computer system. I found a lot in downtown Vancouver, British Columbia, Canada that did not have a street address.
>and maybe not even anything equivalent to state/province in some small >countries). Then there's a legal address which is bound to include Singapore IIRC.
>county along with city and state (at least in the USA), but might use >map/page/plot number instead of the street address. Then there's you That is *will* use in my experience. I have never seen a legal address that was a street address. My experience is in British Columbia, Canada.
>physical address for things like the 911 system. In rural areas that >address might be nearly anything including just directions (turn right >of Rt 9, 5miles to farmhouse on left). And the geographic address Nit: that is not an address but navigational instructions.
>given by latitude and longtitude. Some systems may have to include >multiples of these forms and therefore be required to have multiple >address tables. A surogate (or three) becomes very useful in that case >(surprised to hear me say that?) but can be very tricky to maintain. [snip]
>One final note: the combination city,state is not unique in the USA >either. At least if you are using unincorporated cities. I thought I have found a few exceptions in Canada, too.
>there were two Springfield's in at least one state. Using full names >might be needed. For example I live in the City of Twinsburg, right [quoted text clipped - 5 lines] >requirements. Surogates should not be banned, but neither should they >be the first choice (which happens far too often). The database serves Agreed on that.
>up data to the application. The possibility of change comment refers >to the combination, not just the datamodel. Sincerely,
Gene Wirchenko
Computerese Irregular Verb Conjugation: I have preferences. You have biases. He/She has prejudices.
AK - 31 Jul 2004 23:54 GMT > > "inflated" or not they are real. i created 2 schemas on the same > > tablespaces and loaded same production data into both > > ... and what results? Your earlier post seemed to have figures that > seemed too vague (50% IIRC). that's an executive summary, the average over several dozen frequently used queries
> but you leave out the needed address information table > SUROGATE_ADDRESS(ADDRESS_ID, CITY, STATE, COUNTRY, ADDRESS) > so to get all the info in the original, natural key table, you MUST do > a join in the surogate case. not nesessarily. there are 3 cases: 1. SELECT AMOUNT, PMT_DATE for one customer - no need for a join, faster 2. SELECT AMOUNT, PMT_DATE, ADDRESS, CITY - joinn is nesessary only for surrogate keys, slower 3. SELECT AMOUNT, PMT_DATE, CUSTOMER.SOME_OTHER_ATTRIBUTE - join is neede in both cases, faster
so it's on average slower with surrogate keys only if situation 2 occurs more often
> I must admit I haven't worked with tables or databases in the terabyte > range. Those systems can have a different kind of budget than most > applications (like the difference between running a 5k and a marathon, > very different energy budgets in running those). one of the diffrences is you don't want to change database structure. surrogate keys let you limit changes to 1 table only
> Doesn't sound like much of an expert. It's not like the issues of > international addresses are unknown. (snip)
> Just pointing out that it all goes back to your application > requirements. Surogates should not be banned, but neither should they > be the first choice (which happens far too often). The database serves > up data to the application. The possibility of change comment refers > to the combination, not just the datamodel. thanks for a detailed answer. Still, if I have a big huge table PAYMENT referring to ADDRESS, and if I later need to add a column to the natural PK on ADRESS table, that's gonna be quite a problem. I might want to avoid adding a column to a very large table and rebuilding a foreign key, as I have other plans for my weekends. I could choose to utilize surrogate PK just for that purpose. What do you think?
AK - 07 Jul 2004 07:22 GMT > I don't really understand why would you want to do something like this. > If the user_login is unique, why not just make the user_login the > primary key? Is this for efficiency reasons or am I missing something > more fundamental? Once upon a time the company I work for decided to change everybody's logins. We used to login as SMITHJ, which was changed to JSMITH. How are you going to implement changing user_login if is the primary key? If the primary key is surrogate, it is a snap.
|
|
|