Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
Home
Discussion Groups
Database Servers
DB2InformixIngresMS SQLOraclePervasive.SQLPostgreSQLProgressSybase
Desktop Databases
FileMakerFoxProMS AccessParadox
General
General DB TopicsDatabase Theory
Related Topics
Java Development.NET DevelopmentVB DevelopmentMore Topics ...

Database Forum / General DB Topics / General DB Topics / July 2004

Tip: Looking for answers? Try searching our database.

database schema (PRI, UNI keys)

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
Titus Barik - 05 Jul 2004 22:22 GMT
Hi all,

Looking at the Wordpress database schema I notice that their wp_users
table is defined as follows (snippet):

Field       Type               Key      Default     Extra
ID          int(10) unsigned   PRI      NULL        auto_increment
user_login  varchar(20)        UNI

I don't really understand why would you want to do something like this.
If the user_login is unique, why not just make the user_login the
primary key? Is this for efficiency reasons or am I missing something
more fundamental?

Thanks in advance,

Signature

titus barik (titus@barik.net) | www.barik.net | aim: tbarik
godfather Armand concierges Presbyterian fevered physical
thrower's accomplishment Manuela Uriel's Meighen's versification

--CELKO-- - 06 Jul 2004 04:33 GMT
>> I don't really understand why would you want to do something like
this. If the user_login is unique, why not just make the user_login
the primary key? <<

This is a common design mistake.  The use of any
non-relational,proprietary auto-increment as a key is an attempt to
imitate a sequential file system and it destroys data integrity.

A long key is not always a bad thing for performance.  For example, if
I use (city, state) as my key, I get a free index on just (city).  I
can also add extra columns to the key to make it a super-key when such
a super-key gives me a covering index (i.e. an index which contains
all of the columns required for a query, so that the base table does
not have to be accessed at all).

Faking a sequential file's positional record number, so I can
reference the physical storage location?  Sure, if I want to lose all
the advantages of an abstract data model, SQL set oriented
programming, carry extra data and destroy the portability of code!

More and more programmers who have absolutely no database training are
being told to design a database.  They are using GUIDs, IDENTITY,
ROWID and other proprietary auto-numbering "features" in SQL products
to imitate either a record number (sequential file system mindset) or
OID (OO mindset) since they don't know anything else.

Experienced database designers tend toward intelligent keys they find
in industry standard codes, such as UPC, VIN, GTIN, ISBN, etc.  They
know that they need to verify the data against the reality they are
modeling.  A trusted external source is a good thing to have.

The auto-increment column is a holdover from the early programming
languages which were very close to the hardware.  For example, the
fields (not columns; big difference) in a COBOL or FORTRAN program
were assumed to be physically located in main storage in the order
they were declared in the program.  The languages have constructs
using that model -- logical and physical implementations are
practically one!  The data has meaning BECAUSE of the program reading
it (i.e. the same bits could be a character in one program and be an
integer in another)

The early SQLs were based on existing file systems.  The data was kept
in physically contiguous disk pages, in physically contiguous rows,
made up of physically contiguous columns.  In short, just like a deck
of punch cards or a magnetic tape.  Most programmer still carry that
mental model, which is why I keep doing that rant about file vs.
table, row vs. record and column vs. field.

But physically contiguous storage is only one way of building a
relational database and it is not the best one. The basic idea of a
relational database is that user is not supposed to know *how* or
*where* things are stored at all, much less write code that depends on
the particular physical representation in a particular release of a
particular product on particular hardware at a particular time.

One of the biggest errors is the auto-increment column (actually
property, not a column at all) in the Sybase/SQL Server family.
People actually program with this "feature" and even use it as the
primary key for the table!  Now, let's go into painful details as to
why this thing is bad.

The first practical consideration is that auto-increment is
proprietary and non-portable, so you know that you will have
maintenance problems when you change releases or port your system to
other products.  Newbies actually think they will never port code!
Perhaps they only work for companies that are failing and will be
gone.  Perhaps their code is such crap nobody else want their
application.

But let's look at the logical problems.  First try to create a table
with two columns and try to make them both auto-increment.  If you
cannot declare more than one column to be of a certain data type, then
that thing is not a datatype at all, by definition.  It is a property
which belongs to the PHYSICAL table, not the LOGICAL data in the
table.

Next, create a table with one column and make it an auto-increment  
Now try to insert, update and delete different numbers from it.  If
you cannot insert, update and delete rows from a table, then it is not
a table by definition.

Finally create a simple table with one auto-increment column and a few
other columns.  Use a few statements like

INSERT INTO Foobar (a, b, c) VALUES ('a1', 'b1', 'c1');
INSERT INTO Foobar (a, b, c) VALUES ('a2', 'b2', 'c2');
INSERT INTO Foobar (a, b, c) VALUES ('a3', 'b3', 'c3');

To put a few rows into the table and notice that the auto-increment
sequentially numbered them in the order they were presented.  If you
delete a row, the gap in the sequence is not filled in and the
sequence continues from the highest number that has ever been used in
that column in that particular table.  This is how we did record
numbers in pre-allocated sequential files in the 1950's, by the way. A
utility program would then "pack" or "compress" the records that were
flagged as deleted or unused to move the empty space to the physical
end of the physical file.

But now use a statement with a query expression in it, like this:

INSERT INTO Foobar (a, b, c)
SELECT x, y, z
 FROM Floob;

Since a query result is a table, and a table is a set which has no
ordering, what should the auto-increment numbers be?  The entire,
whole, completed set is presented to Foobar all at once, not a row at
a time.  There are (n!) ways to number (n) rows, so which one do you
pick?  The answer has been to use whatever the *physical* order of the
result set happened to be.  That non-relational phrase "physical
order" again!

But it is actually worse than that.  If the same query is executed
again, but with new statistics or after an index has been dropped or
added, the new execution plan could bring the result set back in a
different physical order.

Can you explain from a logical model why the same rows in the second
query get different auto-increment numbers?  In the relational model,
they should be treated the same if all the values of all the
attributes are identical.

Using auto-increment as a primary key is a sign that there is no data
model, only an imitation of a sequential file system.  Since this
"magic, all-purpose, one-size-fits-all" pseudo-identifier exists only
as a result of the physical state of a particular piece of hardware at
a particular time as read by the current release of a particular
database product, how do you verify that an entity has such a number
in the reality you are modeling?

You will see newbies who design tables like this:

CREATE Drivers
(driver_id auto-increment NOT NULL PRIMARY KEY,  
ssn CHAR(9) NOT NULL REFERENCES Personnel(ssn),
vin CHAR(17) NOT NULL REFERENCES Motorpool(vin));

Now input data and submit the same row a thousand times, a million
times.  Your data integrity is trashed.  The natural key was this:

CREATE Drivers
(ssn CHAR(9) NOT NULL REFERENCES Personnel(ssn),
vin CHAR(17) NOT NULL REFERENCES Motorpool(vin),
PRIMARY KEY (ssn, vin));

To demonstrate, here is a typical idiot newbie schema -- you will them
all over the news groups.  I call them "idiots" because they always
name the auto-increment property column "id" in EVERY table.  They
don't understand basic data modeling -- one and only name for an
attribute.  About half the time they don't use any DRI, but let's show
it.

CREATE TABLE MotorPool
(id auto-increment  NOT NULL PRIMARY KEY,
personnel_id INTEGER NOT NULL REFERENCES Personnel(id),
vehicle_id INTEGER NOT NULL REFERENCES Vehicle(id));

CREATE TABLE Personnel
(id auto-increment NOT NULL PRIMARY KEY,
ssn CHAR(9) NOT NULL UNIQUE,
..);

CREATE TABLE Vehicles
(id auto-increment NOT NULL PRIMARY KEY,
vin CHAR(17) NOT NULL UNIQUE,
..);

Now change the natural key in Personnel:

UPDATE Personnel
  SET ssn = '666666666'
WHERE ssn = '000000000';

Nothing happened in Motorpool, did it?  You can do the same thing with
a VIN.

Now you are REALLY thinking about relations and keys instead of 1950's
sequential record numbering.  Adding an IDENTITY column to either of
these tables as a candidate key would be dangerously redundant; one
query uses the auto-increment and another uses the real key, and like
a man with two watches, you are never sure what time it is.

Finally, an appeal to authority, with a quote from Dr. Codd:
"..Database users may cause the system to generate or delete a
surrogate, but they have no control over its value, nor is its value
ever displayed to them ..."(Dr. Codd in ACM TODS, pp 409-410) and
Codd, E. (1979), Extending the database relational model to capture
more meaning.  ACM Transactions on Database Systems, 4(4).  pp.
397-434.

This means that a surrogate ought to act like an index; created by the
user, managed by the system and NEVER seen by a user.  That means
never used in queries, DRI or anything else that a user does.

Codd also wrote the following:

"There are three difficulties in employing user-controlled keys as
permanent surrogates for entities.

(1)  The actual values of user-controlled keys are determined by users
and must therefore be subject to change by them (e.g. if two companies
merge, the two employee databases might be combined with the result
that some or all of the serial numbers might be changed.).

(2)  Two relations may have user-controlled keys defined on distinct
domains (e.g. one uses social security, while the other uses employee
serial numbers) and yet the entities denoted are the same.

(3)  It may be necessary to carry information about an entity either
before it has been assigned a user-controlled key value or after it
has ceased to have one (e.g. and applicant for a job and a retiree).

These difficulties have the important consequence that an equi-join on
common key values may not yield the same result as a join on common
entities.  A solution - proposed in part [4] and more fully in [14] -
is to introduce entity domains which contain system-assigned
surrogates.  Database users may cause the system to generate or delete
a surrogate, but they have no control over its value, nor is its value
ever displayed to them....." (Codd in ACM TODS, pp 409-410).

References

Codd, E. (1979), Extending the database relational model to capture
more meaning.  ACM Transactions on Database Systems, 4(4).  pp.
397-434
Nick Landsberg - 06 Jul 2004 22:55 GMT
A very impassioned and eloquent post, Mr. Celko!

I have a few things to add from my experiences when
I was in the database world.  Most of which reinforce
your statements.

>>>I don't really understand why would you want to do something like
>
[quoted text clipped - 7 lines]
> A long key is not always a bad thing for performance.  For example, if
> I use (city, state) as my key, I get a free index on just (city).

Absolutely, if you assume B-tree (or other tree-like indices).
It does not work with hash indices, but they tend to be
vendor-specific implementations.  A study of the most
common access patterns can help determine whether it
should be a hash index or a tree.  The answer is "sometimes
both (and let the DBMS optimizer decide which to use for
any given query.)"  It wastes space, sure.  Disk space
no longer seems to be a problem nowadays, but I will probably
get rebuttals about that. (Some of them from myself.  I like
to argue with myself. Sometimes I even win the arguments :)

> ... I can also add extra columns to the key to make it a super-key when such
> a super-key gives me a covering index (i.e. an index which contains
> all of the columns required for a query, so that the base table does
> not have to be accessed at all).

That particular technique also may waste space, but it
should be etched on the inner side of the eyelids of
any physical (as opposed to logical) DBMS designer
if performance is the primary consideration!  Again,
it works quite nicely with tree-like indices.

[ Minor Snip ]

> More and more programmers who have absolutely no database training are
> being told to design a database.  They are using GUIDs, IDENTITY,
> ROWID and other proprietary auto-numbering "features" in SQL products
> to imitate either a record number (sequential file system mindset) or
> OID (OO mindset) since they don't know anything else.

That first sentence of yours in the above probably sums it up
in a nutshell. I have seen way too many database applications
designed by people who were not trained properly.
The result usually was that something suffered.
(Most often performance, sometimes data integrity
or consistenty.)

The typical reaction was to "blame the database."
Switching from Oracle to Sybase or vice versa (often
presented as a proposed solution), did absolutely
nothing to solve the problem.

A discussion about logical schema vs. physical
schema I will leave to another time, if it ever
comes up :)

[REST SNIPPED]

NPL
Signature

"It is impossible to make anything foolproof
because fools are so ingenious"
 - A. Bloch

Laconic2 - 07 Jul 2004 16:01 GMT
I agree mostly with you, Nick and Joe.

It is certainly true that a large number of databases have been designed by
persons who just barely understand how to fetch data from a database,  and
lack the minimal background necessary to distinguish between a good design,
and one that is going to fail in practice.

Having said that, I'm going to take issue with the two of you on one aspect
of this discussion.

The decision to use internal keys  (surrogate keys) as identifiers rather
than trusted external keys (natural keys) is often based on sloppy analysis,
rather than sloppy design.  That is, the decision that a given external key
cannot be trusted is sometimes made in relatively complete ignorance of the
environment into which the database and the application(s) are going to be
placed.   (There are cases where surrogate keys are appropriate, but that's
another story).

The ignorance that result from sloppy analysis seems to be a badge of honor
among the people who pride themselves on "rapid development"  these days.
Back in the stone age (my era), many projects bogged down in excessive
analysis,  and the rapid development reaction was perhaps inevitable.  But
superficial analysis can cost as much time as excessive analysis.  The
wasted time shows up as a consequence of bad design,  but the bad design  is
really a consequence of sloppy analysis.

Again,  this only one aspect.  Even with adequate analysis,  if the design
is done by someone that doesn't know how to design,  bed choices are almost
certain.
AK - 07 Jul 2004 21:12 GMT
> placed.   (There are cases where surrogate keys are appropriate, but that's
> another story).

such as? could you please elaborate?
-P- - 22 Jul 2004 14:29 GMT
> > placed.   (There are cases where surrogate keys are appropriate, but that's
> > another story).
> >
>  such as? could you please elaborate?

Here's one: Our software manages TV commercials for TV stations and cable networks.
For example, Miller Brewing calls up WGN in Chicago and wants to run 6 Lite Beer commercials during the next Cubs game.
We have a requirement (actually an FCC requirement) to uniquely identify each of the six "instances" of this single
order - the spots must be serialized for accounting purposes.  We use an autoincrement system-generated surrogate key to
serialize them.
While there are industry standard codes for the actual copy that will air (it's called an ISCI number) - each of the six
instances carries the same ISCI.  We can't use a date/timestamp either, because the scheduled airtime of each spot is
not assigned at the time the order is placed.  There IS NO industry standard identifier for this.  ISBN's, SSNs, and
other industry-standard identifiers are great, and should be used whenever available, but to eschew surrogate keys for
ALL entities in EVERY datamodel shows a limited scope of knowledge.

Signature

Paul Horan
Sr. Architect
VCI  Springfield, MA
www.vcisolutions.com

Laconic2 - 22 Jul 2004 18:29 GMT
While I agree with your general philosophy regarding surrogate keys,  I
don't think the example given below really illustrates the point.

A "surrogate" key,  by definition is an artificial key that is used in place
of a natural key.  In the example you have given us,  there is no natural
key for the entity you need to identify (the "instance" of the ad).
Therefore the creation of an artificial key,  while correct,  is not really
a case of generating a surrogate for the natural key.

Rather it's generating a key,  natural or artificial, where one is needed
and none exists.

I hope I'm not stirring the pot too much here.

> > > placed.   (There are cases where surrogate keys are appropriate, but that's
> > > another story).
[quoted text clipped - 11 lines]
> other industry-standard identifiers are great, and should be used whenever available, but to eschew surrogate keys for
> ALL entities in EVERY datamodel shows a limited scope of knowledge.
-P- - 23 Jul 2004 00:03 GMT
> While I agree with your general philosophy regarding surrogate keys,  I
> don't think the example given below really illustrates the point.
[quoted text clipped - 9 lines]
>
> I hope I'm not stirring the pot too much here.

No - that's a point well taken.   With that definition in mind, we do actually use one surrogate key.  It's the entity
that describes the "structure" of a television program.

The natural key is:
network           smallint ,  // the channel the program is airing on
schedule_date  date ,      // self-explanatory
schedule_type  smallint ,  //  the regular schedule, or an alternate log in case of a rainout?
start_time        integer  ,  // the number of seconds past the station's "start of day" time

That combination of columns uniquely identifies every program segment and commercial break on a station log.  There are
child tables that have foreign keys back to this table.   Rather than continue to propagate the concatenated key, we
introduce a surrogate key "record_id" (yeah, yeah, I know...   "Rows are not records, tables are not files...") that
becomes the foreign key to the child tables.  To illustrate, a 2-minute break in Seinfeld is one row in this table.
However, we may place four :30 second commercials into that break (or two :60s, or one :60, one :30, and two :15s...)
The associative entity between the "spots" (from the previous thread) and the "break" shown above is simply (spot_id,
record_id) rather than (spot_id, {four more columns}).

Signature

Paul Horan
Sr. Architect
VCI   Springfield, MA
www.vcisolutions.com

Ed prochak - 23 Jul 2004 13:12 GMT
> > While I agree with your general philosophy regarding surrogate keys,  I
> > don't think the example given below really illustrates the point.
[quoted text clipped - 26 lines]
> The associative entity between the "spots" (from the previous thread) and the "break" shown above is simply (spot_id,
> record_id) rather than (spot_id, {four more columns}).

I'm curiout about this question: What benefit do you see in using the
Surrogate versus the concatenated primary key?

Back when disc space and CPU cycles were expensive, I could see a
justification for having a smaller key. What you are doing is
basically moving the index out of a hidden place in the DBMS and into
full view as part of your tables. Queries aren't any faster. Only the
time it takes to type the query is faster (one column versus four
columns).

Then the risk of corrupt data now increases. And your example of the
"spots" seems to bear that out. What stops you from "overbooking"
3minutes of spots in the 2minute time slot? (Yes you may argue that
that's why you want the surrogate, you plan on occasionally breaking
the rules! And if that meets the business requirement, then you may be
going down the right path.)

Like I said, I'm just curious.
  Ed
-P- - 23 Jul 2004 19:18 GMT
> I'm curiout about this question: What benefit do you see in using the
> Surrogate versus the concatenated primary key?

I only touched on it briefly, but there are several child tables "under" this table - including some at two more levels
down.  Propagating ALL that data (and the associated indices) felt "unwieldy".  Plus, a typical operation in our system
is the "sliding" of a program to a later timeframe.  Seinfeld isn't running at 10pm anymore, it's running at 10:30.  If
the start_time column is part of the primary key, then the update must be cascaded down to all those dependent tables.
With the current structure, the update only happens at this "root node".

> Back when disc space and CPU cycles were expensive, I could see a
> justification for having a smaller key. What you are doing is
> basically moving the index out of a hidden place in the DBMS and into
> full view as part of your tables. Queries aren't any faster. Only the
> time it takes to type the query is faster (one column versus four
> columns).

Queries on a 5-column concatenated index (made up of integers and date columns) aren't any faster than two integers?
We have hundreds of thousands of these in a typical station database.  Larger properties will easily exceed a million
rows.

> Then the risk of corrupt data now increases.

How?  What types of corruption are we talking about?

> And your example of the
> "spots" seems to bear that out. What stops you from "overbooking"
> 3minutes of spots in the 2minute time slot? (Yes you may argue that
> that's why you want the surrogate, you plan on occasionally breaking
> the rules! And if that meets the business requirement, then you may be
> going down the right path.)

We do allow breaks to be oversold - it's on the users to ensure that the log is "airable" before its shipped off to
Master Control.

> Like I said, I'm just curious.
>    Ed

Signature

Paul Horan
Sr. Architect
VCI   Springfield, MA
www.vcisolutions.com

Ed prochak - 26 Jul 2004 17:53 GMT
> > I'm curiout about this question: What benefit do you see in using the
> > Surrogate versus the concatenated primary key?
[quoted text clipped - 4 lines]
> the start_time column is part of the primary key, then the update must be cascaded down to all those dependent tables.
> With the current structure, the update only happens at this "root node".

The "unweildy" justification is weak, but I do see your advantage in
the sliding of a program.

This propogation problem might also indicate that the normalization of
the tables was somewhat incomplete. The program and it's timeslots are
really different entities so that was an option at design time that
may have been skipped. (ie instead of a program table with time as
part of the PK, there might have been a program/show table and a
scehdule table with a show/timeslot relation joining them. note this
is just off the top of my head, it's not a detailed/deeply analyzed
plan) Given your current design there are some advantages.

> > Back when disc space and CPU cycles were expensive, I could see a
> > justification for having a smaller key. What you are doing is
[quoted text clipped - 6 lines]
> We have hundreds of thousands of these in a typical station database.  Larger properties will easily exceed a million
> rows.

I'm speaking primarily from an ORACLE background, but I'd expect any
modern DBMS to have fast lookups on the PK index, whether the index
consists of 5 columns of mixed datatypes or 2 columns of both
integers, even on millions of rows. If your DBMS has significant
differences in those two cases, it might be time to consider a
different DBMS.

> > Then the risk of corrupt data now increases.
>
> How?  What types of corruption are we talking about?

I mentioned in another post (in may have been on another thread) about
a system that had addresses consisting of the house number and a
street ID, storing the street names in another table. sometime along
the way the ID's in the street table were corrupted. Needless to say,
fixing that case was a real nightmare.

> > And your example of the
> > "spots" seems to bear that out. What stops you from "overbooking"
[quoted text clipped - 5 lines]
> We do allow breaks to be oversold - it's on the users to ensure that the log is "airable" before its shipped off to
> Master Control.

So I guessed right about the business need. Hopefully, you provide the
users some tools to deal with the process of resolving overbooking
(maybe at least a report that identifies overlaps). Sounds like the
system works, which is the bottom line.

Thanks for the feedback!

 Ed
-P- - 26 Jul 2004 19:35 GMT
"Ed prochak" <ed.prochak@magicinterface.com> wrote in message

> So I guessed right about the business need. Hopefully, you provide the
> users some tools to deal with the process of resolving overbooking
> (maybe at least a report that identifies overlaps). Sounds like the
> system works, which is the bottom line.

You should see this report...
In addition to oversold/undersold breaks, it has to check for things like:
- Spots with the same inventory category in the same break (two car dealership ads);
- Spots from the same advertiser in the same break (Wrigley's doublemint vs. juicy fruit);
- Spots placed in "inappropriate" programming (Bud lite ads in kid's cartoons);
- Was there enough "Canadian" content per hour for our stations north of the border;

...and a host of other potential rules violations.  But I digress...

Signature

Paul Horan
Sr. Architect  VCI
Springfield, MA
www.vcisolutions.com

Laconic2 - 26 Jul 2004 21:47 GMT
> You should see this report...
> In addition to oversold/undersold breaks, it has to check for things like:
[quoted text clipped - 4 lines]
>
> ...and a host of other potential rules violations.  But I digress...

Dare I say that it might be time for someone who is good at report analysis
to look at this report and see if its is being made to do more work than a
single report ought to do?

It sounds like a "kitchen sink" report to me.
-P- - 26 Jul 2004 21:54 GMT
<snip>

>someone who is good at report analysis

You're talking to one.  Thanks for your concern, though...

It's actually pretty well-designed.  Each "section" of the report is a different query, fed into a PowerBuilder
composite datawindow.  The report runs top to bottom in about 20 seconds for a typical station.

Signature

Paul Horan
Sr. Architect
VCI   Springfield, Ma
www.vcisolutions.com

Laconic2 - 27 Jul 2004 14:23 GMT
> It's actually pretty well-designed.  Each "section" of the report is a different query, fed into a PowerBuilder
> composite datawindow.  The report runs top to bottom in about 20 seconds for a typical station.

Fine.  My concern was not how long it takes to run, but whether the
information the report provides "belongs" in a single report.  It sounds
like you've made a conscious decision that it does.  I won't second guess
you.

From a database persepctive,  the fact that each section is driven by a
separate query means that, for all practical purposes, we could treat them
as if they were a separate report.

Given that,  could you restate your original question concerning database
design?  I don't think that databases designed for multiple queries are
particularly unusual.
-P- - 27 Jul 2004 18:09 GMT
> > It's actually pretty well-designed.  Each "section" of the report is a
> different query, fed into a PowerBuilder
[quoted text clipped - 13 lines]
> design?  I don't think that databases designed for multiple queries are
> particularly unusual.

I really didn't have an original question...  The debate was on the usefullness/appropriateness of surrogate keys over
natural, concatenated keys.  In defense of surrogate keys, I brought up an example from our software where used one, and
found it effective.  I'm certainly against their use on ALL tables, but this one example (replacing a 4 column natural
key with a surrogate IDENTITY column, that the users never see or manipulate) demonstrated one possible use for them.

Signature

Paul Horan
Sr. Architect
VCI  Springfield, MA
www.vcisolutions.com

AK - 27 Jul 2004 15:13 GMT
> I'm speaking primarily from an ORACLE background, but I'd expect any
> modern DBMS to have fast lookups on the PK index, whether the index
> consists of 5 columns of mixed datatypes or 2 columns of both
> integers, even on millions of rows. If your DBMS has significant
> differences in those two cases, it might be time to consider a
> different DBMS.

Over the past 10 years I've worked with DB2, Oracle, Sybase, and MS
SQL Server.
For all these RDBMS the difference between joining on 5 columns of
mixed datatypes and joining on 2 integers will definitely be
significant.

Why don't you set up some benchmarks and see for yourself.
David Best - 22 Jul 2004 22:44 GMT
> > > placed.   (There are cases where surrogate keys are appropriate, but that's
> > > another story).
[quoted text clipped - 11 lines]
> other industry-standard identifiers are great, and should be used whenever available, but to eschew surrogate keys for
> ALL entities in EVERY datamodel shows a limited scope of knowledge.

So then it's not a surrogate;  you are actually defining a natural key and
acting in the role of the industry standards group for your company.
AK - 07 Jul 2004 03:18 GMT
from where I sit, natural keys aren't perfect either.
Primary keys aren't supposed to change, never, ever, are they?
Regarding your own example, (city, state), once upon a time I lived in
a city that got renamed.
Having to deal with changing PK is a very common problem in database
programming, correct?
Regarding using SS# as a good natural primary key, it used to be true
several years ago, but not anymore. There is a growing concern over
exposing people's SSN's unless it is absolutely necessary.
My current health insurance card does not have my SSN on it, because
my employer has demanded that SSN be replaced with a meaningless
generated indentifier, definitely a surrogate key. Speaking SQL, on
customers' requests, some people had to run UPDATE SOME_TABLE SET
SSN=0 against some of their tables and were forced to use surrogate
keys. Those who assumed that SSN is guaranteed to be a good natural
primary key forever got in really big trouble. Those "idiots" who
weren't assuming that they know absolutely everything and can predict
the future got away much much easier.

I'm sure lots of people could add to my list.

Finally, there are good decisions and quick decisions. Sometimes it is
profitable to invest time and effort and make a good decision (choose
a natural key), because good decisions usually last longer (though
still not forever, I've given some examples above). however, sometimes
it is profitable to make quick decisions (surrogate keys) and deliver
working solutions ASAP. Identities and sequences provide a very
convenient shortcut and help us in making quick decisions. In database
programming, it always depends and there are no hard and fast rules.
Troels Arvin - 07 Jul 2004 12:08 GMT
> Primary keys aren't
> supposed to change, never, ever, are they?

Who says that a primary key can never change?

> Regarding your own example,
> (city, state), once upon a time I lived in a city that got renamed.

Probably even more common, at least in Denmark, at the moment: cities
fuse. And then, well, you have to deal with it one way or the other.

I think that Tom has a good point in that it may pay off to try to use
natural or - at least - intelligent keys: In some DBMSes, it may
require a bit of work when keys change, but it's of great importance that
you are able to match real-World entities with data in your DBMS.

Signature

Greetings from Troels Arvin, Copenhagen, Denmark

AK - 08 Jul 2004 02:45 GMT
Troels,
let me be the Devil's advocate today.
Obviously surrogate keys have their disadvantages, nobody argues with
that.
However, natural keys also have very pronounced disadvantages either.
Let me give just one example.

Elaborating on (City, state) natural primary key, let's suppose we
have dealt with cities getting renamed and mergeing with other cities.
Let's discuss storage considerations. Let's create PAYMENTS(CITY,
STATE, COUNTRY, ADDRESS, PMT_DATE, PMT_AMOUNT, PRIMARY KEY(CITY,
STATE, COUNTRY, ADDRESS, PMT_DATE)). Let's just populate it with some
data and see how much space the table and the index implementing PK
use up. Then let's create another table PAYMENTS_SUROGATE(ADDRESS_ID,
PMT_DATE, PMT_AMOUNT, PRIMARY KEY(ADDRESS_ID, PMT_DATE)), populate it
and compare storage requirements. Most probably we'll need at least
50% less storage, with many queries running 50% faster. All right,
that's the price we are ready to pay for the logical purity of our
design. Let's suppose the top management are unaware of potential cost
savings and the database goes into production as is.

Some time later the sales team wins a contract in Russia.
Unfortunately, the system won't work in Russia, and the reason is
simple: (CITY, STATE) is not supposed to be unique in Russia. (CITY,
COUNTY, STATE) is unique. We don't have COUNTY in our database. now we
need to choose between:
1. Creating a separate version for Russia
2. Adding COUNTY NOT NULL column to almost every table in the
database. As a PK column, it has to be NOT NULL. MOst queries joining
tables need to be changed.
3. Introducing a surrogate key, which would take about as much time as
option 2

Going for option 2 does not guarantee that the situation won't repeat.
We might later learn that some other combination of columns should be
used as natural primary key in India, or Brazilia, or elsewhere.

I hope my point is clear enough: in many cases we have to develop
applications without enough knowledge on the subject area. Errors in
choosing a natural primary key may be far more expensive than the
disadvantages of using surrogate keys. Performancewise, natural
primary keys may also be not the best alternative.

What do you think?
Ed prochak - 08 Jul 2004 16:23 GMT
> Troels,
> let me be the Devil's advocate today.
[quoted text clipped - 16 lines]
> design. Let's suppose the top management are unaware of potential cost
> savings and the database goes into production as is.

One: I think your space savings estimates are exagerated.
Two: I think your speed savings estimates are clearly inflated. Using
the surrogate key means a join on every payment query. A good DBMS
will reduce the overhead of accessing the second table, but it cannot
reduce it to zero.

I am not one that advocates for bloated software and data, but in the
absence of hard numbers and direct business requirements (e.g., the DB
must run on a portable device like a GPS) this is a poor argument for
surrogates. Disc prices continue to drop year by year so the storage
costs are seldom the real issue. There is a constant struggle between
higher levels of normalization (saves storage space) and higher
performance (flatter, denormalized tables). Surrogate keys can be a
false economy (making you think you saved a little on both).

> Some time later the sales team wins a contract in Russia.
> Unfortunately, the system won't work in Russia, and the reason is
[quoted text clipped - 7 lines]
> 3. Introducing a surrogate key, which would take about as much time as
> option 2

Hmm, doing it wrong (3) takes about as much time as doing it right
(2), so what do we save again???

> Going for option 2 does not guarantee that the situation won't repeat.

But the surrogate doesn't fair any better in this example so why do it
wrong?

> We might later learn that some other combination of columns should be
> used as natural primary key in India, or Brazilia, or elsewhere.
>
> I hope my point is clear enough: in many cases we have to develop
> applications without enough knowledge on the subject area.

That's the point and the ROOT CAUSE of the problem. Get someone on
your team that knows the subject area and do it right the first time.
(Well at least as close to right as possible, then include the
possibility of change in the design.)

> ..... Errors in
> choosing a natural primary key may be far more expensive than the
> disadvantages of using surrogate keys. Performancewise, natural
> primary keys may also be not the best alternative.
>
> What do you think?

It is this conclusion that I find wrong. Having done database
conversions, I found surrogate keys can lead to conversion disaters.
One company tried to save space by replacing street names with a
surrogate key, but somewhere back in time, the link table was
corrupted. The net result was at conversion time, we got a feed with
street addresses like 456 ZZXY. Sure saved a lot of space in the dump
file, but made the client's life a living hell! We programmatically
cleaned up as much as we could, but ultimately the client had to start
calling customers to verify their address.

You might argue that you won't let that happen on YOUR system, but I'm
sure that's what the programmers of that old system thought.

Are surrogates ALWAYS a bad idea? IMHO, they should be used, like
dynamite, with EXTERME CAUTION! There are times when you have to blow
up the mountain, but when you do, it is always messy.

 Ed
AK - 22 Jul 2004 04:09 GMT
> One: I think your space savings estimates are exagerated.
> Two: I think your speed savings estimates are clearly inflated. Using

"inflated" or not they are real. i created 2 schemas on the same
tablespaces and loaded same production data into both

> the surrogate key means a join on every payment query.
just plain not true

> surrogates. Disc prices continue to drop year by year so the storage
> costs are seldom the real issue. There is a constant struggle between

if you are speaking about adding 1 more 100Gb drive into a cheap intel
box, I would completely agree. Think big - going from 30 to 50 Tb is
major undertaking, and an expensive one.

> That's the point and the ROOT CAUSE of the problem. Get someone on
> your team that knows the subject area and do it right the first time.
> (Well at least as close to right as possible, then include the
> possibility of change in the design.)

OK, ""include the possibility of change in the design"!
I've heard this kind of argument before. Be specific: let's suppose
the very best expert in the industry has told you to use (city, state)
PK. How are you gonna "include the possibility of change in the
design" in this particular case?

Be specific!
Ed prochak - 27 Jul 2004 19:03 GMT
> > One: I think your space savings estimates are exagerated.
> > Two: I think your speed savings estimates are clearly inflated. Using
>
> "inflated" or not they are real. i created 2 schemas on the same
> tablespaces and loaded same production data into both

... and what results? Your earlier post seemed to have figures that
seemed too vague (50% IIRC).

> > the surrogate key means a join on every payment query.
> just plain not true

Before changing table you said:
>> Let's create PAYMENTS(CITY,
>> STATE, COUNTRY, ADDRESS, PMT_DATE, PMT_AMOUNT, PRIMARY KEY(CITY,
>> STATE, COUNTRY, ADDRESS, PMT_DATE)).

After changing it you said
> use up. Then let's create another table PAYMENTS_SUROGATE(ADDRESS_ID,
> PMT_DATE, PMT_AMOUNT, PRIMARY KEY(ADDRESS_ID, PMT_DATE)),

but you leave out the needed address information table
SUROGATE_ADDRESS(ADDRESS_ID, CITY, STATE, COUNTRY, ADDRESS)

so to get all the info in the original, natural key table, you MUST do
a join in the surogate case.

> > surrogates. Disc prices continue to drop year by year so the storage
> > costs are seldom the real issue. There is a constant struggle between
>
> if you are speaking about adding 1 more 100Gb drive into a cheap intel
> box, I would completely agree. Think big - going from 30 to 50 Tb is
> major undertaking, and an expensive one.

I must admit I haven't worked with tables or databases in the terabyte
range. Those systems can have a different kind of budget than most
applications (like the difference between running a 5k and a marathon,
very different energy budgets in running those).

> > That's the point and the ROOT CAUSE of the problem. Get someone on
> > your team that knows the subject area and do it right the first time.
[quoted text clipped - 8 lines]
>
> Be specific!

Doesn't sound like much of an expert. It's not like the issues of
international addresses are unknown.

Your address example is somewhat contrived since it almost purposely
left out known possibilities. There are different kinds of addresses
and each can be structured in different ways. Most of my experience is
in American addressing issues.

There's a postal address (likely doesn't include county in the USA,
and maybe not even anything equivalent to state/province in some small
countries). Then there's a legal address which is bound to include
county along with city and state (at least in the USA), but might use
map/page/plot number instead of the street address. Then there's you
physical address for things like the 911 system. In rural areas that
address might be nearly anything including just directions (turn right
of Rt 9, 5miles to farmhouse on left). And the geographic address
given by latitude and longtitude. Some systems may have to include
multiples of these forms and therefore be required to have multiple
address tables. A surogate (or three) becomes very useful in that case
(surprised to hear me say that?) but can be very tricky to maintain.

as to flexibility, make it so that adding additional columns should
not break existing applications, even if you are expanding primary
keys. Your city,state to city,county,state example can be done. In the
USA version of the application, which in the example is the existing
application, can ignore county. Inserting new rows would just require
a trigger to populate a default value for the county. Then the new
Russian version can use the county attribute as needed. So is that
specific enough? It allows the old application to continue working
while the new version is developed. Note the trigger might be smart in
that it could try to supply the correct county for each entry instead
of some fixed value (It could do this correctly for a large percentage
of USA city/states, but not all).

One final note: the combination city,state is not unique in the USA
either. At least if you are using unincorporated cities. I thought
there were two Springfield's in at least one state. Using full names
might be needed. For example I live in the City of Twinsburg, right
next to Twinsburg Township. Postal addresses look the same, but legal
addresses are VERY different. Oh, and watch out with including county.
The city of New York is in more than one county (5 of them?).

Just pointing out that it all goes back to your application
requirements. Surogates should not be banned, but neither should they
be the first choice (which happens far too often). The database serves
up data to the application. The possibility of change comment refers
to the combination, not just the datamodel.

Later,
 ed
Gene Wirchenko - 28 Jul 2004 00:30 GMT
[snip]

>There's a postal address (likely doesn't include county in the USA,

    There might not be a postal address.  I was going through
properties in an MLS computer system.  I found a lot in downtown
Vancouver, British Columbia, Canada that did not have a street
address.

>and maybe not even anything equivalent to state/province in some small
>countries). Then there's a legal address which is bound to include

    Singapore IIRC.

>county along with city and state (at least in the USA), but might use
>map/page/plot number instead of the street address. Then there's you

    That is *will* use in my experience.  I have never seen a legal
address that was a street address.  My experience is in British
Columbia, Canada.

>physical address for things like the 911 system. In rural areas that
>address might be nearly anything including just directions (turn right
>of Rt 9, 5miles to farmhouse on left). And the geographic address

    Nit: that is not an address but navigational instructions.

>given by latitude and longtitude. Some systems may have to include
>multiples of these forms and therefore be required to have multiple
>address tables. A surogate (or three) becomes very useful in that case
>(surprised to hear me say that?) but can be very tricky to maintain.

[snip]

>One final note: the combination city,state is not unique in the USA
>either. At least if you are using unincorporated cities. I thought

    I have found a few exceptions in Canada, too.

>there were two Springfield's in at least one state. Using full names
>might be needed. For example I live in the City of Twinsburg, right
[quoted text clipped - 5 lines]
>requirements. Surogates should not be banned, but neither should they
>be the first choice (which happens far too often). The database serves

    Agreed on that.

>up data to the application. The possibility of change comment refers
>to the combination, not just the datamodel.

Sincerely,

Gene Wirchenko

Computerese Irregular Verb Conjugation:
    I have preferences.
    You have biases.
    He/She has prejudices.
AK - 31 Jul 2004 23:54 GMT
> > "inflated" or not they are real. i created 2 schemas on the same
> > tablespaces and loaded same production data into both
>
> ... and what results? Your earlier post seemed to have figures that
> seemed too vague (50% IIRC).

that's an executive summary, the average over several dozen frequently
used queries

> but you leave out the needed address information table
> SUROGATE_ADDRESS(ADDRESS_ID, CITY, STATE, COUNTRY, ADDRESS)
> so to get all the info in the original, natural key table, you MUST do
> a join in the surogate case.

not nesessarily. there are 3 cases:
1. SELECT AMOUNT, PMT_DATE for one customer - no need for a join,
faster
2. SELECT AMOUNT, PMT_DATE, ADDRESS, CITY - joinn is nesessary only
for surrogate keys, slower
3. SELECT AMOUNT, PMT_DATE, CUSTOMER.SOME_OTHER_ATTRIBUTE - join is
neede in both cases, faster

so it's on average slower with surrogate keys only if situation 2
occurs more often

> I must admit I haven't worked with tables or databases in the terabyte
> range. Those systems can have a different kind of budget than most
> applications (like the difference between running a 5k and a marathon,
> very different energy budgets in running those).

one of the diffrences is you don't want to change database structure.
surrogate keys let you limit changes to 1 table only

> Doesn't sound like much of an expert. It's not like the issues of
> international addresses are unknown.

(snip)

> Just pointing out that it all goes back to your application
> requirements. Surogates should not be banned, but neither should they
> be the first choice (which happens far too often). The database serves
> up data to the application. The possibility of change comment refers
> to the combination, not just the datamodel.

thanks for a detailed answer. Still, if I have a big huge table
PAYMENT referring to ADDRESS, and if I later need to add a column to
the natural PK on ADRESS table, that's gonna be quite a problem. I
might want to avoid adding a column to a very large table and
rebuilding a foreign key, as I have other plans for my weekends. I
could choose to utilize surrogate PK just for that purpose. What do
you think?
AK - 07 Jul 2004 07:22 GMT
> I don't really understand why would you want to do something like this.
> If the user_login is unique, why not just make the user_login the
> primary key? Is this for efficiency reasons or am I missing something
> more fundamental?

Once upon a time the company I work for decided to change everybody's
logins.
We used to login as SMITHJ, which was changed to JSMITH. How are you
going to implement changing user_login if is the primary key? If the
primary key is surrogate, it is a snap.
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2009 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.