to pierce its unholy heart.
I keep getting baited into this thing in CubeLand, and my colleagues do
have something of a point; to wit, if Users are to be empowered to take
responsibility for their data, then they have to be able to maintain the
Allowable Values (the term du jour in CubeLand) for their Fields (old
VSAM folk, my colleagues). I've read a bit here and there that the
"industry" is taken to questioning letting User do this, but my corner
still buys the notion.
I take the position that these are ultimately (check) constraints on
columns somewhere in the tables. So, I've been looking around for some
utilities that do the two operations necessary to quiet my colleagues:
1) a language processor (or language neutral one) that will spit out
html/javascript for screen generation from the catalog on the read side
of things.
and
2) a utility which will allow constraints to be updated by civilians in
a putatively secure way on the write side of things.
I've found nothing on 2), and for 1); Oracle Forms, Rapid SQL (maybe,
not sure), and Hibernated Validator (ditto).
OF is not feasible, given that most of the databases in question live in
z/OS v8, the rest in LUW. Likely ditto for the other two, if they do
what OF does.
Which leaves the question: has anyone seen such? I didn't find any
responses from ibm.com when I went spelunking.
This is from my book THINKING IN SETS"
========================
04.04. OTLT or MUCK Table Problems
I think that Paul Keister was the first person to coin the phrase
"OTLT" (One True Look-up Table) for a common SQL programming technique
that is popular with Newbies. Don Peterson (www.SQLServerCentral.com)
gave the same technique the name "Massively Unified Code-Key" or MUCK
tables in one of his articles.
The technique crops up time and time again, but I will give Paul
Keister credit as the first writer to give it a name. Simply put, the
idea is to have one table to do all of the code look-ups in the
schema. It usually looks like this:
CREATE TABLE Look-ups
(code_type CHAR(10) NOT NULL,
code_value VARCHAR(255) NOT NULL, -- notice size!
code_description VARCHAR(255) NOT NULL, -- notice size!
PRIMARY KEY (code_value, code_type));
So if we have Dewey Decimal Classification (library codes), ICD
(International Classification of Diseases), and two-letter ISO-3166
Country Codes in the schema, we have them all in one, honking big
table.
Let us start with the problems in the DDL and then look at the awful
queries you have to write (or hide in VIEWs). So we need to go back
to the original DDL and add a CHECK() constraint on the "code_type"
column. Otherwise, we might "invent" a new encoding system by
typographical error.
Notice that we are already in trouble because a data element cannot be
both a "<something>_code" and a "<something>_type"; it must be one or
the other and it must be the code or type of some specific attribute.
One of the nice features of the ISO-11179 rules is that they prevent
this mixing of data and meta-data in a way that can be checked
mechanically. Ignore this comment and continue heading for the edge
of the cliff.
The Dewey Decimal and ICD codes are digits and have the same format --
three digits, a decimal point and more digits (usually three); the
ISO-3166 Country Codes are alphabetic. Oops, we now need another
CHECK constraint that will look at the "code_type" and make sure that
the string is in the right format. Now the table looks something like
this, if anyone attempted to do it right, which is not usually the
case:
CREATE TABLE OTLT
(code_type CHAR(10) NOT NULL
CHECK("code_type" IN ('DDC','ICD','ISO3166', ..),
code_value VARCHAR(255) NOT NULL,
CHECK
(CASE code_type
WHEN 'DDC'
AND code_value
SIMILAR TO '[0-9][0-9][0-9].[0-9][0-9][0-9]'
THEN 1
WHEN 'ICD'
AND code_value
SIMILAR TO '[0-9][0-9][0-9].[0-9][0-9][0-9]'
THEN 1
WHEN 'ISO3166'
AND code_value SIMILAR TO '[A-Z][A-Z]'
THEN 1 ELSE 0 END = 1),
code_description VARCHAR(255) NOT NULL,
PRIMARY KEY (code_value, code_type));
The "SIMILAR TO" predicate is the SQL-92 version of a regular
expression parser based on the POSIX Standards, if you are not
familiar with it. Since the typical application database can have
dozens and dozens of codes in it, just keep extending this pattern for
as long as required. Not very pretty is it? In fact, there is a good
chance that you might exceed the number of WHEN clauses allowed in a
CASE expression in a major corporation. That is why most OTLT
programmers do not bother with this absolutely vital constraint.
Now let us consider adding new rows to the OTLT.
INSERT INTO OTLT (code_type, code_value, code_description)
VALUES
('ICD', 259.0, 'Inadequate Genitalia after Puberty');
and also
INSERT INTO OTLT (code_type, code_value, code_description)
VALUES
('DDC', 259.0, 'Christian Pastoral Practices & Religious Orders');
If you make an error in the "code_type" during insert, update or
delete, you have screwed up a totally unrelated value. If you make an
error in the "code_type" during a query, the results could be
interesting.
This can be really hard to find when one of the similarly structured
schemes had unused codes in it.
The next thing you notice about this table is that the columns are
pretty wide VARCHAR(n), or even worse, that they are NVARCHAR(n) which
can store characters from a strange language. The value of (n) is
most often the largest one allowed in that particular SQL product.
Since you have no idea what is going to be shoved into the table,
there is no way to predict and design with a safe, reasonable maximum
size. The size constraint has to be put into the WHEN clause of that
second CHECK() constraint between "code_type" and "code_value". Or
you can live with fixed length codes that are longer (or fatally
shorter) than what they should be.
These large sizes tend to invite bad data. You give someone a
VARCHAR(n) column, and you eventually get a string with a lot of white
space and a small odd character sitting at the end of it. You give
someone an NVARCHAR(255) column and eventually it will get a Buddhist
sutra in Chinese Unicode.
Now let's consider the problems with actually using the OTLT in a
query. It is always necessary to add the "code_type" as well as the
value which you are trying to look-up.
SELECT P1.ssn, P1.lastname, .., L1.code_description
FROM OTLT AS L1, Personnel AS P1
WHERE L1."code_type" = 'ICD'
AND L1.code_value = P1.disease_code
AND ..;
In this sample query, you need to know the "code_type" of the
Personnel table disease_code column and of every other encoded column
in the table. If you got a "code_type" wrong, you can still get a
result.
You also need to allow for some overhead for data type conversions.
It might be more natural to use numeric values instead of VARCHAR(n)
for some encodings to ensure a proper sorting order. Padding a string
of digits with leading zeros adds overhead and can be risky if
programmers do not agree on how many zeros to use.
When you execute a query, the SQL engine has to pull in the entire
look-up table, even if it only uses a few codes. If one code is at
the start of the physical storage, and another is at the end of
physical storage, I can do a lot of caching and paging. When I update
the OTLT table, I have to lock out everyone until I am finished. It
is like having to carry an encyclopedia set with you when all you
needed was a magazine article.
Now consider the overhead with a two-part FOREIGN KEY in a table:
CREATE TABLE EmployeeAbsences
(..
"code_type" CHAR(3) -- min length needed
DEFAULT 'ICD' NOT NULL
CHECK ("code_type" = 'ICD'),
code_value CHAR(7) NOT NULL, -- min length needed
FOREIGN KEY ("code_type", code_value)
REFERENCES OTLT ("code_type", code_value),
..);
Now I have to convert the character types for more overhead. Even
worse, ICD has a natural DEFAULT value (000.000 means "undiagnosed"),
while Dewey Decimal does not. Older encoding schemes often used all
9's for "miscellaneous" so they would sort to the end of the reports
in COBOL programs. Just as there is no Magical Universal "id", there
is no Magical Universal DEFAULT value. I just lost one of the most
important features of SQL!
I am going to venture a guess that this idea came from OO programmers
who think of it as some kind of polymorphism done in SQL. They say to
themselves that a table is a class, which it is not, and therefore it
ought to have polymorphic behaviors, which it does not.
04.05. Definition of a Proper Table
There are good reasons for the data modeling principle that a well-
designed table is a set of things of the same kind instead of a pile
of unrelated items.
At one extreme, we have the "attribute split" tables and on the other
the extreme conglomerated tables. When I post in Newsgroups I refer
to a "Britney Spears, Squids and Automobiles" procedure and table for
years -- attempts to make one table or procedures serve as many
purposes as possible. What is funny about this that there is a
"Britney Spears or Squid" website (http://scienceblogs.com/deepseanews/
2007/02/weekend_foolishness.php) posted after she shaved her head in
2007.
MeBuggyYouJane - 04 Apr 2008 02:08 GMT
> This is from my book THINKING IN SETS"
>
[quoted text clipped - 182 lines]
> 2007/02/weekend_foolishness.php) posted after she shaved her head in
> 2007.
I was the first on my block to own the book. Rather a good one,
actually. But I didn't see any references to the utilities I'm seeking.
Not that I don't want to write them; I just have no patience with
string parsing. The OF approach is about what I would expect. It
puzzles me that there hasn't been even a commercial product. Oh well.
--CELKO-- - 04 Apr 2008 22:05 GMT
>>I was the first on my block to own the book. Rather a good one,actually. <<
My mortgage and my ego both thank you.
>> But I didn't see any references to the utilities I'm seeking. Not that I don't want to write them; I just have no patience with string parsing. The OF approach is about what I would expect. It puzzles me that there hasn't been even a commercial product. <<
Since every encoding scheme will have a different set of rules for its
validation, that would be pretty hard. I tell people to go to a
Regular Expression code website and Google around for whatever they
are using -- VIN, ISBN, ISAN, etc. and then cut& past it into the
local dialect version of SIMILAR TO.
Doing a check digit is pretty easy until you get to the dihedral
five. Substring the digits, cast as INTEGER, multiple by the weights,
do the MOD() and see if this expression matches the last digit cast to
INTEGER. A bit long, but easy to code and maintain.