...
On Sep 19, 1:21 pm, DBMS_Plumber <paul_geoffrey_br...@yahoo.com>
wrote:
> In the first place, this does not answer the question the OP asked.
Upon further review, I am incorrect. Joe's post does attempt to
answer the OP's question. I believed the OP wanted to select a
different set of columns for each query.
My criticism of Joe's response stands, however.
>> I suggest the OP read "Massive Stochastic Testing of SQL" by Don Slutz of Microsoft. <<
I need to get a copy, too.
You might want to look at a current article by Ben Gan on the use of
RAND() and NEWID() and differences in SQL Server 2000 and 2005. He
covers the problems with how a CASE expression uses RAND(),
deterministic functions, duplicate values, etc.
>> I am VERY interested in understanding why these 'proprietary kludges' (either to select random samples, or to generate random numbers) 'have problems' with skewed distributions. (Disclosure - I worked on one.) <<
The SQL products I have worked with use traditional Linear
Congruential algorithms, which they inherited from C and UNIX. As you
get larger and larger samples, you get skewing and duplicate values.
Knuth Vol #2, Chapter 3 has a good history and some remarks about the
history.
The best (worst?) horror story in the 1970's was the discovery that an
IBM FORTRAN routine was not valid. It trashed quite a few PhD
projects. That was the most popular tool in those days for
research.
>> Repeated use of the same set of random numbers will generate an identical sample each time it's used. <<
Not if the population changes each time. But I understand your point.
However, consider that one of the advantages of RNG is that you can
repeat an experiment. I worked with some lab equipment on an old DEC
PDP-11 more decades ago than I really like to remember with a special
circuit card. This thing had a speck of radioactive material and a
simple Geiger counter tube to create **quantum level** random digits.
Of course people did not seeing that black & yellow radiation symbol
on equipment in those days (Cold War Era) so I am not sure if it is
still available. You can probably use background radiation or radio
noise today with sensitive equipment.
But if I want a fixed table, I think most statisticians would agree
that the RAND corporation "Table of One Million Random Digits" has
been tested in every possible way for mathematical correctness. That
is a fixed table available on diskette!
>> From a practical point of view, you would be required to generate a new set of random values for each query. I am also curious to know how you use this table to restrict the rows being returned in the general case to any kind of uniform random sample. <<
I start at the front of the RAND Corporation table and pull digits
until I get to the end. I can go to some university sites and get
bigger validated tables, too.
DBMS_Plumber - 21 Sep 2007 00:49 GMT
> >> I suggest the OP read "Massive Stochastic Testing of SQL" by Don Slutz of Microsoft. <<
>
> I need to get a copy, too.
It doesn't talk about this subject. Only about generating lots of
independent SQL queries to test the robustness of the SQL processor.
It's fascinating - but not apropos.
> You might want to look at a current article by Ben Gan on the use of
> RAND() and NEWID() and differences in SQL Server 2000 and 2005. He
> covers the problems with how a CASE expression uses RAND(),
> deterministic functions, duplicate values, etc.
Familiar with it.
As an aside, Postgres had random number generation functions in 1993.
In fact, that system developed an entire class of optimizer
modifications to cope with the general class of 'non deterministic'
functions, which includes 'rand()', but just as easily might apply to
'has_the_missile_been_fired_yet()'. It was pleasing to see SQL Server
arrive at the same conclusions 15 years later. <snark off>
> >> I am VERY interested in understanding why these 'proprietary kludges' (either to select random samples, or to generate random numbers) 'have problems' with skewed distributions. (Disclosure - I worked on one.) <<
>
[quoted text clipped - 8 lines]
> projects. That was the most popular tool in those days for
> research.
I'm intimately familiar with this literature. There's also an
excellent discussion in the more recent _Numerical_Recipes_ books.
But how do you know what RNG these products use? Did you see their
source code?
Writing a very good RNG is not hard. It's 10% scholarship, 10% code
effort, and 80% testing.
[ snip ]
> However, consider that one of the advantages of RNG is that you can
> repeat an experiment.
There are a large number of RNG algorithms which, when handed the
same seed, generate the same sequence of values, making it possible to
use of the same list of random values as often as one would like.
[ snip ]
> But if I want a fixed table, I think most statisticians would agree
> that the RAND corporation "Table of One Million Random Digits" has
> been tested in every possible way for mathematical correctness. That
> is a fixed table available on diskette!
And is the same set of numbers used repeatedly. Which means
subsequent samples derived using the same table of values -- or
samples derived from other sources using the same table of values --
are no longer independent.
[ snip ]
A raft of other practical problems present themselves with the use
of a static table. I'll describe one of them.
To use that table of random numbers, you need to plug it into a
query somehow. This means either a) modifying the table to add a
column containing a random value, or b) writing a join. Now the
problem with either approach is that a query processor--not being
aware of the random nature of what is going on--is going to move the
restriction or the join operation to wherever the physical planner
sees fit to place it.
Confronted with these problems -- customers wanting bernoulli
samples, for instance, but getting instead samples whose size
distribution was all over the shop -- DBMS vendors added random number
generators and sampling algorithms to their products in very deep
ways. It's not just the quality of the RNG that counts. The whole
query processor needs to change.
And even then, users wonder why the sampling query is no faster than
the original. So they added yet another layer of complexity to make it
go fast, as well as generate "statistically rigorous" samples. (Note:
They're not. But they're usually good enough for government work.)
In a nutshell, DBMS vendors (well - the one or two I am most
familiar with) added random() functions and sampling with enormous
care. They deal with a small number of very, very sophisticated
customers who subject these products to tremendous amounts of
validation. If these customers are disappointed, they get upset. (If
they're happy they never say so. Such is life.)
>> You can write some proprietary kludges, but frankly they have problems
>> with skewed distributions.
[quoted text clipped - 7 lines]
> My advice to the OP would be to write a Java app that generated these
> queries.
Thanks for the advice.
I think that the best one can do is (and its what I'm currently doing):
(1) Devise a query which will return all of the keys for rows from which
you wish to select randomly (let's call them the target rows)
(2) Permute the returned keys (in Java, say)
(3) Iteratively select, key by key, the target rows corresponding to the
first N keys in the permutation. (or generate one query with N
disjuncts in the where clause).
When the target rows are the result of a fairly complicated join, this
is pretty cumbersome (and wildly slow, from my experience). But it seems
that this is what I'm stuck with. Not to worry.
[snip]
Cheers,
Joe