Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
Home
Discussion Groups
Database Servers
DB2InformixIngresMS SQLOraclePervasive.SQLPostgreSQLProgressSybase
Desktop Databases
FileMakerFoxProMS AccessParadox
General
General DB TopicsDatabase Theory
Related Topics
Java Development.NET DevelopmentVB DevelopmentMore Topics ...

Database Forum / General DB Topics / DB Theory / September 2007

Tip: Looking for answers? Try searching our database.

Stochastic Queries

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
Joe Thurbon - 10 Sep 2007 07:21 GMT
As a part of some experimental work I've been doing, I need to write
what I think is best described as a stochastic quota query.

I would like to write a query like

SELECT RANDOM 5 COLUMN_FOO FROM TABLE_BAR

where RANDOM is like TOP, except that for successive queries, I get a
(possibly) different set of 5 COLUMN_FOO values from table TABLE_BAR.

Has anyone seen such a beast?

Thanks in advance,
Joe
Jason Lepack - 10 Sep 2007 13:26 GMT
It depends... I don't know of any "good" methods using ANSI SQL, but
there are methods that can be used depending on what type of SQL you
are using.

Try google... I found all of these solutions by typing into Google:
"language-name" random records

In T-SQL there's the newid() method:
http://www.sqlteam.com/article/using-newid-to-randomly-sort-records

Oracle has sample():
http://www.adp-gmbh.ch/ora/sql/examples/select_sample.html

MySQL has rand():
http://webxadmin.free.fr/article/mysql-random-records-170.php

Cheers,
Jason Lepack

> As a part of some experimental work I've been doing, I need to write
> what I think is best described as a stochastic quota query.
[quoted text clipped - 10 lines]
> Thanks in advance,
> Joe
Joe Thurbon - 10 Sep 2007 21:58 GMT
> It depends... I don't know of any "good" methods using ANSI SQL, but
> there are methods that can be used depending on what type of SQL you
[quoted text clipped - 14 lines]
> Cheers,
> Jason Lepack

Thanks Jason,

I'll look into it.

Cheers,
Joe
-CELKO- - 10 Sep 2007 17:45 GMT
>> As a part of some experimental work I've been doing, I need to write  what I think is best described as a stochastic quota query. <<

You can write some proprietary kludges, but frankly they have problems
with skewed distributions.  Better to stick to a stats package to draw
your samples (with or without replacements? Tiered? etc.?)  There was
a good book years ago with a title somethign like a SAMPLER OF
SAMPLING that went into the details.

One trick I have used is to build a table:

CREATE TABLE Samples
(seq_nbr INTEGER NOT NULL PRIMARY KEY,
random_nbr INTEGER NOT NULL);  -- or whatever data type you need

then load it with the particular random numbers I wanrted from a stat
package.  More work, but MUCH better results.
Joe Thurbon - 10 Sep 2007 22:01 GMT
>>> As a part of some experimental work I've been doing, I need to write  what I think is best described as a stochastic quota query. <<
>
[quoted text clipped - 12 lines]
> then load it with the particular random numbers I wanrted from a stat
> package.  More work, but MUCH better results.

Thanks -CELKO-,

I'll have a think about it.

Cheers,
Joe
DBMS_Plumber - 19 Sep 2007 21:21 GMT
> You can write some proprietary kludges, but frankly they have problems
> with skewed distributions.
...

> then load it with the particular random numbers I wanrted from a stat
> package.  More work, but MUCH better results.

In the first place, this does not answer the question the OP asked.

My advice to the OP would be to write a Java app that generated these
queries. It's been done before. I suggest the OP read "Massive
Stochastic Testing of SQL" by Don Slutz of Microsoft. To my certain
knowledge at least two of the major DBMS vendors have a similar
testing facility. They're not a panacea, but they're useful. For extra
points you might want to include joins, unions and what-not in your
range of query features.

But the the question Joe answers, I am VERY interested in
understanding why these 'proprietery kludges' (either to seelct random
samples, or to generate random numbers) 'have problems' with skewed
distributions. (Disclosure - I worked on one.)

These 'proprietary kludges' sweat a lot of details. Joe's solution
here is wrong. Repeated use of the same set of random numbers will
generate an identical sample each time it's used. From a practical
point of view, you would be required to generate a new set of random
values for each query. I am also curious to know how you use this
table to restrict the rows being returned in the general case to any
kind of uniform random sample.
DBMS_Plumber - 19 Sep 2007 21:24 GMT
On Sep 19, 1:21 pm, DBMS_Plumber <paul_geoffrey_br...@yahoo.com>
wrote:
> In the first place, this does not answer the question the OP asked.

Upon further review, I am incorrect. Joe's post does attempt to
answer the OP's question. I believed the OP wanted to select a
different set of columns for each query.

My criticism of Joe's response stands, however.
-CELKO- - 21 Sep 2007 00:01 GMT
>> I suggest the OP read "Massive Stochastic Testing of SQL" by Don Slutz of Microsoft. <<

I need to get a copy, too.

You might want to look at a current article by Ben Gan on the use of
RAND() and NEWID() and differences in SQL Server 2000 and 2005. He
covers the problems with how a CASE expression uses RAND(),
deterministic functions, duplicate values, etc.

>>  I am VERY interested in understanding why these 'proprietary kludges' (either to select random samples, or to generate random numbers) 'have problems' with skewed distributions. (Disclosure - I worked on one.) <<

The SQL products I have worked with use traditional Linear
Congruential algorithms, which they inherited from C and UNIX.  As you
get larger and larger samples, you get skewing and duplicate values.
Knuth Vol #2, Chapter 3 has a good history and some remarks about the
history.

The best (worst?) horror story in the 1970's was the discovery that an
IBM FORTRAN  routine was not valid.  It trashed quite a few PhD
projects.  That was the most popular tool in those days for
research.

>> Repeated use of the same set of random numbers will generate an identical sample each time it's used. <<

Not if the population changes each time. But I understand your point.

However, consider that one of the advantages of RNG is that you can
repeat  an experiment.  I worked with some lab equipment on an old DEC
PDP-11 more decades ago than I really like to remember with a special
circuit card.  This thing had a speck of radioactive material and a
simple Geiger counter tube to create **quantum level** random digits.
Of course people did not seeing that black & yellow radiation symbol
on equipment in those days (Cold War Era) so I am not sure if it is
still available. You can probably use background radiation or radio
noise today with sensitive equipment.

But if I want a fixed table, I think most statisticians would agree
that the RAND corporation "Table of One Million Random Digits" has
been tested in every possible way for mathematical correctness.  That
is a fixed table available on diskette!

>> From a practical point of view, you would be required to generate a new set of random values for each query. I am also curious to know how you use this table to restrict the rows being returned in the general case to any kind of uniform random sample. <<

I start at the front of the RAND Corporation table and pull digits
until I get to the end.  I can go to some university sites and get
bigger validated tables, too.
DBMS_Plumber - 21 Sep 2007 00:49 GMT
> >> I suggest the OP read "Massive Stochastic Testing of SQL" by Don Slutz of Microsoft. <<
>
> I need to get a copy, too.

It doesn't talk about this subject. Only about generating lots of
independent SQL queries to test the robustness of the SQL processor.
It's fascinating - but not apropos.

> You might want to look at a current article by Ben Gan on the use of
> RAND() and NEWID() and differences in SQL Server 2000 and 2005. He
> covers the problems with how a CASE expression uses RAND(),
> deterministic functions, duplicate values, etc.

Familiar with it.

As an aside, Postgres had random number generation functions in 1993.
In fact, that system developed an entire class of optimizer
modifications to cope with the general class of 'non deterministic'
functions, which includes 'rand()', but just as easily might apply to
'has_the_missile_been_fired_yet()'. It was pleasing to see SQL Server
arrive at the same conclusions 15 years later. <snark off>

> >>  I am VERY interested in understanding why these 'proprietary kludges' (either to select random samples, or to generate random numbers) 'have problems' with skewed distributions. (Disclosure - I worked on one.) <<
>
[quoted text clipped - 8 lines]
> projects.  That was the most popular tool in those days for
> research.

  I'm intimately familiar with this literature. There's also an
excellent discussion in the more recent _Numerical_Recipes_ books.

  But how do you know what RNG these products use? Did you see their
source code?

  Writing a very good RNG is not hard. It's 10% scholarship, 10% code
effort, and 80% testing.

[ snip ]

> However, consider that one of the advantages of RNG is that you can
> repeat  an experiment.

  There are a large number of RNG algorithms which, when handed the
same seed, generate the same sequence of values, making it possible to
use of the same list of random values as often as one would like.

[ snip ]

> But if I want a fixed table, I think most statisticians would agree
> that the RAND corporation "Table of One Million Random Digits" has
> been tested in every possible way for mathematical correctness.  That
> is a fixed table available on diskette!

And is the same set of numbers used repeatedly. Which means
subsequent samples derived using the same table of values -- or
samples derived from other sources using the same table of values --
are no longer independent.

[ snip ]

 A raft of other practical problems present themselves with the use
of a static table. I'll describe one of them.

 To use that table of random numbers, you need to plug it into a
query somehow. This means either a) modifying the table to add a
column containing a random value, or b) writing a join. Now the
problem with either approach is that a query processor--not being
aware of the random nature of what is going on--is going to move the
restriction or the join operation to wherever the physical planner
sees fit to place it.

 Confronted with these problems -- customers wanting bernoulli
samples, for instance, but getting instead samples whose size
distribution was all over the shop -- DBMS vendors added random number
generators and sampling algorithms to their products in very deep
ways. It's not just the quality of the RNG that counts. The whole
query processor needs to change.

 And even then, users wonder why the sampling query is no faster than
the original. So they added yet another layer of complexity to make it
go fast, as well as generate "statistically rigorous" samples. (Note:
They're not. But they're usually good enough for government work.)

 In a nutshell, DBMS vendors (well - the one or two I am most
familiar with) added random() functions and sampling with enormous
care. They deal with a small number of very, very sophisticated
customers who subject these products to tremendous amounts of
validation. If these customers are disappointed, they get upset. (If
they're happy they never say so. Such is life.)
Joe Thurbon - 26 Sep 2007 00:11 GMT
>> You can write some proprietary kludges, but frankly they have problems
>> with skewed distributions.
[quoted text clipped - 7 lines]
> My advice to the OP would be to write a Java app that generated these
> queries.

Thanks for the advice.

I think that the best one can do is (and its what I'm currently doing):

(1) Devise a query which will return all of the keys for rows from which
    you wish to select randomly (let's call them the target rows)
(2) Permute the returned keys (in Java, say)
(3) Iteratively select, key by key, the target rows corresponding to the
    first N keys in the permutation. (or generate one query with N
    disjuncts in the where clause).

When the target rows are the result of a fairly complicated join, this
is pretty cumbersome (and wildly slow, from my experience). But it seems
that this is what I'm stuck with. Not to worry.

[snip]

Cheers,
Joe
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2009 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.