Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
Home
Discussion Groups
Database Servers
DB2InformixIngresMS SQLOraclePervasive.SQLPostgreSQLProgressSybase
Desktop Databases
FileMakerFoxProMS AccessParadox
General
General DB TopicsDatabase Theory
Related Topics
Java Development.NET DevelopmentVB DevelopmentMore Topics ...

Database Forum / General DB Topics / General DB Topics / October 2004

Tip: Looking for answers? Try searching our database.

[newbie Q] Mongo id resolution DB

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
KKramsch - 19 Oct 2004 19:40 GMT
 (I apologize beforehand for the cluelessness of this question.
Except for one grad-level course on relational databases a couple
of years ago, I have little experience with database design, and
certainly zero experience with *real world* database design.)

 In my field (computational genomics) one recurrent headache is
the multiplicity of ids widely used to identify genes.  For each
model organism (human, mouse, yeast, etc.) there are at least two,
but often 4 or 5 different identifier namespaces in widespread use.

 For our research group's internal work, we want to set up a
relational database to store all these gene "synonyms".  My question
has to do with the best design for such a database.  The two
considerations are 1) growth; and 2) performance.  (By "growth" I
mean that both the number of organisms represented in the database
as well as the number of namespaces supported for each organisms
are expected to grow over time.  Needless to say, the total number
of genes listed is also expected to grow.)

 There are two basic designs I can think of.  The first one, Design
#1, consists primarily of one HUGE table with the following fields:

 namespace_1_id : gene_id_1 : namespace_2_id : gene_id_2

 This design would require two auxiliary tables of the form

 namespace_id : organism_id : human-readable name (e.g. "Saccharomyces Genome Database accession number")

and

 organism_id : human-readable name (e.g. "Saccharomyces cerevisiae")

 The other basic design, Design #2, consists of *many* tables of the
form

 gene_id_1 gene_id_2

one such table for each *ordered* pair of distinct namespaces for
each organism.

 Design #1 is very simple, in that the structure of the database
would not need to change no matter how many organisms and namespaces
are added to it in the future.  On the downside, it is not normalized.

 The opposite is true, on both counts, for Design #2.

 Which approach would be better?  Assuming that they are both bad,
what would be a better design for such a database?

 Thanks!

    Karl

P.S. In case it matters, I should add that in most cases there is
NOT a 1-to-1 correspondence between any two namespaces for any
given organism.  For a variety of reason, it is not uncommon for
one id in one namespace to correspond to 0 or >1 ids in another
namespace.
Signature

Sent from a spam-bucket account; I check it once in a blue moon.  If
you still want to e-mail me, cut out the extension from my address,
and make the obvious substitutions on what's left.

Dennis Lee Bieber - 20 Oct 2004 07:30 GMT
On Tue, 19 Oct 2004 18:40:39 +0000 (UTC), KKramsch
<karlUNDERSCOREkramsch@yahooPERIODcom.invalid> declaimed the following
in comp.databases:

>   (I apologize beforehand for the cluelessness of this question.

    That's okay -- I'm probably clueless WRT the genetic data...

>   In my field (computational genomics) one recurrent headache is
> the multiplicity of ids widely used to identify genes.  For each
[quoted text clipped - 22 lines]
>
>   organism_id : human-readable name (e.g. "Saccharomyces cerevisiae")

    From what I can see so far, the hang-up is at the "gene ID"
level... (oh, and /what/ is that sucker? Some sugar eating critter? <G>)

    An initial cut...

ORGANISM
    ID    auto increment primary key
    Name    char

NAMESPACE
    ID    auto increment primary key
    org_ID    number foreign key references ORGANISM (ID)

GENOME
    ID    auto increment primary key
    Description    char    #something not specific to namespace --
                #maybe nucleotide <?> sequence letters?
                #something that can be used to
                #reidentify the entity -- if possible,
                #this should be capable of making a
                #UNIQUE index (no duplicates)

DICTIONARY
    ID    auto increment primary key    #yes, I tend to do these
                        #for all stuff
    Namespace_ID    number foreign key
                references NAMESPACE (ID)
    Genome_ID    number foreign key references GENOME (ID)
    Name    char    #the representation in the namespace
   

    You'll notice there is no "namespace:gene"<>"namespace:gene"
table. For any given gene sequence, there should be only one GENOME
record. The automatically assigned ID number for that record is used in
all "Dictionary" records -- doing something like (obviously untested):

select NAMESPACE.Name, DICTIONARY.Name, GENOME.Description from
DICTIONARY join NAMESPACE on DICTIONARY.Namespace_ID = NAMESPACE.ID join
GENOME on DICTIONARY.Genome_ID = GENOME.ID group by GENOME.Description;

could generate a listing of the synonyms.

> ============================================================== <
>   wlfraed@ix.netcom.com  | Wulfraed  Dennis Lee Bieber  KD6MOG <
>      wulfraed@dm.net     |       Bestiaria Support Staff       <
> ============================================================== <
>           Home Page: <http://www.dm.net/~wulfraed/>            <
>        Overflow Page: <http://wlfraed.home.netcom.com/>        <
Theo Peterbroers - 20 Oct 2004 11:43 GMT
<.....>
>   In my field (computational genomics) one recurrent headache is
> the multiplicity of ids widely used to identify genes.  For each
[quoted text clipped - 5 lines]
> has to do with the best design for such a database.  The two
> considerations are 1) growth; and 2) performance.

First consideration ought to be: can I extract from my database, in
a reasonably straightforward way, the information I need. I.e. you
should start not with the data you (seemingly) want to store, but with
its intended use.
Second consideration could be: if I'm going to add new data, what's the
most efficient way to make sure it is complete. You don't want to add
all possible synonyms within N namespaces, M genes and O organisms by
hand.

>  one HUGE table with the following fields:
>  namespace_1_id : gene_id_1 : namespace_2_id : gene_id_2

I.e. explicitly storing each and every synonym relationship. This
involves far too much redundancy
(NS1, GENE1) <-> (NS2, GENE1)
(NS1, GENE1) <-> (NS3, GENE1), and then you would have to add
(NS2, GENE1) <-> (NS3, GENE1), which is redundant.

> *many* tables of the form
>  gene_id_1 gene_id_2
>  one such table for each *ordered* pair of distinct namespaces for
>  each organism.

No, no, no. Searching for synonyms in N*(N-1)/2 tables is no joy.

> Which approach would be better?  Assuming that they are both bad,
> what would be a better design for such a database?

What do you consider synonyms?
- Two "different" genes coding for what is structurally the same
 protein, where different can be
 - "different gene in same namespace",
 - "same gene in different organism or namespace",
 - "same gene under a superseded name, same namespace",
 ... some more convolutions of namespace, gene and organism ...
- Two "different" genes coding for proteins having the same function
 ... some convolutions of namespace, gene and organism ...
- Two "different" genes having a partial overlap in their nucleotid
 order,
 ... some convolutions of namespace, gene and organism ...

> P.S. In case it matters, I should add that in most cases there is
> NOT a 1-to-1 correspondence between any two namespaces for any
> given organism.  For a variety of reason, it is not uncommon for
> one id in one namespace to correspond to 0 or >1 ids in another
> namespace.
Are you sure that the information you are interested in is "equivalence,
or partial equivalence, of identifiers"?
KKramsch - 20 Oct 2004 14:11 GMT
><.....>
>>   In my field (computational genomics) one recurrent headache is
[quoted text clipped - 6 lines]
>> has to do with the best design for such a database.  The two
>> considerations are 1) growth; and 2) performance.

>First consideration ought to be: can I extract from my database, in
>a reasonably straightforward way, the information I need. I.e. you
[quoted text clipped - 4 lines]
>all possible synonyms within N namespaces, M genes and O organisms by
>hand.

>>  one HUGE table with the following fields:
>>  namespace_1_id : gene_id_1 : namespace_2_id : gene_id_2

>I.e. explicitly storing each and every synonym relationship. This
>involves far too much redundancy
>(NS1, GENE1) <-> (NS2, GENE1)
>(NS1, GENE1) <-> (NS3, GENE1), and then you would have to add
>(NS2, GENE1) <-> (NS3, GENE1), which is redundant.

>> *many* tables of the form
>>  gene_id_1 gene_id_2
>>  one such table for each *ordered* pair of distinct namespaces for
>>  each organism.

>No, no, no. Searching for synonyms in N*(N-1)/2 tables is no joy.

>> Which approach would be better?  Assuming that they are both bad,
>> what would be a better design for such a database?
[quoted text clipped - 11 lines]
>  order,
>  ... some convolutions of namespace, gene and organism ...

 The short answer is: synonyms are whatever the existing (incomplete)
dicitionaries say are synonyms.  A slightly better answer, synonyms
*should* be different names for the *same* stretch of DNA in a
given chromosome.  However, deviations from this ideal are relatively
common, for a host of reasons.

 BTW, as I alluded to above, there are many published lists of
synonyms out there, but they invariably cover only one or a few
organisms, and for the organisms they cover they invariably miss
at least one important namespace.  Moreover, all these dictionaries
are designed to operate via a Web interface, in single query mode,
which makes them unsuitable for high-performance applications.

    Karl

Signature

Sent from a spam-bucket account; I check it once in a blue moon.  If
you still want to e-mail me, cut out the extension from my address,
and make the obvious substitutions on what's left.

Theo Peterbroers - 21 Oct 2004 15:31 GMT
> ><.....>
> >>   In my field (computational genomics) one recurrent headache is
[quoted text clipped - 62 lines]
>
>     Karl
Your problem is partly of an organizational nature. As you know
(should
know) in more oldfashioned fields there exist a Code of Botanical
Nomenclature and a Code of Zoological Nomenclature. Scientific
Congresses decide on those Codes and outstanding issues.
Ultimately, genomics will probably clean up its act, but this does not
help
you in the short run.

Given that I think
1) explicitly storing each and every individual synonym relationship
between
  genes/DNA stretches identified in different namespaces leads to
redundancy
  and is rather unmaintainable
2) creating a lot of tables for storing such relationships will make
your
  database very unfriendly to users,

I believe that we should relate identifiers in different namespaces
through
some sort of 'Least Common Multiple'. Pretty much like having a
'standard interface'. Rather than creating a specific link between
each synonym pair,
each identifier needs just one relationship to its 'standard
interface'.

Basically, I would say just two tables are needed.

Identifiers (Namespace, Gene)
Both columns make up the primary key. You fill this table from the
published dictionaries.

Synonyms (Namespace, Gene, XXX)
All three columns form the primary key, Namespace and Gene are a
foreign key
to Identifiers. XXX is the 'Least Common Multiple' or 'Standard
Interface',
a system generated item identifying synonyms (groups of identifiers
"Namespace, Gene" that you consider to be the same).

Two possible improvements to this design are adding a source or
authority
column to both tables, and introducing a shorthand abbreviation GeneID
for each Identifier.

On filling table Synonyms, each gene belongs, preferably, to just one
group
of synonyms XXX. As you add synonyms from several sources then, you
should
use the XXX which already exists. On the other hand, synonymy may be a
matter
of opinion and scientific debate, in which case a gene may end up
belonging
to this group or that, according to the authority consulted.
KKramsch - 21 Oct 2004 17:02 GMT
>> ><.....>
>> >>   In my field (computational genomics) one recurrent headache is
[quoted text clipped - 49 lines]
>> are designed to operate via a Web interface, in single query mode,
>> which makes them unsuitable for high-performance applications.

>I believe that we should relate identifiers in different namespaces
>through
[quoted text clipped - 3 lines]
>each identifier needs just one relationship to its 'standard
>interface'.

>Basically, I would say just two tables are needed.

>Identifiers (Namespace, Gene)
>Both columns make up the primary key. You fill this table from the
>published dictionaries.

>Synonyms (Namespace, Gene, XXX)
>All three columns form the primary key, Namespace and Gene are a
[quoted text clipped - 3 lines]
>a system generated item identifying synonyms (groups of identifiers
>"Namespace, Gene" that you consider to be the same).

 What exactly is the role/utility of the Identifiers table?  It
looks to me like that all the information in Identifiers is already
contained in Synonyms.

 Thanks,

    Karl

Signature

Sent from a spam-bucket account; I check it once in a blue moon.  If
you still want to e-mail me, cut out the extension from my address,
and make the obvious substitutions on what's left.

Theo Peterbroers - 22 Oct 2004 10:30 GMT
>  
> >I believe that we should relate identifiers in different namespaces
[quoted text clipped - 24 lines]
>
>   Thanks,
Identifiers contains all genes, regardless if they are also known under
a different name in another namespace.
Synonyms contains only those genes for which one or more synonyms exist.
KKramsch - 22 Oct 2004 13:03 GMT
>Identifiers contains all genes, regardless if they are also known under
>a different name in another namespace.
>Synonyms contains only those genes for which one or more synonyms exist.

 OK, I see, but since 99% of genes have synonyms (I'm sorry I
didn't make this point clearer), Synonyms contains practically all
the information in Identifiers.

    Karl
Signature

Sent from a spam-bucket account; I check it once in a blue moon.  If
you still want to e-mail me, cut out the extension from my address,
and make the obvious substitutions on what's left.

Theo Peterbroers - 25 Oct 2004 08:21 GMT
> >Identifiers contains all genes, regardless if they are also known under
> >a different name in another namespace.
[quoted text clipped - 5 lines]
>
>     Karl

And now I see your point.
It seems you can get by with just one table:
(Namespace, Gene, Synonym)
- All three columns are primary key.
- Genes that are related share the same value in Synonym,
- I now tend to think of synonym as 'belonging to one family',
 since there are also 1:n and inexact synonyms.
- Some genes do not have any 'family' in other namespaces <->
 their synonym (a system assigned meaningless ID code) occurs
 just once.
- If you want to express something of a hierarchy between genes,
 like with the 1:n cases where the n-side may have synonyms on
 their level, this can be done by encoding it in one or two
 synonym columns. Please search this newsgroup for 'Trees', 'SQL'
 and 'Celko'.
KKramsch - 26 Oct 2004 18:52 GMT
>> >Identifiers contains all genes, regardless if they are also known under
>> >a different name in another namespace.
[quoted text clipped - 3 lines]
>> didn't make this point clearer), Synonyms contains practically all
>> the information in Identifiers.

>And now I see your point.
>It seems you can get by with just one table:
[quoted text clipped - 6 lines]
>  their synonym (a system assigned meaningless ID code) occurs
>  just once.

 I was leaning towards this solution, but this post confirms my
understanding.  Many thanks!  (And many thanks to all who replied!)

    Karl
Signature

Sent from a spam-bucket account; I check it once in a blue moon.  If
you still want to e-mail me, cut out the extension from my address,
and make the obvious substitutions on what's left.

KKramsch - 22 Oct 2004 13:05 GMT
 Please ignore my last post.  I think I see your point.

    Karl
Signature

Sent from a spam-bucket account; I check it once in a blue moon.  If
you still want to e-mail me, cut out the extension from my address,
and make the obvious substitutions on what's left.

Dennis Lee Bieber - 22 Oct 2004 16:39 GMT
> Identifiers contains all genes, regardless if they are also known under
> a different name in another namespace.
> Synonyms contains only those genes for which one or more synonyms exist.

    Still looks redundant to me -- since a singleton "synonym" can
be told apart by the mere fact that a count of the synonym "ID" column
value would come back as "1" for entries that don't have synonyms.

> ============================================================== <
>   wlfraed@ix.netcom.com  | Wulfraed  Dennis Lee Bieber  KD6MOG <
>      wulfraed@dm.net     |       Bestiaria Support Staff       <
> ============================================================== <
>           Home Page: <http://www.dm.net/~wulfraed/>            <
>        Overflow Page: <http://wlfraed.home.netcom.com/>        <
Laconic2 - 20 Oct 2004 14:41 GMT
>   (I apologize beforehand for the cluelessness of this question.
> Except for one grad-level course on relational databases a couple
> of years ago, I have little experience with database design, and
> certainly zero experience with *real world* database design.)

I apologize beforehand for the vagueness of this answer.  I know as little
about genomes as you know about databases.

>   In my field (computational genomics) one recurrent headache is
> the multiplicity of ids widely used to identify genes.  For each
> model organism (human, mouse, yeast, etc.) there are at least two,
> but often 4 or 5 different identifier namespaces in widespread use.

There are two problems which pervade the field of data management:  the
synonym problem and the homonym problem.  It doesn't matter whether you are
tallying votes,  keeping an inventory of parts for sale, booking airline
reservations,  or cataloguing the genome,  the same problems come up.  In
addition to which, a great many problems come up that are unique to each
field of study or each enterprise.  Data management includes database
management and a whole lot more.

The size of the synonym/homonym problem varies with the size of the
namespace being managed.  I don't know,  but I imagine the genome namespace
to be mind numbingly large.

Just for comparison purposes,  imagine that the FBI receives an image of a
fingerprint,  and is asked whether it matches any fingerprint on file.
Unless they have a smooth way of eliminating 99.9% of their filed
fingerprints from consideration,  the task will be too monumental to even
consider.  They do have such a way.

What the heck do fingerprints have to do with synonyms?  Well, two
fingerprints might be considered "synonyms" if they look very different,
superficially,  but they were prints made by the same finger.

>   There are two basic designs I can think of.  The first one, Design
> #1, consists primarily of one HUGE table with the following fields:

I'm not sure this applies here,  but I  recommend you do a web search on
"the one true lookup table".  This topic has to do with  whether it's better
to store codes of different types in the same column of the same great big
table,  with a second column to disambiguate homonyms,  or whether it's
better to have a separate table for each code type.

With regard to performance,  my epxerience tells me that OTLT is a terrible
idea.  With regard to growth,  it permits new code types to be invented
without performing any DDL.  This is good from the point of view of the
programmers, who just regard data mangement as bureaucracy anyway.  It's a
nightmare from the point of view of the future "data archeologist".

> one such table for each *ordered* pair of distinct namespaces for
> each organism.

I'd like to find out what makes distinct namespaces distinct.
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2009 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.