Database Forum / General DB Topics / General DB Topics / October 2004
[newbie Q] Mongo id resolution DB
|
|
Thread rating:  |
KKramsch - 19 Oct 2004 19:40 GMT (I apologize beforehand for the cluelessness of this question. Except for one grad-level course on relational databases a couple of years ago, I have little experience with database design, and certainly zero experience with *real world* database design.)
In my field (computational genomics) one recurrent headache is the multiplicity of ids widely used to identify genes. For each model organism (human, mouse, yeast, etc.) there are at least two, but often 4 or 5 different identifier namespaces in widespread use.
For our research group's internal work, we want to set up a relational database to store all these gene "synonyms". My question has to do with the best design for such a database. The two considerations are 1) growth; and 2) performance. (By "growth" I mean that both the number of organisms represented in the database as well as the number of namespaces supported for each organisms are expected to grow over time. Needless to say, the total number of genes listed is also expected to grow.)
There are two basic designs I can think of. The first one, Design #1, consists primarily of one HUGE table with the following fields:
namespace_1_id : gene_id_1 : namespace_2_id : gene_id_2
This design would require two auxiliary tables of the form
namespace_id : organism_id : human-readable name (e.g. "Saccharomyces Genome Database accession number")
and
organism_id : human-readable name (e.g. "Saccharomyces cerevisiae")
The other basic design, Design #2, consists of *many* tables of the form
gene_id_1 gene_id_2
one such table for each *ordered* pair of distinct namespaces for each organism.
Design #1 is very simple, in that the structure of the database would not need to change no matter how many organisms and namespaces are added to it in the future. On the downside, it is not normalized.
The opposite is true, on both counts, for Design #2.
Which approach would be better? Assuming that they are both bad, what would be a better design for such a database?
Thanks!
Karl
P.S. In case it matters, I should add that in most cases there is NOT a 1-to-1 correspondence between any two namespaces for any given organism. For a variety of reason, it is not uncommon for one id in one namespace to correspond to 0 or >1 ids in another namespace.
 Signature Sent from a spam-bucket account; I check it once in a blue moon. If you still want to e-mail me, cut out the extension from my address, and make the obvious substitutions on what's left.
Dennis Lee Bieber - 20 Oct 2004 07:30 GMT On Tue, 19 Oct 2004 18:40:39 +0000 (UTC), KKramsch <karlUNDERSCOREkramsch@yahooPERIODcom.invalid> declaimed the following in comp.databases:
> (I apologize beforehand for the cluelessness of this question. That's okay -- I'm probably clueless WRT the genetic data...
> In my field (computational genomics) one recurrent headache is > the multiplicity of ids widely used to identify genes. For each [quoted text clipped - 22 lines] > > organism_id : human-readable name (e.g. "Saccharomyces cerevisiae") From what I can see so far, the hang-up is at the "gene ID" level... (oh, and /what/ is that sucker? Some sugar eating critter? <G>)
An initial cut...
ORGANISM ID auto increment primary key Name char
NAMESPACE ID auto increment primary key org_ID number foreign key references ORGANISM (ID)
GENOME ID auto increment primary key Description char #something not specific to namespace -- #maybe nucleotide <?> sequence letters? #something that can be used to #reidentify the entity -- if possible, #this should be capable of making a #UNIQUE index (no duplicates)
DICTIONARY ID auto increment primary key #yes, I tend to do these #for all stuff Namespace_ID number foreign key references NAMESPACE (ID) Genome_ID number foreign key references GENOME (ID) Name char #the representation in the namespace
You'll notice there is no "namespace:gene"<>"namespace:gene" table. For any given gene sequence, there should be only one GENOME record. The automatically assigned ID number for that record is used in all "Dictionary" records -- doing something like (obviously untested):
select NAMESPACE.Name, DICTIONARY.Name, GENOME.Description from DICTIONARY join NAMESPACE on DICTIONARY.Namespace_ID = NAMESPACE.ID join GENOME on DICTIONARY.Genome_ID = GENOME.ID group by GENOME.Description;
could generate a listing of the synonyms.
> ============================================================== < > wlfraed@ix.netcom.com | Wulfraed Dennis Lee Bieber KD6MOG < > wulfraed@dm.net | Bestiaria Support Staff < > ============================================================== < > Home Page: <http://www.dm.net/~wulfraed/> < > Overflow Page: <http://wlfraed.home.netcom.com/> < Theo Peterbroers - 20 Oct 2004 11:43 GMT <.....>
> In my field (computational genomics) one recurrent headache is > the multiplicity of ids widely used to identify genes. For each [quoted text clipped - 5 lines] > has to do with the best design for such a database. The two > considerations are 1) growth; and 2) performance. First consideration ought to be: can I extract from my database, in a reasonably straightforward way, the information I need. I.e. you should start not with the data you (seemingly) want to store, but with its intended use. Second consideration could be: if I'm going to add new data, what's the most efficient way to make sure it is complete. You don't want to add all possible synonyms within N namespaces, M genes and O organisms by hand.
> one HUGE table with the following fields: > namespace_1_id : gene_id_1 : namespace_2_id : gene_id_2 I.e. explicitly storing each and every synonym relationship. This involves far too much redundancy (NS1, GENE1) <-> (NS2, GENE1) (NS1, GENE1) <-> (NS3, GENE1), and then you would have to add (NS2, GENE1) <-> (NS3, GENE1), which is redundant.
> *many* tables of the form > gene_id_1 gene_id_2 > one such table for each *ordered* pair of distinct namespaces for > each organism. No, no, no. Searching for synonyms in N*(N-1)/2 tables is no joy.
> Which approach would be better? Assuming that they are both bad, > what would be a better design for such a database? What do you consider synonyms? - Two "different" genes coding for what is structurally the same protein, where different can be - "different gene in same namespace", - "same gene in different organism or namespace", - "same gene under a superseded name, same namespace", ... some more convolutions of namespace, gene and organism ... - Two "different" genes coding for proteins having the same function ... some convolutions of namespace, gene and organism ... - Two "different" genes having a partial overlap in their nucleotid order, ... some convolutions of namespace, gene and organism ...
> P.S. In case it matters, I should add that in most cases there is > NOT a 1-to-1 correspondence between any two namespaces for any > given organism. For a variety of reason, it is not uncommon for > one id in one namespace to correspond to 0 or >1 ids in another > namespace. Are you sure that the information you are interested in is "equivalence, or partial equivalence, of identifiers"?
KKramsch - 20 Oct 2004 14:11 GMT ><.....> >> In my field (computational genomics) one recurrent headache is [quoted text clipped - 6 lines] >> has to do with the best design for such a database. The two >> considerations are 1) growth; and 2) performance.
>First consideration ought to be: can I extract from my database, in >a reasonably straightforward way, the information I need. I.e. you [quoted text clipped - 4 lines] >all possible synonyms within N namespaces, M genes and O organisms by >hand.
>> one HUGE table with the following fields: >> namespace_1_id : gene_id_1 : namespace_2_id : gene_id_2
>I.e. explicitly storing each and every synonym relationship. This >involves far too much redundancy >(NS1, GENE1) <-> (NS2, GENE1) >(NS1, GENE1) <-> (NS3, GENE1), and then you would have to add >(NS2, GENE1) <-> (NS3, GENE1), which is redundant.
>> *many* tables of the form >> gene_id_1 gene_id_2 >> one such table for each *ordered* pair of distinct namespaces for >> each organism.
>No, no, no. Searching for synonyms in N*(N-1)/2 tables is no joy.
>> Which approach would be better? Assuming that they are both bad, >> what would be a better design for such a database? [quoted text clipped - 11 lines] > order, > ... some convolutions of namespace, gene and organism ... The short answer is: synonyms are whatever the existing (incomplete) dicitionaries say are synonyms. A slightly better answer, synonyms *should* be different names for the *same* stretch of DNA in a given chromosome. However, deviations from this ideal are relatively common, for a host of reasons.
BTW, as I alluded to above, there are many published lists of synonyms out there, but they invariably cover only one or a few organisms, and for the organisms they cover they invariably miss at least one important namespace. Moreover, all these dictionaries are designed to operate via a Web interface, in single query mode, which makes them unsuitable for high-performance applications.
Karl
 Signature Sent from a spam-bucket account; I check it once in a blue moon. If you still want to e-mail me, cut out the extension from my address, and make the obvious substitutions on what's left.
Theo Peterbroers - 21 Oct 2004 15:31 GMT > ><.....> > >> In my field (computational genomics) one recurrent headache is [quoted text clipped - 62 lines] > > Karl Your problem is partly of an organizational nature. As you know (should know) in more oldfashioned fields there exist a Code of Botanical Nomenclature and a Code of Zoological Nomenclature. Scientific Congresses decide on those Codes and outstanding issues. Ultimately, genomics will probably clean up its act, but this does not help you in the short run.
Given that I think 1) explicitly storing each and every individual synonym relationship between genes/DNA stretches identified in different namespaces leads to redundancy and is rather unmaintainable 2) creating a lot of tables for storing such relationships will make your database very unfriendly to users,
I believe that we should relate identifiers in different namespaces through some sort of 'Least Common Multiple'. Pretty much like having a 'standard interface'. Rather than creating a specific link between each synonym pair, each identifier needs just one relationship to its 'standard interface'.
Basically, I would say just two tables are needed.
Identifiers (Namespace, Gene) Both columns make up the primary key. You fill this table from the published dictionaries.
Synonyms (Namespace, Gene, XXX) All three columns form the primary key, Namespace and Gene are a foreign key to Identifiers. XXX is the 'Least Common Multiple' or 'Standard Interface', a system generated item identifying synonyms (groups of identifiers "Namespace, Gene" that you consider to be the same).
Two possible improvements to this design are adding a source or authority column to both tables, and introducing a shorthand abbreviation GeneID for each Identifier.
On filling table Synonyms, each gene belongs, preferably, to just one group of synonyms XXX. As you add synonyms from several sources then, you should use the XXX which already exists. On the other hand, synonymy may be a matter of opinion and scientific debate, in which case a gene may end up belonging to this group or that, according to the authority consulted.
KKramsch - 21 Oct 2004 17:02 GMT >> ><.....> >> >> In my field (computational genomics) one recurrent headache is [quoted text clipped - 49 lines] >> are designed to operate via a Web interface, in single query mode, >> which makes them unsuitable for high-performance applications.
>I believe that we should relate identifiers in different namespaces >through [quoted text clipped - 3 lines] >each identifier needs just one relationship to its 'standard >interface'.
>Basically, I would say just two tables are needed.
>Identifiers (Namespace, Gene) >Both columns make up the primary key. You fill this table from the >published dictionaries.
>Synonyms (Namespace, Gene, XXX) >All three columns form the primary key, Namespace and Gene are a [quoted text clipped - 3 lines] >a system generated item identifying synonyms (groups of identifiers >"Namespace, Gene" that you consider to be the same). What exactly is the role/utility of the Identifiers table? It looks to me like that all the information in Identifiers is already contained in Synonyms.
Thanks,
Karl
 Signature Sent from a spam-bucket account; I check it once in a blue moon. If you still want to e-mail me, cut out the extension from my address, and make the obvious substitutions on what's left.
Theo Peterbroers - 22 Oct 2004 10:30 GMT > > >I believe that we should relate identifiers in different namespaces [quoted text clipped - 24 lines] > > Thanks, Identifiers contains all genes, regardless if they are also known under a different name in another namespace. Synonyms contains only those genes for which one or more synonyms exist.
KKramsch - 22 Oct 2004 13:03 GMT >Identifiers contains all genes, regardless if they are also known under >a different name in another namespace. >Synonyms contains only those genes for which one or more synonyms exist. OK, I see, but since 99% of genes have synonyms (I'm sorry I didn't make this point clearer), Synonyms contains practically all the information in Identifiers.
Karl
 Signature Sent from a spam-bucket account; I check it once in a blue moon. If you still want to e-mail me, cut out the extension from my address, and make the obvious substitutions on what's left.
Theo Peterbroers - 25 Oct 2004 08:21 GMT > >Identifiers contains all genes, regardless if they are also known under > >a different name in another namespace. [quoted text clipped - 5 lines] > > Karl And now I see your point. It seems you can get by with just one table: (Namespace, Gene, Synonym) - All three columns are primary key. - Genes that are related share the same value in Synonym, - I now tend to think of synonym as 'belonging to one family', since there are also 1:n and inexact synonyms. - Some genes do not have any 'family' in other namespaces <-> their synonym (a system assigned meaningless ID code) occurs just once. - If you want to express something of a hierarchy between genes, like with the 1:n cases where the n-side may have synonyms on their level, this can be done by encoding it in one or two synonym columns. Please search this newsgroup for 'Trees', 'SQL' and 'Celko'.
KKramsch - 26 Oct 2004 18:52 GMT >> >Identifiers contains all genes, regardless if they are also known under >> >a different name in another namespace. [quoted text clipped - 3 lines] >> didn't make this point clearer), Synonyms contains practically all >> the information in Identifiers.
>And now I see your point. >It seems you can get by with just one table: [quoted text clipped - 6 lines] > their synonym (a system assigned meaningless ID code) occurs > just once. I was leaning towards this solution, but this post confirms my understanding. Many thanks! (And many thanks to all who replied!)
Karl
 Signature Sent from a spam-bucket account; I check it once in a blue moon. If you still want to e-mail me, cut out the extension from my address, and make the obvious substitutions on what's left.
KKramsch - 22 Oct 2004 13:05 GMT Please ignore my last post. I think I see your point.
Karl
 Signature Sent from a spam-bucket account; I check it once in a blue moon. If you still want to e-mail me, cut out the extension from my address, and make the obvious substitutions on what's left.
Dennis Lee Bieber - 22 Oct 2004 16:39 GMT > Identifiers contains all genes, regardless if they are also known under > a different name in another namespace. > Synonyms contains only those genes for which one or more synonyms exist. Still looks redundant to me -- since a singleton "synonym" can be told apart by the mere fact that a count of the synonym "ID" column value would come back as "1" for entries that don't have synonyms.
> ============================================================== < > wlfraed@ix.netcom.com | Wulfraed Dennis Lee Bieber KD6MOG < > wulfraed@dm.net | Bestiaria Support Staff < > ============================================================== < > Home Page: <http://www.dm.net/~wulfraed/> < > Overflow Page: <http://wlfraed.home.netcom.com/> < Laconic2 - 20 Oct 2004 14:41 GMT > (I apologize beforehand for the cluelessness of this question. > Except for one grad-level course on relational databases a couple > of years ago, I have little experience with database design, and > certainly zero experience with *real world* database design.) I apologize beforehand for the vagueness of this answer. I know as little about genomes as you know about databases.
> In my field (computational genomics) one recurrent headache is > the multiplicity of ids widely used to identify genes. For each > model organism (human, mouse, yeast, etc.) there are at least two, > but often 4 or 5 different identifier namespaces in widespread use. There are two problems which pervade the field of data management: the synonym problem and the homonym problem. It doesn't matter whether you are tallying votes, keeping an inventory of parts for sale, booking airline reservations, or cataloguing the genome, the same problems come up. In addition to which, a great many problems come up that are unique to each field of study or each enterprise. Data management includes database management and a whole lot more.
The size of the synonym/homonym problem varies with the size of the namespace being managed. I don't know, but I imagine the genome namespace to be mind numbingly large.
Just for comparison purposes, imagine that the FBI receives an image of a fingerprint, and is asked whether it matches any fingerprint on file. Unless they have a smooth way of eliminating 99.9% of their filed fingerprints from consideration, the task will be too monumental to even consider. They do have such a way.
What the heck do fingerprints have to do with synonyms? Well, two fingerprints might be considered "synonyms" if they look very different, superficially, but they were prints made by the same finger.
> There are two basic designs I can think of. The first one, Design > #1, consists primarily of one HUGE table with the following fields: I'm not sure this applies here, but I recommend you do a web search on "the one true lookup table". This topic has to do with whether it's better to store codes of different types in the same column of the same great big table, with a second column to disambiguate homonyms, or whether it's better to have a separate table for each code type.
With regard to performance, my epxerience tells me that OTLT is a terrible idea. With regard to growth, it permits new code types to be invented without performing any DDL. This is good from the point of view of the programmers, who just regard data mangement as bureaucracy anyway. It's a nightmare from the point of view of the future "data archeologist".
> one such table for each *ordered* pair of distinct namespaces for > each organism. I'd like to find out what makes distinct namespaces distinct.
|
|
|