Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
Home
Discussion Groups
Database Servers
DB2InformixIngresMS SQLOraclePervasive.SQLPostgreSQLProgressSybase
Desktop Databases
FileMakerFoxProMS AccessParadox
General
General DB TopicsDatabase Theory
Related Topics
Java Development.NET DevelopmentVB DevelopmentMore Topics ...

Database Forum / General DB Topics / DB Theory / September 2007

Tip: Looking for answers? Try searching our database.

Searching Google n-gram corpus

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
bobterwillinger@gmail.com - 08 Sep 2007 15:36 GMT
(also posted in sql group but got no replies, apolgies if that's bad
etiquette)

Hi,

Google released a corpus of n-grams collected from the Web.

http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-...

It contains all 1..5grams that occur more than 40 times in their web
crawl. It comes as 5 folders, each folder containing around 120 files.
Each file contains 10,000,000 (10^7) lines. A line looks like:

"this is a four gram 65"

where the last number is the frequency of that exact phrase.
The total unzipped size of the 3 grams alone is 19GB, each individual
file around 200MB.
All the unzipped data is around 100GB.

I would like to be able to search through all this and return all
lines that contain a particular word or phrase.
I have no idea where to start with this, but I was wondering would an
SQL database be feasible. For the 5-grams i would need a billion rows
and of 6 columns. What sort of hard disk space would I need, and what
kind of time would i be looking at per search on on ordinary mahcine?,

I would like to be able to find every line where a particular word
occurs, no matter which position it occurs in, and ideally I would
like to be able to find particular bigrams as well.

thanks.
Bob Stearns - 08 Sep 2007 19:59 GMT
> (also posted in sql group but got no replies, apolgies if that's bad
> etiquette)
[quoted text clipped - 28 lines]
>
> thanks.

I believe that your approach is probably inappropriate for this data. If
you want to remain with SQL consider a design like:

create table ngrams( ngramid longint generated always, ngram
varchar(300), primary key ngramid)
create table words ( word varchar(25), pos shortint, ngramid longint,
primary key word,pos,ngramid)

insert all the ngrams in the first table, then for each word in each
ngram insert a record into words.

Search words to find the relevant ngrams containing any word and a self
join of words for the bigrams on t1.word = 'firstword' and t2.word =
'secondword' and t1.pos = t2.pos-1 and t1.ngramid=t2.ngramid.

In actuality, this much better handled by a custom search engine
designed along the same lines but with a lot of compression. If you are
interested in the latter, I will be willing to explain further.
bobterwillinger@gmail.com - 11 Sep 2007 13:48 GMT
> In actuality, this much better handled by a custom search engine
> designed along the same lines but with a lot of compression. If you are
> interested in the latter, I will be willing to explain further.

thanks, that makes sense.. what kind of compression do you mean?
Bob Stearns - 12 Sep 2007 07:37 GMT
>>In actuality, this much better handled by a custom search engine
>>designed along the same lines but with a lot of compression. If you are
>>interested in the latter, I will be willing to explain further.
>
> thanks, that makes sense.. what kind of compression do you mean?

Well there are at least two different directions of compression: words
in the n-grams and the word lists. since all of the words in the 2-grams
through 5-grams are guaranteed to be in the 1-grams, consider sorting
the 1-grams in frequency order and assigning each word a compression id;
the first 127 get 1 byte ids x'01'-x'7f', the next 16328 get 2 byte ids
x'8000'-x'bfff', and the remainder 3 byte ids from x'c00000'-x'dfffff'
(that is 2 million words, much more than in english -- if foreign
languages are included the 4 byte ids go from x'e0000000'-x'efffffff',
an additional 256 million words, which should be enough). Now record the
2-5 grams with the corresponding 1-gram ids which should shorten them to
an average of around two bytes per word. If that is not enough we could
look at prefix trees to reduce the storage even further: every n-gram's
leading (n-1)-gram is guaranteed to exist in the corpus, so the
representation of an n-gram is (n-1)-gram id and the word id of the last
word.

The word lists can be stored one record per word:

wordid as described above, position in ngram, number of ngrams, list of
ngrams

The list of ngrams starts with a marker indicating whether the list of
ngramids or a bit list with each bit position representing an ngram. If
it a bit list, several types of RLE can be used to shorten it.

I have used these techniques on everything from a TRS-80 to a 16-way
3090, currently called (I think) X machines in IBM parlance, to great
effect to speed up (and just make possible) searches over corpuses that
would otherwise not even be possible (like a 4 million volume library,
searchable on every title, author, publisher, etc.).
Shield - 16 Sep 2007 02:53 GMT
compressing the data would take allot of time. time taken away from
the actual experiment.

what would be the fastest way to using the dataset, using the same
conditions of searching for occurances?
Bob Stearns - 16 Sep 2007 06:45 GMT
>  compressing the data would take allot of time. time taken away from
> the actual experiment.
>
> what would be the fastest way to using the dataset, using the same
> conditions of searching for occurances?

The payoff on the compression & reindexing is less than 100 straight
searches on the original corpus. If your experiment is that small (or
even 1000 searches) just do the in the brute force way.
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2009 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.