>>In actuality, this much better handled by a custom search engine
>>designed along the same lines but with a lot of compression. If you are
>>interested in the latter, I will be willing to explain further.
>
> thanks, that makes sense.. what kind of compression do you mean?
Well there are at least two different directions of compression: words
in the n-grams and the word lists. since all of the words in the 2-grams
through 5-grams are guaranteed to be in the 1-grams, consider sorting
the 1-grams in frequency order and assigning each word a compression id;
the first 127 get 1 byte ids x'01'-x'7f', the next 16328 get 2 byte ids
x'8000'-x'bfff', and the remainder 3 byte ids from x'c00000'-x'dfffff'
(that is 2 million words, much more than in english -- if foreign
languages are included the 4 byte ids go from x'e0000000'-x'efffffff',
an additional 256 million words, which should be enough). Now record the
2-5 grams with the corresponding 1-gram ids which should shorten them to
an average of around two bytes per word. If that is not enough we could
look at prefix trees to reduce the storage even further: every n-gram's
leading (n-1)-gram is guaranteed to exist in the corpus, so the
representation of an n-gram is (n-1)-gram id and the word id of the last
word.
The word lists can be stored one record per word:
wordid as described above, position in ngram, number of ngrams, list of
ngrams
The list of ngrams starts with a marker indicating whether the list of
ngramids or a bit list with each bit position representing an ngram. If
it a bit list, several types of RLE can be used to shorten it.
I have used these techniques on everything from a TRS-80 to a 16-way
3090, currently called (I think) X machines in IBM parlance, to great
effect to speed up (and just make possible) searches over corpuses that
would otherwise not even be possible (like a 4 million volume library,
searchable on every title, author, publisher, etc.).
Shield - 16 Sep 2007 02:53 GMT
compressing the data would take allot of time. time taken away from
the actual experiment.
what would be the fastest way to using the dataset, using the same
conditions of searching for occurances?
Bob Stearns - 16 Sep 2007 06:45 GMT
> compressing the data would take allot of time. time taken away from
> the actual experiment.
>
> what would be the fastest way to using the dataset, using the same
> conditions of searching for occurances?
The payoff on the compression & reindexing is less than 100 straight
searches on the original corpus. If your experiment is that small (or
even 1000 searches) just do the in the brute force way.