Hi all,
Let's suppose I'm writing a website where users search for movie
titles, and suppose there are 200,000 movies. The site is in PHP/
MySQL.
I'd like to implement a "fuzzy" text search so that similar movie
titles come up in a list no matter if you include all the words or
even misspell them.
MySQL's fulltext-search is a first step, but it doesn't cover mis-
spellings.
My question is this: suppose I broke down every movie title into
bigrams (i.e. "Jaws" becomes *j, ja, aw, ws, s*), and store a
relationship between every bigram and every movie title that includes
it in a separate table, BIGRAMS. (Will have two fields, "bigram" and
"movie_id").
I break down every search string into bigrams. Then, for every search,
I retrieve all the rows from BIGRAMS that include any of the bigrams
in the search string. I then calculate COUNT(movie_id) ORDER BY DESC
and take the top matches.
Comments? Anybody have experience with this? Does it work? Or would it
be far too slow? Is there another better solution I'm missing? We're
talking 200,000 movies, which means BIGRAMS might have 4 million rows
or so.
Thanks
Michael
Bob Badour - 25 Oct 2007 00:48 GMT
> Hi all,
>
[quoted text clipped - 27 lines]
> Thanks
> Michael
http://en.wikipedia.org/wiki/Stemming
crazyhorse - 25 Oct 2007 05:53 GMT
> > Hi all,
>
[quoted text clipped - 29 lines]
>
> http://en.wikipedia.org/wiki/Stemming
That doesn't really help... I'm trying to compensate for spelling
errors, not different grammatical forms of words...
David Cressey - 25 Oct 2007 09:12 GMT
> Hi all,
>
[quoted text clipped - 27 lines]
> Thanks
> Michael
If you do a search on "Soundex" you will come up with some tools for
detecting near misses that sound right.
The most common typo is swapping two letters, and this kind of error would
probably defeat a pure soundex implementation. But I wouldn't be surprised
if sites that offer a soundex tool might offer what I'll call a "typex
tool", for lack of a better name.
diogomartins - 25 Oct 2007 13:05 GMT
Hi,
One common approach to deal with query terms mispellings in
Information Retrieval systems is to employ some sort of string
distance technique, whose most common approach is described in [http://
en.wikipedia.org/wiki/Levenshtein_distance]. Techniques like that can
identify, by syntax, the nearest terms, similarly to suggestions given
by spell checkers of word processors. I think that using stemming in
conjunction with string distance can satisfy your requirements. The
really hard part would be to put that functionalities inside MySQL....
[]'s
Diogo
> Hi all,
>
[quoted text clipped - 27 lines]
> Thanks
> Michael
crazyhorse - 25 Oct 2007 15:53 GMT
> Hi,
>
[quoted text clipped - 42 lines]
> > Thanks
> > Michael
I think the string distance is too expensive to compute in a database,
and again, stemming is not really what I need.
For this application, it's also not just misspelled words -- it's
skipping a word in the movie title, or using an alternate form (i.e.
"Stephen" or "Steven"), or specifying a longer version when we only
have a shorter title in our database. I'm looking for a fuzzy,
flexible search in general that can be implemented in a database.
No real strong ideas for this, huh?
Patricia Shanahan - 25 Oct 2007 16:06 GMT
...
> I think the string distance is too expensive to compute in a database,
> and again, stemming is not really what I need.
[quoted text clipped - 6 lines]
>
> No real strong ideas for this, huh?
How about a multi-layer approach?
1. Look up the actual words against the actual movie titles. That will
pick up deliberately and distinctively unusual names and spellings. For
example, it will match "shrek".
2. Apply one or more layers of normalization, and check against a
normalized version of the titles.
For example, reduce all forms of names with multiple common forms to a
single form. Do spelling correction with a US English dictionary. Delete
articles ("the","a" ...).
Patricia
paul c - 26 Oct 2007 00:58 GMT
...
> I think the string distance is too expensive to compute in a database,
> and again, stemming is not really what I need.
[quoted text clipped - 6 lines]
>
> No real strong ideas for this, huh?
I tend to think this problem lays more in the area of type theory than
relational theory and that (conventional) relational db theory doesn't
offer much help other than to label such support as what Codd called
'theta' or what I might call 'semi-equal' operations. Besides, the
conventional rm theory at least doesn't have much to say about indexing.
It seems untoward to me at least to use a conventional db for type
support when a second conventional db would still be needed for the
immediate application!
Still, to give my two cents on the actual question, I know that Google
publishes some of their fairly gargantuan word combination "tables"
(sorry, I've forgotten the term they use, maybe they recognize
permuations too) where they rank the frequencies of various phrases.
Maybe that would be a better starting point. I remember setting up
industry-specific text db's at one time and finding that I could express
each industry's lingo in word codes that were 16 or fewer bits long.
Soundex reduced the size even further (this was about twenty years ago
and we used to pay big bucks for service bureau storage). I'd be
interested to know the number of unique words in the 200,000 movie
titles but perhaps what's more important is the possibility which seems
likely to me that most titles have five or fewer significant words, even
if proper names are included. Five words have 120 permutations and that
times the number of titles is something more than 20,000,000 which is
not a big number as far as conventional indexes are concerned. If
looking at it this way seems feasible, then I wouldn't rule out Soundex
either.