Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
Home
Discussion Groups
Database Servers
DB2InformixIngresMS SQLOraclePervasive.SQLPostgreSQLProgressSybase
Desktop Databases
FileMakerFoxProMS AccessParadox
General
General DB TopicsDatabase Theory
Related Topics
Java Development.NET DevelopmentVB DevelopmentMore Topics ...

Database Forum / General DB Topics / General DB Topics / August 2003

Tip: Looking for answers? Try searching our database.

Simple utility to intersect lists of names

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
hesiod - 24 Aug 2003 09:38 GMT
Is there a simple utility that you can take two lists of any kind of arbitrary
elements, such as a list of names, or even a flat list of words, and which will
output any elements held in common between the two lists?  

If a commonly known piece of software does this, could you explain the
process so that a dummy could do it?

The output need not be pretty, as long as it slows down the difficult and
tedious process of  manually looking at one list of names and looking
through another list of names to see if they're there.

The lists of names aren't in a real "database" format, just flat lists or
lists separated by commas.  I would prefer to get false positives for duplicates
than to miss an actual positives, since I don't want the duplicates to weed
them out, but to have a list of duplicates.

Some data might be in actual databases, some might be arbitrary lists
of proper names, some might have misspellings, have middle initials or
lack middle initials, or otherwise vary slightly.  I would like the duplicate
detection to be fuzzy.

I have found a number of theoretical discussions of such things and even
assignments to create such things in comp sci sylabbus sites using Google,
but nothing that seems to match what I'm looking for.  Sorting lists with the
sort function in Unix and then using join almost works for this purpose, but
not well enough and there's no way to make it fuzzy.  I've also found some
rather shitty suggestions from spammers of various sorts on merging
mailing lists, but they're generally more concerned with having large lists and
don't really care if 99% of it is utter rubbish.

It seems like the sort of thing that someone must do somewhere, and I'd
rather not recreate it if they have.  I don't care about the output's appearance
and a command line utility, python or perl script, or whatever will do.  
(It seems it should be possible to do this in MySQL too but I'm rather clueless
how to go about that.)

I am going to have to manually verify positives anyway, so this is just
something to speed up a tedious and time-consuming part of the process.
Patrick Schaaf - 24 Aug 2003 10:31 GMT
>Is there a simple utility that you can take two lists of any kind of arbitrary
>elements, such as a list of names, or even a flat list of words, and which will
>output any elements held in common between the two lists?  

Assuming two flat ASCII input files, A and B, with one word on each line,
first make the words in each file sorted & unique:

for f in A B; do
    sort $f | uniq >unique.$f
done

Then, mergesort them, and use 'uniq -d' to select only duplicates:

sort -m A B | uniq -d >both.AB

For 3 input files, use the same sort loop, and then:

sort -m A B C | uniq -c | awk '$1 == 3 {print $2}' >allthree.ABC

I hope this approach fits your definition of "simple". It does fit mine,
for lots of everyday stuff.

best regards
 Patrick
Paul Skaife - 26 Aug 2003 15:25 GMT
The simple utility you are looking for is called ClueOffice. You can read in
2 lists of arbitrary elements that are then compared and a report produced
of not only the exact matches, but also matches that are nearly the same.

Try http://www.itcg.nl/d_software.html to download a demo and contact me to
get a key for higher amounts of data if you like the product.

Good lucka and I hope this helps
Paul Skaife

> Is there a simple utility that you can take two lists of any kind of arbitrary
> elements, such as a list of names, or even a flat list of words, and which will
[quoted text clipped - 34 lines]
> I am going to have to manually verify positives anyway, so this is just
> something to speed up a tedious and time-consuming part of the process.
Tony Douglas - 26 Aug 2003 18:46 GMT
(Just a quick thought) - if you're in Unix-land, wouldn't comm be of
assistance here ? Something like

sort fileA | uniq > s.u.fileA
sort fileB | uniq > s.u.fileB
comm -12 s.u.fileA s.u.fileB > common.to.both
comm -23 s.u.fileA s.u.fileB > in.fileA.only
comm -13 s.u.fileA s.u.fileB > in.fileB.only

"man comm" gives the lowdown on the combinations of 1,2 & 3 in the
flag. So handy once I had found it !

- Tony
Paul Skaife - 27 Aug 2003 09:46 GMT
The simple utility you are looking for is called ClueOffice. You can read in
2 lists of arbitrary elements that are then compared and a report produced
of not only the exact matches, but also matches that are nearly the same.

Try http://www.itcg.nl/d_software.html to download a demo and contact me to
get a key for higher amounts of data if you like the product.

Good lucka and I hope this helps
Paul Skaife

> Is there a simple utility that you can take two lists of any kind of arbitrary
> elements, such as a list of names, or even a flat list of words, and which will
[quoted text clipped - 34 lines]
> I am going to have to manually verify positives anyway, so this is just
> something to speed up a tedious and time-consuming part of the process.
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2009 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.