I've been struggling with a project for a while now, and now that I've finally realized that it basically boils down to a user interface design problem, I figured I'd ask for help again (it worked out pretty well the last time).
There are a set of objects; each object has a canonical name. I have a list of strings that are meant to refer to the objects, but many of them don't exactly match the canonical name, either because of typos, abbreviations, alternate forms of the name, or just someone was typing in the name from memory and got it wrong. The task is to go through the list and find all duplicates, i.e. strings that map to the same object, and record that mapping. There are on the order of 1000 strings, and roughly a third of them will be duplicates; many of them are exact matches, but many are not.
First of all, is there some sort of data-entry tool or database editor that will already handle something like this? Or should I implement either a web servlet or a standalone GUI? Or is munging a flat text file by hand in Emacs really the best way to go about it? (I've done it this way several times before, but it's really tedious.)
I have access to a database that has canonical names for some subset of the objects, and can do fuzzy matching, returning a list of candidates for a given search string along with a similarity rating (from 1 to 100%). The problem is that since not all the objects are in the database, just picking the most similar object for a string will often give the wrong object. In these cases it's usually easy to see that it's the wrong object, but it still needs to be checked by hand. Is there any point to using this database, or will it just complicate the task without really saving any time?
Anyway, sorry if this is all a bit vague. I'm trying to abstract away as much of the details as possible, because those have been just making me more confused about how to go about this problem. Feel free to ask clarifying questions.
There are a set of objects; each object has a canonical name. I have a list of strings that are meant to refer to the objects, but many of them don't exactly match the canonical name, either because of typos, abbreviations, alternate forms of the name, or just someone was typing in the name from memory and got it wrong. The task is to go through the list and find all duplicates, i.e. strings that map to the same object, and record that mapping. There are on the order of 1000 strings, and roughly a third of them will be duplicates; many of them are exact matches, but many are not.
First of all, is there some sort of data-entry tool or database editor that will already handle something like this? Or should I implement either a web servlet or a standalone GUI? Or is munging a flat text file by hand in Emacs really the best way to go about it? (I've done it this way several times before, but it's really tedious.)
I have access to a database that has canonical names for some subset of the objects, and can do fuzzy matching, returning a list of candidates for a given search string along with a similarity rating (from 1 to 100%). The problem is that since not all the objects are in the database, just picking the most similar object for a string will often give the wrong object. In these cases it's usually easy to see that it's the wrong object, but it still needs to be checked by hand. Is there any point to using this database, or will it just complicate the task without really saving any time?
Anyway, sorry if this is all a bit vague. I'm trying to abstract away as much of the details as possible, because those have been just making me more confused about how to go about this problem. Feel free to ask clarifying questions.
From:
no subject
[I tried to do a mockup here but LJ wouldn't let me. It looked a text box with "The Trouble with Charlie" in it, followed by a select box that contained "91% - The Trouble with Harry", "85% - The Truth about Charlie", "new name", and "no match". Something like that.]
What you have then is a system that starts with the strings that (a) are going to require manual checking anyway and (b) are most likely to generate/confirm canonical names that are not yet in the database. When the user submits the form, for each string, if one of the candidates is selected, note the mapping. If "no match" is selected, do nothing. If "new name" is selected, add the new name to the database, note the mapping, and redo the fuzzy matching for all unmapped strings.
That way as you add new names to the database, the other misspellings ("Trouble with Charlie, The") get matched with them. As the user goes through the list, the similarity ratings will get higher and higher (because the low similarity ones will be done already, and because the system will have more canonical names to match to). Eventually, if the fuzzy matching is any good, it will get to a sufficiently high similarity rating, and sufficiently correct results, that the user will just be clicking the submit button on every page.
This is probably more work than it's worth for a thousand strings, but if this is, say, something that you do every year, it might be useful.
(The "no match" idea is probably not worth implementing; the idea was that if there were several close matches to the same new name on the same page, the user might want to enter the new name for one match, and then submit the form to get the fuzzy matching redone. But you'd still have to pick "no match" for each string you wanted to leave unmatched, and it's probably not worth the added complexity for the marginal hypothetical convenience.)
From:
no subject
There's still the problem of not finding duplicates, though, when the fuzzy match fails even after one of the strings is hand-canonicalized. But maybe this will be a rare enough issue that I can do those merges by hand.
From:
no subject
From:
no subject
From:
no subject
From:
no subject
Hm, I'm still a little wary of redoing the fuzzy match that many times, though. That's a quadratic amount of net traffic for the remote database. Maybe I should just insert the new name into the old matches where appropriate. Assuming that the similarity rating is solely based on the two strings and not a relative comparison of the choices.
From:
no subject
For implementing this puppy, an Emacs plug-in might be easiest. Assuming you know elisp, which I don't (I can't believe it actually uses dynamic scope; ugh).
From:
no subject
I'm not really comfortable enough with Elisp coding (especially the GUI) to do this there. It'll probably just be a PLT web server servlet, but I think I may have to upgrade from text files to a real DBMS, i.e. PostgreSQL, and talk to it with SchemeQL. But I know practically nothing about database stuff, so it may take me a while to get up to speed.