Not long after I decided to bail on the Tsongas Arena Pixies shows and commented that I hoped they would someday play somewhere smaller and closer, they announced a show at Avalon next Thursday. Well, then I went and forgot to try to get a ticket when they went on sale today at noon, and now it's sold out. Dur! Still, I'm thinking I have better ways to spend the money, like maybe a handful of smaller club shows, or eggs.
I've been struggling with a project for a while now, and now that I've finally realized that it basically boils down to a user interface design problem, I figured I'd ask for help again (it worked out pretty well the last time).
There are a set of objects; each object has a canonical name. I have a list of strings that are meant to refer to the objects, but many of them don't exactly match the canonical name, either because of typos, abbreviations, alternate forms of the name, or just someone was typing in the name from memory and got it wrong. The task is to go through the list and find all duplicates, i.e. strings that map to the same object, and record that mapping. There are on the order of 1000 strings, and roughly a third of them will be duplicates; many of them are exact matches, but many are not.
First of all, is there some sort of data-entry tool or database editor that will already handle something like this? Or should I implement either a web servlet or a standalone GUI? Or is munging a flat text file by hand in Emacs really the best way to go about it? (I've done it this way several times before, but it's really tedious.)
I have access to a database that has canonical names for some subset of the objects, and can do fuzzy matching, returning a list of candidates for a given search string along with a similarity rating (from 1 to 100%). The problem is that since not all the objects are in the database, just picking the most similar object for a string will often give the wrong object. In these cases it's usually easy to see that it's the wrong object, but it still needs to be checked by hand. Is there any point to using this database, or will it just complicate the task without really saving any time?
Anyway, sorry if this is all a bit vague. I'm trying to abstract away as much of the details as possible, because those have been just making me more confused about how to go about this problem. Feel free to ask clarifying questions.
There are a set of objects; each object has a canonical name. I have a list of strings that are meant to refer to the objects, but many of them don't exactly match the canonical name, either because of typos, abbreviations, alternate forms of the name, or just someone was typing in the name from memory and got it wrong. The task is to go through the list and find all duplicates, i.e. strings that map to the same object, and record that mapping. There are on the order of 1000 strings, and roughly a third of them will be duplicates; many of them are exact matches, but many are not.
First of all, is there some sort of data-entry tool or database editor that will already handle something like this? Or should I implement either a web servlet or a standalone GUI? Or is munging a flat text file by hand in Emacs really the best way to go about it? (I've done it this way several times before, but it's really tedious.)
I have access to a database that has canonical names for some subset of the objects, and can do fuzzy matching, returning a list of candidates for a given search string along with a similarity rating (from 1 to 100%). The problem is that since not all the objects are in the database, just picking the most similar object for a string will often give the wrong object. In these cases it's usually easy to see that it's the wrong object, but it still needs to be checked by hand. Is there any point to using this database, or will it just complicate the task without really saving any time?
Anyway, sorry if this is all a bit vague. I'm trying to abstract away as much of the details as possible, because those have been just making me more confused about how to go about this problem. Feel free to ask clarifying questions.
.