I've been struggling with a project for a while now, and now that I've finally realized that it basically boils down to a user interface design problem, I figured I'd ask for help again (it worked out pretty well the last time).

There are a set of objects; each object has a canonical name. I have a list of strings that are meant to refer to the objects, but many of them don't exactly match the canonical name, either because of typos, abbreviations, alternate forms of the name, or just someone was typing in the name from memory and got it wrong. The task is to go through the list and find all duplicates, i.e. strings that map to the same object, and record that mapping. There are on the order of 1000 strings, and roughly a third of them will be duplicates; many of them are exact matches, but many are not.

First of all, is there some sort of data-entry tool or database editor that will already handle something like this? Or should I implement either a web servlet or a standalone GUI? Or is munging a flat text file by hand in Emacs really the best way to go about it? (I've done it this way several times before, but it's really tedious.)

I have access to a database that has canonical names for some subset of the objects, and can do fuzzy matching, returning a list of candidates for a given search string along with a similarity rating (from 1 to 100%). The problem is that since not all the objects are in the database, just picking the most similar object for a string will often give the wrong object. In these cases it's usually easy to see that it's the wrong object, but it still needs to be checked by hand. Is there any point to using this database, or will it just complicate the task without really saving any time?

Anyway, sorry if this is all a bit vague. I'm trying to abstract away as much of the details as possible, because those have been just making me more confused about how to go about this problem. Feel free to ask clarifying questions.

From: [identity profile] jfb.livejournal.com


My first thoughts: You're going to have to do some manual checking, but you can make it easier. Do the fuzzy match for all the (distinct) strings. Order them by ascending similarity rating of the best match. (I.e., the first string in this order is the one for which there are no close matches.) Present them to the user in this order (in batches of say 20, if it's easy to do). For each, show the user the closest match and its similarity rating (to clue the user in on how good this batch of matches will be), and the other candidates and their ratings, and allow them to choose a candidate, accept the string verbatim, enter a new canonical name, or decline to answer.

[I tried to do a mockup here but LJ wouldn't let me. It looked a text box with "The Trouble with Charlie" in it, followed by a select box that contained "91% - The Trouble with Harry", "85% - The Truth about Charlie", "new name", and "no match". Something like that.]

What you have then is a system that starts with the strings that (a) are going to require manual checking anyway and (b) are most likely to generate/confirm canonical names that are not yet in the database. When the user submits the form, for each string, if one of the candidates is selected, note the mapping. If "no match" is selected, do nothing. If "new name" is selected, add the new name to the database, note the mapping, and redo the fuzzy matching for all unmapped strings.

That way as you add new names to the database, the other misspellings ("Trouble with Charlie, The") get matched with them. As the user goes through the list, the similarity ratings will get higher and higher (because the low similarity ones will be done already, and because the system will have more canonical names to match to). Eventually, if the fuzzy matching is any good, it will get to a sufficiently high similarity rating, and sufficiently correct results, that the user will just be clicking the submit button on every page.

This is probably more work than it's worth for a thousand strings, but if this is, say, something that you do every year, it might be useful.

(The "no match" idea is probably not worth implementing; the idea was that if there were several close matches to the same new name on the same page, the user might want to enter the new name for one match, and then submit the form to get the fuzzy matching redone. But you'd still have to pick "no match" for each string you wanted to leave unmatched, and it's probably not worth the added complexity for the marginal hypothetical convenience.)

From: [identity profile] dougo.livejournal.com


Thanks, that's actually pretty close to what I was thinking. The problem is that the database with the fuzzy match interface is actually a remote web service, and I can't really just add objects to it as I go. But I can probably steal their fuzzy match code and apply it to my local data. I hadn't thought about incrementally redoing the fuzzy match, that's a nice idea; I was just thinking of doing it in one batch and then figure out how to clean up the results.

There's still the problem of not finding duplicates, though, when the fuzzy match fails even after one of the strings is hand-canonicalized. But maybe this will be a rare enough issue that I can do those merges by hand.

From: [identity profile] dougo.livejournal.com


Well, suppose one string doesn't match anything in the database, so I accept it as the canonical name. Then later I find another string that does match an object in the database, and it turns out the first string should also be mapped to this object. Somehow I have to go back and fix the first mapping.

From: [identity profile] jfb.livejournal.com


Ah. Yeah, I'd expect that to be rare enough that you're better off handling it manually than trying to design and implement a fix for it.

From: [identity profile] dougo.livejournal.com


I think maybe the thing to do is to redo the fuzzy match on the new name just entered, before assuming it's a brand new object. And then stick it onto the queue again if it's not an exact match.

Hm, I'm still a little wary of redoing the fuzzy match that many times, though. That's a quadratic amount of net traffic for the remote database. Maybe I should just insert the new name into the old matches where appropriate. Assuming that the similarity rating is solely based on the two strings and not a relative comparison of the choices.

From: [identity profile] mshonle.livejournal.com


Cataloging your MP3s, eh? Try to interface with the CDDB, which has similar information. (I.e., you querry it with a CD, and it will sometimes provide you a list from which to choose.)

For implementing this puppy, an Emacs plug-in might be easiest. Assuming you know elisp, which I don't (I can't believe it actually uses dynamic scope; ugh).

From: [identity profile] dougo.livejournal.com


I'm not cataloging my MP3s, but that's pretty close to what I'm doing. I'm using MusicBrainz rather than CDDB; it's an open database, and is better organized (and better moderated) than FreeDB. I've written some Scheme code to talk to their RDF query interface, which I will publish at some point (probably via PLaneT) when I have time to clean it up.

I'm not really comfortable enough with Elisp coding (especially the GUI) to do this there. It'll probably just be a PLT web server servlet, but I think I may have to upgrade from text files to a real DBMS, i.e. PostgreSQL, and talk to it with SchemeQL. But I know practically nothing about database stuff, so it may take me a while to get up to speed.
.

Most Popular Tags

Powered by Dreamwidth Studios

Style Credit

Expand Cut Tags

No cut tags