DH2012: Free Your Metadata tutorial (pre-conference)
I enjoyed this tutorial from the Free Your Metadata group. This session was an actual, valuable workshop on using Google Refine to clean and refine metadata, and it was very well run (apparently because this team has had plenty of practice, running these workshops for libraries). They started with a use case based on real data from the Sydney Powerhouse Museum and then demonstrated some of the power of Google Refine to clean messy data, working with a subset of the freely available museum data (which you can download and work with, following the steps they have documented in a nice tutorial). They showed how to identify and remove blank rows, identify and remove duplicate rows, split multi-value cells and then faceting to identify variant terms or outliers (e.g., categories used for almost everything or categories used only one one or two items), and then use Google Refine's cluster function to group and collapse variant or inconsistent terms (e.g., upper/lower case variants of a single subject term). Once you have cleaned the data, then you can reconcile terms to a controlled vocabulary by an RDF extension to Google Refine and exposing your vocabulary of choice (e.g., LCSH) as a SPARQL endpoint. Their suggested workflow is that you import your data, clean and refine and reconcile it, work with it for a week or so to check everything, and then you would export it and (presumably) overwrite your original data with the new, refined version. The point of all of this is that you should be able take care of large chunks of records with similar problems that can be handled systematically, and then only the outliers, which should be a small minority, will need to be cleaned manually.
Google Refine seems like a pretty cool tool, but I'm not sure how much it gives me, or what it does that I, as a developer, couldn't script for myself. Since most of the metadata we work with in the library is in various XML formats, in order to really take advantage of what Google Refine offers, I think I would have to script some kind of export in order to load the data into Google Refine in semi-tabular form, and then figure out a way to export that back out and into and merge with the original data. I should say, Google Refine does support XML to some extent - but it's not clear to me how far that goes, and if we want to work with data on a large scale (as Refine is intended), importing XML files individually is not going to be very practical. There are APIs for some parts of Google Refine, but I feel like I could probably script the features I care about directly (e.g., SPARQL queries against known vocabularies to reconcile local terms with authoritative terms). Google Refine is more likely to be something a metadata librarian or content export could use (it looks enough like a spreadsheet that most users would probably feel comfortable with it, and be able to learn the functions they would care about), so it might be worth the trouble in some cases-- but I'm not entirely convinced, or at least not sure where we could immediately take advantage of what it offers.