Making Messy Data Useful


I’m currently working with a data set from the Adjunct Project that began from a simple, brilliant, idea – use social media to ask adjunct faculty input their pay and benefits into a collective Google Spreadsheet. Over a couple of weeks the spreadsheet was populated with nearly 2000 responses, providing up to date information from a wide range of schools and regions, but users created data that mixed text and numbers, offered caveats and stories, and, even for simple questions like “state” used wildly different naming conventions. When respondents can upload information to a form or an open spreadsheet, they may not always respond in terms that allow for comparison.

My goal is to create a few data visualizations that will help people make sense of the data, and in that process I stumbled upon Google Refine, which is a new and in my opinion, invaluable tool for cleaning up messy data.

Refine allows you to identify inconsistencies of language and conflicting units of measurement and to easily batch edit cells by offering text and numeric facets to group cells and clusters similar groups for cells for comparison. For example, it quickly identifies “University of Michigan” “UofM” “Univ. Mich” as similar, and even shows me a group of all of University of Michigan’s campuses that appear in the spreadsheet.

Refine is also useful for recoding data into the format you need. For example, one column of the spreadsheet asks about “pay per 3 credit course.” Using the clustering feature, I can group, or recode the amounts into more meaningful categories that allow for comparison (e.g., less than $2000, $2001-4000, etc.).

If you want to check it out for yourself, Google offers a few introductory videos here.