We should have been paying attention during Perl classes

A joint post by Keith Resendes (histogramma.com) and Gary Davis

Image courtesy of http://translationbiz.wordpress.com


1. We found unemployment data in text format by metropolitan area, for several years (by months in fact) at a Bureau of Labor Statistics site.
2. Great! we thought: let’s read this into R using the read.table function.
3. Uh, oh! there’s not actually a header row in the data.
4. Download the data as a text file and open in a text editor.
5. Remove offending comment, pretending to be a header, and re-format headers as single words.
6. Set path to file and use read.table to read in text file as a data frame.
7. Uh oh! R doesn’t want to do that – there’s a problem with one of the columns.
8. Well duh! The 4th column has entries like “Anniston-Oxford, AL MSA ” – spaces and commas as separators.
9. There is no consistency in the table, column to column, as to what constitutes a separator. Mostly several spaces are inserted, but not always the same number – sometimes commas are used.

10. The numbers contain commas, as in 34,562
11. What a mess!
12. Who is using this data? Anyone?
13. Why hasn’t the Bureau of Labor Statistics cleaned up this data? Because no one’s using it?

14. We sort of know we could clean this up with Perl , but … as the title says … we weren’t paying attention during Perl classes.