Republic of Mathematics blog

We should have been paying attention during Perl classes

Posted by: Gary Ernest Davis on: March 22, 2013

A joint post by Keith Resendes (histogramma.com) and Gary Davis

Image courtesy of http://translationbiz.wordpress.com

tearing-out-hair

1. We found unemployment data in text format by metropolitan area, for several years (by months in fact) at a Bureau of Labor Statistics site.
2. Great! we thought: let’s read this into R using the read.table function.
3. Uh, oh! there’s not actually a header row in the data.
4. Download the data as a text file and open in a text editor.
5. Remove offending comment, pretending to be a header, and re-format headers as single words.
6. Set path to file and use read.table to read in text file as a data frame.
7. Uh oh! R doesn’t want to do that – there’s a problem with one of the columns.
8. Well duh! The 4th column has entries like “Anniston-Oxford, AL MSA ” – spaces and commas as separators.
9. There is no consistency in the table, column to column, as to what constitutes a separator. Mostly several spaces are inserted, but not always the same number – sometimes commas are used.

10. The numbers contain commas, as in 34,562
11. What a mess!
12. Who is using this data? Anyone?
13. Why hasn’t the Bureau of Labor Statistics cleaned up this data? Because no one’s using it?

14. We sort of know we could clean this up with Perl , but … as the title says … we weren’t paying attention during Perl classes.

2 Responses to "We should have been paying attention during Perl classes"

The ‘column’ command is used to manipulate column-deliminated data, including properly formatting it for csv tools.

It’ll do cool things like, just give me columns 4, 2 and 3-7.
It’s not standard on all unixes, but is on most Linuxes.
http://stackoverflow.com/questions/12133062/formatting-lists-with-the-column-command-in-nix

Thanks so much. Our problem is getting the poorly formatted data into R in the first place. There are no “columns” on the Bureau of Labor statistics website – just entries separated, haphazardly it seems, with spaces, and commas, mingled with numerical text containing commas. We need to munge the data before getting it into R, and that’s where Perl comes in. We didn’t think of Unix because we didn’t even go to Unix classes!

Leave a Reply