Ad Hoc Data Analysis From The Unix Command Line/Rewriting The Data With Inline perl
“ | I'm reminded of the day my daughter came in, looked over my shoulder at some Perl 4 code, and said, 'What is that, swearing?' | ” |
—Larry Wall |
Command Line perl
[edit | edit source]A tutorial on perl is beyond the scope of this document; if you don't know perl, you should learn at least a little bit. If you invoke perl like perl -n -e '#a perl statement' the -n option causes perl to wrap your -e argument in a implicit while loop like this:
while (<>) { # a perl statement }
This loop reads standard input a line at a time into the variable $_, and then executes the statement(s) give by the -e argument. Given -p instead of -n, perl to adds a print statement to the loop as well:
while (<>) { # a perl statement print $_; }
Example - Using perl to create an indicator variable
[edit | edit source]Education level is recorded in columns 53-54 as ordered set of categories, where 11 and above indicates a college degree. Let's condense this to a single indicator variable for completed college or not. The raw data:
$ cat pums_53.dat | grep "^P" | cut -c53-54 | head -5 12 11 06 03 08
And once passed through the perl script:
$ cat pums_53.dat | grep "^P" | cut -c53-54 | perl -ne 'print $_>=11?1:0,"\n"' | head -5 1 1 0 0 0
And the final result:
~/census_data>cat pums_53.dat | grep "^P" | cut -c53-54 | perl -ne 'print $_>=11?1:0,"\n"' | sort | uniq -c 37507 0 21643 1
About 36% of Washingtonians have a college degree.
Example - computing conditional probability of membership in two sets
[edit | edit source]Let's look at the relationship between education level and whether or not people ride their bikes to work. People's mode of transportation to work is encoded as a series of categories in columns 191-192, where category 9 indicates a bicycle. We'll use an inline perl script to rewrite both education level and mode of transportation:
$ cat pums_53.dat | grep "^P" | cut -c53-54,191-192 | perl -ne 'print substr($_,0,2)>=11?1:0,substr($_,2,2)==9?1:0,"\n";' | sort | uniq -c 37452 00 55 01 21532 10 111 11
55/(55+36447) = 0.15% of non college educated people ride their bike to work. 111/(111+20219) = 0.56% of college educated people ride their bike to work.
Sociological interpretation is left as an exercise for the reader.
Example - A histogram with custom bucket size
[edit | edit source]Suppose we wanted to take a look at distribution of personal incomes. The normal trick of sort and uniq would work, but the personal income in the census data has resolution down to the $10 level, so the output would be very long and it would be hard to quickly see the pattern. We can use perl to round the income data down to the nearest $10,000 on the fly. Before the inline perl script:
$ cat pums_53.dat | grep "^P" | cut -c297-303 | head -4 0018000 0004100 0004300 0005300
And after:
$ cat pums_53.dat | grep "^P" | cut -c297-303 | perl -pe '$_=10000*int($_/10000)."\n"' | head -4 10000 0 0 0
And finally, the distribution (up to $100,000). The extra grep [0-9] ensures that blank records are not considered in the distribution.
$ cat pums_53.dat | grep "^P" | cut -c297-303 | grep [0-9] | perl -pe '$_=10000*int($_/10000)."\n"' | sort -n | uniq -c | head -12 20 -10000 15193 0 8038 10000 6776 20000 5436 30000 3685 40000 2370 50000 1536 60000 899 70000 521 80000 326 90000 283 100000
Example - Finding the median (or any percentile) of a distribution
[edit | edit source]If we sort all the incomes in order and had a way to pluck out the middle number, we could easily get the median. I'll give two ways to do this. The first uses cat -n. If given the -n option, cat prepends line numbers to each line. We see that there are 46,359 non blank records, so the 23179th one in sorted order is the median.
$ cat pums_53.dat | grep "^P" | cut -c297-303 | grep [0-9] | wc -l 46359 $ cat pums_53.dat | grep "^P" | cut -c297-303 | grep [0-9] | sort | cat -n | grep "^ *23179" 23179 0019900
An even simpler method, using head and tail:
$ cat pums_53.dat | grep "^P" | cut -c297-303 | grep [0-9] | sort | head -23179| tail -1 0019900
The median income in Washington state in 2000 was $19,900.
Example - Finding the average of a distribution
[edit | edit source]What about the average? One way to compute the average is to accumulate a running sum with perl, and do the division by hand at the end:
$ cat pums_53.dat | grep "^P" | cut -c297-303 | grep [0-9] | perl -ne 'print $sum+=$_,"\n";' | cat -n | tail -1 46359 1314603988
$1314603988/ 46359 = $28357.0393666818
You could also get perl to do this division with an END block which perl will execute only after it has exhausted standard input:
$ cat pums_53.dat | grep "^P" | cut -c297-303 | grep [0-9] | perl -ne '$sum += $_; $count++; END {print $sum/$count,"\n";}' 28357.0393666818