Ad Hoc Data Analysis From The Unix Command Line/Printable version

From Wikibooks, open books for an open world
Jump to navigation Jump to search


Ad Hoc Data Analysis From The Unix Command Line

The current, editable version of this book is available in Wikibooks, the open-content textbooks collection, at
https://en.wikibooks.org/wiki/Ad_Hoc_Data_Analysis_From_The_Unix_Command_Line

Permission is granted to copy, distribute, and/or modify this document under the terms of the Creative Commons Attribution-ShareAlike 3.0 License.


Preliminaries

Formatting[edit | edit source]

These typesetting conventions will be used when presenting example interactions at the command line:

$ command argument1 argument2 argument3 
output line 1 
output line 2 
output line 3 
[...]

The "$ " is the shell prompt. What you type is shown in boldface; command output is in regular type.

Example data[edit | edit source]

I will use the following sample files in the examples.

The Unix password file[edit | edit source]

The password file can be found in /etc/passwd. Every user on the system has one line (record) in the file. Each record has six fields separated by colon (':') characters. The fields are username, encrypted password, userid, default group, home directory and default shell. We can look at the first few lines with the head command, which prints just the first few lines of a file. Correspondingly, the tail command prints just the last few lines.

$ head -5 /etc/passwd 
root:x:0:0:root:/:/bin/bash 
bin:x:1:1:bin:/bin:/sbin/nologin 
daemon:x:2:2:daemon:/sbin:/sbin/nologin 
adm:x:3:4:adm:/var/adm:/sbin/nologin 
lp:x:4:7:lp:/var/spool/lpd:/sbin/nologin

Census data[edit | edit source]

The US Census releases Public Use Microdata Samples (PUMS) on its website. We will use the 1% sample of Washington state's data, the file pums_53.dat, which can be downloaded here

$ head -2 pums_53.dat 
H000011715349 53010 99979997 70 15872 639800 120020103814700280300000300409 
02040201010103020 0 0 014000000100001000 0100650020 0 0 0 0 0000 0 0 0 0 0 
05000000000004400000000010 76703521100000002640000000000
P00001170100001401000010420010110000010147030400100012005003202200000 005301000
000300530 53079 53 7602 76002020202020202200000400000000000000010005 30 53010
70 9997 99970101006100200000001047904431M 701049-20116010 520460000000001800000
00000000000000000000000000000000000000001800000018000208

Important note: The format of this data file is described in an excel spreadsheet that can be downloaded here.

Developer efficiency vs. computer efficiency[edit | edit source]

The techniques discussed here are usually extremely efficient in terms of developer time, but generally less efficient in terms of compute resources (CPU, I/O, memory). This kind of brute force and ignorance may be inelegant, but when you don't yet understand the scope of your problem, it is usually best to spend 30 seconds writing a program that will run for 3 hours than vice versa.

The online manual[edit | edit source]

The "man" command displays information about a given command (colloquially referred to as the command's "man page"). The online man pages are an extremely valuable resource; if you do any serious work with the commands presented here, you'll eventually read all their man pages top to bottom. In Unix literature the man page for a command (or function, or file) is typically referred to as command(n). The number "n" specifies a section of the manual to disambiguate entries which exist in multiple sections. So, passwd(1) is the man page for the passwd command, and passwd(5) is the man page for the passwd file. On a Linux system you ask for a certain section of the manual by giving the section number as the first argument as in "man 5 passwd". Here's what the man command has to say about itself:

$ man man 
man(1)                                                        man(1) 
NAME
       man - format and display the on-line manual pages 
       manpath - determine user's search path for man pages 

SYNOPSIS 
       man [-acdfFhkKtwW] [--path] [-m system] [-p string] [-C 
       config_file] [-M pathlist] [-P pager] [-S section_list] 
       [section] name ... 

DESCRIPTION 
       man formats and displays the on-line manual pages. If you 
       specify section, man only looks in that section of the 
       manual. name is normally the name of the manual page, 
       which is typically the name of a command, function, or 
       file. [...]
Standard Input, Standard Output, Redirection and Pipes



Standard Input, Standard Output, Redirection and Pipes

"This is the Unix philosophy: Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface."—Doug McIlroy, the inventor of Unix pipes

The commands I'm going to talk about here are called filters. Data passes through them and they modify it a bit on the way. These commands read data from their "standard input" and write data to their "standard output." By default, standard input is your keyboard and standard output is your screen. For example, the tr command is a filter that translates one set of characters to another. This invocation of tr turns all lower case characters to upper case characters:

$ tr "[:lower:]" "[:upper:]"
hello
HELLO
i feel like shouting
I FEEL LIKE SHOUTING 
[ctrl-d]

Ctrl-d is how you tell the command from the keyboard that you're done entering input.

You can tell your shell to connect standard output to a file instead of your screen using the ">" operator. The term for this is "redirection". One would talk about "redirecting" tr's output to a file. Later you can use the cat command to write the file to your screen.

$ tr a-z A-Z > tr_output
this is a test
[ctrl-d]
$ cat tr_output
THIS IS A TEST

Many Unix commands that take a file as an argument will read from standard input if not given a file. For example, the grep command searches a file for a string and prints the lines that match. If I wanted to find my entry in the password file I might say:

$ grep jrauser /etc/passwd
jrauser:x:7777:100:John Rauser:/home/jrauser:/bin/bash

But I could also redirect a file to grep's standard input using "<" operator. You can see that the "<" and ">" operators are like little arrows that indicate the flow of data.

$ grep jrauser < /etc/passwd
jrauser:x:7777:100:John Rauser:/home/jrauser:/bin/bash

You can use the pipe "|" operator to connect the standard output of one command to the standard input of the next. The cat command reads a file and writes it to its standard output, so yet another way to find my entry in the password file is:

$ cat /etc/passwd | grep jrauser
jrauser:x:7777:100:John Rauser:/home/jrauser:/bin/bash

For a slightly more interesting example, the mail command will send a message it reads from standard input. Let's send my entry in the password file to me in an email.

$ cat /etc/passwd | grep jrauser | mail -s "passwd entry" jrauser@example.com

Using output with headers[edit | edit source]

In many situations, you end up with output that has a first line that is a header describing the data, and subsequent lines that are the data. An example is ps:

$ ps | head -5
  PID TTY           TIME CMD
22313 ttys000    0:00.86 -bash
31537 ttys001    0:00.06 -bash
22341 ttys002    0:00.28 -bash
70093 ttys002    0:00.00 head -5

If you wish to manipulate the data but not the header use tail with -n switch to start at line 2. For example:

$ ps | tail -n +2 | grep bash | head -5
22313 ttys000    0:00.86 -bash
31537 ttys001    0:00.06 -bash
22341 ttys002    0:00.28 -bash
70120 ttys002    0:00.00 -bash

This output shows only "bash" processes (because of grep)

Preliminaries · Counting Part 1 - grep and wc

References[edit | edit source]



Counting Part 1 - grep and wc

"90% of data analysis is counting" - John Rauser

...well, at least once you've figured out the right question to ask, which is, perhaps, the other 90%.

Example - Counting the size of a population[edit | edit source]

The simplest command for counting things is wc, which stands for word count. By default, wc prints the number of lines, words, and characters in a file.

$ wc pums_53.dat
85025 1219861 25659175 pums_53.dat

Nearly always we just want to count the number of lines (records), which can be done by giving the -l option to wc.

$ wc -l pums_53.dat
85025 pums_53.dat

Example - Using grep to select a subset[edit | edit source]

So, recalling that this is a 1% sample, there were 8.5 million people in Washington as of the 2000 census? Nope, the census data has two kinds of records, one for households and one for persons. The first character of a record, an H or P, indicates which kind of record it is. We can grep for and count just person records like this:

$ grep -c "^P" pums_53.dat
59150

The caret '^' means that the 'P' must occur at the beginning of the line. So there were about 5.9 million people in Washington State in 2000. Also interesting, the average household had 59,150/(85,025-59,150) = 2.3 people.

Standard Input, Standard Output, Redirection and Pipes · Picking The Data Apart With cut



Picking The Data Apart With cut

Fixed width data[edit | edit source]

How many households had just 1 person? Referring to the file layout, we see that the 106th and 107th characters of a household record indicate the number of people in the household. We can use the cut command to pull out just that bit of data from each record. The argument -c106-107 instructs cut to print the 106th through 107th characters of each line. The head command prints just the first few lines of a file (or its standard input).

$ census_data>grep "^H" pums_53.dat  | cut -c106-107 | head -5
03 
02 
03 
02 
02

You can give cut a comma separated list to pull out multiple ranges. To see the household id along with the number of occupants of the household:

$ census_data>grep "^H" pums_53.dat  | cut -c2-8,106-107 | head -5
000011703 
000024602 
000231203 
000242102 
000250202

The -c argument is used for working with so called "fixed-width" data. Data where the columns of a record are found at certain offset in bytes from the beginning of a record. Fixed width data abounds on a Unix system. ls -l writes its output in a fixed width format:

$ ls -l /etc | head -5
total 6548 
-rw-r--r--    1 root     root          46 Dec  4 12:23 adjtime 
drwxr-xr-x    4 root     root        4096 Oct  8  2003 alchemist 
-rw-r--r--    1 root     root        1048 Aug 31  2001 aliases 
-rw-r--r--    1 root     root       12288 Oct  8  2003 aliases.db

As does ps:

$ ps -u'
USER       PID %CPU %MEM   VSZ  RSS TTY      STAT START   TIME COMMAND 
jrauser  26870  0.0  0.1  2576 1388 pts/0    S    09:45   0:00 /bin/bash 
jrauser   8943  0.0  0.0  2820  880 pts/0    R    12:58   0:00 ps -u

Returning to the question of how many 1 person households are there in Washington:

$ grep "^H" pums_53.dat  | cut -c106-107 | grep -c 01
7192

7,192, or about 28% of households have only one occupant.

Delimited data[edit | edit source]

In delimited data, elements of a record are separated by a special 'delimiter' character. In the password file, fields are delimited by colons:

$ head -5 /etc/passwd
root:x:0:0:root:/:/bin/bash 
bin:x:1:1:bin:/bin:/sbin/nologin 
daemon:x:2:2:daemon:/sbin:/sbin/nologin 
adm:x:3:4:adm:/var/adm:/sbin/nologin 
lp:x:4:7:lp:/var/spool/lpd:/sbin/nologin

The 7th column of the password file is the user's login shell. How many people use bash as their shell?

$ cut -d: -f7 /etc/passwd | grep -c /bin/bash 
170

You can give either -c or -f a comma separated list, so to see a few users that use tcsh as their shell:

$ cut -d: -f1,7 /etc/passwd | grep /bin/tcsh | head -5
iglass:/bin/tcsh
svowell:/bin/tcsh
dsedaris:/bin/tcsh
skine:/bin/tcsh
jhitt:/bin/tcsh

Tricky delimiters[edit | edit source]

The space character is a common delimiter. Unfortunately, your shell probably discards all extra whitespace on the command line. You can sneak a space character past your shell by wrapping it in quotes, like: cut -d" " -f 5

The tab character is another common delimiter. It can be hard to spot, because on the screen it just looks like any other white space. The od (octal dump) command can give you insight into the precise formatting of a file. For instance I have a file which maps first names to genders (with 95% probability). When casually inspected, it looks like fixed width data:

$ head -5 gender.txt
AARON           M 
ABBEY           F 
ABBIE           F 
ABBY            F 
ABDUL           M

But on closer inspection there are tab characters delimiting the columns:

$ od -bc gender.txt | head
0000000 101 101 122 117 116 040 040 040 040 040 040 011 115 012 101 102 
          A   A   R   O   N                          \t   M   \n  A   B 
0000020 102 105 131 040 040 040 040 040 040 011 106 012 101 102 102 111 
          B   E   Y                          \t   F  \n   A   B   B   I 
0000040 105 040 040 040 040 040 040 011 106 012 101 102 102 131 040 040 
          E                          \t   F  \n   A   B   B   Y 
0000060 040 040 040 040 040 011 106 012 101 102 104 125 114 040 040 040 
                             \t   F  \n   A   B   D   U   L 
0000100 040 040 040 011 115 012 101 102 105 040 040 040 040 040 040 040 
                     \t   M  \n   A   B   E

The first thing to do is read your system's manpage on "cut": it may already delimit by tab by default. If not, it requires a bit of trickery to get a tab character past your shell to the cut command. First, many shells have a feature called tab completion; when you hit tab they don't actually insert a tab, instead they attempt to figure out which file, directory or command you want and type that instead. In many shells you can overcome this special functionality by typing a control-v first. Whatever character you type after the control-v is literally inserted. Like a space character, you need to protect the tab character with quotes or the shell will discard it like any other white space separating pieces of the command line.

So to get the ratio of male first names to female first names I might run the following commands. Between the double quotes I typed control-v and then hit tab.

$ wc -l gender.txt
5017 gender.txt 
$ cut -d" " -f2 gender.txt | grep M | wc -l 
1051 
$ cut -d" " -f2 gender.txt | grep F | wc -l 
3966

Apparently there's much more variation in female names than male names.

If your system's cut command delimits on tab, the above command becomes simply cut -f2 gender.txt.

Counting Part 1 - grep and wc · Joining The Data with join



Joining The Data with join

Please note - Join assumes that that input data is sorted based on the key on which the join is going to take place.

Delimited data[edit | edit source]

In delimited data, elements of a record are separated by a special 'delimiter' character. In the CSV files, fields are delimited by commas or tabs:

$ cat j1
1,a
1,b
2,c
2,d
2,e
3,f
3,g
4,h
4,i
5,j
$ cat j2
1,A
1,B
1,C
2,D
2,E
4,F
4,G
5,H
6,I
6,J
$ join -t , -a 1 -a 2 -o 0,1.2,2.2 j1 j2
1,a,A
1,a,B
1,a,C
1,b,A
1,b,B
1,b,C
2,c,D
2,c,E
2,d,D
2,d,E
2,e,D
2,e,E
3,f,
3,g,
4,h,F
4,h,G
4,i,F
4,i,G
5,j,H
6,,I
6,,J

Explanation of options:

"-t ,"          Input and output field separator is "," (for CSV)
"-a 1"          Output a line for every line of j1 not matched in j2
"-a 2"          Output a line for every line of j2 not matched in j1
"-o 0,1.2,2.2"  Output field format specification:

0 denotes the match (join) field (needed when using "-a") 1.2 denotes field 2 from file 1 ("j1") 2.2 denotes field 2 from file 2 ("j2").

Using the "-a" option creates a full outer join as in SQL.

This command must be given two and only two input files.

Multi-file Joins[edit | edit source]

To join several files you can loop through them.

$ join -t , -a 1 -a 2 -o 0,1.2,2.2 j1 j2 > J

File "J" is now the full outer join of "j1", "j2".

$ join -t , -a 1 -a 2 -o 0,1.2,2.2 J j3 > J

and so on through j4, j5, ...

For many files this is best done with a loop

 $ for i in * ; do join -t , -a 1 -a 2 -o 0,1.2,2.2 J $i > J ; done

Sorted Data Note[edit | edit source]

join assumes that the input data has been sorted by the field to be joined. See section on sort for details. • Counting Part 2 - sort and uniq

Credits: Some text adapted from Ted Harding's email to the R mailing list.

Picking The Data Apart With cut · Counting Part 2 - sort and uniq



Counting Part 2 - sort and uniq

So far we've seen how to use cut, grep and wc to select and count records with certain qualities. But each set of records we'd like to count requires a separate command, as with counting the numbers of male and female names in the most recent example. Combining the uniq and sort commands allows us to count many groups at once.

uniq and sort[edit | edit source]

The uniq command squashes out contiguous duplicate lines. That is, it copies from its standard input to its standard output, but if a line is identical to the immediately preceding line, the duplicate line is not written. For example:

$ cat foo
a 
a
a
b
b
a
a
a
c
$ uniq foo
a
b
a
c

Note that 'a' is written twice because uniq compares only to the immediately preceding line. If the data is sorted first, we get each distinct record just once:

$ sort foo | uniq
a
b
c

Finally, giving the -c option causes uniq to write counts associated with each distinct entry:

$ sort foo | uniq -c
6 a
2 b
1 c

Sorting a CSV file by an arbitrary column is easy as well:

$ cat file.csv
a, 10, 0.5
b, 20, 0.1
c, 14, 0.01
d, 55, 0.23
e, 94, 0.78
f, 1,  0.34
g, 75, 1.0
h, 3,  2.0
i, 12, 1.5
$ sort -n -t"," -k 2 file.csv
f, 1,  0.34
h, 3,  2.0
a, 10, 0.5
i, 12, 1.5
c, 14, 0.01
b, 20, 0.1
d, 55, 0.23
g, 75, 1.0
e, 94, 0.78
$ sort -n -t"," -k 3 file.csv
c, 14, 0.01
b, 20, 0.1
d, 55, 0.23
f, 1,  0.34
a, 10, 0.5
e, 94, 0.78
g, 75, 1.0
i, 12, 1.5
h, 3,  2.0

Example - Creating a frequency table[edit | edit source]

The combination of sort and uniq -c is extremely powerful. It allows one to create frequency tables from virtually any record oriented text data. Returning to the name to gender mapping of the previous chapter, we could have gotten the count of male and female names in one command like this:

$ cut -d" " -f2 gender.txt | sort | uniq -c
3966 F
1051 M

Example - Creating another frequency table[edit | edit source]

And returning to the census data, we can now easily compute the complete distribution of occupants per household:

$ grep "^H" pums_53.dat  | cut -c106-107 | sort | uniq -c
1796 00
7192 01
7890 02
3551 03
3195 04
1391 05
 518 06
 190 07
  79 08
  39 09
  14 10
  14 11
   3 12
   3 13

Example - Verifying a primary key[edit | edit source]

This is a good opportunity to point out a big benefit of being able to play with data in this fashion. It allows you to quickly spot potential problems in a dataset. In the above example, why are there 1,796 households with 0 occupants? As another example of quickly verifying the integrity of data, let's make sure that household id is truly a unique identifier:

$ grep "^H" pums_53.dat | cut -c2-8 | sort | uniq -c | grep -v "^ *1 " | wc -l
0

This grep invocation will print only lines that do not (because of the -v flag) begin with a series of spaces followed by a 1 (the count from uniq -c) followed by a tab (entered using the control-v trick). Since the number of lines written is zero, we know that each household id occurs once and only once in the file.

The technique of grepping uniq's output for lines with a certain count is generally useful. One other common application is finding the set of overlapping (duplicated) keys in a pair of files by grepping the output of uniq -c for lines that begin with a 2.

Example - A frequency table sorted by most common category[edit | edit source]

Throwing an extra sort on the end of the pipeline will sort the frequency table so that the most common class is at the top (or bottom). This is useful when data is categorical and does not have a natural order. You'll want to give sort the -n option so that it sorts the counts numerically instead of lexically, and I like to give the -r option to reverse the sort so that the output is sorted in descending order, but this just a stylistic issue. For example, here is the distribution of household heating fuel from most common to least common:

$ grep "^H" pums_53.dat | cut -c132 | sort | uniq -c | sort -rn
12074 3
 7007 1
 3161 
 1372 6 
 1281 4 
  757 2
  170 8
   43 9
    6 5
    4 7

Type 3, electricity, is most common, followed by type 1, gas. Type 7 is solar power.

Converting the frequency table to proper CSV[edit | edit source]

The output of uniq -c is not in proper CSV form. This makes is necessary to convert the output if further operations on the output are wanted. Here we use a bit of inline perl to rewrite the lines and reverse the order of the fields.

$ cut -d" " -f2 gender.txt | sort | uniq -c | perl -pe 's/^\s*([0-9]+) (\S+).*/$2, $1/' 
F, 3966
M, 1051
Joining The Data with join · Rewriting The Data With Inline perl



Rewriting The Data With Inline perl

I'm reminded of the day my daughter came in, looked over my shoulder at some Perl 4 code, and said, 'What is that, swearing?'

—Larry Wall

Command Line perl[edit | edit source]

A tutorial on perl is beyond the scope of this document; if you don't know perl, you should learn at least a little bit. If you invoke perl like perl -n -e '#a perl statement' the -n option causes perl to wrap your -e argument in a implicit while loop like this:

while (<>) {
   # a perl statement
}

This loop reads standard input a line at a time into the variable $_, and then executes the statement(s) give by the -e argument. Given -p instead of -n, perl to adds a print statement to the loop as well:

while (<>) {
   # a perl statement
   print $_;
}

Example - Using perl to create an indicator variable[edit | edit source]

Education level is recorded in columns 53-54 as ordered set of categories, where 11 and above indicates a college degree. Let's condense this to a single indicator variable for completed college or not. The raw data:

$ cat pums_53.dat | grep "^P" | cut -c53-54 | head -5
12
11
06
03
08

And once passed through the perl script:

$ cat pums_53.dat | grep "^P" | cut -c53-54 | 
perl -ne 'print $_>=11?1:0,"\n"' | head -5
1
1
0
0
0

And the final result:

~/census_data>cat pums_53.dat | grep "^P" | cut -c53-54 |
perl -ne 'print $_>=11?1:0,"\n"' | sort | uniq -c
37507 0
21643 1

About 36% of Washingtonians have a college degree.

Example - computing conditional probability of membership in two sets[edit | edit source]

Let's look at the relationship between education level and whether or not people ride their bikes to work. People's mode of transportation to work is encoded as a series of categories in columns 191-192, where category 9 indicates a bicycle. We'll use an inline perl script to rewrite both education level and mode of transportation:

$ cat pums_53.dat | grep "^P" | cut -c53-54,191-192 | 
perl -ne 'print substr($_,0,2)>=11?1:0,substr($_,2,2)==9?1:0,"\n";' | 
sort | uniq -c 
37452 00
   55 01
21532 10
  111 11

55/(55+36447) = 0.15% of non college educated people ride their bike to work. 111/(111+20219) = 0.56% of college educated people ride their bike to work.

Sociological interpretation is left as an exercise for the reader.

Example - A histogram with custom bucket size[edit | edit source]

Suppose we wanted to take a look at distribution of personal incomes. The normal trick of sort and uniq would work, but the personal income in the census data has resolution down to the $10 level, so the output would be very long and it would be hard to quickly see the pattern. We can use perl to round the income data down to the nearest $10,000 on the fly. Before the inline perl script:

$ cat pums_53.dat | grep "^P" | cut -c297-303 | head -4
0018000
0004100
0004300
0005300

And after:

$ cat pums_53.dat | grep "^P" | cut -c297-303 | 
perl -pe '$_=10000*int($_/10000)."\n"' | head -4
10000
0
0
0

And finally, the distribution (up to $100,000). The extra grep [0-9] ensures that blank records are not considered in the distribution.

$ cat pums_53.dat | grep "^P" | cut -c297-303 | grep [0-9] |
perl -pe '$_=10000*int($_/10000)."\n"' | sort -n | uniq -c | head -12
   20 -10000
15193      0
 8038  10000
 6776  20000
 5436  30000
 3685  40000
 2370  50000
 1536  60000
  899  70000
  521  80000
  326  90000
  283 100000

Example - Finding the median (or any percentile) of a distribution[edit | edit source]

If we sort all the incomes in order and had a way to pluck out the middle number, we could easily get the median. I'll give two ways to do this. The first uses cat -n. If given the -n option, cat prepends line numbers to each line. We see that there are 46,359 non blank records, so the 23179th one in sorted order is the median.

$ cat pums_53.dat | grep "^P" | cut -c297-303 | grep [0-9] | wc -l
46359 
$ cat pums_53.dat | grep "^P" | cut -c297-303 | grep [0-9] | sort | 
cat -n | grep "^ *23179"
23179 0019900

An even simpler method, using head and tail:

$ cat pums_53.dat | grep "^P" | cut -c297-303 | grep [0-9] | sort |
head -23179| tail -1
0019900

The median income in Washington state in 2000 was $19,900.

Example - Finding the average of a distribution[edit | edit source]

What about the average? One way to compute the average is to accumulate a running sum with perl, and do the division by hand at the end:

$ cat pums_53.dat | grep "^P" | cut -c297-303 | grep [0-9] | 
perl -ne 'print  $sum+=$_,"\n";' | cat -n | tail -1
46359 1314603988

$1314603988/ 46359 = $28357.0393666818

You could also get perl to do this division with an END block which perl will execute only after it has exhausted standard input:

$ cat pums_53.dat | grep "^P" | cut -c297-303 | grep [0-9] | 
perl -ne '$sum += $_; $count++; END {print $sum/$count,"\n";}' 
28357.0393666818
Counting Part 2 - sort and uniq · Quick Plotting With gnuplot



Quick Plotting With gnuplot

Example - creating a scatter plot[edit | edit source]

Does the early bird get the worm? Let's look at the relationship between the time a person leaves for work and their income. Income is recorded in columns 297-303, and the time a person leaves for work is recorded in columns 196-198, encoded in ten minute intervals. This pipeline extracts, cleans and formats the data:

$ cat pums_53.dat | grep "^P" | cut -c196-198,297-303 | grep -v "^000" | 
grep -v " $" | perl -pe 'substr($_,3,0)=" ";' > time_vs_income

The greps knock out records for which either field is null, and the perl script inserts a space between the two columns so gnuplot can parse the columns apart. Plotting in gnuplot is simple:

$ gnuplot

        G N U P L O T
        Linux version 3.7 patchlevel 1
        last modified Fri Oct 22 18:00:00 BST 1999

Terminal type set to 'x11'
gnuplot> plot 'time_vs_income' with points

And the resulting plot:

Time vs income.gif

Recall that 0 on the x-axis is midnight, and 20 is 200 minutes after midnight or about 3:20am. Increased density in the beginning of the traditional 1st and 2nd shift periods is apparent. Folks who work regular business hours clearly have higher incomes. It would be interesting to compute the average income in each time bucket, but that makes a pretty hairy command line perl script. Here is it in all its gruesome glory:

$ cat pums_53.dat | grep "^P" | cut -c196-198,297-303 | grep -v "^000" | 
grep -v " $" | perl -ne '/(\d{3})(\d{7})/; $sum{$1}+=$2; $count{$1}++; END { foreach $k
(keys(%count)) {print $k," ",$sum{$k}/$count{$k},"\n"}}' | sort -n > time_vs_avgincome

You can plot the result for yourself if you're curious.

Example - Creating a bar chart with gnuplot[edit | edit source]

Let's look at historic immigration rates among Washingtonians. Year of immigration is recorded in columns 78-81, and 0000 means the person is a native born citizen. We can apply the usual tricks with cut, grep, sort, and uniq, but it's a bit hard to see the patterns when scrolling back and forth in text output, it would nicer if we could see a plot.

$ cat pums_53.dat | grep "^P" | cut -c78-81 | grep -v 0000 | sort | uniq -c | head -10
 2 1910
 7 1914
12 1919
 7 1920
 6 1921
 5 1922
 7 1923
 5 1924
 8 1925

Gnuplot is a fine graphing tool for this purpose, but it wants the category label to come first, and the count to come second, so we need to write a perl script to reverse uniq's output and stick the result in a file. See perlrun(1) for details on the -a and -F options to perl.

$ cat pums_53.dat | grep "^P" | cut -c78-81 | grep -v 0000 | sort | uniq -c |
perl -lape 'chomp $F[-1]; $_ = join " ", reverse @F' > year_of_immigration

Now we can make a bar chart from the contents of the file with gnuplot.

gnuplot> plot 'year_of_immigration' with impulses

Here's the graph gnuplot creates:

Year of immigration.gif

Be a bit careful interpreting this plot, only people who are still alive can be counted, so it naturally goes up and to the right (people who immigrated more recently have a better chance of still being alive). That said, there seems to have been an increase in immigration after the end of World War II, and also a spike after the end if the Vietnam war. I remain at a loss to explain the spike around 1980, consult your local historian.

External links[edit | edit source]

set term win; set grid;A1=-300;A2=250;n=360;z=2500;splot(x**2)/(A1**2) + (y**2)/(A2**2) - ((2*x*y)/(A1*A2))*cos(n-z) - (sin(n-z))**2

Rewriting The Data With Inline perl · Appendices



Appendices

Appendix A: pcalc source code[edit | edit source]

A perl read-eval-print loop. This makes a very handy calculator on the command line. Example usage:

$ pcalc 1+2
3
$ pcalc "2*2"
4
$ pcalc 2*3
6

Source:

#!/opt/third-party/bin/perl
use strict;
if ($#ARGV >= 0) {
  eval_print(join(" ",@ARGV))
} else { 
  use Term::ReadLine;
  my $term = new Term::ReadLine 'pcalc';
  while ( defined ($_ = $term->readline("")) ) {
    s/[\r\n]//g;
    eval_print($_);
    $term->addhistory($_) if /\S/;
  }
}

sub eval_print {
  my ($str) = @_;
  my $result = eval $str;
  if (!defined($result)) {
    print "Error evaluating '$str'\n";
  } else {
    print $result,"\n";
  }
}

Appendix B: Random unfinished ideas[edit | edit source]

Ideas too good to delete, but that aren't fleshed out.

Micro shell scripts from the command line[edit | edit source]

Example - which .so has the object I want?

Using backticks[edit | edit source]

Example - killing processes by name[edit | edit source]

kill `ps auxww | grep httpd | grep -v grep | awk '{print $2}'`

Example - tailing the most recent log file in one easy step[edit | edit source]

tail -f `ls -rt *log | tail -1`

James' xargs trick[edit | edit source]

James uses echo with xargs and feeds one xargs' output into another xargs in clever ways to build up complex command lines.

tee(1)[edit | edit source]

perl + $/ == agrep[edit | edit source]

Example - Finding duplicate keys in two files[edit | edit source]

Quick Plotting With gnuplot