Ad Hoc Data Analysis From The Unix Command Line/Joining The Data with join

From Wikibooks, open books for an open world
Jump to: navigation, search

Please note - Join assumes that that input data is sorted based on the key on which the join is going to take place.

Delimited data[edit]

In delimited data, elements of a record are separated by a special 'delimiter' character. In the CSV files, fields are delimited by commas or tabs:

$ cat j1
1,a
1,b
2,c
2,d
2,e
3,f
3,g
4,h
4,i
5,j
$ cat j2
1,A
1,B
1,C
2,D
2,E
4,F
4,G
5,H
6,I
6,J
$ join -t , -a 1 -a 2 -o 0,1.2,2.2 j1 j2
1,a,A
1,a,B
1,a,C
1,b,A
1,b,B
1,b,C
2,c,D
2,c,E
2,d,D
2,d,E
2,e,D
2,e,E
3,f,
3,g,
4,h,F
4,h,G
4,i,F
4,i,G
5,j,H
6,,I
6,,J

Explanation of options:

"-t ,"          Input and output field separator is "," (for CSV)
"-a 1"          Output a line for every line of j1 not matched in j2
"-a 2"          Output a line for every line of j2 not matched in j1
"-o 0,1.2,2.2"  Output field format specification:

0 denotes the match (join) field (needed when using "-a") 1.2 denotes field 2 from file 1 ("j1") 2.2 denotes field 2 from file 2 ("j2").

Using the "-a" option creates a full outer join as in SQL.

This command must be given two and only two input files.

Multi-file Joins[edit]

To join several files you can loop through them.

$ join -t , -a 1 -a 2 -o 0,1.2,2.2 j1 j2 > J

File "J" is now the full outer join of "j1", "j2".

$ join -t , -a 1 -a 2 -o 0,1.2,2.2 J j3 > J

and so on through j4, j5, ...

For many files this is best done with a loop

 $ for i in * ; do join -t , -a 1 -a 2 -o 0,1.2,2.2 J $i > J ; done

Sorted Data Note[edit]

join assumes that the input data has been sorted by the field to be joined. See section on sort for details. • Counting Part 2 - sort and uniq

Credits: Some text adapted from Ted Harding's email to the R mailing list.

Picking The Data Apart With cut · Counting Part 2 - sort and uniq