Ad Hoc Data Analysis From The Unix Command Line/Joining The Data with join
Please note - Join assumes that that input data is sorted based on the key on which the join is going to take place.
Delimited data
[edit | edit source]In delimited data, elements of a record are separated by a special 'delimiter' character. In the CSV files, fields are delimited by commas or tabs:
$ cat j1 1,a 1,b 2,c 2,d 2,e 3,f 3,g 4,h 4,i 5,j
$ cat j2 1,A 1,B 1,C 2,D 2,E 4,F 4,G 5,H 6,I 6,J
$ join -t , -a 1 -a 2 -o 0,1.2,2.2 j1 j2 1,a,A 1,a,B 1,a,C 1,b,A 1,b,B 1,b,C 2,c,D 2,c,E 2,d,D 2,d,E 2,e,D 2,e,E 3,f, 3,g, 4,h,F 4,h,G 4,i,F 4,i,G 5,j,H 6,,I 6,,J
Explanation of options:
"-t ," Input and output field separator is "," (for CSV) "-a 1" Output a line for every line of j1 not matched in j2 "-a 2" Output a line for every line of j2 not matched in j1 "-o 0,1.2,2.2" Output field format specification:
0 denotes the match (join) field (needed when using "-a") 1.2 denotes field 2 from file 1 ("j1") 2.2 denotes field 2 from file 2 ("j2").
Using the "-a" option creates a full outer join as in SQL.
This command must be given two and only two input files.
Multi-file Joins
[edit | edit source]To join several files you can loop through them.
$ join -t , -a 1 -a 2 -o 0,1.2,2.2 j1 j2 > J
File "J" is now the full outer join of "j1", "j2".
$ join -t , -a 1 -a 2 -o 0,1.2,2.2 J j3 > J
and so on through j4, j5, ...
For many files this is best done with a loop
$ for i in * ; do join -t , -a 1 -a 2 -o 0,1.2,2.2 J $i > J ; done
Sorted Data Note
[edit | edit source]join assumes that the input data has been sorted by the field to be joined. See section on sort for details. • Counting Part 2 - sort and uniq
Credits: Some text adapted from Ted Harding's email to the R mailing list.