Guide to Unix/Commands/Text Processing
Unix supports multiple text processing commands.
Identifies lines common to two files or unique to them. Options control the manner of identification, e.g. outputting only common lines.
Splits input into output files. The split can be driven by the number of lines and by a regex match.
cut can select columns ("fields") from lines in text files, with specifiable column separator.
Converts tabs to spaces, defaulting to 8 spaces per tab. See also #unexpand.
Formats text, including reflowing paragraphs to a specific maximum number of characters per line. Does not seem covered by POSIX:
Limits the maximum length of a line in a manner different from #fmt.
Converts between character encodings.
- iconv, opengroup.org
Combines lines from files based of their fields, assuming the files are sorted on the fields used for joining.
Adds line numbers.
For multiple files, joins lines corresponding by line number as if each file were a column of a table and each file line a row of the table.
Formats input for printing, including pagination with header and footer.
sed, a stream editor, is noted for its text replacement capability with regular expression support, but can do more. You can learn more in Sed Wikibook.
Sorts lines in files, outputting the sorted lines and leaving the input intact.
- sort file.txt
- Sorts the file alphabetically.
- sort file.txt file2.txt
- Sorts the lines of two files alphabetically, outputting a single sorted stream of lines from the two files.
- cat file.txt | sort
- Sorts the input stream created by cat. Thus, equivalent to sort file.txt.
- sort -n file.txt
- Sorts the file numerically. Thus, 12 comes after 2, which it does not alphabetically.
- sort -r file.txt
- Sorts the file in the reverse order. Thus, b comes before a.
- sort -k5,5 file.txt
- Sorts the file by the 5th field (column) via -k.
- sort -t, -k5,5 file.txt
- As above, using comma (,) as the field separator via -t.
- sort -k5,5 -k3,3 file.txt
- Sorts the file first by the 5th field, then by the 3rd field.
- sort -k5,5 -k3,3n file.txt
- As above, but when sorting by the 3rd field, do so numerically via appended "n".
- sort -k5 file.txt
- Sorts the file first by the 5th field, and then subsequently all the remaining fields, ignoring 1-4th fields for the sorting purposes.
- sort -u file.txt
- Sorts the file, removing duplicate lines, thereby ensuring each output line is unique.
- sort -u -k5,5 file.txt
- Sorts the file by the 5th field, keeping only one line from each set of lines having the same key, where the key is the 5th field.
Peforms spell checking. Seems absent from POSIX.
Performs a character-by-character mapping or "translation", and more. Yields greater brevity than sed for some tasks.
- echo "a:b:c:d" | tr : \\n
- Splits into multiple lines by colon (:). The colon will not be in the output.
- echo "a b c d" | tr " " \\n
- Splits into multiple lines by space.
- echo "abba" | tr ab cd
- Replaces a with c and b with d. Thus, yields cddc.
- echo "a,b:c,d:e" | tr ,: :,
- Swaps commas with colons. Thus, yields a:b,c:d,e.
- echo "a b c d" | tr -d " "
- Removes spaces from the input, outputting abcd. -d stands for delete.
- echo "a,b,c:d:e" | tr -dc ,:
- Keeps only the commas and colons. -c stands for complement. Thus, yields ,,::.
- echo "a,,,b,c::d" | tr -s ,:
- Replaces sequences of commas with a single comma, and sequences of colon with a single colon. -s stands for squeeze. Thus, yields a,b,c:d.
Converts spaces to tabs, defaulting to 8 spaces per tabs.
Outputs single lines out of each same-line bloks, and more. Ideally used with the input sorted. You can learn more in Uniq wikibook.
- sort file.txt | uniq