Guide to Unix/Commands/Text Processing

From Wikibooks, open books for an open world
Jump to navigation Jump to search

Unix supports multiple text processing commands.

awk[edit]

awk is a powerful text-processing tool using regular expressions, providing expanded capabilities beyond #cut and #sed. You can learn more in AWK and An Awk Primer Wikibooks.

Links:

comm[edit]

Identifies lines common to two files or unique to them. Options control the manner of identification, e.g. outputting only common lines.

Links:

csplit[edit]

Splits input into output files. The split can be driven by the number of lines and by a regex match.

Links:

cut[edit]

cut can select columns ("fields") from lines in text files, with specifiable column separator.

Links:

expand[edit]

Converts tabs to spaces, defaulting to 8 spaces per tab. See also #unexpand.

Links:

fmt[edit]

Formats text, including reflowing paragraphs to a specific maximum number of characters per line. Does not seem covered by POSIX:

Links:

fold[edit]

Limits the maximum length of a line in a manner different from #fmt.

Links:

  • fold, opengroup.org
  • 4.3 fold in GNU Coreutils manual, gnu.org

iconv[edit]

Converts between character encodings.

Links:

join[edit]

Combines lines from files based of their fields, assuming the files are sorted on the fields used for joining.

Links:

nl[edit]

Adds line numbers.

Links:

  • nl, opengroup.org
  • 3.3 nl in GNU Coreutils manual, gnu.org

paste[edit]

For multiple files, joins lines corresponding by line number as if each file were a column of a table and each file line a row of the table.

Links:

pr[edit]

Formats input for printing, including pagination with header and footer.

Links:

sed[edit]

sed, a stream editor, is noted for its text replacement capability with regular expression support, but can do more. You can learn more in Sed Wikibook.

Links:

sort[edit]

Sorts lines in files, outputting the sorted lines and leaving the input intact.

Examples:

  • sort file.txt
    • Sorts the file alphabetically.
  • sort file.txt file2.txt
    • Sorts the lines of two files alphabetically, outputting a single sorted stream of lines from the two files.
  • cat file.txt | sort
    • Sorts the input stream created by cat. Thus, equivalent to sort file.txt.
  • sort -n file.txt
    • Sorts the file numerically. Thus, 12 comes after 2, which it does not alphabetically.
  • sort -r file.txt
    • Sorts the file in the reverse order. Thus, b comes before a.
  • sort -k5,5 file.txt
    • Sorts the file by the 5th field (column) via -k.
  • sort -t, -k5,5 file.txt
    • As above, using comma (,) as the field separator via -t.
  • sort -k5,5 -k3,3 file.txt
    • Sorts the file first by the 5th field, then by the 3rd field.
  • sort -k5,5 -k3,3n file.txt
    • As above, but when sorting by the 3rd field, do so numerically via appended "n".
  • sort -k5 file.txt
    • Sorts the file first by the 5th field, and then subsequently all the remaining fields, ignoring 1-4th fields for the sorting purposes.
  • sort -u file.txt
    • Sorts the file, removing duplicate lines, thereby ensuring each output line is unique.
  • sort -u -k5,5 file.txt
    • Sorts the file by the 5th field, keeping only one line from each set of lines having the same key, where the key is the 5th field.

Links:

spell[edit]

Peforms spell checking. Seems absent from POSIX.

Links:

tr[edit]

Performs a character-by-character mapping or "translation", and more. Yields greater brevity than sed for some tasks.

Examples:

  • echo "a:b:c:d" | tr : \\n
    • Splits into multiple lines by colon (:). The colon will not be in the output.
  • echo "a b c d" | tr " " \\n
    • Splits into multiple lines by space.
  • echo "abba" | tr ab cd
    • Replaces a with c and b with d. Thus, yields cddc.
  • echo "a,b:c,d:e" | tr ,: :,
    • Swaps commas with colons. Thus, yields a:b,c:d,e.
  • echo "a b c d" | tr -d " "
    • Removes spaces from the input, outputting abcd. -d stands for delete.
  • echo "a,b,c:d:e" | tr -dc ,:
    • Keeps only the commas and colons. -c stands for complement. Thus, yields ,,::.
  • echo "a,,,b,c::d" | tr -s ,:
    • Replaces sequences of commas with a single comma, and sequences of colon with a single colon. -s stands for squeeze. Thus, yields a,b,c:d.

Links:

unexpand[edit]

Converts spaces to tabs, defaulting to 8 spaces per tabs.

Links:

uniq[edit]

Outputs single lines out of each same-line bloks, and more. Ideally used with the input sorted. You can learn more in Uniq wikibook.

Examples:

  • sort file.txt | uniq

Links: