An Awk Primer/Search Patterns (2)

From Wikibooks, open books for an open world
Jump to: navigation, search

Fields and Blocks[edit]

There is more to Awk's string-searching capabilities. The search can be constrained to a single field within the input line. For example:

$1 ~ /^France$/

This searches for lines whose first field ($1—more on "field variables" later) is the word "France", whereas:

$1 !~ /^Norway$/

This will search for lines whose first field is not the word "Norway".

It is possible to search for an entire series or "block" of consecutive lines in the text, using one search pattern to match the first line in the block and another search pattern to match the last line in the block. For example:

/^Ireland/,/^Summary/

This matches a block of text whose first line begins with "Ireland" and whose last line begins with "Summary".

Here's how it works: once /^Ireland/ is matched, all following lines of text will be automatically matched until /^Summary/ is matched. At that point, the matching stops. If a line beginning with "Summary" is not found, everything after "Ireland" will be matched through the end of the file.

Beyond Regular Expressions[edit]

There is no need for the search pattern to be a regular expression. It can be a wide variety of other expressions as well. For example:

NR == 10

This matches line 10. Lines are numbered beginning with 1. NR is, as explained in the overview, a count of the lines searched by Awk, and == is the "equality" operator. Similarly:

NR == 10,NR == 20

This matches lines 10 through 20 in the input file.

Comparison Operators[edit]

Awk supports search patterns using a full range of comparison operations:

  • < Less than
  • <= Less than or equal to
  • == Equal
  • != Not equal
  • >= Greater than or equal to
  • > Greater than
  • ~ Matches
  • !~ Does not match

For example,

NF == 0

This matches all blank lines, or those whose number of fields is zero.

$1 == "France"

This is a string comparison that matches any line whose first field is the string "France". The astute reader may notice that this example seems to do the same thing as a the previous example:

$1 ~ /^France$/

In fact, both examples do the same thing, but in the example immediately above the ^ and $ meta-characters had to be used in the regular expression to specify a match with the entire first field; without them, it would match such strings as "FranceFour", "NewFrance", and so on. The string expression matches only to "France".

Logic Operators[edit]

It is also possible to combine several search patterns with the && (AND) and || (OR) operators. For example:

((NR >= 30) && ($1 == "France")) || ($1 == "Norway")

This matches any line past the 30th that begins with "France", or any line that begins with "Norway". If a line begins with "France", but it's before the 30th, it will not match. All lines beginning with "Norway" will match, however.

One class of pattern-matching that wasn't listed above is performing a numeric comparison on a field variable. It can be done, of course; for example:

$1 == 100

This matches any line whose first field has a numeric value equal to 100. This is a simple thing to do and it will work fine. However, suppose we want to perform:

$1 < 100

This will generally work fine, but there's a nasty catch to it, which requires some explanation: if the first field of the input can be either a number or a text string, this sort of numeric comparison can give crazy results, matching on some text strings that aren't equivalent to a numeric value.

This is because Awk is a weakly-typed language. Its variables can store a number or a string, with Awk performing operations on each appropriately. In the case of the numeric comparison above, if $1 contains a numeric value, Awk will perform a numeric comparison on it, as expected; but if $1 contains a text string, Awk will perform a text comparison between the text string in $1 and the three-letter text string "100". This will work fine for a simple test of equality or inequality, since the numeric and string comparisons will give the same results, but it will give unexpected results for a "less than" or "greater than" comparison. Essentially, when comparing strings Awk compares their ASCII values. This is roughly equivalent to an alphabetical ("phone book style") sort. Even still, it's not perfectly alphabetical because uppercase and lowercase letters will not compare properly, and number and punctuation compare in a somewhat arbitrary way.

More about Types[edit]

Awk is not broken; it is doing what it is told to do in this case. If this problem comes up, it is possible to add a second test to the comparison to determine if the field contains a numeric value or a text string. This second test has the form:

(( $1 + 0 ) == $1 )

If $1 contains a numeric value, the left-hand side of this expression will add 0 to it, and Awk will perform a numeric comparison that will always be true.

If $1 contains a text string that doesn't look like a number, for want of anything better to do Awk will interpret its value as 0. This means the left-hand side of the expression will evaluate to zero; because there is a non-numeric text string in $1, Awk will perform a string comparison that will always be false. This leads to a more workable comparison:

((( $1 + 0 ) == $1 ) && ( $1 > 100 ))

The same test could be modified to check for a text string instead of a numeric value:

(( $1 + 0 ) != $1 )

It is worthwhile to remember this trickery for the rare occasions it is needed. Weakly-typed languages are convenient, but in some unusual cases they can turn around and bite.

Test It Out[edit]

Incidentally, if there's some uncertainty as to how Awk is handling a particular sort of data, it is simple to run tests to find out for sure. For example, I wanted to see if my version of Awk could handle a hexadecimal value as would be specified in C—for example, "0xA8"—and so I simply typed in the following at the command prompt:

awk 'BEGIN {tv="0xA8"; print tv,tv+0}'

This printed "0xA8 0", which meant Awk thought that the data was strictly a string. This little example consists only of a BEGIN clause, allowing an Awk program to be run without specifying an input file. Such "one-liners" are convenient when playing with examples. If you are uncertain about what Awk may be doing, just try a test; it won't break anything.

Practice[edit]

  1. Write an Awk program that prints any line with less than 5 words, unless it starts with an asterisk.
  2. Write an Awk program that prints every line beginning with a number.
  3. Write an Awk program that scans a line-numbered text file for errors. It should print out any line that is missing a line number and any line that is numbered incorrectly, along with the actual line number.

On the next page, you'll learn some of the finer points of strings and numbers.