Fortran/strings

From Wikibooks, open books for an open world
Jump to navigation Jump to search

Modern Fortran has a wide range of facilities for handling string or text data but some of these language-defined facilities have not been widely implemented by the compiler developers. It should be remembered that Fortran is designed for scientific computing and is probably not a good choice for writing a new word processor.

Character type[edit | edit source]

The main feature in Fortran that supports strings is the intrinsic data type character. A character literal constant can be delimited by either single or double quotes, and, where necessary, these can be escaped by using two consecutive single or double quotes. The concatenation operator is // (but this cannot be used to concatenate character entities of different KIND). Character scalar variables and arrays are allowed. Character variables have a sub-string notation to refer to and extract sub-strings.

Example

program string_1
    implicit none
    ! Declarations
    character (len=6) :: word1
    character (len=2) :: word2

    word1 = "abcdef" ! Assignment
    word2 = word1(5:6) ! Substring
    word1 = 'Don''t ' ! Escape with a double quote
    write (*,*) word2//word1 ! Concatenation
end program string_1

In the above example, the two character variables word1 and word2 are declared to have length 6 and 2 characters respectively.

In character assignment operations, if the right hand side of the assignment is shorter than the left hand side, the remaining characters on the left hand side are filled with blanks. If the right hand side is longer than the left hand side, then the right hand side is truncated. In neither case is an error raised either by the compiler or at run time.

character arrays and coarrays are permitted and can be declared and accessed in the same way as any other Fortran array. Where the array index and substring notations are to be combined, the array indices appear first and the substring expression appears second as illustrated in the final line of the following example:

character (len=120), dimension (10) :: text
text(1) = 'This is the first element of the array "text"'
text(2:3) = ' '       ! Elements 2 and 3 are blank.
text(4)(20:20) = '!'  ! Character 20 of element 4.

Unlike some programming languages, Fortran character data and variables do not require an explicit character to terminate a string. Also, unlike C-type languages, Fortran character data do not accommodate embedded and escaped control characters (e.g. /n) and all processing of output control is done via an extensive format sub-system.

Character collating sequence[edit | edit source]

Internally, Fortran maintains a collating sequence for all the permitted characters. Non-printing characters may be included in the collating sequence. The collating sequence is not specified by the language standard but most vendors support either ASCII or EBCDIC. This collating sequence means that lexical comparisons can be performed to ascertain whether e.g. 'a'<'b', but the outcome is essentially vendor specific. Hence there is a difference between functions such as ichar and iachar that is described below.

Character kind[edit | edit source]

character can also have a kind, but this is vendor-specific. It can allow compilers to support unicode, or the Russian alphabet or Japanese characters etc. It is not necessary to specify the length or kind of a character variable. If a character variable is declared with neither, the result is a variable of default kind and one character long. A single number is to indicate length, and two numbers indicate length and kind in that order. It is generally much clearer, but slightly more verbose to be explicit, as shown in lines 6-8 of the following example. The compiler vendor has control over which kinds of character are supported and the integer values assigned to access the corresponding character sets.

program string_2
    implicit none
    character :: one
    character (5) :: english_name
    character (5,2) :: japanese_name
    character (len=80) :: line
    character (len=120, kind=3) :: unicode_line
    character (kind=4, len=256) :: ebcdic_string
    !...
end program string_2

The intrinsic function selected_char_kind(name) returns the positive integer kind value of the character set with the corresponding name (e.g default, ascii, kanji, iso_10646 etc) but the only character set that must be supported is default, and if the name is not supported then -1 will be returned. Disappointingly, vendors generally have been slow to implement more than the default kind but gfortran, for instance, is a notable exception.

Language-defined Intrinsic Functions and Subprograms[edit | edit source]

Fortran has a fairly limited set of intrinsic functions to support character manipulation, searching and conversion. But the basic set is enough to construct some powerful features as required. There are some strange absences such as the ability to convert from lower-case to upper-case but this can be understood and forgiven since these concepts may not exist in many of the languages or character sets that may be represented by different character kinds. Functions such as size, lbound and ubound which apply to arrays of any data type, including character type, are not described here.

achar[edit | edit source]

achar(i, kind) returns the ith character in the ASCII collating sequence for the characters of the specified kind. The integer i must be in the range 0 < i < 127. Kind is an optional integer. If kind is not specified the default kind is assumed. achar(72) has the value 'H'. One really useful feature of achar is that it permits access to the non-printing ASCII characters such as return (achar(13)). achar will always return the ASCII character even if the processor's collating sequence is not ASCII. If kind is present, the kind parameter of the result is that specified by kind; otherwise, the kind parameter of the result is that of default character. If the processor cannot represent the result value in the kind of the result, the result is undefined. Using achar is highly recommended in preference to char, described below, because it is portable from one processor to another.

adjustl[edit | edit source]

adjustl(string) left justifies by removing leading (left) blanks from string and filling the right of string with blanks so that the result has the same length as the input string.

adjustr[edit | edit source]

adjustr(string) right justifies by removing trailing (right) blanks from string and filling the left of the string with blanks so that the result has the same length as the input string.

char[edit | edit source]

char(i, kind) returns the ith character in the processor collating sequence for the characters of the specified kind. The integer i does not have to be in the range 0 < i < 127. Kind is an optional integer. If kind is not specified the default kind is assumed. If the processor cannot represent the result value in the kind of the result, the result is undefined.

iachar[edit | edit source]

iachar(c, kind) is the inverse of achar described above. c is a single input character and iachar(c) returns the position of c in the ASCII character set as a default integer. Kind is an optional input integer and if kind is specified, it specifies the kind of the integer returned by iachar.

ichar[edit | edit source]

ichar(c, kind) is the inverse of CHAR described above. c is a single input character and ichar(c) returns the position of c in the selected character set as a default integer. Kind is an optional input integer and if kind is specified, it specifies the kind of the integer returned by ichar.

index[edit | edit source]

index(string, substring) returns a default integer representing the position of the first instance of substring in string searching from left to right. There are two optional arguments: back and kind. If the logical back is set true the search is conducted from right to left, and if the integer kind is specified, then the integer returned by index will be of that kind. If substring does not appear in string the result is 0.

len[edit | edit source]

len(c, kind) returns an integer representing the declared length of character c. This can be extremely useful in subprograms which receive character dummy arguments. c can be a character array. Kind is an optional integer which controls the kind of the integer returned by len.

len_trim[edit | edit source]

len_trimc, kind) returns the length of c excluding any trailing blanks (but including leading blanks). If c is only blanks the result is 0. Hence expressions like len_trim(adjustl(c)) can be used to count the number of characters in c between the first and last non-blank characters. Kind is an optional integer which controls the kind of the integer returned by len_trim.

new_line[edit | edit source]

new_line(c) is a character function that returns the new line character for the current processor. The kind of the returned character will be the same as the kind of c. A blank character may be returned if the character kind from which c is drawn does not contain a relevant newline character. This function is not likely to be used except in some very specific circumstances.

repeat[edit | edit source]

repeat(string, ncopies) concatenates integer ncopies of the string. Hence repeat('=',72) is a string of 72 equals signs. String must be scalar but can be of any length. Trailing blanks in string are included in the result.

scan[edit | edit source]

scan(string, set, back, kind) returns a default integer (or an integer of the optional kind) that represents the first position that any character in set appears in string. To search right to left, the optional logical back must be set true. string can be an array in which case, the result in an integer array. If string is an array then set can be an array of the same size and shape as string and each element of set is scanned for in the corresponding element of string. index, described above, is a special case of scan, because every character of set must be found and in the order of the characters in set.

selected_char_kind[edit | edit source]

selected_char_kind(name) is an integer function that returns the kind value of the character set named. The only set that must be supported by the language standard is name='DEFAULT'. If name is not supported the result is -1.

trim[edit | edit source]

trim(string) is a character valued function that returns a string with the trailing blanks removed. If string is all blanks the result has zero length.

verify[edit | edit source]

verify(string, set, back, kind) is an integer function that returns the position of the first character in string that is not in set. So verify is roughly the obverse of scan. In verify back and kind are both optional and have the same role as described in scan above. If every character in string is also in set (or string has zero length), then the function returns 0.

Regular expressions[edit | edit source]

Fortran does not have any language-defined regex or sorting capability for character data. Fortran does not have a language-defined text tokenizer but, with a little ingenuity, list directed input can provide a partial solution. However, there are Fortran libraries that wrap C regex libraries.

I/O of character data[edit | edit source]

read formatting[edit | edit source]

read for character data can be list-directed or formated using the "a" or "an" forms of this edit descriptor. In the "a" form, the width is taken from the width of the corresponding item in the list. In the "an" form, the integer n specifies the number of characters to transfer. The general edit description "gn" can also be used.

Example

character (120) :: line
open (10,"test.dat")
read (10,'(a)') line        ! Read up to 120 characters into line
read (10,'(a5)') line(115:) ! Read 5 character and put them at the end of line

write Formatting[edit | edit source]

The a and g edit descriptors exist for write as described above. The "a" form will write the whole character variable including all the trailing blanks so it is common to use trim or adjustl or both.

Example

character (len=512) :: line
!...
write (10,'(a)') trim(adjustl(line))

Internal Read and Write[edit | edit source]

Fortran has many hidden secrets and one of the most useful is that read and write statements can be used on character variables as if they were files. Hence the otherwise mystifying lack of functions to convert numbers to strings and vice versa. The character variable is treated as an 'internal file'

Example

character (120) :: text_in, text_out
integer :: i
real :: x
!...
write (text_in,'(A,I0)') 'i = ', i  ! Formatted
!...
read (text_out,*) x  ! List-directed

In addition to type conversion, this internal read/write can be used as a very flexible and bullet proof method of reading files where the contents may be of uncertain format. The external file is read line by line into a character variable, scan and verify can be used on the line to determine what is present and then an internal file read is done on the character variable to convert to real, integer, complex etc as appropriate.

Recent Extensions[edit | edit source]

character(:), allocatable[edit | edit source]

The size of character scalar data can be deferred (or "allocatable") and therefore free from being required to be declared of a specific length. The resulting scalar can then be formally allocated, or it can be automatically allocated as shown in the following example.

Example

character (:), allocatable :: string
!...
string = 'abcdef'
!...
string = '1234567890'
!...
string = trim(line)
!...

It is even possible to declare an array of assumed length elements, as illustrated below.

Example

character (:), dimension (:), allocatable :: strings

However, this feature should be used carefully and some restrictions apply

Actual/Dummy arguments of type character[edit | edit source]

It is frequently the case that a procedure may be written with a character dummy argument where the length of that argument is not known in advance. Modern Fortran allows dummy arguments to be declared with assumed length using len=*. Functions of type character can be written so that the result assumed a length related to the length of the dummy arguments.

Example

call this('Hello')
call this('Goodbye')
!...
subroutine this(string)
    implicit none
    character (len=*), intent (in) :: string
    character (len=len(string)+5)  :: temp
    !...
end subroutine

In the above example, the character variable temp is declared to have 5 more characters than string, no matter how long the actual argument is. In the next example, a function return a string, the length of which is related to the length of one or more arguments.

Example

string = that('thing', 7)
!...
function that(in_string, n) result (out_string)
    implicit none
    character (len=*), intent (in)    :: in_string
    integer, intent(in)               :: n
    character (len=len(in_string)*n)  :: out_string
    !...
end function

In circumstances where the character function has to return a string and the length of this string is not simply related to the inputs, the assumed length, allocatable form described above can be used, and is illustrated in the case conversion examples below.

character parameters[edit | edit source]

character parameters can be declared without explicitly stating the length, for example;

character (*), parameter :: place = 'COEFF_LIST_initialise'

Approaches to Case Conversion[edit | edit source]

Here are some further examples of the ideas above, but directed to the case conversion for languages where case conversion as a concept exists. In the first example, the ASCII character set functions iachar and achar are used to check each character in a string consecutively.

Example

function up_case(in) result (out)
    implicit none
    character (*), intent (in) :: in
    character (:), allocatable :: out
    integer                    :: i, j

    out = in                           ! Transfer whole array
    do i = 1, LEN_TRIM(out)            ! Each character
        j = iachar(out(i:i))           ! Get the ASCII position
        select case (j)
            case (97:122)              ! The lower case characters
                out(i:i) = ACHAR(j-32) ! Offset to the upper case
        end select
    end do
end function up_case

An alternative approach that does not rely on the ASCII representation function could be as follows:

Example

function to_upper(in) result (out)
    implicit none
    character (*), intent (in) :: in
    character (:), allocatable :: out
    integer                    :: i, j
    character (*), parameter   :: upp = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
    character (*), parameter   :: low = 'abcdefghijklmnopqrstuvwxyz'

    out = in                          ! Transfer all characters
    do i = 1, len_trim(out)           ! All non-blanks
        j = index(low, out(i:i))      ! Is ith character in low
        if (j>0) out(i:i) = upp(j:j)  ! Yes, then subst with upp
    end do
end function to_upper

Which routine is quicker will depend on the relative speed of the index and iachar intrinsics. In one less than very scientific test, the first method above seemed to be slightly more than twice as fast as the second method, but this will vary from vendor to vendor.