R Programming/Text Processing

From Wikibooks, open books for an open world
Jump to navigation Jump to search

This page includes all the material you need to deal with strings in R. The section on regular expressions may be useful to understand the rest of the page, even if it is not necessary if you only need to perform some simple tasks.

This page may be useful to :

  • perform statistical text analysis.
  • collect data from an unformatted text file.
  • deal with character variables.

In this page, we learn how to read a text file and how to use R functions for characters. There are two kind of function for characters, simple functions and regular expressions. Many functions are part of the standard R base package.

help.search(keyword = "character", package = "base")

However, their name and their syntax is not intuitive to all users. Hadley Wickham has developed the stringr package which defines functions with similar behaviour but their names are easier to retain and their syntax much more systematic[1].

  • Keywords : text mining, natural language processing
  • See CRAN Task view on Natural Language Processing[2]
  • See also the following packages tm, tau, languageR, scrapeR.


Reading and writing text files[edit | edit source]

R can read any text file using readLines() or scan(). It is possible to specify the encoding of the imported text file with readLines(). The entire contents of the text file can be read into an R object (e.g., a character vector). scan() is more flexible. The kind of data expected can be specified in the second argument (e.g., character(0) for a string).

text <- readLines("file.txt",encoding="UTF-8")
scan("file.txt", character(0)) # separate each word
scan("file.txt", character(0), quote = NULL) # get rid of quotes
scan("file.txt", character(0), sep = ".") # separate each sentence
scan("file.txt", character(0), sep = "\n") # separate each line

We can write the content of an R object into a text file using cat() or writeLines(). By default cat() concatenates vectors when writing to the text file. You can change it by adding options sep="\n" or fill=TRUE. The default encoding depends on your computer.

cat(text,file="file.txt",sep="\n")
writeLines(text, con = "file.txt", sep = "\n", useBytes = FALSE)

Before reading a text file, you can look at its properties. nlines() (parser package) and countLines() (R.utils package) count the number of lines in the file. count.chars() (parser package) counts the number of bytes and characters in each line of a file. You can also display a text file using file.show().

Character encoding[edit | edit source]

R provides functions to deal with various set of encoding schemes. This is useful if you deal with text file which have been created with another operating system and especially if the language is not English and has many accents and specific characters. For instance, the standard encoding scheme in Linux is "UTF-8" whereas the standard encoding scheme in Windows is "Latin1". The Encoding() functions returns the encoding of a string. iconv() is similar to the unix command iconv and converts the encoding.

  • iconvlist() gives the list of available encoding scheme on your computer.
  • readLines(), scan() and file.show() have also an encoding option.
  • is.utf8() (tau) tests if the encoding is "utf8".
  • is.locale() (tau) tests if encoding is the same as the default encoding on your computer.
  • translate() (tau) translates the encoding into the current locale.
  • fromUTF8() (descr) is less general than iconv().
  • utf8ToInt() (base)

Example[edit | edit source]

The following example was run under Windows. Thus, the default encoding is "latin1".

> texte <- "Hé hé"
> Encoding(texte)
[1] "latin1"
> texte2 <-  iconv(texte,"latin1","UTF-8")
> Encoding(texte2)
[1] "UTF-8"

Regular Expressions[edit | edit source]

A regular expression is a specific pattern in a set of strings. For instance, one could have the following pattern : 2 digits, 2 letters and 4 digits. R provides powerful functions to deal with regular expressions. Two types of regular expressions are used in R[3]

  • extended regular expressions, used by ‘perl = FALSE’ (the default),
  • Perl-like regular expressions used by ‘perl = TRUE’.

There is a also an option called ‘fixed = TRUE’ which can be considered as a literal regular expression. fixed() (stringr) is equivalent to fixed=TRUE in the standard regex functions. These functions are by default case sensitive. This can be changed by specifying the option ignore.case = TRUE.

If you are not a specialist in regular expression you may find the glob2rx() useful. This function suggests some regular expression for a specific ("glob" or "wildcard") pattern :

> glob2rx("abc.*")
[1] "^abc\\."

Functions which use regular expressions in R[edit | edit source]

  • sub(), gsub(), str_replace() (stringr) make some substitutions in a string.
  • grep(), str_extract() (stringr) extract some value
  • grepl(), str_detect() (stringr) detect the presence of a pattern.
  • see also splitByPattern() (R.utils)
  • See also gsubfn() in the gsubfn package.

Extended regular expressions (The default)[edit | edit source]

  • "." stands for any character.
  • "[ABC]" means A,B or C.
  • "[A-Z]" means any upper letter between A and Z.
  • "[0-9]" means any digit between 0 and 9.

Here is the list of metacharacters ‘$ * + . ? [ ] ^ { } | ( ) \’. If you need to use one of those characters, precede them with a doubled backslash.

Here are some classes of regular expressions : For numbers :

  • ‘[:digit:]’ Digits: ‘0 1 2 3 4 5 6 7 8 9’.

For letters :

  • ‘[:alpha:]’ Alphabetic characters: ‘[:lower:]’ and ‘[:upper:]’.
  • ‘[:upper:]’ Upper-case letters.
  • ‘[:lower:]’ Lower-case letters.

Note that the set of alphabetic characters includes accents such as é è ê which are very common in some languages like French. Therefore, it is more general than "[A-Za-z]" which does not include letters with accent.

For other characters :

  • ‘[:punct:]’ Punctuation characters: ‘! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~’.
  • ‘[:space:]’ Space characters: tab, newline, vertical tab, form feed, carriage return, and space.
  • ‘[:blank:]’ Blank characters: space and tab.
  • ‘[:cntrl:]’ Control characters.

For combination of other classes :

  • [:alnum:] Alphanumeric characters: ‘[:alpha:]’ and ‘[:digit:]’.
  • ‘[:graph:]’ Graphical characters: ‘[:alnum:]’ and ‘[:punct:]’.
  • ‘[:print:]’ Printable characters: ‘[:alnum:]’, ‘[:punct:]’ and space.
  • ‘[:xdigit:]’ Hexadecimal digits: ‘0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f’.

You can quantify the number of repetition by adding after the regular expression the following characters :

  • ‘?’ The preceding item is optional and will be matched at most once.
  • ‘*’ The preceding item will be matched zero or more times.
  • ‘+’ The preceding item will be matched one or more times.
  • ‘{n}’ The preceding item is matched exactly ‘n’ times.
  • ‘{n,}’ The preceding item is matched ‘n’ or more times.
  • ‘{n,m}’ The preceding item is matched at least ‘n’ times, but not more than ‘m’ times.
  • ^ to force the regular expression to be at the beginning of the string
  • $ to force the regular expression to be at the end of the string

If you want to know more, have a look at the 2 following help files :

>?regexp # gives some general explanations
>?grep # help file for grep(),regexpr(),sub(),gsub(),etc

Perl-like regular expressions[edit | edit source]

It is also possible to use "perl-like" regular expressions. You just need to use the option perl=TRUE.

Examples[edit | edit source]

If you want to remove space characters in a string, you can use the \\s Perl macro.

sub('\\s', '',x, perl = TRUE)

See also[edit | edit source]

Concatenating strings[edit | edit source]

  • paste() concatenates strings.
  • str_c() (stringr) does a similar job.
  • cat() prints and concatenates strings.

Examples[edit | edit source]

> paste("toto","tata",sep=' ')
[1] "toto tata"
> paste("toto","tata",sep=",")
[1] "toto,tata"
> str_c("toto","tata",sep=",")
[1] "toto,tata"
> x <- c("a","b","c")
> paste(x,collapse=" ")
[1] "a b c"
> str_c(x, collapse = " ")
[1] "a b c"
> cat(c("a","b","c"), sep = "+")
a+b+c

Splitting a string[edit | edit source]

  • strsplit() : Split the elements of a character vector ‘x’ into substrings according to the matches to substring ‘split’ within them.
  • See also str_split() (stringr).
> unlist(strsplit("a.b.c", "\\."))
[1] "a" "b" "c"
  • tokenize() (tau) split a string into tokens.
> tokenize("abc defghk")
[1] "abc"    " "      "defghk"

Counting the number of characters in a string[edit | edit source]

  • nchar() gives the length of a string. Note that that for non-ASCII encodings, there is more one way to measure such a length.
  • See also str_length() (stringr)
> nchar("abcdef")
[1] 6
> nchar(NA)
[1] NA
> nchar("René")
[1] 4
> nchar("René", type = "bytes")
[1] 5

Detecting the presence of a substring[edit | edit source]

Detecting a pattern in a string ?[edit | edit source]

  • grepl() returns a logical expression (TRUE or FALSE).
  • str_detect() (stringr) does a similar job.
> string <- "23 mai 2000"
> string2 <- "1 mai 2000"
> regexp <- "([[:digit:]]{2}) ([[:alpha:]]+) ([[:digit:]]{4})"
> grepl(pattern = regexp, x = string)
[1] TRUE
> str_detect(string, regexp)
[1] TRUE
> grepl(pattern = regexp, x = string2)
[1] FALSE

The 1st one is true and the second one is false since there is only one digit in the first number.

Counting the occurrence of each pattern in a string ?[edit | edit source]

  • textcnt() (tau) counts the occurrence of each pattern or each term in a text.
> string <- "blabla 23 mai 2000 blabla 18 mai 2004"
> textcnt(string,n=1L,method="string")
blabla    mai 
     2      2 
attr(,"class")
[1] "textcnt"

Extracting the position of a substring or a pattern in a string[edit | edit source]

Extracting the position of a substring ?[edit | edit source]

  • cpos() (cwhmisc) returns the position of a substring in a string.
  • substring.location() (cwhmisc) does the same job but returns the first and the last position.
 
> cpos("abcdefghijklmnopqrstuvwxyz","p",start=1)
[1] 16
> substring.location("abcdefghijklmnopqrstuvwxyz","def")
$first
[1] 4

$last
[1] 6

Extracting the position of a pattern in a string ?[edit | edit source]

  • regexpr() returns the position of the regular expression. str_locate() (stringr) does the same job. gregexpr() is similar to regexpr() but the starting position of every match is returned. str_locate_all() (stringr) does the same job.
> regexp <- "([[:digit:]]{2}) ([[:alpha:]]+) ([[:digit:]]{4})"
> string <- "blabla 23 mai 2000 blabla 18 mai 2004"
> regexpr(pattern = regexp, text = string)
[1] 8
attr(,"match.length")
[1] 11
> gregexpr(pattern = regexp, text = string)
[[1]]
[1]  8 27
attr(,"match.length")
[1] 11 11
> str_locate(string,regexp)
     start end
[1,]     8  18
> str_locate_all(string,regexp)
[[1]]
     start end
[1,]     8  18
[2,]    27  37

Extracting a substring from a string[edit | edit source]

Extracting a fixed width substring ?[edit | edit source]

  • substr() takes a sub string.
  • str_sub() (stringr) is similar.
> substr("simple text",1,3)
[1] "sim"
> str_sub("simple text",1,3)
[1] "sim"

Extracting the first word in a string ?[edit | edit source]

  • first.word() First Word in a String or Expression in the Hmisc package
> first.word("abc def ghk")
[1] "abc"

Extracting a pattern in a string ?[edit | edit source]

  • grep() returns the value of the regular expression if value=T and its position if value=F.
> grep(pattern = regexp, x = string , value = T) 
[1] "23 mai 2000"
> grep(pattern = regexp, x = string2 , value = T) 
character(0)
> grep(pattern = regexp, x = string , value = F) 
[1] 1
> grep(pattern = regexp, x = string2 , value = F) 
integer(0)
  • str_extract(), str_extract_all(), str_match(), str_match_all() (stringr) and m() (caroline package) are similar to grep(). str_extract() and str_extract_all() return a vector. str_match() and str_match_all() return a matrix and m() a dataframe.
> library("stringr")
> regexp <- "([[:digit:]]{2}) ([[:alpha:]]+) ([[:digit:]]{4})"
> string <- "blabla 23 mai 2000 blabla 18 mai 2004"
> str_extract(string,regexp)
[1] "23 mai 2000"
> str_extract_all(string,regexp)
[[1]]
[1] "23 mai 2000" "18 mai 2004"

> str_match(string,regexp)
     [,1]          [,2] [,3]  [,4]  
[1,] "23 mai 2000" "23" "mai" "2000"
> str_match_all(string,regexp)
[[1]]
     [,1]          [,2] [,3]  [,4]  
[1,] "23 mai 2000" "23" "mai" "2000"
[2,] "18 mai 2004" "18" "mai" "2004"
> library("caroline")
> m(pattern = regexp, vect = string, names = c("day","month","year"), types = rep("character",3))
  day month year
1  18   mai 2004
  • Named capture regular expressions can be used to define column names in the regular expression (this also serves to document the regular expression). Install the namedCapture package via devtools::install_github("tdhock/namedCapture") to use str_match_all_named(). It uses the base function gregexpr(perl=TRUE) to parse a Perl-Compatible Regular Expression, and returns a list of match matrices with column names:
> named.regexp <- paste0(
+   "(?<day>[[:digit:]]{2})",
+   " ",
+   "(?<month>[[:alpha:]]+)",
+   " ",
+   "(?<year>[[:digit:]]{4})")
> namedCapture::str_match_all_named(string, named.regexp)
[[1]]
     day  month year  
[1,] "23" "mai" "2000"
[2,] "18" "mai" "2004"

Making some substitution inside a string[edit | edit source]

Substituting a pattern in a string[edit | edit source]

  • sub() makes a substitution.
  • gsub() is similar to sub() but replace all occurrences of the pattern whereas sub() only replaces the first occurrence.
  • str_replace() (stringr) is similar to sub, str_replace_all() (stringr) is similar to gsub.

In the following example, we have a French date. The regular pattern is the following : 2 digits, a blank, some letters, a blank, 4 digits. We capture the 2 digits with the [[:digit:]]{2} expression, the letters with [[:alpha:]]+ and the 4 digits with [[:digit:]]{4}. Each of these three substrings is surrounded with parenthesis. The first substring is stored in "\\1", the second one in "\\2" and the 3rd one in "\\3".

string <- "23 mai 2000"
regexp <- "([[:digit:]]{2}) ([[:alpha:]]+) ([[:digit:]]{4})"
sub(pattern = regexp, replacement = "\\1", x = string) # returns the first part of the regular expression
sub(pattern = regexp, replacement = "\\2", x = string) # returns the second part
sub(pattern = regexp, replacement = "\\3", x = string) # returns the third part

In the following example, we compare the outcome of sub() and gsub(). The first one removes the first space whereas the second one removes all spaces in the text.

> text <- "abc def ghk"
> sub(pattern = " ", replacement = "",  x = text)
[1] "abcdef ghk"
> gsub(pattern = " ", replacement = "",  x = text)
[1] "abcdefghk"

Substituting characters in a string ?[edit | edit source]

  • chartr() substitutes characters in an expression. It stands for "character translation".
  • replacechar() (cwhmisc) does the same job ...
  • as well as str_replace_all() (stringr).
> chartr(old="a",new="o",x="baba")
[1] "bobo"
> chartr(old="ab",new="ot",x="baba")
[1] "toto"
> replacechar("abc.def.ghi.jkl",".","_")
[1] "abc_def_ghi_jkl"
> str_replace_all("abc.def.ghi.jkl","\\.","_")
[1] "abc_def_ghi_jkl"

Converting letters to lower or upper-case[edit | edit source]

  • tolower() converts upper-case characters to lower-case.
  • toupper() converts lower-case characters to upper-case.
  • capitalize() (Hmisc) capitalize the first letter of a string
  • See also cap(), capitalize(), lower(), lowerize() and CapLeading() in the cwhmisc package.
> tolower("ABCdef")
[1] "abcdef"
> toupper("ABCdef")
[1] "ABCDEF"
> capitalize("abcdef")
[1] "Abcdef"

Filling a string with some character[edit | edit source]

  • padding() (cwhmisc) fills a string with some characters to fit a given length. See also str_pad() (stringr).
> library("cwhmisc")
> padding("abc",10," ","center") # adds blanks such that the length of the string is 10.
[1] "   abc    "
> str_pad("abc",width=10,side="center", pad = "+")
[1] "+++abc++++"
> str_pad(c("1","11","111","1111"),3,side="left",pad="0") 
[1] "001"  "011"  "111"  "1111"

Note that str_pad() is very slow. For instance for a vector of length 10,000, we have a very long computing time. padding()does not seem to handle character vectors but the best solution may be to use the sapply() and padding() functions together.

>library("stringr")
>library("cwhmisc")
>a <- rep(1,10^4)
> system.time(b <- str_pad(a,3,side="left",pad="0"))
utilisateur     système      écoulé 
     50.968       0.208      73.322 
> system.time(c <- sapply(a, padding, space = 3, with = "0", to = "left"))
utilisateur     système      écoulé 
      7.700       0.020      12.206

Removing leading and trailing spaces[edit | edit source]

  • trimws() (memisc package) trim leading and trailing white spaces.
  • trim() (gdata package) does the same job.
  • See also str_trim() (stringr)
> library("memisc")
> trimws("  abc def   ")
[1] "abc def" 
> library("gdata")
> trim(" abc def ")
[1] "abc def"
> str_trim("  abd def  ")
[1] "abd def"

Comparing two strings[edit | edit source]

Assessing if they are identical[edit | edit source]

  • == returns TRUE if both strings are the same and false otherwise.
> "abc"=="abc"
[1] TRUE
> "abc"=="abd"
[1] FALSE

Computing distance between strings[edit | edit source]

Few packages implement the Levenshtein distance between two strings:

  • adist() in base package utils
  • stringMatch() in MiscPsycho
  • stringdist() in stringdist
  • levenshteinDist() in RecordLinkage

A benchmark comparing the speed of levenshteinDist() and stringdist() is available here: [1].

Example with utils[edit | edit source]

> adist("test","tester")
[1] 2

Example with MiscPsycho[edit | edit source]

stringMatch() (MiscPsycho) computes If normalize="YES" the levenshtein distance is divided by the maximum length of each string.

> library("MiscPsycho")
> stringMatch("test","tester",normalize="NO",penalty=1,case.sensitive = TRUE)
[1] 2

Approximate matching[edit | edit source]

agrep() search for approximate matches using the Levenshtein distance.

  • If 'value = TRUE', this returns the value of the string
  • If 'value = FALSE' this returns the position of the string
  • max returns the maximal levenshtein distance.
>  agrep(pattern = "laysy", x = c("1 lazy", "1", "1 LAZY"), max = 2, value = TRUE)
[1] "1 lazy"
>  agrep("laysy", c("1 lazy", "1", "1 LAZY"), max = 3, value = TRUE)
[1] "1 lazy"

Miscellaneous[edit | edit source]

  • deparse() : Turn unevaluated expressions into character strings.
  • char.expand() (base) expands a string with respect to a target.
  • pmatch() (base) and charmatch() (base) seek matches for the elements of their first argument among those of their second.
> pmatch(c("a","b","c","d"),table = c("b","c"), nomatch = 0)
[1] 0 1 2 0
  • make.unique() makes a character string unique. This is useful if you want to use a string as an identifier in your data.
> make.unique(c("a", "a", "a"))
[1] "a"   "a.1" "a.2"

References[edit | edit source]

  1. Hadley Wickham "stringr: modern, consistent string processing" The R Journal, December 2010, Vol 2/2, http://journal.r-project.org/archive/2010-2/RJournal_2010-2_Wickham.pdf
  2. http://cran.r-project.org/web/views/NaturalLanguageProcessing.html
  3. In former versions (< 2.10) we had also basic regular expressions in R :
    • extended regular expressions, used by extended = TRUE (the default),
    • basic regular expressions, as used by extended = FALSE (obsolete in R 2.10).
    Since basic regular expressions (‘extended = FALSE’) are now obsolete, the extended option is obsolete in version 2.11.