Introducing Julia/Strings and characters

From Wikibooks, open books for an open world
Jump to navigation Jump to search
« Introducing Julia
Strings and characters
»
Dictionaries and sets Working with text files

Strings and characters[edit]

Strings[edit]

A string is a sequence of one or more characters, usually seen when enclosed in double quotes:

"this is a string"

There are two important things you need to know about strings.

One is, that they're immutable. You can't change them once they're created. But it's easy to make new strings from parts of existing ones.

The second is that you have to be careful when using two specific characters: double quotes ("), and dollar signs ($). If you want to include a double quote character in the string, it has to be preceded with a backslash, otherwise the rest of the string would be interpreted as Julia code, with potentially interesting results. And if you want to include a dollar sign ($) in a string, that should also be prefaced by a backslash, because it's used for string interpolation.

julia> demand = "You owe me \$50!"
"You owe me \$50!"

julia> println(demand)
You owe me $50!
julia> demandquote = "He said, \"You owe me \$50!\""
"He said, \"You owe me \$50!\""

Strings can also be enclosed in triple double quotes. This is useful because you can use ordinary double quotes inside the string without having to put backslashes before them:

julia> """this is "a" string"""
"this is \"a\" string"

You'll encounter a few specialized types of string too, which consist of one or more characters immediately followed by the opening double quote:

  • r" " indicates a regular expression
  • v" " indicates a version string
  • b" " indicates a byte literal
  • raw" " indicates a raw string that doesn't do interpolation

String interpolation[edit]

You often want to use the results of Julia expressions inside strings. For example, suppose you want to say:

"The value of x is n."

where n is the current value of x. Any Julia expression can be inserted into a string with the $() construction:

julia> x = 42
42

julia> "The value of x is $(x)."
"The value of x is 42."

You don't have to use the parentheses if you're just using the name of a variable:

julia> "The value of x is $x."
"The value of x is 42."

To include the result of a Julia expression in a string, enclose the expression in parentheses first, then precede it with a dollar sign:

julia> "The value of 2 + 2 is $(2 + 2)."
"The value of 2 + 2 is 4."

Substrings[edit]

To extract a smaller string from a string, use getindex(s, range) or s[range] syntax. For basic ASCII strings, you can use the same techniques that you use to extract elements from arrays:

julia> s = String("a load of characters")
"a load of characters"

julia> s[1:end]
"a load of characters"

julia> s[3:6]
"load"
julia> s[3:end-6]
"load of char"

You can easily iterate through a string:

for char in s
    print(char, "_")
end
a_ _l_o_a_d_ _o_f_ _c_h_a_r_a_c_t_e_r_s_

Watch out if you take a single element from the string, rather than a string of length 1 (i.e. with the same start and end positions):

julia> s[1:1]
"a" 

julia> s[1]
'a'

The second result isn't a string, but a character (inside single quotes).

Unicode strings[edit]

Not all strings are ASCII. To access individual characters in Unicode strings, you can't always use simple indexing, because some characters occupy more than one index position. Don't be fooled just because some of the index numbers appear to work:

julia> su = String("AéB𐅍CD")
"AéB𐅍CD"

julia> su[1]
'A'

julia> su[2]
'é'

julia> su[3]
ERROR: UnicodeError: invalid character index
in slow_utf8_next(::Array{UInt8,1}, ::UInt8, ::Int64) at ./strings/string.jl:67
in next at ./strings/string.jl:92 [inlined]
in getindex(::String, ::Int64) at ./strings/basic.jl:70

Instead of length(str) to find the length of a string, use lastindex(str):

julia> length(su)
6
julia> lastindex(su)
10

The isascii() functions tests whether a string is ASCII or contains Unicode characters:

julia> isascii(su)
false

In this string, the 'second' character, é, has 2 bytes, the 'fourth' character, 𐅍, has 4 bytes.

There are some useful functions for working with strings like this, including thisind(), nextind(), and prevind():

for i in eachindex(su)
    println(thisind(su, i), " -> ", su[i])
end
1 -> A
2 -> é
4 -> B
5 -> 𐅍
9 -> C
10 -> D

The 'third' character, B, starts with the 4th element in the string.

As an alternative, use the eachindex iterator:

for charindex in eachindex(su)
    @show su[charindex]
end
su[charindex] = 'A'
su[charindex] = 'é'
su[charindex] = 'B'
su[charindex] = '𐅍'
su[charindex] = 'C'
su[charindex] = 'D'

Splitting and joining strings[edit]

You can stick strings together (a process often called concatenation) using the multiply (*) operator:

julia> "s" * "t"
"st"

If you've used other programming languages, you might expect to use the addition (+) operator:

julia> "s" + "t"
LoadError: MethodError: `+` has no method matching +(::String, ::String)

- so use *.

If you can 'multiply' strings, you can also raise them to a power:

julia> "s" ^ 18
"ssssssssssssssssss"

You can also use string():

julia> string("s", "t")
"st"

but if you want to do a lot of concatenation, inside a loop, perhaps, it might be better to use the string buffer approach (see below).

To split a string, use split() function. Given this simple string:

julia> s = "You know my methods, Watson."
"You know my methods, Watson."

a simple call to the split() function divides the string at the spaces, returning a five-piece array:

julia> split(s)
5-element Array{SubString{String},1}:
"You"
"know"
"my"
"methods,"
"Watson."

Or you can specify the string of 1 or more characters to split at:

julia> split(s, "e")
2-element Array{SubString{String},1}:
"You know my m"
"thods, Watson."

julia> split(s, " m")'
3-element Array{SubString{String},1}:
"You know"    
"y"       
"ethods, Watson."

The characters you use to do the splitting don't appear in the final result:

julia> split(s, "hod")
2-element Array{SubString{String},1}:
"You know my met"
"s, Watson."

If you want to split a string into separate single-character strings, use the empty string ("") which splits the string between the characters:

julia> split(s,"")
28-element Array{SubString{String},1}:
"Y"
"o"
"u"
" "
"k"
"n"
"o"
"w"
" "
"m"
"y"
" "
"m"
"e"
"t"
"h"
"o"
"d"
"s"
","
" "
"W"
"a"
"t"
"s"
"o"
"n"
"."

You can also split strings using a regular expression to define the splitting points. Use the special regex string construction r" ". Inside this, you can use regular expression characters with special meanings:

julia> split(s, r"a|e|i|o|u")
8-element Array{SubString{String},1}:
"Y"
""
" kn"
"w my m"
"th"
"ds, W"
"ts"
"n."

Here, the r"a|e|i|o|u" is a regular expression string, and — as you'll know if you love regular expressions — that this matches any of the vowels. So the resulting array consists of the string split at every vowel. Notice the empty strings in the results -— if you don't want those, add a false flag at the end:

julia> split(s, r"a|e|i|o|u", false)
7-element Array{SubString{String},1}:
"Y"   
" kn"  
"w my m"
"th"  
"ds, W" 
"ts"  
"n."  

If you wanted to keep the vowels, rather than use them for splitting work, you have to delve deeper into the world of regex literal strings. Read on.

You can join the elements of a split string in array form using join():

julia> join(split(s, r"a|e|i|o|u", false), "aiou")
"Yaiou knaiouw my maiouthaiouds, Waioutsaioun."

Character objects[edit]

Above we extracted smaller strings from larger strings:

julia> s[1:1]
"a"

But when we extracted a single element from a string:

julia> s[1]
'a'

-notice the single quotes. In Julia, these are used to mark character objects, so 'a' is a character object, but "a" is a string with length 1. These are not equivalent.

You can convert character objects to strings easily enough:

julia> string('s') * string('d')
"sd"

or

julia> string('s', 'd')
"sd"

It's easy to input 32 bits Unicode characters using \U escape sequence (the uppercase means 32 bits). The lowercase escape sequence \u can be used for 16 and 8 bit characters:

julia> ('\U1014d', '\u2640', '\u26')
('𐅍','♀','&')

For strings, the \Uxxxxxxxx and \uxxxx syntax is more strict.

julia> "\U0001014d2\U000026402\u26402\U000000a52\u00a52\U000000352\u00352\x352"
"𐅍2♀2♀2¥2¥2525252"

Converting between numbers and strings[edit]

Turning integers into strings is the job of the string() function. The keyword base lets you specify the number base for the conversion, which you can use to convert decimal digits to a binary, octal, or hexadecimal string:

julia> string(11, base=2)
"1011"
julia> string(11, base=8)
"13"

julia> string(11, base=16)
"b"

julia> string(11)
"11"
julia> a = BigInt(2)^200
1606938044258990275541962092341162602522202993782792835301376
julia> string(a)
"1606938044258990275541962092341162602522202993782792835301376"
julia> string(a, base=16)
"1000000000000000000000000000000000000000000000000"

To convert strings to numbers, use parse(), and you can also specify the number base (such as binary or hex) if you want the string to be interpreted as using a number base:

julia> parse(Int, "100")
100

julia> parse(Int, "100", base=2)
4

julia> parse(Int, "100", base=16)
256

julia> parse(Float64, "100.32")
100.32

julia> parse(Complex{Float64}, "0 + 1im")
0.0 + 1.0im

Converting characters to integers and back again[edit]

Int() converts a character into an integer, and Char() turns an integer into a character.

julia> Char(8253)
'‽': Unicode U+203d (category Po: Punctuation, other)

julia> Char(0x203d) # the Interrobang is Unicode U+203d in hexadecimal
'‽': Unicode U+203d (category Po: Punctuation, other)

julia> Int('‽')
8253

julia> string(Int('‽'), base=16)
"203d"

To go from a single character string to the code number (such as its ASCII or UTF code number), try this:

julia> Int("S"[1])
83

printf formatting[edit]

If you're deeply attached to C-style printf() functionality, you'll be able to use a Julia macro (you call macros by prefacing them with the @ sign). The macro is provided in the Printf package, which you'll need to load first:

julia> using Printf
julia> @printf("pi = %0.20f", float(pi))
pi = 3.14159265358979311600

or you can create another string using the sprintf() macro, also to be found in the Printf package:

julia> @sprintf("pi = %0.20f", float(pi))
"pi = 3.14159265358979311600"

Convert a string to an array[edit]

To read from a string into an array, you can use the IOBuffer() function. This is available with a number of Julia functions (including printf()). Here's a string of data (it could have been read from a file):

data="1 2 3 4
5 6 7 8
9 0 1 2"

"1 2 3 4\n5 6 7 8\n9 0 1 2"

Now you can "read" this string using functions such as readdlm(), the "read with delimiters" function. This can be found in the package DelimitedFiles.

julia> using DelimitedFiles
julia> readdlm(IOBuffer(data))
3x4 Array{Float64,2}:
1.0 2.0 3.0 4.0
5.0 6.0 7.0 8.0
9.0 0.0 1.0 2.0

You can add an optional type specification:

julia> readdlm(IOBuffer(data), Int)
3x4 Array{Int64,2}:
1 2 3 4
5 6 7 8
9 0 1 2

Sometimes you want to do things to strings that you can do better with arrays. Here's an example.

julia> s = "/Users/me/Music/iTunes/iTunes Media/Mobile Applications";

You can explode the pathname string into an array of character objects, using collect(), which gathers the items in a collection or string into an array:

julia> collect(s)
55-element Array{Char,1}:
'/'
'U'
's'
'e'
'r'
's'
'/'
...

Similarly, you can use split() to split the string and count the results:

julia> split(s, "")
55-element Array{Char,1}:
'/'
'U'
's'
'e'
'r'
's'
'/'
...

To count the occurrences of a particular character object, you can use an anonymous function:

julia> count(c -> c == '/', collect(s))
6

although here converting to an array is unnecessary and inefficient. Here's a better way:

julia> count(c -> c == '/', s)
6

Finding and replacing things inside strings[edit]

If you want to know whether a string contains a specific character, use the general-purpose in() function.

julia> s = "Elementary, my dear Watson";
julia> in('m', s)
true

But the occursin() function, which accepts two strings, is more generally useful, because you can use substrings with one or more characters. Notice that you place the search term first, then the string you're looking in — occursin(needle, haystack):

julia> occursin("Wat", s)
true
julia> occursin("m", s)
true
julia> occursin("mi", s)
false
julia> occursin("me", s)
true

You can get the location of the first occurrence of a substring using findfirst(needle, haystack). The first argument can be a single character, a string, or a regular expression:

julia> s ="You know my methods, Watson.";

julia> findfirst("meth", s)
13:16
julia> findfirst(r"[aeiou]", s)  # first vowel
2
julia> findfirst(isequal('a'), s) # first occurrence of character 'a'
23

In each case, the result contains the indices of the characters, if present.

Replacing[edit]

The replace() function returns a new string with a substring of characters replaced with something else:

julia> replace("Sherlock Holmes", "e" => "ee")
"Sheerlock Holmees"

You use the => operator to specify the pattern you're looking for, and its replacement. Usually the third argument is another string, as here. But you can also supply a function that processes the result:

julia> replace("Sherlock Holmes", "e" => uppercase)
"ShErlock HolmEs"

where the function (here, the built-in uppercase() function) is applied to the matching substring.

There's no replace! function, where the "!" indicates a function that changes its argument. That's because you can't change a string — they're immutable.

Regular expressions[edit]

You can use regular expressions to find matches for substrings. Some functions that accept a regular expression are:

  • replace() changes occurrences of regular expressions
  • match() returns the first match or nothing
  • eachmatch() returns an iterator that lets you search through all matches
  • split() splits a string at every match

Use replace() to replace each consonant with an underscore:

julia> replace("Elementary, my dear Watson!", r"[^aeiou]" => "_")
"__e_e__a________ea___a__o__"

and the following code replaces each vowel with the results of running a function on each match:

julia> replace("Elementary, my dear Watson!", r"[aeiou]" => uppercase)
"ElEmEntAry, my dEAr WAtsOn!"

With replace() you can access the matches if you provide a special substitution string s"", where \1 refers to the first match, \2 to the second, and so on. With this regex operation, each lowercase letter preceded by a space is repeated three times:

julia> replace("Elementary, my dear Watson!", r"(\s)([a-z])" => s"\1\2\2\2")
"Elementary, mmmy dddear Watson!"

For more regular expression fun, there are the -match- functions.

Here I've loaded the complete text of "The Adventures of Sherlock Holmes" from a file into the string called text:

julia> f = "/tmp/adventures-of-sherlock-holmes.txt"
julia> text = read(f, String);

To use the possibility of a match as a Boolean condition, suitable for use in an if statement for example, use occursin().

julia> occursin(r"Opium", text)
false

That's odd. We were expecting to find evidence of the great detective's peculiar pharmacological recreations. In fact, the word "opium" does appear in the text, but only in lower-case, hence this false result—regular expressions are case-sensitive.

julia> occursin(r"(?i)Opium", text)
true

This is a case-insensitive search, set by the flag (?i)), and it returns true.

You could check every line for the word using a simple loop:

for l in split(text, "\n")
    occursin(r"opium", l) && println(l)
end
opium. The habit grew upon him, as I understand, from some
he had, when the fit was on him, made use of an opium den in the
brown opium smoke, and terraced with wooden berths, like the
wrinkled, bent with age, an opium pipe dangling down from between
very short time a decrepit figure had emerged from the opium den,
opium-smoking to cocaine injections, and all the other little
steps - for the house was none other than the opium den in which
lives upon the second floor of the opium den, and who was
learn to have been the lodger at the opium den, and to have been
doing in the opium den, what happened to him when there, where is
"Had he ever showed any signs of having taken opium?"
room above the opium den when I looked out of my window and saw,

For more useable output (in the REPL), add enumerate() and some highlighting:

red = Base.text_colors[:red]; default = Base.text_colors[:default];
for (n, l) in enumerate(split(text, "\n"))
    occursin(r"opium", l) && println("$n $(replace(l, "opium" => "$(red)opium$(default)"))")
end
5087 opium. The habit grew upon him, as I understand, from some
5140 he had, when the fit was on him, made use of an opium den in the
5173 brown opium smoke, and terraced with wooden berths, like the
5237 wrinkled, bent with age, an opium pipe dangling down from between
5273 very short time a decrepit figure had emerged from the opium den,
5280 opium-smoking to cocaine injections, and all the other little
5429 steps - for the house was none other than the opium den in which
5486 lives upon the second floor of the opium den, and who was
5510 learn to have been the lodger at the opium den, and to have been
5593 doing in the opium den, what happened to him when there, where is
5846 "Had he ever showed any signs of having taken opium?"
6129 room above the opium den when I looked out of my window and saw,

There's an alternative syntax for adding regex modifiers, such as case-insensitive matches. Notice the "i" immediately following the regex string in the second example:

julia> occursin(r"Opium", text)
false

julia> occursin(r"Opium"i, text)
true

With the eachmatch() function, you apply the regex to the string to produce an iterator. For example, to look for substrings in our text matching the letters "L", followed by some other characters, ending with "ed":

julia> lmatch = eachmatch(r"L.*?ed", text)

The result in lmatch is an iterable object containing all the matches, as RegexMatch objects:

julia> collect(lmatch)[1:10]
10-element Array{RegexMatch,1}:
RegexMatch("London, and proceed")         
RegexMatch("London is a pleasant thing indeed")  
RegexMatch("Looking for lodgings,\" I answered") 
RegexMatch("London he had received")       
RegexMatch("Lied")                
RegexMatch("Life,\" and it attempted")      
RegexMatch("Lauriston Gardens wore an ill-omened")
RegexMatch("Let\" card had developed")      
RegexMatch("Lestrade, is here. I had relied")   
RegexMatch("Lestrade grabbed")         

We can step through the iterator and look at each match in turn. You can access a number of fields of a RegexMatch, to extract information about the match. These include captures, match, offset, offsets, and regex. For example, the match field contains the matched substring:

for i in lmatch
    println(i.match)
end
London - quite so! Your Majesty, as I understand, became entangled
Lodge. As it pulled
Lord, Mr. Wilson, that I was a red
League of the Red
League was founded
London when he was young, and he wanted
LSON" in white letters, upon a corner house, announced
League, and the copying of the 'Encyclopaed
Leadenhall Street Post Office, to be left till called
Let the whole incident be a sealed
Lestrade, being rather puzzled
Lestrade would have noted
...
Lestrade," drawled
Lestrade looked
Lord St. Simon has not already arrived
Lord St. Simon sank into a chair and passed
Lord St. Simon had by no means relaxed
Lordship. "I may be forced
London. What could have happened
London, and I had placed

Other fields include captures, the captured substrings as an array of strings, offset, the offset into the string at which the whole match begins, and offsets, the offsets of the captured substrings.

To get an array of matching strings, use something like this:

julia> collect(m.match for m in eachmatch(r"L.*?ed", text))
58-element Array{SubString{String},1}:
"London - quite so! Your Majesty, as I understand, became entangled"
"Lodge. As it pulled"                        
"Lord, Mr. Wilson, that I was a red"                
"League of the Red"                         
"League was founded"                        
"London when he was young, and he wanted"              
"Leadenhall Street Post Office, to be left till called"       
"Let the whole incident be a sealed"                
"Lestrade, being rather puzzled"                  
"Lestrade would have noted"                     
"Lestrade looked"                          
"Lestrade laughed"                         
"Lestrade shrugged"                         
"Lestrade called"                          
... 
"Lord St. Simon shrugged"                      
"Lady St. Simon was decoyed"                    
"Lestrade,\" drawled"                        
"Lestrade looked"                          
"Lord St. Simon has not already arrived"              
"Lord St. Simon sank into a chair and passed"            
"Lord St. Simon had by no means relaxed"              
"Lordship. \"I may be forced"                    
"London. What could have happened"                 
"London, and I had placed" 

The basic match() function looks for the first match for your regex. Use the match field to extract the information from the RegexMatch object:

julia> match(r"She.*",text).match
"Sherlock Holmes she is always THE woman. I have seldom heard\r"

A more streamlined way of obtaining matching lines from a file is this:

julia> f = "adventures of sherlock holmes.txt"

julia> filter(s -> occursin(r"(?i)Opium", s), map(chomp, readlines(open(f))))
12-element Array{SubString{String},1}:
"opium. The habit grew upon him, as I understand, from some"    
"he had, when the fit was on him, made use of an opium den in the" 
"brown opium smoke, and terraced with wooden berths, like the"   
"wrinkled, bent with age, an opium pipe dangling down from between"
"very short time a decrepit figure had emerged from the opium den,"
"opium-smoking to cocaine injections, and all the other little"  
"steps - for the house was none other than the opium den in which" 
"lives upon the second floor of the opium den, and who was"    
"learn to have been the lodger at the opium den, and to have been" 
"doing in the opium den, what happened to him when there, where is"
"\"Had he ever showed any signs of having taken opium?\""     
"room above the opium den when I looked out of my window and saw,"

Making a Regex[edit]

Sometimes you want to make a regular expression from within your code. You can do this by making a Regex object. Here is one way you could count the number of vowels in the text:

f = open("sherlock-holmes.txt")

text = read(f, String)

for vowel in "aeiou"
    r = Regex(string(vowel))
    l = [m.match for m = eachmatch(r, thetext)]
    println("there are $(length(l)) letter \"$vowel\"s in the text.")
end
there are 219626 letter "a"s in the text.
there are 337212 letter "e"s in the text.
there are 167552 letter "i"s in the text.
there are 212834 letter "o"s in the text.
there are 82924 letter "u"s in the text.

Testing and changing strings[edit]

There are lots of functions for testing and changing strings:

  • length(str) length of string
  • sizeof(str) length/size
  • startswith(strA, strB) does strA start with strB?
  • endswith(strA, strB) does strA end with strB?
  • occursin(strA, strB) does strA occur in strB?
  • all(isletter, str) is str entirely letters?
  • all(isnumeric, str) is str entirely number characters?
  • isascii(str) is str ASCII?
  • all(iscntrl, str) is str entirely control characters?
  • all(isdigit, str) is str 0-9?
  • all(ispunct, str) does str consist of punctuation?
  • all(isspace, str) is str whitespace characters?
  • all(isuppercase, str) is str uppercase?
  • all(islowercase, str) is str entirely lowercase?
  • all(isxdigit, str) is str entirely hexadecimal digits?
  • uppercase(str) return a copy of str converted to uppercase
  • lowercase(str) return a copy of str converted to lowercase
  • titlecase(str) return copy of str with the first character of each word converted to uppercase
  • uppercasefirst(str) return copy of str with first character converted to uppercase
  • lowercasefirst(str) return copy of str with first character converted to lowercase
  • chop(str) return a copy with the last character removed
  • chomp(str) return a copy with the last character removed only if it's a newline

Streams[edit]

To write to a string, you can use a Julia stream. The sprint() (String Print) function lets you use a function as the first argument, and uses the function and the rest of the arguments to send information to a stream, returning the result as a string.

For example, consider the following function, f. The body of the function maps an anonymous 'print' function over the arguments, enclosing them with angle brackets. When used by sprint, the function f processes the remaining arguments and sends them to the stream.

function f(io::IO, args...)
    map((a) -> print(io,"<",a, ">"), args)
end
f (generic function with 1 method)
julia> sprint(f, "fred", "jim", "bill", "fred blogs")
"<fred><jim><bill><fred blogs>"

Functions like println() can take an IOBuffer or stream as their first argument. This lets you print to streams instead of printing to the standard output device:

julia> iobuffer = IOBuffer()
IOBuffer(data=Uint8[...], readable=true, writable=true, seekable=true, append=false, size=0, maxsize=Inf, ptr=1, mark=-1)
julia> for i in 1:100
           println(iobuffer, string(i))
       end

After this, the in-memory stream called iobuffer is full of numbers and newlines, even though nothing was printed on the terminal. To copy the contents of iobuffer from the stream to a string or array, you can use take!():

julia> String(take!(iobuffer))
"1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n11\n12\n13\n14 ... \n98\n99\n100\n"