Introducing Julia/Strings and characters

From Wikibooks, open books for an open world
Jump to: navigation, search
« Introducing Julia
Strings and characters
»
Dictionaries and sets Working with text files

Strings and characters[edit]

Strings[edit]

A string is a sequence of one or more characters, usually seen when enclosed in double quotes:

"this is a string"

There are two important things you need to know about strings.

One is, that they're immutable. You can't change them once they're created. But it's easy to make new strings from parts of existing ones.

The second is that you have to be careful when using two specific characters: double quotes ("), and dollar signs ($). If you want to include a double quote character in the string, it has to be preceded with a backslash, otherwise the rest of the string would be interpreted as Julia code, with potentially interesting results. And if you want to include a dollar sign ($) in a string, that should also be prefaced by a backslash, because it's used for string interpolation.

julia> demand = "You owe me \$50!"
"You owe me \$50!"

julia> println(demand)
You owe me $50!

julia> demandquote = "He said, \"You owe me \$50!\""
"He said, \"You owe me \$50!\""

Strings can also be enclosed in triple double quotes. This is useful because you can use ordinary double quotes inside the string without having to put backslashes before them:

julia> """this is "a" string"""
"this is \"a\" string"

You'll encounter a few specialized types of string too, which consist of one or more characters immediately followed by the opening double quote:

  • r" " indicates a regular expression
  • v" " indicates a version string
  • b" " indicates a byte literal

String interpolation[edit]

You often want to use the results of Julia expressions inside strings. For example, suppose you want to say:

"The value of x is n."

where n is the current value of x.

Any Julia expression can be inserted into a string with the $() construction:

julia> x = 42
42
julia> "The value of x is $(x)."

displays:

"The value of x is 42."

You don't have to use the parentheses if you're just using the name of a variable:

julia> "The value of x is $x."
"The value of x is 42."

To include the result of a Julia expression in a string, enclose the expression in parentheses first, then precede it with a dollar sign:

julia> "The value of 2 + 2 is $(2 + 2)."
"The value of 2 + 2 is 4."

Substrings[edit]

To extract a smaller string from a string, use getindex(s, range) or s[range] syntax. For basic ASCII strings, you can use the same techniques that you use to extract elements from arrays:

julia> s = String("a load of characters")
"a load of characters"

julia> s[1:end]
"a load of characters"

julia> s[3:6]
"load"

julia> s[3:end-6]
"load of char"

You can easily iterate through a string:

julia> for char in s
           print(char, "_")
       end
a_ _l_o_a_d_ _o_f_ _c_h_a_r_a_c_t_e_r_s_

Watch out if you take a single element from the string, rather than a string of length 1 (i.e. with the same start and end positions):

julia> s[1:1]
"a" 

julia> s[1]
'a'

The second result isn't a string, but a character (single quotes).

Unicode strings[edit]

Not all strings are ASCII. To access individual characters in Unicode strings, you can't always use simple indexing, because some characters occupy more than one index position. Don't be fooled just because some of the index numbers appear to work:

julia> su = String("AéB𐅍CD")
"AéB𐅍CD"

julia> su[1]
'A'

julia> su[2]
'é'

julia> su[3]
ERROR: UnicodeError: invalid character index
 in slow_utf8_next(::Array{UInt8,1}, ::UInt8, ::Int64) at ./strings/string.jl:67
 in next at ./strings/string.jl:92 [inlined]
 in getindex(::String, ::Int64) at ./strings/basic.jl:70

Instead of length(str) to find the length of a string, use endof(str):

julia> length(su)
6

julia> endof(su)
10

The isascii() functions tests whether a string is ASCII or contains Unicode characters:

julia> isascii(su)
false

In this string, the 'second' character, é, has 2 bytes, the 'fourth' character, 𐅍, has 4 bytes. If you can't use an iteration loop to step through the entire string, use nextind() to find the next valid element. Here's a short example of how to examine a string character by character:

julia> c = 1; 
julia> while c <= endof(su)
           println(c, " -> ", su[c])
           c = nextind(su, c)
       end
1 -> A
2 -> é
4 -> B
5 -> 𐅍
9 -> C
10 -> D

The 'third' character, B, starts with the 4th element in the string.

As an alternative, use the eachindex iterator:

julia> for charindex in eachindex(su)
           @show su[charindex]
       end
su[charindex] = 'A'
su[charindex] = 'é'
su[charindex] = 'B'
su[charindex] = '𐅍'
su[charindex] = 'C'
su[charindex] = 'D'

Splitting and joining strings[edit]

You can stick strings together (a process often called concatenation) using the multiply (*) operator:

julia> "s" * "t"
"st"

If you've used other programming languages, you might expect to use the addition (+) operator:

julia> "s" + "t"
LoadError: MethodError: `+` has no method matching +(::String, ::String)

If you can 'multiply' strings, you can also raise them to a power:

julia> "s" ^ 18
"ssssssssssssssssss"

You can also use string():

julia> string("s", "t")
"st"

but if you want to do a lot of concatenation, inside a loop, perhaps, it might be better to use the string buffer approach (see below).

To split a string, use split() function. Given this simple string:

julia> s = "You know my methods, Watson."
"You know my methods, Watson."

a simple call to the split() function divides the string at the spaces, returning a five-piece array:

julia> split(s)
5-element Array{SubString{String},1}:
 "You"
 "know"
 "my"
 "methods,"
 "Watson."

Or you can specify the string of 1 or more characters to split at:

julia> split(s, "e")
2-element Array{SubString{String},1}:
 "You know my m"
 "thods, Watson."
julia> split(s, " m")
3-element Array{SubString{String},1}:
 "You know"       
 "y"              
 "ethods, Watson."

The characters you use to do the splitting don't appear in the final result:

julia> split(s, "hod")
2-element Array{SubString{String},1}:
 "You know my met"
 "s, Watson."

If you want to split a string into separate single-character strings, use the empty string ("") which splits the string between the characters:

julia> split(s,"")
28-element Array{SubString{String},1}:
 "Y"
 "o"
 "u"
 " "
 "k"
 "n"
 "o"
 "w"
 " "
 "m"
 "y"
 " "
 "m"
 "e"
 "t"
 "h"
 "o"
 "d"
 "s"
 ","
 " "
 "W"
 "a"
 "t"
 "s"
 "o"
 "n"
 "."

You can also split strings using a regular expression to define the splitting points. Use the special regex string construction r" ":

julia> split(s, r"a|e|i|o|u")
8-element Array{SubString{String},1}:
 "Y"
 ""
 " kn"
 "w my m"
 "th"
 "ds, W"
 "ts"
 "n."

Here, the r"a|e|i|o|u" is a regular expression string, and — as you'll know if you love regular expressions — that this matches any of the vowels. So the resulting array consists of the string split at every vowel. Notice the empty strings in the results -— if you don't want those, add a false flag at the end:

julia> split(s, r"a|e|i|o|u", false)
7-element Array{SubString{String},1}:
 "Y"     
 " kn"   
 "w my m"
 "th"    
 "ds, W" 
 "ts"    
 "n."

If you wanted to keep the vowels, rather than use them for splitting work, you have to delve deeper into the world of regex literal strings. Read on.

You can join the elements of a split string in array form using join():

julia> join(split(s, r"a|e|i|o|u", false), "aiou")
"Yaiou knaiouw my maiouthaiouds, Waioutsaioun."

Character objects[edit]

Above we extracted smaller strings from larger strings:

julia> s[1:1]
"a"

But when we extracted a single element from a string:

julia> s[1]
'a'

- notice the single quotes. In Julia, these are used to mark character objects, so 'a' is a character object, but "a" is a string with length 1. These are not equivalent.

You can convert character objects to strings easily enough:

julia> string('s') * string('d')
"sd"

or

julia> string('s', 'd')
"sd"

It's easy to input 32 bits Unicode characters using \U escape sequence. The escape sequences \u and \x can be used for 16 bits and 8 bits characters:

julia> ('\U1014d','\u2640','\xa5')
('𐅍','♀','¥')

For strings, the \Uxxxxxxxx and \uxxxx syntax are more strict, and \x cannot be used for non ASCII characters.

julia> "\U0001014d2\U000026402\u26402\U000000a52\u00a52\U000000352\u00352\x352"
"𐅍2♀2♀2¥2¥2525252"

Converting to and from strings[edit]

The bin(), oct(), dec(), hex() functions turn a integer into binary, octal, decimal or hex strings.

julia> bin(11),oct(11),dec(11),hex(11)
("1011","13","11","b")
julia> a = BigInt(2)^200
1606938044258990275541962092341162602522202993782792835301376
julia> dec(a)
"1606938044258990275541962092341162602522202993782792835301376"
julia> hex(a)
"1000000000000000000000000000000000000000000000000"

The function string() can be used instead of dec()

julia> string(123)
"123"

Use parse to convert numbers in string form to actual numbers.

julia> parse(Int, "100")
100
julia> parse(Int, "100", 2)
4
julia> parse(Int, "100", 16)
256

The Int() function turns a character into an integer, and the Char() function turns an integer into a character.

julia> Char(0x203d) # the Interrobangis Unicode U+203d in hexadecimal
'‽'
julia> Int('‽')
8253
julia> hex('‽')
"203d"

To go from a single character string to the code number (such as its ASCII or UTF code number), try this:

julia> Int("S"[1])
83

If you're deeply attached to C-style printf() functionality, you'll be able to use a Julia macro (which are called by prefacing them with the @ sign):

julia> @printf("pi = %0.20f", float(pi))
pi = 3.14159265358979311600

or you can create another string using the sprintf() macro:

julia> @sprintf("pi = %0.20f", float(pi))
"pi = 3.14159265358979311600"

Convert a string to an array[edit]

To read from a string into an array, you can use the IOBuffer() function. This is available with a number of Julia functions (including printf()). Here's a string of data (it could have been read from a file):

julia> data="1 2 3 4
       5 6 7 8
       9 0 1 2"

"1 2 3 4\n5 6 7 8\n9 0 1 2"

Now you can "read" this string using functions such as readdlm(), the "read with delimiters" function:

julia> readdlm(IOBuffer(data))

3x4 Array{Float64,2}:
 1.0  2.0  3.0  4.0
 5.0  6.0  7.0  8.0
 9.0  0.0  1.0  2.0

You can add an optional type specification:

julia> readdlm(IOBuffer(data), Int)

3x4 Array{Int32,2}:
 1  2  3  4
 5  6  7  8
 9  0  1  2

Sometimes you want to do things to strings that you can do better with arrays. Here's an example.

julia> s = "/Users/me/Music/iTunes/iTunes Media/Mobile Applications";

You can explode the pathname string into an array of character objects, using collect(), which gathers the items in a collection or string into an array:

julia> collect(s)
55-element Array{Char,1}:
 '/'
 'U'
 's'
 'e'
 'r'
 's'
 '/'
 ...

Similarly, you can use split() to split the string and count the results:

julia> split(s, "")
55-element Array{Char,1}:
 '/'
 'U'
 's'
 'e'
 'r'
 's'
 '/'
 ...

To count the occurrences of a particular character object, you can use an anonymous function:

julia> count(c -> c == '/', collect(s))
6

although here converting to an array is unnecessary and inefficient. Here's a better way:

julia> count(c -> c == '/', s)
6

Finding and replacing things inside strings[edit]

If you want to know whether a string contains a specific character, use the general-purpose in() function.

julia> s = "Elementary, my dear Watson";
julia> in('m', s)
true

The contains() function, which accepts two strings, is more generally useful, because you can use substrings with one or more characters. Notice that you place the container first, then the string you're looking for:

julia> contains(s, "Wat")
true
julia> contains(s, "m")
true
julia> contains(s, "mi")
false
julia> contains(s, "me")
true

You can get the location of the first occurrence of a substring using search(). The second argument can be a single character, a vector or a set of characters, a string, or a regular expression:

julia> s ="You know my methods, Watson.";
julia> search(s, "h")
16:16

julia> search(s, ['a', 'e', 'i', 'o', 'u'])
2

This search is for the first occurrence of any of the set of characters, and 'o' was in the second position.

julia> search(s, "meth")
13:16

julia> search(s, r"m.*")
10:28

In each case, the result contains the indices of the characters, if present. If not:

julia> search(s, "mo")
0:-1

julia> s[0:-1]
""

For some tasks, you might prefer to use searchindex(), which returns either the start index or 0:

julia> searchindex(s, "m")
10

julia> searchindex(s, "mu")
0

You can also use a regular expression string (r" ") with search():

julia> search(s, r"m(y|e)")
10:11

looks for "my" or "me"

julia> s[search(s, r"m(y|e)")]
"my"

The replace() function returns a new string with a substring of characters replaced with something else:

julia> replace("Sherlock Holmes", "e", "ee")
"Sheerlock Holmees"

Usually the third argument is another string, as here. But you can also supply a function that processes the result:

julia> replace("Sherlock Holmes", "e", uppercase)
"ShErlock HolmEs"

where the function (here, the built-in uppercase() function) is applied to the matching substring.

There's no replace! function, where the "!" indicates a function that changes its argument. That's because you can't change a string — they're immutable.

Regular expressions[edit]

You can use regular expressions to find matches for substrings. Some functions that accept a regular expression are:

  • replace() changes occurrences of regular expressions
  • ismatch() returns true or false if there's a match for a regular expression
  • match() returns the first match or nothing
  • matchall() returns an array of matches
  • eachmatch() returns an iterator that lets you go through all the matches
  • search() searches a string for a match
  • split() splits a string at every match

Use replace() to replace each consonant with an underscore:

julia> replace("Elementary, my dear Watson!", r"[^aeiou]", "_")
"__e_e__a________ea___a__o__"

and the following code replaces each vowel with the results of running a function on each match:

julia> replace("Elementary, my dear Watson!", r"[aeiou]", uppercase)
"ElEmEntAry, my dEAr WAtsOn!"

With replace() you can access matches if you provide a special substitution string s"", where \1 refers to the first match, \2 to the second, and so on. A letter preceded by a space is repeated three times:

julia> replace("Elementary, my dear Watson!", r"(\s)([a-z])", s"\1\2\2\2") 
"Elementary, mmmy dddear Watson!"

For more regular expression fun, there are the -match- functions.

Here I've loaded the complete text of "The Adventures of Sherlock Holmes" from a file into the string called text:

julia> f = "/tmp/adventures-of-sherlock-holmes.txt"
julia> text = readstring(f);

To use the possibility of a match as a Boolean condition, suitable for use in an if statement for example, use ismatch().

julia> ismatch(r"Opium", text)
false

julia> ismatch(r"(?i)Opium", text)
true

The word "opium" does appear in the text, but only in lower-case, hence the first false result — regular expressions are case-sensitive. The second search, a case-insensitive search (set by the flag (?i)) for "Opium", returns true.

You could check every line for the word using ismatch() in a simple loop:

for l in split(text, "\n")
    ismatch(r"opium", l) && println(l)
end

opium. The habit grew upon him, as I understand, from some
he had, when the fit was on him, made use of an opium den in the
brown opium smoke, and terraced with wooden berths, like the
wrinkled, bent with age, an opium pipe dangling down from between
very short time a decrepit figure had emerged from the opium den,
opium-smoking to cocaine injections, and all the other little
steps - for the house was none other than the opium den in which
lives upon the second floor of the opium den, and who was
learn to have been the lodger at the opium den, and to have been
doing in the opium den, what happened to him when there, where is
"Had he ever showed any signs of having taken opium?"
room above the opium den when I looked out of my window and saw,

For more useable output (in the REPL), add enumerate() and some highlighting:

julia> bold = "\x1b[1m"; default = "\x1b[0m";
julia> for (n,l) in enumerate(split(text, "\n"))
           ismatch(r"opium", l) && println("$n $(replace(l, "opium", "$(bold)opium$(default)"))")
       end
5087 opium. The habit grew upon him, as I understand, from some
5140 he had, when the fit was on him, made use of an opium den in the
5173 brown opium smoke, and terraced with wooden berths, like the
5237 wrinkled, bent with age, an opium pipe dangling down from between
5273 very short time a decrepit figure had emerged from the opium den,
5280 opium-smoking to cocaine injections, and all the other little
5429 steps - for the house was none other than the opium den in which
5486 lives upon the second floor of the opium den, and who was
5510 learn to have been the lodger at the opium den, and to have been
5593 doing in the opium den, what happened to him when there, where is
5846 "Had he ever showed any signs of having taken opium?"
6129 room above the opium den when I looked out of my window and saw,

There's an alternative syntax for adding regex modifiers, such as case-insensitive matches. Notice the "i" following the regex string:

julia> ismatch(r"m"i, s)
true

With the eachmatch() function, you apply the regex to the string to produce an iterator. For example, to look for substrings in our text matching the letters "L", followed by some other characters, ending with "ed":

julia> lmatch = eachmatch(r"L.*?ed", text);

The result in lmatch is an iterable object containing all the matches, as RegexMatch objects. Now we can work through the iterator and look at each match in turn. You can access a number of fields of the RegexMatch, to extract information about the match. For example, the .match field contains the matched substring:

julia>for i in lmatch
   println(i.match)
end

London - quite so! Your Majesty, as I understand, became entangled
Lodge. As it pulled
Lord, Mr. Wilson, that I was a red
League of the Red
League was founded
London when he was young, and he wanted
LSON" in white letters, upon a corner house, announced
League, and the copying of the 'Encyclopaed
Leadenhall Street Post Office, to be left till called
Let the whole incident be a sealed
Lestrade, being rather puzzled
Lestrade would have noted
...
Lestrade," drawled
Lestrade looked
Lord St. Simon has not already arrived
Lord St. Simon sank into a chair and passed
Lord St. Simon had by no means relaxed
Lordship. "I may be forced
London. What could have happened
London, and I had placed

Other fields include .captures, the captured substrings as an array of strings, .offset, the offset into the string at which the whole match begins, and .offsets, the offsets of the captured substrings.

If you don't want an iterable object, use the matchall() function instead:

julia> lmatches = matchall(r"L.*?ed", text);

Now the lmatches array contains the matching substrings, which you can inspect any way you want:

julia> lmatches[4:6]
3-element Array{SubString{UTF8String},1}:
 "League of the Red"
 "League was founded"
 "London when he was young, and he wanted"

The basic match() function looks for the first match for your regex. Use the .match field to extract the information from the RegexMatch object:

julia> match(r"She.*",text).match
"Sherlock Holmes she is always THE woman. I have seldom heard\r"

It's possible to use `filter` directly on an array of strings:

filter(r"(?i)Opium", map(chomp, readlines(open(f))))
20-element Array{AbstractString,1}:
 "opium. The habit grew upon him, as I understand, from some"
 "he had, when the fit was on him, made use of an opium den in the"
 "brown opium smoke, and terraced with wooden berths, like the"
 "wrinkled, bent with age, an opium pipe dangling down from between"
 "very short time a decrepit figure had emerged from the opium den,"
 "opium-smoking to cocaine injections, and all the other little"
 "steps - for the house was none other than the opium den in which"
 "lives upon the second floor of the opium den, and who was"
 "learn to have been the lodger at the opium den, and to have been"
 "doing in the opium den, what happened to him when there, where is"
 "\"Had he ever showed any signs of having taken opium?\""
 "room above the opium den when I looked out of my window and saw,"
 "opium, while the people at the house partook of the"
 "the powdered opium?  Above all, where could he, a"
 "summer.  The opium was probably brought from London. "
 "Powdered opium is by no means tasteless.  The flavor"
 "happened to come along with powdered opium upon the"
 "mutton for supper that night.  The opium was added"
 "                            opium, and poisons generally."
 "rebels, drunk with opium and with bang, were enough to remind us"

Testing and changing strings[edit]

There are lots of functions for testing and changing strings:

  • length(str) length of string
  • sizeof(str) length/size
  • startswith(strA, strB) does strA start with strB?
  • endswith(strA, strB) does strA end with strB?
  • contains(strA, strB) does strA contain strB?
  • all(isalnum, str) is str alphanumeric?
  • all(isalpha, str) is str alphabetic?
  • isascii(str) is str ASCII?
  • all(iscntrl, str) is str control characters?
  • all(isdigit, str) is str 0-9?
  • all(islower, str) is str lowercase?
  • all(ispunct, str) does str consist of punctuation?
  • all(isspace, str) is str whitespace characters?
  • all(isupper, str) is str uppercase?
  • all(isxdigit, str) is str hexadecimal digits?
  • uppercase(str) return a copy of str converted to uppercase
  • lowercase(str) return a copy of str converted to lowercase
  • titlecase(str) return copy of str with the first character of each word converted to uppercase
  • ucfirst(str) return copy of str with first character converted to uppercase
  • lcfirst(str) return copy of str with first character converted to lowercase
  • chop(str) return a copy with the last character removed
  • chomp(str) return a copy with the last character removed only if it's a newline

Streams[edit]

To write to a string, you can use a Julia stream. The sprint() (String Print) function lets you use a function as the first argument, and uses the function and the rest of the arguments to send information to a stream.

For example, consider the following function, f. The body of the function maps an anonymous 'print' function over the arguments, enclosing them with angle brackets. When used by sprint, the function f processes the remaining arguments and sends them to the stream, which, with sprint(), is a string.

julia> function f(io::IO, args...)
    map((a) -> print(io,"<",a, ">"), args)
end

f (generic function with 1 method)
julia> sprint(f, "fred", "jim", "bill", "fred blogs")

"<fred><jim><bill><fred blogs>"

Functions like println() can take an IOBuffer or stream as their first argument. This lets you print to streams instead of printing to the standard output device:

julia> iobuffer = IOBuffer()

IOBuffer(data=Uint8[...], readable=true, writable=true, seekable=true, append=false, size=0, maxsize=Inf, ptr=1, mark=-1)

julia> for i in 1:100
           println(iobuffer, string(i))
       end

After this, the in-memory stream called iobuffer is full of numbers and newlines, even though nothing was printed on the terminal. To copy the contents of iobuffer from the stream to a string or array, you can use takebuf_string() (or takebuf_array()):

julia> takebuf_string(iobuffer)
"1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n11\n12\n13\n14\n15\n16\n17\n18\n19\n20\n21\n22\n23\n24\n25\n26\n27\n28\n29\n30\n31\n32\n33\n34\n35\n36\n37\n38\n39\n40\n41\n42\n43\n44\n45\n46\n47\n48\n49\n50\n51\n52\n53\n54\n55\n56\n57\n58\n59\n60\n61\n62\n63\n64\n65\n66\n67\n68\n69\n70\n71\n72\n73\n74\n75\n76\n77\n78\n79\n80\n81\n82\n83\n84\n85\n86\n87\n88\n89\n90\n91\n92\n93\n94\n95\n96\n97\n98\n99\n100\n"
julia>