Introducing Julia/Working with text files

From Wikibooks, open books for an open world
Jump to: navigation, search
« Introducing Julia
Working with text files
»
Strings and characters Working with dates and times

Working with text files[edit]

Reading from files[edit]

The standard approach for getting information from a text file is using the open(), read(), and close() functions.

Open[edit]

To read text from a file, first obtain a file handle:

f = open("sherlock-holmes.txt")

f is now Julia's connection to the file on disk. When you've finished with the file, you should close the connection, using:

close(f)

However, the recommended way to read a file in Julia is to wrap any file-processing functions inside a do block:

 open("sherlock-holmes") do f
    # do stuff with the open file
 end

The open file is automatically closed when this block finishes. See Controlling the flow for more about do blocks.

Slurp — reading a file all at once[edit]

You can read the entire contents of an open file at once with readstring():

  s = readstring(f)

This returns a string, complete with newlines.

Or you can use readlines() to read in the whole file as an array, with each line an element:

julia> f = open("sherlock-holmes.txt");
julia> lines = readlines(f)
76803-element Array{String,1}:
 "THE ADVENTURES OF SHERLOCK HOLMES by SIR ARTHUR CONAN DOYLE\r\n"
 "\r\n"
 "   I. A Scandal in Bohemia\r\n"
 "  II. The Red-headed League\r\n"
 ...
 "Holmes, rather to my disappointment, manifested no further\r\n"
 "interest in her when once she had ceased to be the centre of one\r\n"
 "of his problems, and she is now the head of a private school at\r\n"
 "Walsall, where I believe that she has met with considerable success.\r\n"
julia> close(f)

Now you can step through the lines:

counter = 1
for l in lines
   println("$counter $l")
   counter += 1
end
1 THE ADVENTURES OF SHERLOCK HOLMES by SIR ARTHUR CONAN DOYLE
2
3    I. A Scandal in Bohemia
4   II. The Red-headed League
5  III. A Case of Identity
6   IV. The Boscombe Valley Mystery
...
12638 interest in her when once she had ceased to be the centre of one
12639 of his problems, and she is now the head of a private school at
12640 Walsall, where I believe that she has met with considerable success.

There's a better way to do this — see enumerate(), below.

You might find the chomp() function useful — it removes the trailing newline from a string.

Line by line[edit]

The eachline() function turns a source into an iterator. This allows you to process a file a line at a time:

f = open("sherlock-holmes.txt");
for ln in eachline(f)
       print("$(length(ln)), $ln")
end
close(f)
1, THE ADVENTURES OF SHERLOCK HOLMES by SIR ARTHUR CONAN DOYLE
2,
28,    I. A Scandal in Bohemia
29,   II. The Red-headed League
26,  III. A Case of Identity
35,   IV. The Boscombe Valley Mystery

62, the island of Mauritius. As to Miss Violet Hunter, my friend
60, Holmes, rather to my disappointment, manifested no further
66, interest in her when once she had ceased to be the centre of one
65, of his problems, and she is now the head of a private school at
70, Walsall, where I believe that she has met with considerable success.

Another approach is to read until you reach the end of the file. Typically you also want to keep track of which line you're on:

 open("sherlock-holmes") do f
   line = 1
   while !eof(f)
     x = readline(f)
     println("$line $x")
     line += 1
   end
 end

An easier approach is to use enumerate() on an iterable object — you'll get the line numbering for free:

open("sherlock-holmes.txt") do f
    for i in enumerate(eachline(f))
      println(i)
    end
end

If you have a specific function that you want to call on a file, you can use this alternative syntax:

julia> function bigUpFunc(f::IOStream)
    return uppercase(readstring(f))
end
julia> upversion = bigUpFunc(open("sherlock-holmes.txt"));
julia> upversion[1:21]
"THE COMPLETE SHERLOCK"

This opens the file, runs the bigUpFunc function on it, then closes it again, assigning the processed contents to the variable.

You can use readcsv() and readdlm() functions to read lines from CSV files or files delimited with certain characters, such as data files, arrays stored as text files, and tables. And if you use the DataFrames package, there's also a readtable() specifically designed to read data into a table.

Working with paths and filenames[edit]

These functions will be useful for working with filenames:

  • cd(path) changes the current directory
  • readdir(path) returns a lists of the contents of a named directory, or the current directory,
  • abspath(path) adds the current directory's path to a filename to make an absolute pathname
  • joinpath(str, str, ...) assembles a pathname from pieces
  • isdir(path) tells you whether the path is a directory
  • splitdir(path) - split a path into a tuple of the directory name and file name.
  • splitdrive(path) - on Windows, split a path into the drive letter part and the path part. On Unix systems, the first component is always the empty string.
  • splitext(path) - if the last component of a path contains a dot, split the path into everything before the dot and everything including and after the dot. Otherwise, return a tuple of the argument unmodified and the empty string.
  • expanduser(path) - replace a tilde character at the start of a path with the current user's home directory.
  • normpath(path) - normalize a path, removing "." and ".." entries.
  • realpath(path) - canonicalize a path by expanding symbolic links and removing "." and ".." entries.
  • homedir() - current user's home directory.
  • dirname(path) - get the directory part of a path.
  • basename(path)- get the file name part of a path.

To work on a restricted selection of files in a directory, use filter() and an anonymous function to filter the file names and just keep the ones you want. (filter() is more of a fishing net or sieve, rather than a coffee filter, in that it catches what you want to keep.)

for f in filter(x -> endswith(x, "jl"), readdir())
    println(f)
end

Astro.jl
calendar.jl
constants.jl
coordinates.jl
...
pseudoscience.jl
riseset.jl
sidereal.jl
sun.jl
utils.jl
vsop87d.jl

If you want to match a group of files using a regular expression, then use ismatch(). Let's look for both JPG and PNG files (remembering to escape the "."):

for f in filter(x -> ismatch(r"\.jpg|\.png", x), readdir())
    println(f)
end

034571172750.jpg
034571172750.png
51ZN2sCNfVL._SS400_.jpg
51bU7lucOJL._SL500_AA300_.jpg
Voronoy.jpg
kblue.png
korange.png
penrose.jpg
r-home-id-r4.png
wave.jpg

To examine a file hierarchy, use walkdir(), which lets you work through a directory, and examine the files in each directory in turn.

File information[edit]

If you want information about a specific file, use stat("pathname"), and then use one of the fields to find out the information. Here's how to get all the information and the field names listed for a file "i":

julia> for n in fieldnames(stat(i))
    println(n, ": ", getfield(stat(i),n))
end

device: 16777219
inode: 2955324
mode: 16877
nlink: 943
uid: 502
gid: 20
rdev: 0
size: 32062
blksize: 4096
blocks: 0
mtime:1.409769933e9
ctime:1.409769933e9
julia>

Although you can access these fields via a 'stat' structure:

 julia> s = stat("Untitled1.ipynb")
 StatStruct(mode=100644, size=64424)

 julia> s.ctime
 1.446649269e9

you can also use some of them directly:

 julia> ctime("Untitled2.ipynb")
 1.446649269e9

although not size:

 julia> s.size
 64424

To work on specific files that meet conditions — all IPython files modified after a certain date, for example — you could use something like this:

 julia>function process_file(path)
           println(path, " ", stat(path).size)
       end 
       for afile in filter!(f ->
               endswith(f, "ipynb") &&
               (mtime(f) > Dates.datetime2unix(DateTime("2015-11-03T09:00"))),
           readdir())
           process_file(realpath(afile))
       end

Interacting with the file system[edit]

The cp(), mv(), rm(), and touch() functions have the same names and functions as their Unix shell counterparts.

To convert filenames to pathnames, use abspath(). You can map this over a list of files in a directory:

julia> map(abspath,readdir())
67-element Array{String,1}:
 "/Users/me/.CFUserTextEncoding"
 "/Users/me/.DS_Store"
 "/Users/me/.Trash"
 "/Users/me/.Xauthority"
 "/Users/me/.ahbbighrc"
 "/Users/me/.apdisk"
 "/Users/me/.atom"
...

To restrict the list to filenames that contain a particular substring, use an anonymous function inside filter() — something like this:

julia> filter(x -> contains(x, "re"),map(abspath, readdir()))
4-element Array{String,1}:
 "/Users/me/.DS_Store"
 "/Users/me/.gitignore"
 "/Users/me/.hgignore_global"
 "/Users/me/Pictures"
 ...

To restrict the list to regular expression matches, try this:

julia> filter(x -> ismatch(r"recur.*\.jl", x), map(abspath, readdir()))
2-element Array{String,1}:
 "/Users/me/julia/recursive-directory-scan.jl"
 "/Users/me/julia/recursive-text.jl"

Writing to files[edit]

To write to a text file, open it using the "w" flag and make sure that you have permission to create the file in the specified directory:

open("/tmp/t.txt", "w") do f
        write(f, "A, B, C, D\n")
     end

Here's how to write 20 lines of 4 random numbers between 1 and 10, separated by commas:

function fourrandom()
    return rand(1:10,4)
end

open("/tmp/t.txt", "w") do f
           for i in 1:20
              n1, n2, n3, n4 = fourrandom()
              write(f, "$n1, $n2, $n3, $n4 \n")
           end
       end

A quicker alternative to this is to use the writedlm() function, described next:

writedlm("/tmp/test.txt", rand(1:10, 20, 4), ", ")

Writing and reading array to and from a file[edit]

The convenient writedlm() and readdlm() functions let you write an array or collection to a file.

writedlm() writes the contents of an object to a text file, and readdlm() reads the data from a file into an array:

julia> numbers = rand(5,5)
5x5 Array{Float64,2}:
 0.913583  0.312291  0.0855798  0.0592331  0.371789
 0.13747   0.422435  0.295057   0.736044   0.763928
 0.360894  0.434373  0.870768   0.469624   0.268495
 0.620462  0.456771  0.258094   0.646355   0.275826
 0.497492  0.854383  0.171938   0.870345   0.783558

julia> writedlm("/tmp/test.txt", numbers)

You can see the file using the shell (type a semicolon ";" to switch):

<shell>  cat "/tmp/test.txt"
.9135833328830523	.3122905420350348	.08557977218948465	.0592330821115965	.3717889559226475
.13747015238054083	.42243494637594203	.29505701073304524	.7360443978397753	.7639280496847236
.36089432672073607	.43437288984307787	.870767989032692	.4696243851552686	.26849468736154325
.6204624598015906	.4567706404666232	.25809436255988105	.6463554854347682	.27582613759302377
.4974916625466639	.8543829989347014	.17193814498701587	.8703447748713236	.783557793485824

The elements are separated by tabs unless you specify another delimiter. Here, a colon is used to delimit the numbers:

julia> writedlm("/tmp/test.txt", rand(1:6, 10, 10), ":")
shell>  cat "/tmp/test.txt"
3:3:3:2:3:2:6:2:3:5
3:1:2:1:5:6:6:1:3:6
5:2:3:1:4:4:4:3:4:1
3:2:1:3:3:1:1:1:5:6
4:2:4:4:4:2:3:5:1:6
6:6:4:1:6:6:3:4:5:4
2:1:3:1:4:1:5:4:6:6
4:4:6:4:6:6:1:4:2:3
1:4:4:1:1:1:5:6:5:6
2:4:4:3:6:6:1:1:5:5

To read in data from a text file, you can use readdlm().

julia> numbers = rand(5,5)
5x5 Array{Float64,2}:
 0.862955  0.00827944  0.811526  0.854526  0.747977
 0.661742  0.535057    0.186404  0.592903  0.758013
 0.800939  0.949748    0.86552   0.113001  0.0849006
 0.691113  0.0184901   0.170052  0.421047  0.374274
 0.536154  0.48647     0.926233  0.683502  0.116988

julia> writedlm("/tmp/test.txt", numbers)

julia> numbers = readdlm("/tmp/test.txt")
5x5 Array{Float64,2}:
 0.862955  0.00827944  0.811526  0.854526  0.747977
 0.661742  0.535057    0.186404  0.592903  0.758013
 0.800939  0.949748    0.86552   0.113001  0.0849006
 0.691113  0.0184901   0.170052  0.421047  0.374274
 0.536154  0.48647     0.926233  0.683502  0.116988

Since it's so common to use files where the elements are separated with commas rather than tabs (CSV files), Julia provides "-csv" versions of these "-dlm" functions, writecsv() and readcsv(). As ever, refer to the official documentation for options and keywords. There are also a number of Julia packages specifically designed for reading and writing data to files, including DataFrames.jl and CSV.jl. Look through the Julia package directory for these and more.