Haskell/Libraries/Data structures primer
This chapter introduces some of the data structures available from the libraries. We will focus on common and particularly useful examples that every Haskeller should learn about.
This chapter continually emphasizes shortcomings with lists, but that does not mean you should quit using them! Lists are the default data structure in Haskell for good reasons: beyond their simplicity, lists have a pretty high power-to-weight ratio in a lazy, purely functional setting. Laziness makes it possible to use lists as streams where we sequentially process elements that are generated on demand. That process allows functions such as
zipWith to work as effective replacements for many common uses of iterative control structures.
As powerful as they may be, lists are better suited to patterns like streaming and control structure-like rather than simple data storage and retrieval. Of course, switching to a different data structure involves trade-offs. There will be advantages and disadvantages to any data structure, and the right choice depends on the problem at hand.
Lookups: Data.Map and co.
First, we'll consider a common problem: performing lookups on a data structure. Given a collection of associations between keys and values, we may want to retrieve the value, if any, corresponding to some key. We could simply store the associations as a list of pairs,
[(k, v)]. Indeed, Prelude contains
lookup :: Eq k => k -> [(k, v)] -> Maybe v, but to find a value from within a list, we have to go through all pairs in the list, testing the keys for equality until we reach either the one we are looking (or exhaust the list). A lookup in a plain association list is an O(n) operation, as the expected number of steps needed to perform it grows proportionally to the length of the list. It is easy to see how that can become a problem when there are a lot of associations.
We can achieve better lookups by switching to a more appropriate data structure. The
Map type provided by
Data.Map from the
containers package is a fine general-purpose choice. Note that
Data.Map is generally imported qualified, to avert name clashes with Prelude functions.
GHCi> import qualified Data.Map as M GHCi> :t M.empty M.empty :: M.Map k a
Map, keys and values are arranged in a (size balanced, binary) tree. That tree form makes looking for a key work by simply going down a particular branch of the tree. Tree manipulation, however, happens entirely behind the scenes.
Map, like many other data structures from the libraries, is used as an abstract type through an interface with no mention of the tree implementation backing it. In particular, constructors are not exported: a new
Map is built e.g. by either inserting associations into an
empty map or by using the utility function
GHCi> let foo = M.fromList [(1, "Robert"), (5, "Ian"), (6, "Bruce")] GHCi> :t foo foo :: M.Map Integer [Char]
Data.Map interface provides O(log n) lookups...
GHCi> :t M.lookup M.lookup :: Ord k => k -> M.Map k a -> Maybe a GHCi> M.lookup 5 foo Just "Ian" GHCi> M.lookup 7 foo Nothing
...as well as scores of other useful operations — unions, intersections, deletions and so forth. Instances for a handful of important type classes such as
Functor are available as well.
GHCi> M.size $ M.union foo $ M.fromList [(11, "Andrew"), (17, "Mike")] 5 GHCi> fmap reverse foo fromList [(1,"treboR"),(5,"naI"),(6,"ecurB")]
Other modules providing map and map-like data structures worth knowing about include:
containersprovides a more efficient map implementation limited to
Data.Set, also in
containers, provides a set implementation. Sets are appropriate when the interesting operation is simply finding whether a value is in a collection, rather than retrieving a value given a key. They are a lot like a map in which only the keys matter, and so many considerations about performance and implementations apply to both sets and maps.
unordered-containerspackage provides hash maps and sets. They offer efficiency gains (such as constant-time lookups) without limiting the type of the key to
Int, at the cost of a comparatively limited interface and the loss of the ordering guarantees of the
Peeking at both ends with Data.Sequence
One of the peculiarities of lists is that they are asymmetric. Given that we construct and deconstruct lists by their head with `(:)`, operations at the head are more efficient than the corresponding operations at the tail. For instance, while prepending an element with `(:)` takes constant time, naïvely appending a single element with
\xs x -> xs ++ [x] takes time proportional to the length of
xs. That means building up a list by repeatedly appending will take quadratic time in the number of elements, which is really bad.
When lots of operations at the middle or at the tail have to be done, an excellent list-like alternative to lists are sequences, as provided by the
Data.Sequence module, which is also part of
containers. Sequences and lists are quite different from one another, even though many of the familiar list functions reappear in some guise in
Data.Sequence. While lists are lazy and can be infinite, sequences are finite and strict. The trade-off which makes sequences useful is that, at the cost of some overhead with respect to lists, many operations which were troublesome with lists perform much better. Remarkably, we get both appending and prepending in constant time, length in constant time and also concatenation and random access in logarithmic time. All of that is available through a pleasant, purely functional interface.
GHCi> import qualified Data.Sequence as S GHCi> import Data.Sequence((<|), (|>), (><), ViewL(..), ViewR(..)) GHCi> let foo = S.fromList [1, 3, 5, 2, 9] GHCi> :t foo foo :: S.Seq Integer
(|>) appends, and
GHCi> 0 <| foo fromList [0,1,3,5,2,9] GHCi> foo |> 18 fromList [1,3,5,2,9,18] GHCi> foo >< foo fromList [1,3,5,2,9,1,3,5,2,9]
You can also pattern match at both ends. For that, you use either
viewr to get the desired view of the sequence and then match using
GHCi> S.viewl foo 1 :< fromList [3,5,2,9] GHCi> S.viewr foo fromList [1,3,5,2] :> 9 GHCi> let xs :> x = S.viewr foo GHCi> xs fromList [1,3,5,2] GHCi> x 9
Raw performance with arrays
Performance demands when dealing with large volumes of data can be quite stringent. For situations which require fast processing of bulk data, with laziness and streaming not being relevant concerns, Haskell offers true arrays in the vein of those found in C and elsewhere. Arrays are compact memory-wise, offer constant time random access and many blazingly fast operations (the main exceptions being those that require copying the arrays, such as immutable array concatenation), at the cost of a certain unwieldiness arising from the deep differences in behaviour between arrays and the purely functional data structures we normally deal with. There are several array libraries available; each of them generally providing a number of different kinds of arrays - from those whose usage do not feel very different from usual Haskell data structures to C-like mutable arrays of raw primitive values. Here we will mention three of the most popular array libraries.
vectoris a good default choice if you are just starting with arrays, or if you do not have specialised needs covered by other libraries. It provides one-dimensional arrays with an interface reasonably similar to that found in other data structure libraries such as
array, in contrast, has a more intimidating interface. As for features, it supports multi-dimensional arrays and custom indexing. Importantly, it is also part of the language standard, and is bundled with GHC, which makes it useful for library writers unwilling to incur additional dependencies. We provide an overview of standard arrays in a separate chapter, which can be used as an introduction to array-related terminology.
repais a sophisticated library providing state-of-the-art multi-dimensional parallelisable arrays. It is well suited for tasks such as image processing.
text, bytestring, and the problem with String
A quick glance at the
base libraries might leave us with the impression that
Strings are the preferred way of getting data into and out of a Haskell program. However, there are several issues with
String which make it a poor fit for such a role.
- The most obvious problem is performance. A
Stringis just a list of
Char. In applications that have to process even modestly large amounts of text or binary data, the advantages of general-purpose linked lists are overshadowed by the large losses of efficiency relative to what a specialised implementation would make possible.
- With binary data, a deeper issue is that a
Char-based representation makes little sense, as we are actually dealing with raw bytes.
- Finally, while Haskell
Chars are Unicode characters, overall the support for different encodings and internationalisation in the
baselibraries is somewhat lacking.
Those shortcomings of
String are addressed by the libraries
bytestring. Both are de facto standards; pretty much all modern libraries whose functionality involves any significant volumes of data input or output use them. They have clearly separate use cases:
textis for efficient processing of Unicode text. It supports conversion between encodings and, with the companion
text-iculibrary, a wide assortment of Unicode services.
bytestringis for efficient processing of binary data, in all of its guises - cases include network packets, raw image data, serialization (through libraries such as
cereal); you name it.
The core types of both libraries,
ByteString, are implemented as specialised, monomorphic containers of
Word8 (i.e. raw bytes) respectively. The internal representation is array-based and very compact. In terms of interfaces, both libraries are quite straightforward. The main subtlety to be aware of is that in both cases there are strict and lazy variants of the types. The strict versions are well-suited for processing large volumes of small pieces of data, while the lazy ones are processed in chunks, and therefore allow for streaming and processing of large pieces of monolithic data without memory consumption woes.
A convenience feature worth being aware of when dealing with
String replacements is that the
OverloadedStrings GHC extension make it possible to have automatic, type-directed conversion of string literals to
ByteString, which can be quite helpful, especially for
- It goes without saying that mutable arrays must live in either the