Haskell/Other data structures

From Wikibooks, open books for an open world
Jump to navigation Jump to search

In this chapter, we will work through examples of how the techniques we have studied thus far can be used to deal with more complex data types. In particular, we will see examples of recursive data structures, which are data types that can contain values of the same type. Recursive data structures play a vital role in many programming techniques, and so even if you are not going to need defining a new one often (as opposed to using the ones available from the libraries) it is important to be aware of what they are and how they can be manipulated. Besides that, following closely the implementations in this chapter is a good exercise for your budding Haskell abilities.

Note

The Haskell library ecosystem provides a wealth of data structures (recursive and otherwise), covering a wide range of practical needs. Beyond lists, there are maps, sets, finite sequences and arrays, among many others. A good place to begin learning about the main ones is the Data structures primer in the Haskell in Practice track. We recommend you to at least skim it once you finish the next few Intermediate Haskell chapters.


Trees[edit | edit source]

One of the most important types of recursive data structures is trees. There are several different kinds of trees, so we will arbitrarily choose a simple one to use as an example. Here is its definition:

data Tree a = Leaf a | Branch (Tree a) (Tree a)

As you can see, it's parameterized; i.e. we can have trees of Ints, trees of Strings, trees of Maybe Ints, trees of (Int, String) pairs and so forth. What makes this data type special is that Tree appears in the definition of itself. A Tree a is either a leaf, containing a value of type a or a branch, from which hang two other trees of type Tree a.

Lists as Trees[edit | edit source]

As we have seen in Lists II and Lists III, we break lists down into two cases: An empty list (denoted by []), and an element of the specified type plus another list (denoted by (x:xs)). That means the definition of the list data type might look like this:

 -- Pseudo-Haskell, will not actually work (because lists have special syntax).
data [a] = [] | (a:[a])

An equivalent definition you can actually play with is:

data List a = Nil | Cons a (List a)

Like trees, lists are also recursive. For lists, the constructor functions are [] and (:). They correspond to Leaf and Branch in the Tree definition above. That implies we can use Leaf and Branch for pattern matching just as we did with the empty list and the (x:xs).

Maps and Folds[edit | edit source]

We already know about maps and folds for lists. Now, we can write map and fold functions for our new Tree type. To recap:

data Tree a = Leaf a | Branch (Tree a) (Tree a) deriving (Show)
data [a]    = []     | (:)    a [a]
  -- (:) a [a] is the same as (a:[a]) with prefix instead of infix notation.

Note

Deriving is explained later on in the section Classes and types. For now, understand it as telling Haskell (and by extension your interpreter) how to display a Tree instance.


Map[edit | edit source]

Let's take a look at the definition of map for lists:

map :: (a -> b) -> [a] -> [b]
map _ [] = []
map f (x:xs) = f x : map f xs

If we were to write treeMap, what would its type be? Defining the function is easier if you have an idea of what its type should be.

We want treeMap to work on a Tree of some type and return another Tree of some type by applying a function on each element of the tree.

treeMap :: (a -> b) -> Tree a -> Tree b

This is just like the list example.

Now, when talking about a Tree, each Leaf only contains a single value, so all we have to do is apply the given function to that value and then return a Leaf with the altered value:

treeMap :: (a -> b) -> Tree a -> Tree b
treeMap f (Leaf x) = Leaf (f x)

This looks a lot like the empty list case with map. Now, if we have a Branch, it will include two subtrees; what do we do with those? The list-map uses a call to itself on the tail of the list (recursion), so we also shall do that with the two subtrees. The complete definition of treeMap is as follows:

treeMap :: (a -> b) -> Tree a -> Tree b
treeMap f (Leaf x) = Leaf (f x)
treeMap f (Branch left right) = Branch (treeMap f left) (treeMap f right)

We can make this a bit more readable by noting that treeMap f is itself a function with type Tree a -> Tree b. This gives us the following revised definition:

treeMap :: (a -> b) -> Tree a -> Tree b
treeMap f = g where
  g (Leaf x) = Leaf (f x)
  g (Branch left right) = Branch (g left) (g right)

If you didn't follow that immediately, try re-reading it. This use of pattern matching may seem weird at first, but it is essential to the use of datatypes. Remember that pattern matching happens on constructor functions.

When you're ready, read on for folds over Trees.

Fold[edit | edit source]

As with map, let's first review the definition of foldr for lists:

foldr :: (a -> b -> b) -> b -> [a] -> b
foldr f acc [] = acc
foldr f acc (x:xs) = f x (foldr f acc xs)

Recall that lists have two constructors:

(:) :: a -> [a] -> [a]  -- takes an element and combines it with the rest of the list
[] :: [a]  -- the empty list takes zero arguments

Thus foldr takes two arguments corresponding to the two constructors:

f :: a -> b -> b  -- a function takes two elements and operates on them to return a single result
acc :: b  -- the accumulator defines what happens with the empty list

Let's take a moment to make this clear. If the initial foldr is given an empty list, then the default accumulator is returned. With a non-empty list, the first element is combined (with f) with the result of folding the tail of the list, and so the fold proceeds until we get to the empty list.

Like foldr for lists, we want treeFold to transform a tree of some type into a value of some other type; so in place of [a] -> b we will have Tree a -> b. How do we specify the transformation? First note that Tree a has two constructors (just like lists have two constructors):

Branch :: Tree a -> Tree a -> Tree a
Leaf :: a -> Tree a

So treeFold will have two arguments corresponding to the two constructors:

fbranch :: b -> b -> b
fleaf :: a -> b

Putting it all together we get the following type definition:

treeFold :: (b -> b -> b) -> (a -> b) -> Tree a -> b

That is, the first argument, of type (b -> b -> b), is a function specifying how to combine subtrees into a single result; the second argument, of type a -> b, is a function specifying what to do with leaves (which are the end of recursion, just like empty-list for lists); and the third argument, of type Tree a, is the whole tree we want to fold.

As with treeMap, we'll avoid repeating the arguments fbranch and fleaf by introducing a local function g:

treeFold :: (b -> b -> b) -> (a -> b)  -> Tree a -> b
treeFold fbranch fleaf = g where
  -- definition of g goes here

The argument fleaf tells us what to do with Leaf subtrees:

g (Leaf x) = fleaf x

The argument fbranch tells us how to combine the results of "folding" two subtrees:

g (Branch left right) = fbranch (g left) (g right)

Our full definition becomes:

treeFold :: (b -> b -> b) -> (a -> b) -> Tree a -> b
treeFold fbranch fleaf = g where
  g (Leaf x) = fleaf x
  g (Branch left right) = fbranch (g left) (g right)

For examples of how these work, copy the Tree data definition and the treeMap and treeFold functions to a Haskell file, along with the following example Tree and example functions to fold over.

tree1 :: Tree Integer
tree1 = 
    Branch
       (Branch 
           (Branch 
               (Leaf 1) 
               (Branch (Leaf 2) (Leaf 3))) 
           (Branch 
               (Leaf 4) 
               (Branch (Leaf 5) (Leaf 6)))) 
       (Branch
           (Branch (Leaf 7) (Leaf 8)) 
           (Leaf 9))
 
doubleTree = treeMap (*2)  -- doubles each value in tree
sumTree = treeFold (+) id -- sum of the leaf values in tree
fringeTree = treeFold (++) (: [])  -- list of the leaves of tree

Then load it into GHCi and evaluate:

doubleTree tree1
sumTree tree1

Other datatypes[edit | edit source]

Map and fold functions can be defined for any kind of data type. In order to generalize the strategy applied for lists and trees, in this final section we will work out a map and a fold for a very strange, contrived datatype:

data Weird a b = First a
               | Second b
               | Third [(a,b)]
               | Fourth (Weird a b)

It can be a useful exercise to write the functions as you follow the examples, trying to keep the coding one step ahead of your reading.

General Map[edit | edit source]

The first important difference in working with this Weird type is that it has two type parameters. For that reason, we will want the map function to take two functions as arguments, one to be applied on the elements of type a and another for the elements of type b. With that accounted for, we can write the type signature of weirdMap:

weirdMap :: (a -> c) -> (b -> d) -> Weird a b -> Weird c d

Next step is defining weirdMap. The key point is that maps preserve the structure of a datatype, so the function must evaluate to a Weird which uses the same constructor as the one used for the original Weird. For that reason, we need one definition to handle each constructor, and these constructors are used as patterns for writing them. As before, to avoid repeating the weirdMap argument list over and over again a where clause comes in handy. A sketch of the function is below:

weirdMap :: (a -> c) -> (b -> d) -> Weird a b -> Weird c d
weirdMap fa fb = g
  where
    g (First x)          = --More to follow
    g (Second y)         = --More to follow
    g (Third z)          = --More to follow
    g (Fourth w)         = --More to follow

The first two cases are fairly straightforward, as there is just a single element of a or b type inside the Weird.

weirdMap :: (a -> c) -> (b -> d) -> Weird a b -> Weird c d
weirdMap fa fb = g
  where
    g (First x)          = First (fa x)
    g (Second y)         = Second (fb y)
    g (Third z)          = --More to follow
    g (Fourth w)         = --More to follow

Third is trickier because it contains a list whose elements are themselves data structures (the tuples). So we need to navigate the nested data structures, apply fa and fb on all elements of type a and b and eventually (as a map must preserve structure) produce a list of tuples – [(c,d)] – to be used with the constructor. The simplest approach might seem to be just breaking down the list inside the Weird and playing with the patterns:

    g (Third []) = Third []
    g (Third ((x,y):zs)) = Third ( (fa x, fb y) : ( (\(Third z) -> z) (g (Third zs)) ) )

This appears to be written as a typical recursive function for lists. We start by applying the functions of interest to the first element in order to obtain the head of the new list, (fa x, fb y). But what will we cons it to? As g requires a Weird argument, we need to make a Weird using the list tail in order to make the recursive call. But then g will give a Weird and not a list, so we have to retrieve the modified list from that – that's the role of the lambda function. Finally, there is also the empty list base case to be defined as well.

After all of that, we are left with a messy function. Every recursive call of g requires wrapping zs into a Weird, while what we really wanted to do was to build a list with (fa x, fb y) and the modified xs. The problem with this solution is that g can (thanks to pattern matching) act directly on the list head but (due to its type signature) can't be called directly on the list tail. For that reason, it would be better to apply fa and fb without breaking down the list with pattern matching (as far as g is directly concerned, at least). But there was a way to directly modify a list element-by-element...

    g (Third z) = Third ( map (\(x, y) -> (fa x, fb y) ) z)

...our good old map function, which modifies all tuples in the list z using a lambda function. In fact, the first attempt at writing the definition looked just like an application of the list map except for the spurious Weird packing and unpacking. We got rid of these by having the pattern splitting of z done by map, which works directly with regular lists. You could find it useful to expand the map definition inside g to see the difference more clearly. Finally, you may prefer to write this new version in an alternative and clean way using list comprehension syntax:

    g (Third z) = Third [ (fa x, fb y) | (x,y) <- z ]

Adding the Third function, we only have the Fourth left to define:

weirdMap :: (a -> c) -> (b -> d) -> Weird a b -> Weird c d
weirdMap fa fb = g
  where
    g (First x)          = First (fa x)
    g (Second y)         = Second (fb y)
    g (Third z)          = Third ( map (\(x, y) -> (fa x, fb y) ) z)
    g (Fourth w)         = --More to follow

All we need to do is apply g recursively:

weirdMap :: (a -> c) -> (b -> d) -> Weird a b -> Weird c d
weirdMap fa fb = g
  where
    g (First x)          = First (fa x)
    g (Second y)         = Second (fb y)
    g (Third z)          = Third ( map (\(x, y) -> (fa x, fb y) ) z)
    g (Fourth w)         = Fourth (g w)

General Fold[edit | edit source]

While we were able to define a map by specifying as arguments a function for every separate type, this isn't enough for a fold. For a fold, we'll need a function for every constructor function. With lists, the constructors are [] and (:). The acc argument in the foldr function corresponds to the [] constructor. The f argument in the foldr function corresponds to the (:) constructor. The Weird datatype has four constructors, so we need four functions – one for handling the internal structure of the datatype specified by each constructor. Next, we have an argument of the Weird a b type, and finally we want the whole fold function to evaluate to a value of some other, arbitrary, type. Additionally, each individual function we pass to weirdFold must evaluate to the same type weirdFold does. That allows us to make a mock type signature and sketch the definition:

weirdFold :: (something1 -> c) -> (something2 -> c) -> (something3 -> c) -> (something4 -> c) -> Weird a b -> c
weirdFold f1 f2 f3 f4 = g
  where
    g (First x)          = --Something of type c here
    g (Second y)         = --Something of type c here
    g (Third z)          = --Something of type c here
    g (Fourth w)         = --Something of type c here

Now, we need to figure out which types something1, something2, something3 and something4 correspond to. That is done by analyzing the constructors, since the functions must take as arguments the elements of the datatype (whose types are specified by the constructor type signature). Again, the types and definitions of the first two functions are easy enough. The third one isn't too difficult either because, for the purposes of folding the list of (a,b), tuples are no different from a simple type (unlike in the map example, the internal structure does not concern us now). The fourth constructor, however, is recursive, and we have to be careful. As with weirdMap, we also need to recursively call the g function. This brings us to the final definition:

weirdFold :: (a -> c) -> (b -> c) -> ([(a,b)] -> c) -> (c -> c) -> Weird a b -> c
weirdFold f1 f2 f3 f4 = g
  where
    g (First x)          = f1 x
    g (Second y)         = f2 y
    g (Third z)          = f3 z
    g (Fourth w)         = f4 (g w)

Note

If you were expecting very complex expressions in the weirdFold above and are surprised by the immediacy of the solution, it might be helpful to have a look on what the common foldr would look like if we wrote it in this style and didn't have the special square-bracket syntax of lists to distract us:

-- List a is [a], Cons is (:) and Nil is []
data List a = Cons a (List a) | Nil

listFoldr :: (a -> b -> b) -> (b) -> List a -> b
listFoldr fCons fNil = g
  where
    g (Cons x xs) = fCons x (g xs)
    g Nil         = fNil

Now it is easier to see the parallels. The extra complications are that Cons (that is, (:)) takes two arguments (and, for that reason, so does fCons) and is recursive, requiring a call to g. Also, fNil is of course not really a function, as it takes no arguments.


Folds on recursive datatypes[edit | edit source]

As far as folds are concerned, Weird was a fairly nice datatype to deal with. Just one recursive constructor, which isn't even nested inside other structures. What would happen if we added a truly complicated fifth constructor?

    Fifth [Weird a b] a (Weird a a, Maybe (Weird a b))

This is a valid and yet tricky question. In general, the following rules apply:

  • A function to be supplied to a fold has the same number of arguments as the corresponding constructor.
  • The type of the arguments of such a function match the types of the constructor arguments, except if the constructor is recursive (that is, takes an argument of its own type).
  • If a constructor is recursive, any recursive argument of the constructor will correspond to an argument of the type the fold evaluates to.[1]
  • If a constructor is recursive, the complete fold function should be (recursively) applied to the recursive constructor arguments.
  • If a recursive element appears inside another data structure, the appropriate map function for that data structure should be used to apply the fold function to it.

So f5 would have the type:

f5 :: [c] -> a -> (Weird a a, Maybe c) -> c

as the type of Fifth is:

Fifth :: [Weird a b] -> a -> (Weird a a, Maybe (Weird a b)) -> Weird a b

The definition of g for the Fifth constructor will be:

    g (Fifth list x (waa, mc)) = f5 (map g list) x (waa, maybeMap g mc)
      where
        maybeMap f Nothing = Nothing
        maybeMap f (Just w) = Just (f w)

Note that nothing strange happens with the Weird a a part. No g gets called. What's up? This is recursion, right? Well, not really. Weird a a and Weird a b are different types, so it isn't a real recursion. It isn't guaranteed that, for example, f2 will work with something of type 'a', where it expects a type 'b'. It can be true for some cases but is not reliable for every case.

Also look at the definition of maybeMap. Verify that it is indeed a map function as:

  • It preserves structure.
  • Only types are changed.

A nice sounding word[edit | edit source]

The folds we have defined here are examples of catamorphisms. A catamorphism is a general way to collapse a data structure into a single value. There is deep theory associated with catamorphisms and related recursion schemes; however, we won't go through any of it now, as our main goal here was exercising the mechanics of data structure manipulation in Haskell with believable examples.

Notes

  1. This sort of recursiveness, in which the function used for folding can take the result of another fold as an argument, is what confers the folds of data structures such as lists and trees their "accumulating" functionality.