A popular problem to write a program for is a Sudoku puzzle solver. Sudoku puzzles are relatively simple timewasters that many of us have solved on airplanes on paper and pencil (or pen if one is brave). The basic concept of a Sudoku puzzle is that one has a 9×9 grid that is broken into nine 3×3 subgrids. The goal is to find a marking of the 9×9 grid such that the numbers 1 through 9 appear without repetition in each column, row, and 3×3 subgrid. One starts with an initial marking, and has to fill in the empty squares with valid numbers until the puzzle is filled up. For easy problems, the initial marking makes it pretty easy to fill out the puzzle, but for harder ones with a relatively sparse initial marking, the process of filling out the puzzle can be challenging to do by hand.

A number of solver techniques can be used to attack the problem, such as classical tree search methods that explore the solution space and rule out markings that violate the problem constraints. In this example, we take a different approach and don’t actually write a solver. Instead, we take advantage of a very general type of solver known as a SAT solver, and show that by encoding a Sudoku puzzle as a SAT problem, the SAT solver can very rapidly generate a solution for us.

SAT solvers are very generic tools that are designed to solve what are known as boolean satisfiability problems. A satisfiability problem is one in which we start with a single large boolean formula and seek an assignment of truth values (true or false) to the variables in the formula such that the overall formula is true. If such an assignment can be found, we say that the formula is satisfiable and the solver can emit a set of variable assignments that yield this satisfying state. If the formula is not satisfiable, the solver will tell us that it is unsatisfiable (UNSAT). SAT solvers operate on boolean formulas in what is known as conjunctive normal form (CNF), where a formula is built as a set of smaller formulae made up of variables (or their negation) connected by OR operators, and these smaller formulae are connected by conjunction (AND) operators. Any boolean expression can be converted into CNF by the application of basic propositional logical laws such as De Morgan’s laws and others. Conversion of arbitrary boolean formulas to CNF is outside the scope of this document.

For those who have Mathematica and would prefer to follow along in the notebook file from which this post was derived, it is available in my public github repo here.

In this post, we will use the SAT encoding for Sudoku as defined in this paper. To express Sudoku as a SAT problem we must encode the state of the grid, as well as the constraints that define a valid marking. In essence, we are going to encode something along the lines of “position I,J is marked 3, position M,N is marked 5, …” for the entire grid, as well as the constraints on the rows, columns, and 3×3 subgrids that determine whether or not a marking meets the rules of the game. Since a SAT problem requires us to encode everything in terms of booleans instead of integers, we can’t simply say that “I,J is 3”. The encoding we discuss here takes the approach of stating that for position I,J, there are 9 boolean values corresponding to which of the 9 potential values are actually in the grid cell. If I,J is marked 3, then we would say that the variable corresponding to that (call it I,J,3) is true, while all others (I,J,1 and I,J,2 and so on up to I,J,9) are false. So, the state of the grid is represented as a 9x9x9 set of boolean variables – every possible position, and every possible value.

In mathematica, we can create such a set of boolean variables by creating a 9x9x9 table of uniquely named variables.

s = Table[Table[Table[Unique[s], {z, 1, 9}], {y, 1, 9}], {x, 1, 9}];

The table S contains 9x9x9 uniquely named symbols that we can work with as our boolean values. At this point, they don’t have any truth value associated with them – they just denote the labels for the variables and the truth assignment will come later.

Before we proceed, we will create a couple of helper functions. One idiom that we want to work with are conjunctions or disjunctions over a list of formulae. For example, if we want to say “s1 AND s2 AND s3 AND s4”, it would be convenient to say “AND [s1,s2,s3,s4]”. This can be achieved by folding the appropriate operator (AND or OR) over a table, with the appropriate initial value for the fold.

AndList[l_] := Fold[And, True, l]; OrList[l_] := Fold[Or, False, l];

Once we have this, we can start to set up the constraints that define the puzzle rules. The simplest constraint states that each cell has at least one value associated with it. So, cell x,y is either 1 or 2 or 3 or … or 9. That corresponds to a disjunction of the variables x,y,1 and x,y,2, and so on. Since this then applies for all possible instances of x,y, each set of disjunctions is joined with a conjunction. This is easily implemented with our helpers.

f1 = AndList[Table[ AndList[Table[ OrList[Table[ s[[x, y, i]], {i, 1, 9}]], {x, 1, 9}]], {y, 1, 9}]];

Now, the rules also state that each value occurs at most in each row and each column, which we capture in formula f2 and f3.

f2 = AndList[Table[ AndList[Table[ AndList[Table[ AndList[Table[ Not[s[[x, y, z]]] || Not[s[[i, y, z]]], {i, x + 1, 9}]], {x, 1, 8}]], {z, 1, 9}]], {y, 1, 9}]]; f3 = AndList[Table[ AndList[Table[ AndList[Table[ AndList[Table[ Not[s[[x, y, z]]] || Not[s[[x, i, z]]], {i, y + 1, 9}]], {y, 1, 8}]], {z, 1, 9}]], {x, 1, 9}]];

Formula f4 and f5 capture the final constraint that states that each number appears at most once in each 3×3 subgrid. These are more complex since we need to repeat the constraints for each subgrid. The inner portion of the constraint deals with a single 3×3 subgrid, and the conjunctions outside that iterate those inner constraints over the full set of 3×3 grids. While this looks potentially complex due to the deep nesting of constraints, writing the indices out by hand makes it pretty clear.

f4 = AndList[Table[ AndList[Table[ AndList[Table[ AndList[Table[ AndList[Table[ AndList[Table[ Not[s[[3 i + x, 3 j + y, z]]] || Not[s[[3 i + x, 3 j + k, z]]], {k, y + 1, 3}]], {y, 1, 3}]], {x, 1, 3}]], {j, 0, 2}]], {i, 0, 2}]], {z, 1, 9}]]; f5 = AndList[Table[ AndList[Table[ AndList[Table[ AndList[Table[ AndList[Table[ AndList[Table[ AndList[Table[ Not[s[[3 i + x, 3 j + y, z]]] || Not[s[[3 i + k, 3 j + l, z]]], {l, 1, 3}]], {k, x + 1, 3}]], {y, 1, 3}]], {x, 1, 3}]], {j, 0, 2}]], {i, 0, 2}]], {z, 1, 9}]];

Once we have done this, we can look at the size of the constraint set.

Length[Flatten[f1 && f2 && f3 && f4 && f5]] 8829

As we expect from the paper that describes this encoding, we should have a formula with 8829 clauses. At this point we can move on to encoding the initial marking of the puzzle as additional constraints. Once we do that, we can let the SAT solver loose to find an assignment of truth variables such that all constraints (rules and initial marking) hold, which means the puzzle is solved. I decided to encode the initial markings as an associative array that maps coordinates to markings. So, the association of {3,5} to 4 means that at position {3,5} in the grid a 4 has been written in the box. Here is the marking for a simple Sudoku puzzle.

initial = <| {1, 2} -> 6, {1, 3} -> 1, {1, 4} -> 8, {1, 9} -> 7, {2, 2} -> 8, {2, 3} -> 9, {2, 4} -> 2, {2, 6} -> 5, {2, 8} -> 4, {3, 5} -> 4, {3, 7} -> 9, {3, 9} -> 3, {4, 1} -> 2, {4, 4} -> 1, {4, 5} -> 6, {4, 7} -> 3, {5, 1} -> 6, {5, 2} -> 7, {5, 8} -> 5, {5, 9} -> 1, {6, 3} -> 4, {6, 5} -> 2, {6, 6} -> 3, {6, 9} -> 8, {7, 1} -> 7, {7, 3} -> 5, {7, 5} -> 9, {8, 2} -> 9, {8, 4} -> 4, {8, 6} -> 2, {8, 7} -> 7, {8, 8} -> 3, {9, 1} -> 1, {9, 6} -> 8, {9, 7} -> 4, {9, 8} -> 6|>;

ShowMarking[initial]

Once we have this, we can convert this to a boolean formula. This entails enumerating the symbols s[x,y,v] for all positions and values that are true as a list of variables connected by conjunctions.

fInit = AndList[ Map[s[[#[[1]], #[[2]], initial[#]]] &, Keys[initial]]];

At this point, we’re done! Solving the problem is a matter of asking whether or not a satisfying instance exists for the constraints AND the initial marking. So, we conjoin all of the constraints and the initial marking, as well as the full list of symbols that appear in the formula, and let the solver go.

soln = SatisfiabilityInstances[f1 && f2 && f3 && f4 && f5 && fInit, Flatten[s]];

What returns is a list of boolean assignments for all of the symbols that were passed in, which in this case is the 9x9x9 array of symbols corresponding to all possible markings. We will need to reshape the flattened 9*9*9 long list into a 9x9x9 cube.

r = ArrayReshape[soln, {9, 9, 9}];

Once we have done this, we can then visualize the answer. The core of this is the Position function which is applied to the 9 element long list for each cell that corresponds to the marking of the cell. Specifically, the entry in that list that is true indicates which value to write in the cell of the puzzle. So, if XY4 is true, that says that the cell XY has a 4 in it.

Table[Table[Flatten[Position[r[[i, j]], True]][[1]], {j, 1, 9}], {i, 1, 9}]

{{4, 6, 1, 8, 3, 9, 5, 2, 7}, {3, 8, 9, 2, 7, 5, 1, 4, 6}, {5, 2, 7, 6, 4, 1, 9, 8, 3}, {2, 5, 8, 1, 6, 7, 3, 9, 4}, {6, 7, 3, 9, 8, 4, 2, 5, 1}, {9, 1, 4, 5, 2, 3, 6, 7, 8}, {7, 4, 5, 3, 9, 6, 8, 1, 2}, {8, 9, 6, 4, 1, 2, 7, 3, 5}, {1, 3, 2, 7, 5, 8, 4, 6, 9}}

If we want to look at the puzzle in its usual grid form, this is pretty easy in Mathematica.

ShowSolution[soln]

Once we have this methodology defined, it becomes easy to solve harder problems. Here is an assignment that is much harder to perform by hand since it is much sparser.

initialHard = <| {1, 6} -> 6, {1, 8} -> 5, {2, 1} -> 2, {2, 3} -> 7, {2, 5} -> 8, {3, 3} -> 4, {4, 2} -> 6, {4, 6} -> 5, {5, 3} -> 8, {5, 5} -> 4, {5, 7} -> 1, {6, 4} -> 3, {6, 8} -> 9, {7, 7} -> 7, {8, 5} -> 1, {8, 7} -> 8, {8, 9} -> 4, {9, 2} -> 3, {9, 4} -> 2|>;

ShowMarking[initialHard]

fInit = AndList[ Map[s[[#[[1]], #[[2]], initialHard[#]]] &, Keys[initialHard]]];

soln = SatisfiabilityInstances[f1 && f2 && f3 && f4 && f5 && fInit, Flatten[s]];

The SAT solver has no trouble with it, and yields the following solution.

ShowSolution[soln]

Hopefully this shows some of the power of SAT solvers for solving general problems. Many problems can be written in terms of boolean satisfiability, and as soon as we do that we can take advantage of powerful SAT solvers. Beyond the one included in Mathematica, there are other popular SAT solvers such as MiniSAT and those embedded in more powerful tools known as SMT (Satisfiability Modulo Theories) solvers like Z3, Yices, and so on.

For those who are curious, here are the helper functions used to show the solutions as well as the initial markings.

ShowSolution[soln_] := Module[ {rarray, r}, r = ArrayReshape[soln, {9, 9, 9}]; rarray = Table[Table[Flatten[Position[r[[i, j]], True]][[1]], {j, 1, 9}], {i, 1, 9}]; Graphics[ Table[ {EdgeForm[Thin], If[EvenQ[Floor[(j - 1)/3] + Floor[(i - 1)/3]*3], Lighter[Gray, 0.5], White], Rectangle[{i, j}, {i + 1, j + 1}], Black, Text[Style[rarray[[i, 10 - j]], Large], {i + 0.5, j + 0.5}]}, {i, 1, 9}, {j, 1, 9}] ] ]

ShowMarking[marking_] := Module[ {}, Graphics[ Table[ {EdgeForm[Thin], If[EvenQ[Floor[(j - 1)/3] + Floor[(i - 1)/3]*3], Lighter[Gray, 0.5], White], Rectangle[{i, j}, {i + 1, j + 1}], Black, If[KeyExistsQ[marking, {i, 10 - j}], Text[Style[marking[{i, 10 - j}], Large], {i + 0.5, j + 0.5}]]}, {i, 1, 9}, {j, 1, 9}] ] ]

]]>

One of the big attractions to functional programming for me is the clear conceptual mapping from abstract algorithm to code I personally find comes with the functional approach. So, today I’ll share one of my favorites that came from some work I did for a project in program analysis. For the curious, we applied this algorithm (albeit a Haskell version) in a paper published at a workshop a couple years ago for analyzing source control repositories. There is more to the code for that paper than summarized here, but the tree edit distance algorithm described below is at the core of it all.

One of the classical problems in computer science is the edit distance problem – given two strings, what is the minimal sequence of edits (insertions, deletions, or substitutions) that must be performed to one of them in order to yield the other? A common approach to solving this problem efficiently is to use an algorithm based on dynamic programming. For those who have forgotten that part of algorithms class, dynamic programming is often used in problems where greedy approaches fail to find the optimal answer. Basically, these are problems where a locally sub-optimal decision is necessary to find a globally optimal answer.

Not all structures of interest are equivalent to one-dimensional strings, so traditional string edit distance algorithms aren’t really useful when it comes to comparing them. An example of such a structure is a tree, either a plain old binary tree or a more general rose tree. Fortunately, the general algorithmic approach of dynamic programming can be used to compare trees too! One such algorithm was published in 1991 by Wuu Yang in an article entitled “Identifying syntactic differences between two programs“. In this post we’ll build out a function using this algorithm that takes two trees and a label comparison function that yields a pair of “edit trees” that correspond to the sequence of operations necessary to turn one of the inputs into the other (and vice versa).

Yang’s algorithm is essentially a recursive application of a traditional sequence edit distance algorithm applied at each level down the tree. The set of allowed edits is either keeping or deleting each subtree – unlike other edit distance algorithms, there is no substitution operation used. Edits are considered at each level of the tree, where the root is the first level, the children of the root are the second level, and so on. The algorithm seeks to find the best matching of the sequence of children directly below the subtree root, and recursively applies this to each of these subtrees. To illustrate a simple example of this, consider the trees below:

In this case, the color of the subtrees indicates equivalence – the red ones are the same, but the orange and green ones are unique to each tree. Similarly, the purple and yellow nodes appear in each tree. The edit sequence therefore corresponds to keeping the root level vertices (the purple one), and then at the second level, keeping the yellow vertex, discarding the orange subtree from the first tree and the green subtree from the second tree, and keeping the red subtree in both. The nodes at each level are considered to be ordered, so if we were to swap the orange node and red subtree in one of the trees and not the other, the result would be to discard everything below the purple vertex since no pairing in the order given exists. Whether or not this would matter to an application depends on the semantics of the tree structure. Here we are only concerned with trees where the order of children of each node in the tree matters.

Dynamic programming involves the construction of a table of intermediate costs associated with a sequence of actions (e.g., deletion, insertion, modification) applied to two input sequences. The step after constructing this table is to build a path through it starting at one corner to the other that maximizes (or minimizes) the cumulative cost. If we are comparing two rooted subtrees, then we can associate each row of the table with the children of the first subtree root, and the columns of the table with the children of the second subtree root.

The path through the table starting from the lower right corner is composed of one of three actions: moving diagonally from the (x,y) coordinate to the (x-1,y-1) coordinate, moving left to (x-1,y), or moving up to (x,y-1). These moves correspond to maintaining both the xth child of the first tree root and the yth child of the second tree root, deleting the yth child of the second tree root (making no choice about the xth child of the first tree root yet), or deleting the xth child of the first tree root (and making no choice about the yth child of the second tree root yet).

So, how do we get started building this out in F#? For those who want to follow along with the full code available, it can be found on github. Our goal will be to build a function that implements Yang’s algorithm with the following type:

val yang: Tree<'a> -> Tree<'a> -> ('a -> 'a -> bool) -> int * (EditTree<'a> * EditTree<'a>)

This function takes two trees, a function to compare the contents of the trees, and yields a pair – the quantitative edit distance score followed by a pair corresponding to the edit sequence to apply to the first tree and the edit sequence to apply to the second.

First, we can start encoding the decision making process for traversing the table that will eventually be built. I typically start by defining types of use in modeling the algorithm.

type Direction = Left | Up | Diag type EditOp = Keep | Delete

These can be related via a function that converts a direction to the corresponding actions to the first tree and the second tree. An option type is used to indicate whether we definitively do something (keep or delete) or make no choice yet.

let dirToOp = function | Up -> (Some Delete, None) | Left -> (None, Some Delete) | Diag -> (Some Keep, Some Keep)

This corresponds pretty clearly to the text above – diagonal move, keep the element from both. A horizontal or vertical move corresponds to deleting from one tree, but deferring the decision about the element in the other tree until later along the path.

We need some trees to work with, so we can define a simple rose tree structure where each node of the tree is represented as a generic label type with a list of children. In F# speak, this type is an instance of a record.

type Tree<'a> = { label : 'a; children : Tree<'a> list }

Now, what do we need to encode in the table itself? At each cell, we need to know what the cost is so far, the direction to go from that cell, and something about the subtrees x and y under consideration at that point. As such, the table will be composed of cells containing a tuple of three elements – an integer, a direction, and a pair of indicators for how to interpret the xth and yth subtrees. In a common pattern in dynamic programming algorithms, we will have a special 0th row and 0th column of the table that acts as a sort of “gutter” that essentially encodes how to deal with the case where we’ve run out of contents for one of the sequences but not the other and want the algorithm to fall straight towards the (0,0) coordinate. These are indicated in the picture above as the cells with grey shading.

Let’s consider two cases: we are either dealing with a pair of subtrees to compare inside the table (x,y) where x>0 and y>0, or we are dealing with a location in one of the gutters (x,0) or (0,y). How do we interpret those? If we are inside the table, the subtree represented at that point has a root (of type ‘a) and a list of children, each of which is a tree containing labels of type ‘a. As we recurse down the tree, we determine whether or not these children would be deleted or kept. So, each cell of the table contains the root and a list of children where each child is a pair : an edit operation, and the corresponding tree. On the other hand, in the gutters we know that one of the sequences has been exhausted and the other isn’t going to be interpreted any further. The exhausted tree can be represented as a terminal element, and the tree that isn’t going to be interpreted any further can be treated as a leaf of the corresponding tree of edits where the entire subtree hangs off of the leaf.

This can all be encoded as the following type where an ENode corresponds to a rooted subtree with edits amongst its children, an ELeaf corresponds to a subtree below which no edits will take place, and ENil represents the terminal element for an exhausted tree:

type EditTree<'a> = ENode of 'a * ((EditOp * EditTree<'a>) list) | ELeaf of Tree<'a> | ENil

To start off in creating the table we first want to map our inputs, which are two trees (ta and tb) each with a list of subtrees into an array of subtrees (call them ak and bk for “a kids” and “b kids”). This is because we want to have constant time lookup of any element, so an array is a natural way to do that. We’ll also create variables to hold their lengths to make the rest of the code a bit cleaner to read.

let ak = Array.ofList (ta.children) let bk = Array.ofList (tb.children) let lena = Array.length ak let lenb = Array.length bk

Now, we want to initialize the array to some reasonable default value, so we will choose the tuple (0, Diag, (ENil, ENil)). All possible traces through this basically result in empty trees with a zero score.

let ytable = Array2D.create (lena+1) (lenb+1) (0, Diag, (ENil, ENil))

As we work our way through the table, at any index ij, we are going to want to look at the table entry for (i-1)(j-1), i(j-1), (i-1)j. In addition, depending on what part of the algorithm we are at we may want to look at the score, the direction, or the pair of edit trees at that cell. Through the use of custom infix operators, we can avoid having our code peppered with a mixture of indexing logic and tuple unpacking by encapsulating it in functions.

Now, this is one of those parts of the code I thought hard about because custom infix operators are a mixed choice – on the one hand, they can make code very compact (which is my goal here), but on the other hand, they can be absolutely obtuse to read for someone who didn’t write the code in the first place. My rule of thumb is that I will use them in contexts within a function or module in a way where they cannot escape outside that context. In this case, when you look at the full function that we are defining, you’ll see that they are only visible within that single function. To me, that’s acceptable since a third party won’t have to look far for their definition.

In any case, let’s define three infix operators:

let (@!@) i j = let (a,_,_) = ytable.[i,j] in a let (@+@) i j = let (_,b,_) = ytable.[i,j] in b let (@%@) i j = let (_,_,c) = ytable.[i,j] in c

Each of these look up the i,j’th cell and return one element of the tuple. We’ll see why this is useful very soon. We can finally define the elements of the dynamic programming table:

for i = 0 to lena do for j = 0 to lenb do match (i,j) with | (0,0) -> ytable.[i,j] <- (0, Diag, (ENil, ENil)) | (0,_) -> ytable.[i,j] <- (0, Left, (ENil, ELeaf (bk.[j-1]))) | (_,0) -> ytable.[i,j] <- (0, Up, (ELeaf (ak.[i-1]), ENil)) | (_,_) -> let (ijscore, (ijl, ijr)) = yang (ak.[i-1]) (bk.[j-1]) lblcmp let a = ( (i-1 @!@ j-1) + ijscore, Diag, (ijl, ijr) ) let b = ( (i-1 @!@ j ), Up, (ijl, ijr) ) let c = ( (i @!@ j-1), Left, (ijl, ijr) ) ytable.[i,j] <- maxByFirst [a;b;c]

That’s it. The rest of the function is a bit simpler. Once we have the table defined, a dynamic programming algorithm will trace the optimal path through the table using the decisions that were made at each cell.

When processing cell (x,y), we want to interpret the move to make (up, down, diagonal), the edit trees that were stored in that cell for the xth and yth subtrees, and the coordinates of the cell along the trace path. We again make use of the inline functions to extract the second and third elements of the triplet at cell (x,y), and use pattern matching on the move direction to add an element to the traceback list and recurse. Due to the construction of the table (with the gutters), we set the base case to be (0,0) which corresponds to the empty list.

let rec traceback = function | (0,0) -> [] | (x,y) -> let move = x @+@ y let (l,r) = x @%@ y match move with | Up -> ((x,y), Up , (l,r)) :: (traceback ((x-1), y )) | Left -> ((x,y), Left, (l,r)) :: (traceback (x, (y-1))) | Diag -> ((x,y), Diag, (l,r)) :: (traceback ((x-1), (y-1)))

The final step requires us to think about what we want this function to return. Given two trees, we want to return the score at that point as well as the list of edit trees corresponding to the children of the root of each tree. If we look at the list that is returned on the traceback, there are a few important properties to note:

- It’s backwards, since the list is built from the lower right corner to the upper left.
- The list of operations along the traceback is defined in terms of what we do to both sets of subtrees.
- When we map a direction to a pair of edit operations, the option type returned by dirToOp implies that some of points along the trace path we don’t have any information about what to do to one of the trees, so we would want to skip it.
- We want to pair up the edit operations with the corresponding edit subtree.

So, we want code that performs the traceback from the lower right corner, reverses the sequence, maps the direction at each step of the trace to a pair of edit operations that are associated with the pair of edit trees at that point in the tree, and then peels this list of pairs apart into a pair of lists.

let (tba, tbb) = traceback (lena, lenb) |> List.rev |> List.map (fun (_,d,(l,r)) -> let ops = dirToOp d ((fst ops,l),(snd ops,r))) |> List.unzip

This particular step may seem simple, but it represents a major advantage of languages like F# that provide an option type. In C# or other languages, we may simulate the option type with NULL values or something similar. Unfortunately, in those languages there is no way to distinguish the list that contains the NULLs and the list that does not. In F# on the other hand, the presence or absence of these is explicit via the option type. By having a mapping from the list with the optional values to a list without, we can guarantee that the list without them has no values of the None type. In languages in which we are dealing only with object references, the type system can’t help us separate code where we tolerate NULL values from code that does not. This is a very compelling feature of languages with Option types in my opinion (or, for the Haskellers out there, the Maybe type).

Looking at the function above, we can see one of my favorite parts of F# and related languages – the code is not hugely different than the description of what it is supposed to do.

Finally, we can spin through each of these lists to purge all of the steps in which the direction yielded a None operation, and at the same time unpack the Some instances to get rid of the option type.

let removeNones = List.choose (fun (x,y) -> match x with | Some s -> Some (s,y) | None -> None)

let aekids = tba |> removeNones let bekids = tbb |> removeNones

Now, finally we can define the score, and bundle the two lists of edit trees with the root of each tree to return the score and pair of edit trees. The score is either 0 if the roots don’t match, otherwise it is one plus the score at the lower right corner of the dynamic programming table. Given that the tree type is polymorphic in terms of the contents of the tree, we will use a comparison function that is passed in. This could also be accomplished by making the type parameter of the tree have a constraint that it be comparable – I chose not to do that, and don’t remember why (I wrote this code a few years ago).

let score = if lblcmp ta.label tb.label then 1 + (lena @!@ lenb) else 0

The final bundling of return trees therefore is:

let (reta, retb) = if (score = 0) then (ELeaf ta, ELeaf tb) else (ENode (ta.label,aekids), ENode (tb.label,bekids)) (score, (reta, retb))

To see the code in action, one can play with the script in the repository that illustrates the sequence of edit actions described for the purple-rooted tree above. We start off with two trees, purp1 and purp2. *[Edit 1/3/16: As a commenter somewhere pointed out, the subtrees for orange and green here have empty lists of children yet are illustrated as subtrees in the picture above. This makes no difference – because the root of each of these subtrees is where the mismatch occurs, what lies beneath them doesn’t change the result of applying the algorithm. I made a conscious choice to make them empty subtrees to avoid bloating up the code below. If one needs convincing, go ahead and check the code out and give it a try with non-empty subtrees for orange and green!]*

val purp1 : Tree<string> = {label = "purple"; children = [{label = "orange"; children = [];}; {label = "yellow"; children = [];}; {label = "red"; children = [{label = "rchild"; children = [];}];}];} val purp2 : Tree<string> = {label = "purple"; children = [{label = "yellow"; children = [];}; {label = "green"; children = [];}; {label = "red"; children = [{label = "rchild"; children = [];}];}];}

When we invoke the tree differencing code as:

let (p1,p2) = treediff purp1 purp2 lblcmp

The result is the edit trees for each side. The first one looks like this:

val p1 : EditTree<string> = ENode ("purple", [(Delete, ELeaf {label = "orange"; children = [];}); (Keep, ENode ("yellow",[])); (Keep, ENode ("red",[(Keep, ENode ("rchild",[]))]))])

How do we read this? Given that the roots match, we are handed back an edit tree that is rooted with a purple node that has three child edit trees: a child that states that the orange subtree is to be deleted, the yellow subtree is to be kept, and the red subtree is to be kept. Similarly, the other edit tree looks like:

val p2 : EditTree<string> = ENode ("purple", [(Keep, ENode ("yellow",[])); (Delete, ELeaf {label = "green"; children = [];}); (Keep, ENode ("red",[(Keep, ENode ("rchild",[]))]))])

In both cases, the result of applying the edits to the corresponding input tree would be a purple rooted tree with two children: a yellow node followed by a red subtree.

The code associated with this post can be found on github. The code that is related to the paper I mention at the beginning can also be found on github and sourceforge (yes, I know – sourceforge. Don’t ask why…), which includes the edit distance code as well as algorithms that consume the result of the edit distance algorithm to identify specific changes made in trees. Eventually I would like to port those algorithms over to F# since they are pretty useful.

]]>

On hack day at the conference, I joined in with the group working on the ZooniBot – a program developed the previous year to aid the users of the Zooniverse forums by automatically answering questions without the need for moderators to manually jump in and respond. Early in the day on Hack Day, I asked people in the room what would be useful things for ZooniBot to be able to do that it couldn’t already do. While I didn’t spend much time on it on Hack Day, one of the problems that I really liked and has followed me since then was suggested by Brooke Simmons and is intended to address a common question that comes up on the GalaxyZoo site. Consider the following image:

To the eye of a Zooniverse user who is new to the Galaxy Zoo, it is quite likely that this would stand out as unusual. What is the streak? These common artifacts are caused by satellites zipping along in the field of view while the image was being captured. A satellite appears as a bright streak that flashes across the image, and due to its speed, frequently traverses the frame while one color filter is engaged – which is why the bright streaks tend to look green. Of course, a few lucky satellites wandered past while other filters were engaged.

In some cases, two artifacts crossed the field of view resulting in trails of different colors.

How do we build an automatic system that can inform curious GalaxyZoo users that these images contain satellite trails? Given that the image data set is huge and unannotated with metadata about what kinds of artifacts that they contain, such a question requires the automatic system to look at the image and perform some kind of image analysis to make a guess as to whether or not the image contains an artifact like a trail. At first blush, this seems relatively straightforward.

For this specific kind of artifact, we can make a couple of observations:

- The artifacts appear as straight lines.
- The artifacts commonly appear in one color channel.

The natural building block for an automated detector is the Hough transform. The basic concept behind the Hough transform is that we can take an image and compute its corresponding Hough image. For example, in the example of the two lines above, the corresponding Hough image is:

In the source image, we have a set of pixel values that have an (x,y) location as well as an intensity value in each of the color channels (r,g,b). Before applying the Hough transform, we map each pixel to a binary value indicating whether or not the pixel is bright enough to be considered on or off. This is achieved by applying some form of image segmentation that maps the RGB image to a binary image. In this example, I used Otsu’s method for computing the best threshold for each image. Once the binary image is available, the Hough transform looks at every line that goes through the image by varying the angle of the line over [0,pi], and for every offset from the upper left corner of the image to the upper right corner. The results is the Hough image we see above, where the X axis corresponds to the angle, and the Y axis corresponds to the line offset. The intensity for each angle/offset combination is the number of pixels that were set to 1 in the binary image along the corresponding line. As we can see, there are two bright spots. Looking more closely at the lines that correspond to those bright spots (known as “Hough peaks”), we see that they match relatively well to the two lines that we see in the blue and green channels.

Once we’ve applied the Hough transform to identify the lines in an image, we can then extract out the pixel values in the original image along the detected lines. For example, from the image above where we had two lines (one blue, one green), we can see the intensity in each of the three color channels in the plots below.

Once we have extracted lines, we are left with a set of candidate lines that we’d like to test to see if they meet the criteria for being a trail. The important thing to do is filter out line-like feature that appear that aren’t actually trails, like galaxies that are viewed edge-on. For example, the following image will yield a strong line-like feature along the center of the galaxy.

A simple filter is achieved by taking all pixels that lie upon each detected line and computing some basic statistics – the mean and standard deviation of the intensity along the line. If our goal is to find things that are dominant in one color channel, then we can ask the following: is the mean value for one color channel significantly higher than the other two channels? The heuristic chosen in the filter currently tests if the channel with the highest mean is at least one standard deviation (computed on its intensities) from the others.

Unfortunately, the data set has curveballs to throw at this test. For example, sometimes trails are short and don’t span the full image.

These trails are harder to detect with a basic heuristic on the mean and standard deviation of intensities along the line since the portion of the detected line that covers the region of the image where the trail is missing drag the mean down and push the standard deviation up. Even worse, there are images that for whatever reason have saturation in a single channel all over the image, meaning that any line that is detected ends up passing the heuristic test.

Clearly something a bit more discerning than heuristics based on basic summary statistics is necessary. This work is ongoing, and will hopefully eventually lead to something of value to the GZ talk community. In the meantime, I’ve put the code related to this post up on github for folks curious about it. If this topic if of interest to you, feel free to drop me an e-mail to see what the status is and what we’re currently up to with it. I’m eager to see what new problems like this that .dotastro 6 will send my way.

]]>

Given just the image and its object identifier (1237645941297709234), we might be curious to learn a few things about it:

- Where in the sky did it come from?
- What kind of object is it?

Answering these questions requires a bit of work. First, we need to query the SDSS database to retrieve the (ra, dec) coordinates of the object. Once we have this, it is possible to then go to another database like SIMBAD to learn if it is a known astronomical object, and if so, what kind of object it is.

Both the SDSS and SIMBAD databases are accessible via HTTP queries, making programmatic access easy. In this post I’m using their basic interfaces. SDSS offers a number of access methods, so there is likely to be a cleaner one than I’m using here – I’m ignoring that for the moment. SIMBAD on the other hand presents a relatively simple interface that seems to predate modern web services, so instead of well structured JSON or some other format, dealing with its response is an exercise in string parsing.

To start off, I defined a few types that are used to represent responses from SIMBAD.

type SimbadObject = | S_Galaxy | S_PlanetaryNebula | S_HerbigHaro | S_Star | S_RadioGalaxy | S_GalaxyInGroup | S_GalaxyInCluster | S_Unknown of string type SimbadResponse = | SimbadValid of SimbadObject | SimbadError of string | SimbadEmpty

Interpreting the SIMBAD object type responses was a simple exercise of matching strings to the corresponding constructors from the SimbadObject discriminated union.

let interpret_simbad_objstring s = match s with | "PN" -> S_PlanetaryNebula | "HH" -> S_HerbigHaro | "Star" -> S_Star | "RadioG" -> S_RadioGalaxy | "Galaxy" -> S_Galaxy | "GinGroup" -> S_GalaxyInGroup | "GinCl" -> S_GalaxyInCluster | _ -> S_Unknown s

Before moving on, I needed a few helper functions. The first two exist to safely probe lists to extract either the first or second element. By “safely”, I mean that in the case that a first or second element doesn’t exist, a reasonable value is returned. Specifically, I make use of the usual option type (Maybe for the Haskell crowd).

let getfirst l = match l with | [] -> None | (x::xs) -> Some x let getsecond l = match l with | [] -> None | (x::xs) -> getfirst xs

I could roll these into a single function, “get_nth”, but as I said, this was a quick exercise and these functions play a minor role in things so I didn’t care much about it. Another utility function that is required is one to take a single string that contains multiple lines and turn it into a list of lines, excluding all empty lines. This function should also be agnostic about line terminators: CR, LF, CR-LF all should work.

let multiline_string_to_lines (s:string) = s.Split([|'\r'; '\n'|]) |> Array.filter (fun s -> s.Length > 0) |> Array.toList

With these helpers, we can finally write the code to query SIMBAD. This code assumes that the FSharp.Data package is available (this is accessible via Nuget in VS and under mono on Mac/Linux). Given a coordinate (ra,dec), we can define the following function:

let simple_simbad_query (ra:float) (dec:float) = let baseurl = "http://simbad.u-strasbg.fr/simbad/sim-script" let script = "format object \"%OTYPE(S)\"\nquery coo "+string(ra)+" "+string(dec) let rec find_data (lines:string list) = match lines with | [] -> SimbadEmpty | (l::ls) -> if l.StartsWith("::data::") then match getfirst ls with | None -> SimbadEmpty | Some s -> SimbadValid (interpret_simbad_objstring s) elif l.StartsWith("::error::") then match getsecond ls with | None -> SimbadEmpty | Some s -> SimbadError s else find_data ls Http.RequestString(baseurl,query=["script",script],httpMethod="GET") |> multiline_string_to_lines |> find_data

The first two lines of the function body are related to the SIMBAD query – the base URL to aim the request at, and the script that will be sent to the server to execute and extract the information that we care about. The script is parameterized with the ra and dec coordinates that were passed in. Following those initial declarations, we have a recursive function that spins over a SIMBAD response looking for the information that we wanted. When all goes well, at some point in the SIMBAD response a line that looks like “::data::::::::” will appear, immediately followed by a line containing the information we were actually looking for. If something goes wrong, such as querying for a (ra,dec) that SIMBAD doesn’t know about, we will have to look for an error case that follows a line starting with “::error::::::”. In the error case, the information we are looking for is actually the second line following the error sentinel.

In the end, the find_data helper function will yield a value from a discriminated union representing SIMBAD responses:

type SimbadResponse = | SimbadValid of SimbadObject | SimbadError of string | SimbadEmpty

This encodes valid responses, empty responses, and error responses in a nice type where the parameter represents the relevant information depending on the circumstance.

With all of this, the simple_simbad_query function body is formed from a simple pipeline in which an HTTP request is formed from the base URL and the query script. This is fed into the function to turn a multiline string into a string list, and then the recursive find_data call is invoked to scan for the data or error sentinels and act accordingly. Nothing terribly subtle here. What is nice though is that, in the end we get a well typed response that has been interpreted and brought into the FSharp type system as much as possible. For example, if an object was a galaxy, the result would be a value “SimbadValid S_Galaxy”.

A similar process is used to query the SDSS database to look up the (ra, dec) coordinates of the object given just its identifier.

let simple_sdss_query (objid: string) = let sdss_url = "http://skyserver.sdss3.org/public/en/tools/search/x_sql.aspx" let response = Http.RequestString(sdss_url, query=["format","json"; "cmd","select * from photoobj where objid = "+objid], httpMethod="GET") let jr = JsonValue.Parse(response) let elt = jr.AsArray() |> Array.toList |> List.head let first_row = elt.["Rows"].AsArray() |> Array.toList |> List.head let ra, dec = (first_row.["ra"].AsFloat()) , (first_row.["dec"].AsFloat()) ra, dec

As before, we form the request to the given URL. Fortunately, SDSS presents a reasonable output format – instead of a weird textual representation, we can ask for JSON and take advantage of the JSON parser available in FSharp.Data. Of course, I immediately abuse the well structured format by making a couple dangerous but, in this case, acceptable assumptions about what I get back. Specifically, I immediately turn the response into a list and extract the first element since that represents the table of results that were returned for my SQL query. I then extract the rows from that table, and again collapse them down to a list and take the first element since I only care about the first row. What is missing from this is error handling for the case when I asked for an object ID that doesn’t exist. I’m ignoring that for now.

Once we have the row corresponding to the object it becomes a simple task of extracting the “ra” and “dec” fields and turning them into floating point numbers. These then are returned as a pair.

Given this machinery, it then becomes pretty simple to ask both SDSS and SIMBAD about SDSS objects. Here is a simple test harness that asks about a few of them and prints the results out.

let main argv = let objids = ["1237646585561219107"; "1237645941297053860"; "1237646586102349867"; "1237646588244918500"; "1237646647297376587"; "1237660558135787607"; "1237646586638827702"; "1237657608571125897"; "1237656539664089169"] for objid in objids do let ra, dec = simple_sdss_query objid let objtype = simple_simbad_query ra dec let objstring = match objtype with | SimbadValid s -> "OBJTYPE="+(sprintf "%A" s) | SimbadError s -> "ERROR="+s | SimbadEmpty -> "Empty Simbad response" printfn "RA=%f DEC=%f %s" ra dec objstring 0

The resulting output is:

RA=77.538556 DEC=-0.946330 OBJTYPE=S_Galaxy RA=62.098838 DEC=-1.075872 OBJTYPE=S_Galaxy RA=87.319430 DEC=-0.591992 ERROR=No astronomical object found : RA=76.117486 DEC=1.164739 ERROR=No astronomical object found : RA=71.760602 DEC=0.066689 OBJTYPE=S_GalaxyInCluster RA=69.348799 DEC=25.042414 OBJTYPE=S_PlanetaryNebula RA=86.456666 DEC=-0.086512 OBJTYPE=S_HerbigHaro RA=122.733827 DEC=36.829334 OBJTYPE=S_RadioGalaxy RA=345.357080 DEC=-8.465958 OBJTYPE=S_Galaxy

A few closing thoughts. My goal with doing this was to take advantage of the type system that FSharp provides to bring things encoded as strings or enumerated values into a form where the code can be statically checked at compile time. For example, we have three possible SIMBAD responses: valid, error, or empty. Using discriminated unions allows me to avoid things like untyped Nil values or empty strings, neither of which capture useful semantics in the data. I’ve also isolated the code that maps the ad-hoc string representations used in the database responses in specific functions, outside of which the string-based nature of the response is hidden such that the responses can be consumed in a type safe and semantically meaningful manner. An unanticipated response will immediately become apparent due to an error in the string interpretation functions, instead of potentially percolating out into a function that consumes the responses leading to hard to debug situations.

Of course, there are likely better ways to achieve this – either better FSharp idioms to clean up the code, or better interfaces to the web-based databases that would allow me to use proper WSDL type providers or LINQ database queries. I’m satisfied with this little demo though for a two hour exercise on a Saturday night.

]]>

The paper is available on the ArXiV entitled “*Identifying Change Patterns in Software History” *– the final version isn’t much different (the reviewers liked it and didn’t request many changes), and fortunately the workshop has an open access policy on the publications. So, the final version will be available in the wild in the not-too-distant future and won’t likely be too hard to turn up.

The problem we looked at is simply stated: what does the change history of a piece of software as represented in a source control system tell us about what people did to it over time? Anyone who has worked on a project for any substantial amount of time knows that working on code isn’t dominated by adding new features – it is mostly an exercise of cleaning up and repairing flaws, reorganizing code to make it easier to add to in the future, and adding things that make the code more robust. During the process of making these changes, I have often found that it feels like I do similar things over and over – add a null pointer check here, rearrange loop counters there, add parameters to functions elsewhere. Odds are, if you asked a programmer “what did you have to do to address issue X in your code”, they would describe a pattern instead of an explicit set of specific changes, such as “we had to add a status parameter and tweak loop termination tests”.

This led us to ask : *can we identify the patterns that the programmers had in mind that they would tell you in response to our question of “what did you do”, simply by looking at the evolutions stored in a version control system?* While easily stated, the question we looked at in our paper was how one actually goes about doing this.

We started with some work I had developed as part of a Dept. of Energy project called “COMPOSE-HPC” where we built infrastructure to manipulate programs in their abstract syntax form via a generic text representation of their syntax trees. The representation we chose was the Annotated Term form used by the Stratego/XT and Spoofax Workbench projects. A benefit of the ATerm form for programs is that it allows us to separate the language parser from the analyzer – parsing takes place in whatever compiler front end is available, and all we require is a traversal of the resulting parse tree or resulting abstract syntax tree that can emit terms that conform to the ATerm format.

Instead of reproducing the paper here, I recommend that interested people download it and read it over. I believe the slides will be posted online at some point so you can see the presentation that Jason gave. The short story is that given the generic ATerm representation for the program, we combined a structural differencing algorithm (specifically, the one discussed in this paper) with the well known anti-unification algorithm to identify change points and distill the abstract pattern represented by the change. The details in the paper that aren’t reproduced here are how we defined a metric of similarity for clustering changes such that anti-unification could be applied over groups of similar changes to yield meaningful patterns. To show the idea at work, we used the existing Haskell language-java parser to parse code and wrote a small amount of code to emit an aterm representation that could be analyzed. We applied it to two real open source repositories – the one for the ANTLR parser project and the Clojure compiler. It was satisfying to apply it to real repositories instead of contrived toy repositories – we felt that the fact the idea didn’t fall over when faced with the complexity and size of real projects indicated that we had something of real interest to share with the world here.

The code for this project will be up on github in the not too distant future – I will update this post when it becomes available.

]]>

How do we find the number of indentations in each shape? A few years ago I was doing work with microscope slide images looking at cells in tissue, and similar shape analysis problems arose when trying to reason about the types of cells that appeared in the image.

Aside from any specific context (like biology), counting the dents in the black and white blobs shown above looked like an interesting little toy problem that wouldn’t be too hard to solve for simple shapes. The key tool we will use is the interpretation of the boundary of the shape as a parametric curve that we can then approximate the curvature along. When the sign of the curvature changes (i.e., when the normal to the curve switches from pointing into the shape to outside, or vice versa), we can infer that an inflection point has occurred corresponding to a dent beginning or ending.

The tools we need to solve this are:

- A way to find a sequence of points (x,y) that represent a walk around the boundary of the shape,
- A way to compute the derivative and second derivative of that boundary to compute an approximation of the curvature of the boundary at each point,
- A method for identifying dents as changes in the curvature value.

In Matlab, most of these tools are readily available for us. The bwboundaries function can take a binary image with the shape and produce the (x,y) sequence that forms the set of points around the boundary in order. The fast Fourier transform then can be used to turn this sequence into a trigonometric polynomial as well as computing derivatives necessary to approximate the curvature at the points along the boundary.

Working backwards, our goal is to be able to compute the curvature at any point along the boundary of the shape. We will be working with a parametric function defined as x=x(t) and y=y(t), which allows us to compute the curvature via:

This means we need to have a way to compute the derivatives of the parametric function x’, x”, y’, and y”. It turns out this is pretty straightforward if we employ the Fourier transform. A useful property of the Fourier transform is its relationship to derivatives of functions. Specifically, given a function x(t), the following property holds:

where:

That relationship happens to be quite convenient, since it means that if we can take the Fourier transform of our parametric function, then the derivatives come with little effort. We can perform some multiplications in the frequency domain and then use the inverse Fourier transform to recover the derivative.

Armed with this information, we must first obtain the sequence of points along the shape boundary that represent discrete samples of the parametric function that defines the boundary curve. Starting with a binary image with just a few dents in the shape, we can extract this with bwboundaries.

` [b,~] = bwboundaries(im, 8);`

b1 = b{1};

x = b1(:,2);

y = b1(:,1);

This function returns a cell array in the event that more than one boundary can be found (e.g., if the shape has holes). In the images considered here, there are no holes, so the first element of the cell array is used and the rest (if any are present) are ignored. The x coordinates are extracted from the second column of b, and the y coordinates from the first column. Now we want to head into the frequency domain.

` xf = fft(x);`

yf = fft(y);

As discussed above, differentiation via the FFT is just a matter of scaling the Fourier coefficients. More detail on how this works can be found in this document as well as this set of comments on Matlab central.

` nx = length(xf);`

hx = ceil(nx/2)-1;

ftdiff = (2i*pi/nx)*(0:hx);

ftdiff(nx:-1:nx-hx+1) = -ftdiff(2:hx+1);

ftddiff = (-(2i*pi/nx)^2)*(0:hx);

ftddiff(nx:-1:nx-hx+1) = ftddiff(2:hx+1);

Before we continue, we have to take care of pixel effects that will show up as unwanted high frequency components in the FFT. If we look closely at the boundary that is traced by bwboundaries, we see that it is jagged due to the discrete pixels. To the FFT, this looks like a really high frequency component of the shape boundary. In practice though, we aren’t interested in pixel-level distortions of the shape – we care about much larger features, which lie in much lower frequencies than the pixel effects. If we don’t deal with these high frequency components, we will see oscillations all over the boundary, and a resulting huge number of places where the curvature approaches zero. In the figure below, the red curve is the discrete boundary defined by bwboundaries, and the green curve is the boundary after low pass filtering.

We can apply a crude low pass filter by simply zeroing out frequency components above some frequency that we chose arbitrarily. There are better ways to perform filtering, but they aren’t useful for this post. In this case, we will preserve only the 24 lowest frequency components. Note that we preserve both ends of the FFT since it is symmetric, and preserve one more value on the lower end of the sequence since the first element is the DC offset and not the lowest frequency component.

xf(25:end-24) = 0;

yf(25:end-24) = 0;

The result looks reasonable.

Finally, we can apply our multipliers to compute the derivative and second derivative of the parametric function describing the shape boundary. We only want the real components of the complex FFT.

dx = real(ifft(xf.*ftdiff'));

dy = real(ifft(yf.*ftdiff'));

ddx = real(ifft(xf.*ftddiff'));

ddy = real(ifft(yf.*ftddiff'));

Here we can see the derivatives plotted along the curve. The blue arrows are the first, and the cyan arrows are the second derivative.

Finally, we can compute our approximation for the curvature.

k = sqrt((ddy.*dx - ddx.*dy).^2) ./ ((dx.^2 + dy.^2).^(3/2));

The last step is to identify places where the curvature approaches zero. Ideally, we’d be working with the signed curvature so that we can identify where inflection points are by observing the normal vector to the boundary surface switching from pointing outward to inward and vice versa. The approximation above is not signed, so we rely on another little hack where make the following assumption: the shape that we are dealing with never has flat edges where the curvature is zero. Obviously, this isn’t a good assumption in the general case, but sufficient to demonstrate the technique here. If this technique is applied in a situation where we want to do the right thing, the signed curvature is the thing we’d want to compute and we would look for sign changes.

Instead, we look for places where the curvature goes very close to zero, and assume those are our inflection points. Now, often more than one point near an inflection point exhibits near zero curvature, so we look for gaps of more than one point where the curvature is near zero. For each dent, we should see two inflection points – one where the boundary enters the dent, and one where it leaves it.

` thresh = 0.001;`

pts = find(k < thresh);

dpts = diff(pts);

n = (length(find(abs(dpts) > 1))+1)/2;

For the example images above, the code yields the correct number of dents (2, 7, and 3).

Before leaving, it’s worth noting some issues with this. First, the assumption that curvature diving towards zero implies that in the signed case it would have switched sign is definitely not valid in a general sense. Second, the filtering process is damaging to dents that are “sharp” – it has the effect of rounding them off, which could cause problems.

On the other hand, working with the boundary as a parametric curve and computing derivatives using the Fourier transform does buy us robustness since we stay relatively true to a well defined (in a mathematical sense) notion of what a dent in the shape is.

[**NOTE**: After posting, I discovered that some of the images were corrupted and have since replaced them with JPG equivalents. The quality inline is not as good as I’d like, but I am having issues exporting my Matlab plots to PNG.]

]]>

]]>

The topic of the summer school will be research at the intersection of geometric measure theory and geometric analysis with data analysis, especially in the presence of uncertainty.

This should be very interesting, as it is often the case that finding structure and information in a large pile of data is a question of understanding its fundamental geometric properties. I’ll be participating as one of the people who comes at the problem from the computational side. Check out the page for the summer school here, which has links to the poster for the session as well as a document that summarizes the topics to be covered. Students can find information on applying on the page as well.

]]>

The model that the book discusses is known as the Fermi-Pasta-Ulam (or FPU) problem. It is quite simple. Consider a 1D string that is represented as a sequence of line segments connected at n vertices. The connections between vertices are basically springs that obey Hooke’s law. In the model, the focus is on the displacement of these vertices over time. For the case where all vertices are equally spaced, the string doesn’t move at all since it is in an equilibrium state. When they are not uniformly spaced, interesting things start happening. The following video shows the case where the vertices are displaced based on a sine wave.

*(Note: the slowing and speeding up of the video is artificial and not part of the model, but is due to variations in compute time per frame on my computer since I recorded this as a screen capture).*

The model is defined by tracking the second derivative of the position of the vertices. The unperturbed linear model is described as:

Adding in a perturbation that is nonlinear is what introduces the interesting behavior:

The parameter alpha allows us to control the amount that this nonlinear term contributes to the evolution of the system. In my example movies, I let alpha be somewhere around 0.5. In Matlab, this is easily written as:

function fpu(n,alpha,s,dt,iters) % initial positions for n points on the line % and two additional points for the fixed ends q = (0:n+1)./(n+1); qrest = (0:n+1)./(n+1); % velocities start as zero qvel = zeros(1,n+2); % perturbation added with a sinusoid q = q + 0.1.*sin(s.*q .* pi); % make sure ends are fixed at 0 and 1 q(n+2) = 1; q(1) = 0; for i=1:iters % first term of q" qterm1 = q(3:end)-q(2:end-1); % second term of q" qterm2 = q(1:end-2)-q(2:end-1); % qacc is q". note that (a-b)^2 = (-(b-a))^2 qacc = qterm1 + qterm2 + alpha .* (qterm1.^2 - qterm2.^2); % velocity is velocity + acc qvel(2:end-1) = qvel(2:end-1) + qacc; % position is updated by velocity * time step q(2:end-1) = q(2:end-1) + qvel(2:end-1).*dt; end end

Adding a few strategic plot commands at the end of that loop lets you visualize it and see the plot I showed above.

The goal of the model originally was to look at how the energy in the system moved between the different vibrational modes of the system. As can be seen pretty clearly in the video above, over time, the single low frequency sinusoid that the system starts with evolves into a more complex motion where higher frequency modes gain energy and contribute more to the state of the system.

We can look at the energy in each mode by writing the Hamiltonian of the system as:

where:

So, if we compute each component of the sum and look at them individually (instead of looking at the overall energy of the system represented by ), we can see how the different modes contribute to the overall energy of the system. The code above can be easily modified to add this computation after the update to the q vector:

for k=1:n a(i,k) = (sqrt(2/(n+1))) * sum(q(2:end-1).*sin((1:n)*k*pi/(n+1))); ap(i,k) = (sqrt(2/(n+1))) * ... sum((j/n+1).*pi.*q(2:end-1).*cos((1:n)*k*pi/(n+1))); end h(i,:) = 0.5 .* (ap(i,:).^2 + omega.^2 .* a(i,:).^2);

Here is a look at the energy of modes 2-8 for the first 30000 iterations of the Matlab code:

As we can see, over time some of the higher frequency modes begin to gain more energy. This is apparent in the video as well.

Just for fun, we can also look at an instance of the problem where we intentionally start off with a perturbation. In this case, we start off with initial conditions that cause the right-most link in the chain to be stretched further than the rest, leading to a wave that moves back and forth down the string. The impact of this on the energy distribution is interesting as well. This video was created using the code above, but instead of passing a nice integer value for the parameter s (like 1.0), I passed in a value like s=1.3.

Looking at the plot of the energy in each mode, we see that now things are less regular. For example, look at the plot for mode 3 – instead of being a nice smooth peak, the shape of the peaks over time changes. This plot spans the first 3000 iterations of the model for this case, allowing us to look a bit more carefully at the behavior at a fine time scale.

Overall, I think this problem is an interesting one to play with and use to learn a little about dynamical systems. I’m not done with the book yet, but so far it is pretty interesting. Fortunately, these days we have computers far faster than those that Fermi and friends had, so it is possible to casually play with problems like this that were quite challenging to work on back in the earliest days of computing.

]]>

Over the last few years, due to multicore and manycore processor trends, concurrency and parallelism have been written about a number of times. For many programmers, books play an important role relative to blogs or magazines simply due to the level of detail and breadth that can fit in a document with a generous page budget that is rigorously edited and technically reviewed. As some of my readers here may have noticed in the side bar off to the side of the posts (you probably can’t see the link if you get my posts via an RSS reader), I’m an author of one of these books, and wanted to talk a little about it here. I recently found out that the book is offered in a Kindle for Mac or PC edition, at a fraction of the price of the original, so I thought it would be worth discussing given the lower cost now available to potential readers.

The book came out and was widely available in early 2010, and has been selling as well as I would hope with essentially no marketing. My hope is that getting word out via an article on here will lead to interested people noticing this project that took about two years to put together with my co-authors Tim Mattson of Intel and Craig Rasmussen of LANL (I started the book with Craig while I was still working at LANL).

**Why did we write the book?**

Our goal in writing this book was to provide a resource to people interested in concurrency and parallelism that took a bit of a different approach to the topic than other books out there. As I discuss in more detail below, we made a point of focusing on the topic without adopting a specific implementation technique or programming model, with a large dose of historical context and a grounding in generally applicable patterns that appear in concurrent code. I believe it is important to learn the topic through a general conceptual foundation rooted in the history of how technology today evolved from prior efforts over 30-40 years.

In researching the current playing field of books in the area when I started writing, I asked myself “why don’t I find these books totally satisfying?”. The most common answer that I found myself coming up with was that the books were often heavily rooted in one way of implementing concurrent or parallel programs. For example, it is not uncommon to find a book that discusses the topic solely in the context of MPI, or OpenMP, or Java, or (insert your favorite library here). This is quite reasonable for a practitioner who wants to learn the theory of the area in the context of the specific implementation in which they plan to work. For example, I found a book on concurrency in Java (“Java Concurrency in Practice”) to be quite wonderful, especially if you were interested in then going to write Java code (which, at the time I read it, I was). Similar books exist for other popular systems, and even lesser known ones like ZPL (also, a very good book).

Unfortunately, this mingling of general concepts with specific implementations of them can often lead to some generality being lost in the basic concepts. Often implementations adopt a specific way of approaching a problem leading to a skewed view on a broader topic. This isn’t unique to the topic of this book – a similar limited focus arises in other areas, like GUI programming, distributed programming, web frameworks, etc… Furthermore, it is quite common for libraries or languages to fall out of favor over time, either due to technology trends or simply a lack of maintenance in the specific tools. For example, the Parallel Virtual Machine library, a one time competitor to MPI, is no longer a serious contender for parallel programming – leading to many of the books that chose it as their implementation of choice to become similarly dusty and inaccessible to readers over time.

So, we embarked on the project with a few high-level goals in mind:

**Don’t start with the premise that parallelism and concurrency are intrinsically difficult concepts to grasp.**

As was recently pointed out to me when talking to someone about this at a conference, words like concurrent, simultaneous, contention, and so on, were in common usage far before computers existed. Unsurprisingly, this is because they come up in our daily lives quite frequently – in activities ranging from cooking a meal (which, I use as part of my argument that concurrency is quite natural in the book), to coordinating a meeting with a group of people, and performing “multitasking” at work. Concurrency is an intrinsic property of the real world, and humans are actually pretty good at managing it when it comes to physical and mental activities. Starting with this premise, and focusing on what technologies we have at hand, we can begin to understand where the difficulties actually lie. How often are they are due to limitations of technology and implementation choices versus inherently challenging properties of a particular problem to be solved?

**Choose established, stable languages.**

While it was inevitable that we would choose concrete languages to demonstrate examples, I set down strict criteria about the languages that were chosen. MPI isn’t a language (it’s a library), and it is relatively well covered in other literature. OpenMP on the other hand, is standardized and available to many programmers in multiple languages and resides fairly close to the language itself. Cilk represents a relatively mature extension on C that is very (very!) simple, yet very powerful. It was not surprising to see Intel buy Cilk++ a year or two ago. Erlang was chosen because it was well established and mature, and represented a fundamentally different programming model from the rest. Other declarative languages have been in the news for their concurrency features, but they were still moving targets in 2008 when the bulk of the book was being written. Fortran was originally included (and the appendix that was cut from the book is available for free on the book web site), but was cut due to a likely limited interested audience. I put the appendix on the web site after the book was published since, while not a common-place language, Fortran 2008 actually represents a fairly modern parallel programming language. This surprises some folks, especially when the last they heard of Fortran it was peppered with capital letters, line numbers, and goto statements everywhere. It seems many people are unaware of the 1990, 1995, 2003, and 2008 revisions of that language that removed much of the arcane cruft and added relatively interesting new features.

**Provide a strong, language neutral conceptual basis.**

Readers really should learn about concurrency control and the corresponding types of problems that arise in concurrent systems before diving into any specific implementation. This is often given light treatment in books, and students must accumulate the knowledge by fusing concepts that they learn in operating systems, databases, distributed systems, and parallel programming courses. I tried to pull much of this together into a discussion that covered the essentials from all of these topical sources. This was rooted in my observations during a couple year stay at the University of Oregon as an adjunct faculty member where I taught parallel programming, operating systems, and distributed systems. Those courses were quite valuable in testing the material that eventually made it into the book. Some of my students from those courses provided great feedback when I was putting the material together (they are acknowledged in the introduction).

**Present the history – How did technology evolve to what we see now?**

This is probably the most fun part of the book to me, and I really think this is critical for any reader who wants to know and deeply understand the state of the art. The hot technologies of today are definitely not new — almost all of them derive directly from work that was performed in the past, and how those past efforts evolved into what we see now is very informative. History tells us what worked, why it worked, and what didn’t work. History is where we see technologies that went out of popularity due to trends in industry or technology, yet are relevant again due to technology coming full circle back to concepts invented decades ago. For example, GPUs bear a strong resemblance to vector machines and massively parallel computers from the 1980s. Unsurprisingly, the programming models popular today from NVidia and others are extremely similar to those developed twenty or thirty years ago. Programmers are not immune from the old saying, “Those who cannot remember the past are condemned to repeat it” — knowing the history of your technologies makes you a wiser software engineer and developer.

As we were working on the book, about 9 months in, we asked Tim to join in to help us ground the book in terms of how we discussed and demonstrated the concepts. The original plan was to have the later third to half of the book focused on demonstrating concepts in the context of his parallel design patterns – so, why not ask him to help out. I was quite happy when he agreed to join in. This led to the next goal:

**Present examples in a framework that programmers can apply generally.**

The later chunk of the book is focused on applying what we discuss early on through a series of examples. Instead of laying chapters out in a specialized fashion (e.g., chapter on a web server, chapter on a parallel image processing algorithm, etc…), we chose instead to frame the examples in terms of a set of generalizable patterns that emerge in concurrent and parallel programs. By structuring the book this way, presenting concrete examples in the context of a generalizable framework for thinking about programming, we hoped to provide the reader with knowledge that was more easily applicable to problems they will face themselves.

**Concluding remarks**

Two years on from finishing, I am glad we chose to write the book the way we did instead of falling victim to the temptations to either focus on technologies that may be passing fads or unstable and ever-changing research platforms. I think the book should have, forgive the pun, a long shelf life.

I do hope you enjoy the book if you choose to grab a copy, and we are quite receptive to feedback for ways that the book could be improved if we ever get around to releasing a second edition. We maintain an errata list on the book web page, along with a set of source code examples, pointers to a variety of concurrent and parallel programming languages, and slides for those who wish to use the book in the classroom setting.

For those interested in more, here are some links:

– Table of contents (PDF)

– Book web page

– Amazon

– CRC Press book page

And don’t take my word on the book – we’ve had a number of postive reviews! I keep a set of links to book reviews here, and currently have four up – the ACM computing reviews, Times Higher Education Supplement, InsideHPC, and Scientific Programming journal.

]]>