Python to Clojure: Getting Used to Immutability

tl;dr clojure is an interesting language emphasizing immutable data and higher order functions. While these offer some tangible benefits, effective use requires a style of thinking about programming that is quite different from the imperative style many of us know

I have been following the Clojure community for a little while and have been interested in their way of doing things. Not only are there a lot of smart people who have been involved in designing the language, but the support for both the JVM and for Javascript means that Clojure is a language that may yield a single language to “rule them all”, or at least to allow portability to just about any platform. JVM plus JS is a pretty nice combo. That Clojure is a lisp is also appealing since hearing of how lisp was Paul Graham’s “secret weapon” in Beating the Averages. But in addition to being a lisp, Clojure prides itself on immutable data and functional programming. These two are interesting concepts that are catching on in many areas of programming and are in the same spirit of reproducible research where the processing and analysis of data should be as transparent as possible and should focus on what we are ultimately interested in: the values themselves.

Although the functional language purists will have other criteria (functions as first class values (python is already fine on this count)) the functional programming ethos is what is of interest to me: the idea of describing data transformations and piping the data through these transformations. Thats the essential, practical take-home message of functional programming and reproducibility. In pursuing these goals, we see the incorporation of some great libraries for R that allow piping and laziness, facilitating the easy composition of functional pipelines. Unsurprisingly, these libraries are written by Hadley Wickham and show the same simple, composable style he brought to gglot. In the python community, there are a number of interesting projects, but I particularly like the efforts of Matthew Rocklin who has been working on tools that emphasize data transformations in the functional style, or using the logic of data transformation to describe data transformations for large datasets, abstracting away the implementation details which are taken care of by well crafted software.

While many aspects of functional programming are becoming popular, Clojure’s emphasis on immutability everywhere is one of the distinguising characteristics. So I wanted to try it on chemistry function, the Smallest Set of Smallest Rings, an algorithm for finding rings in chemical graphs. (Basic SSSR overview here; John May/Christopher Steinbeck’s coverage here and here). I followed the algorithm published by John Figueras, more or less copying his algorithm into python and then porting it to clojure - my code can be found as a gist.

The heart of the algorithm is the ringfinding function. This function takes a molecular graph and a starting node and then initiates a breadth first search to look for the shortest closed path. The algorithm uses a queue to keep track of nodes that need to be processed, and a map/dictionary to keep track of the paths.The python port was straightforward - almost a straight copy but easier since I don’t need to worry about garbage collection. Its easy to read and, once I figured out what I was doing, straightforward to write/debug. Porting it to Clojure, however, took a lot more time and mental clash. I’m not as comfortable in clojure so there is a lot of data type mismatch - like trying to merge a set and value when you meant to merge two sets, or using the incorrect function/arity. Those are annoying errors and its very easy to make them although I’m sure the cognitive dissonance reduces as you get more comfortable with the language. I ended up with functioning code that will find rings in a molecule graph but I wanted to highlight a few of the conceptual differences I encountered using the following code as illustration.


(def four_six_plus2 
    "molecule graph with two rings and some non-ring atoms"
    {\A #{\B \H \I}, \B #{\A \C} \C #{\B \D \H}, \D #{\C \E}
     \E #{\D \F},    \F #{\E \G} \G #{\H \F},    \H #{\A \C \G}
     \I #{\A \J},    \J #{\I} })

(defn- rmvals [coll k] 
  "takes a map of {key #{}}, removes the specified key and the references to that key"
  (let [conns (get coll k)
        mol-nokey (dissoc coll k)
        rmval (fn [mol k1] (update-in mol [k1] disj k))]
    (reduce rmval mol-nokey conns)))

(defn- prune [Molecule]
  "remove atoms from a map that have less than 2 connections.
  repeat until all atoms have 2 or more connections"
  (let [singles (map first (filter #(< (count (second %)) 2) Molecule))]
    (if (empty? singles)
      Molecule
      (prune (reduce rmvals Molecule singles)))))

This bit of code is used to “prune” a molecule graph of atoms that do not have two connections where prune is the main function, rmvals is a helper function, and four_six_plus2 is an example molecule that needs pruning. You can see that prune gets the molecule keys that are singles and then uses reduce to successively apply the rmval function to each key in the singles collection, using the Molecule as an initial value. Although I’ve used other higher order functions before, map in particular, the use of reduce in this case was highly counterintuitive to me. It was necessitated because I could not do what I wanted to which was to simply modify the Molecule in place, which is exactly what was done using python and its mutable values. Furthermore, you can see by the rmvals function, that I need to check the values and remove the references to the atom I will delete. In python prune could look something like this pseudocode:

for single in singles:        #get singles
    keys = Molecule[singles]  #get keys
    for key in keys:          #remove references to that key
        Molecule[key] = Moleceule[key].remove(single) 
    Molecule.del(single)      # remove the key

In this case all the deletions are done on the initial object and are as loop through. In Clojure, this is not possible unless that object is specifically marked as mutable; an atom. And even then you need a special syntax to denote a change. To get around this requires rethinking how you would approach a problem and that is the hardest part. To use reduce with an initial collection and change it using a function acting on a sequence of values is not substantively different than a nested for loop ; however I am unable to access the intermediate values! This is sort of the point: the use of higher order functions reduces exposure to intermediate state changes…. and that is a good thing but it can be tough to wrap your mind around. And what if you definitely need state? Here is the getring function illustrating the use of atoms and queues:

(defn- getring
  "breadth-first search for smallest ring modified BFS algorithm after Figueras
   all mutable state is kept in atoms:
      the atomqueue is used to kep track of which atom to evaluate
      the paths is a map of sets that keep track of all the paths starting at the rootnode
      the visited atom keeps track of atoms you've seen
      the ring is set to nil and awaits discovery

  Evaluation of smallest ring is assessed by merging your paths and making sure the intersection
  is only one."
  [Molecule rootnode]
  (let [init-paths (into {} (for [[k v]  Molecule] [k #{}]))
        paths (atom (update-in init-paths [rootnode] conj rootnode))
        atomqueue (atom (conj clojure.lang.PersistentQueue/EMPTY rootnode))
        visited (atom #{rootnode})
        ring (atom nil)]
     (while (nil? @ring)
       (let [atomcode (dequeue! atomqueue)
             connectedatoms (->> (get Molecule atomcode)
                                 (remove @visited))]
         (doseq [atm connectedatoms]
           ;check for rings and return if a ring is found
           (let [currentpath (get @paths atomcode)
                 nextpath  (get @paths atm)
                 intersect (clojure.set/intersection currentpath nextpath)]
             (cond
               (and (> (count @paths) 1) (= 1 (count intersect)))
                 (reset! ring (clojure.set/union currentpath nextpath))
               (= 0 (count nextpath))
                 (do
                  (swap! paths update-in [atm] clojure.set/union currentpath)
                  (swap! paths update-in [atm] conj atm)
                  (swap! atomqueue conj atm)))))
         (swap! visited conj atomcode)))
    @ring))

In this caase I could not figure out how not to use mutation. A queue is needed to keep track of which atoms to check, a path is needed to keep track of which paths have been visited, and a ring varaible is needed to indicate if a ring has been found in order to terminate the search. These have to be explicity made mutable using (atom ...), they have a different way of accessing the values - they need to be “derefed” using @ as in @ring, and they need to be explicitly mutated using swap!. These differences in syntax from the base Cloure values can be confunsing and also lead to trouble. There are other types of mutability in Clojure but for most basic purposes, and expecially as a local variable like in this case, they are not needed.

Clojure also has laziness built in. What if you change (doseq [atm connected atoms] .....) to (for [atm connectedatoms]....)? Nothing right? Wrong. Clojure’s sequence abstraction defaults to lazy values and in the case of the for loop you simply do not get execution and, in this case, the while is never exited. So thats another fundamental difference in execution styles between your average imperative language and Clojure.

So in closing I would have a few recommendations for those who are new to Clojure:

  1. Use an IDE like Cursive Clojure. This will help you avoid innumerable errors that arise when learning a new language, especially one that is both conceptually and syntactically different. The new version of clojure is also good at picking up “arity” errors which can be a problem for neophytes - especailly those who sprinkle (do (println ...)) in their for, let and loop blocks - you may be unintentionally changing your function……
  2. Use atoms for mutable state. For now don’t even bother looking at transients and refs; atoms get the job done. It will take some getting used to as the use of @, swap! and reset! have different syntax than their non-atom counterparts, but there are times where mutability will be necessary.
  3. Reduce is your friend. Using reductions is the biggest conceptual jump related to using immutable data. Although most examples of reducing functions use + or some simple function, reduce can and should be used to ‘consume’ collections where that collection can hold some form of state. See this stackoverflow for an example and check out the recent example by Tom Macwright on using reduce in Javascript . The beauty of reduce is the ability to alter the collection within the function and return the altered collection, “passing the data” through the functions without explicity changing it as you would in an imperative language. To be effective with reduce you need to really embrace clojure’s emphasis of everything as a sequence. Writing that prune function was a bit of an epiphany as it required rethinking the nested loops into a sequence that I could pass to a reducing function that would keep track of the state (removing atoms and their connections).

So all-in-all, I would say this is one of my first non-trivial clojure functions and it forced me to embrace the “clojure way” - or at least some of it. Beyond the semantic difficulties of being in a lisply language, my major barrier to writing this function in clojure was not having a good strategy for deciding where and how to make things mutable. I ended up arriving at two types of “mutation”: the use of atoms as a substitute for classic variables and the use of reducing functions to “pull state through” the function. Clojure is still a bit of mental mismatch but I am going to stick with it for a bit - in particular for Clojurescript.