nrgb

Random thoughts on normal things.

Clojure Dojo: Bloom Filter

Clojure Dojo

The other day I was fortunate enough to go to a Clojure Dojo held by the London Clojurians at ThoughtWorks. It was my first ever Dojo and I had a fantastic time. I must say thanks to the organizers and to ThoughtWorks for the beer and pizza. Superb!

After everyone had introduced themselves we voted on which Clojure programming task we would like to work on. There were choices such as making use of a testing framework and using a music library called Overtone. However, I chose to implement a Bloom Filter simply because I had not come across them before. We were split into our respective teams of approximately five and we set to work.

Bloom Filter Task

So what is a Bloom Filter? A detailed explanation is beyond the scope of this blog but here’s a summary from Wikipedia:

Bloom Filter: A space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not; i.e. a query returns either “possibly in set” or “definitely not in set”. Elements can be added to the set, but not removed.

I am fairly new to Clojure. Fortunately, one of the team members, not only knew Clojure but he also had a basic idea about Bloom Filters – Result! We all participated and it was interesting to observe the tactics of using the Clojure REPL to successfully build a solution in a interactive “bottom-up” way. I could not recall the exact implementation therefore I decided to write my own.

In my experience of writing Investment Banking software typically we try to follow a “top-down” process consisting of:

  1. Gather the Functional Requirements
  2. State some high-level Design Objectives
  3. Implement the code including unit tests

This got me thinking: How would I write my own Clojure Bloom Filter using a top-down approach?

Functional Requirements

This is quite simple as the Wikipedia summary details the expected functionality:

  1. Add textual string words to a Bloom Filter
  2. Check whether a word is in the Bloom Filter returning true if it is, otherwise false.

Design Objectives

I have decided to embrace Functional Programming ideals to define the objectives by three categories: Data structures, Functions and Other Considerations:

  1. Data Structures:
    1. Bloom Filter Buffer (BFB) of configurable size where each element can either be 1 or 0 (default).
    2. A configurable list of hash functions.
    3. A configurable list of input words.
  2. Functions:
    1. A hash function which takes a word and returns a hash number within the bounds of the BFB size
    2. A function which takes a hash function and a word  and to returns a BFB with the appropriate element (as defined by the hash value of the word) set to 1.
    3. A function to process a word through a collection of hash functions
    4. A functions to process collection of words
    5. A check function to see if a word is in the BF or not.
  3. Other Considerations:
    1. Ignore any performance optimizations and memory issues for now
    2. Ignore any diversity issues around hash-functions
    3. Use the REPL to test the functions.

In other words…just make it work…for now! 🙂

Implementation

Coming from the Java world I am more used to NetBeans whereas nearly everyone at the Dojo used Emacs. I do have the entry: Learn Emacs on my To-Do list but for now I am using Counterclockwise Clojure plug-in for Eclipse running Clojure 1.6.

Here’s my implementation in almost the exact order that reflected my thought process:

(ns clj1.bloomfilter)

;; ———————————————————————————-
;; 1. Define a function to create a bloom filter buffer
;; Let’s try creating a vector of size n initialised to 0
(apply vector (take 5 (repeat 0)))
; [0 0 0 0 0]
; Ok, that worked so let’s create a function to create our bloom filter buffer.

(defn buffer
“Create a Bloom Filter Buffer initialised with 0’s”
[n]
(apply vector (take n (repeat 0))))

; Now lets quickly test the fn
(buffer 10)
; [0 0 0 0 0 0 0 0 0 0]

;; ———————————————————————————-
;; 2. Create our initial BFB
(def bloombuf (buffer 10))

;; ———————————————————————————-
;; 3. Let’s try write some hash functions. Note: hash fns must be from 0 to size of buf
;; Play around with the internal hash fn.
(mod (hash “hello”) (count bloombuf))
; 9

;; Use the internal hash fn for our first hash function.
(defn built-in-hash
[elem]
(mod (hash elem) (count bloombuf)))

;; As per the design objectives, write another hash function
;; But for now, simply re-use the internal hash function value, but incremented by 1.
(defn my-hash
[elem]
(mod (inc (hash elem)) (count bloombuf)))

;; Try out our hash functions
(built-in-hash “hello”)
; 9
(my-hash “hello”)
; 0

;; Define the collection of hash-fns which the Bloom Filter will use
(def hashfns [built-in-hash my-hash])
;; ———————————————————————————-
;; 4. Define some words which we will use to pre-populate our Bloom Filter Buffer.
(def words [“hello” “goodbye”])

;; ———————————————————————————-
;; 5. Define a function to hash a word and return a BFB of it
;; Let’s play with setting the buffer depending on the hash value.
(buffer 10)
; [0 0 0 0 0 0 0 0 0 0]

;; Use our built-in-hash function with “hello” should set element 9
(assoc (buffer 10) (built-in-hash “hello”) 1)
; [0 0 0 0 0 0 0 0 0 1]

;; Use our 2nd hash function with “hello” should set element 0
(assoc (buffer 10) (my-hash “hello”) 1)
; [1 0 0 0 0 0 0 0 0 0]

;; Ok use the above code to write a specific put function
(defn put
“Return a Bloom Filter Buffer with the appropriate element
set to 1 corresponding to the hash value of the word”
[word buf hashfn]
; (println “put: word is ” word)
; (println “put: buf is ” buf)
(assoc buf (hashfn word) 1))

;; Test the put function
(put “hello” bloombuf built-in-hash)
; [0 0 0 0 0 0 0 0 0 1]
;; ———————————————————————————-
;; 6. Now a implement a way to bloom a single word over
;; our collection of hash-fns by using reduce.
(defn bloom
“Return a Bloom Filter Buffer with the appropriate elements
set to 1 corresponding to hash values of the word against
a collection of hashfns”
[word bb]
; (println “bloom: word is ” word)
; (println “bloom: bb is ” bb)
(reduce (fn [b hfn] (put word b hfn))
bb
hashfns))

;; And test the bloom function
(bloom “hello” bloombuf)
; [1 0 0 0 0 0 0 0 0 1]

(bloom “goodbye” bloombuf)
; [0 0 1 1 0 0 0 0 0 0]

;; ———————————————————————————-
;; 6. Implement a way to bloom-filter over a collection of words
(defn bloomAll
“Return a Bloom Filter Buffer by processing a collection of words”
[words bb]
(reduce (fn [b word] (bloom word b))
bb
words))

;; And test bloomAll. We expect to see the union of the bloom-filter results of
;; “hello” and “goodbye”
(bloomAll words bloombuf)
; [1 0 1 1 0 0 0 0 0 1]

;; Define a bloom filter pre-populated with our designated words.
(def bbuf (bloomAll words bloombuf))
bbuf
; [1 0 1 1 0 0 0 0 0 1]
;; ———————————————————————————-
;; 7. Write a check function to see if a word has been bloomed using a given
;; hash-fn.
(defn check?
“Check a word against a pre-populated Bloom Filter Buffer using a passed in hash fn”
[word buf hashfn]
; (println “check?: word is ” word)
; (println “check?: buf is ” buf)
(= (buf (hashfn word)) 1))

;; Test the check? functions
(check? “Hello” bbuf built-in-hash)
; false

(check? “hello” bbuf built-in-hash)
; true

(check? “Hello” bbuf my-hash)
; false

(check? “hello” bbuf my-hash)
; true
;; ———————————————————————————-
;; 8. Write a function to see if a word has been bloomed for our collection of hash-fns
(defn bloomed?
“Check a word against a pre-populated Bloom Filter Buffer”
[word bb]
(reduce (fn [b hfn] (and b (check? word bb hfn)))
true
hashfns))

;; Finally test the bloomed? function.
(bloomed? “hello” bbuf)
; true

(bloomed? “Hello” bbuf)
; false

(bloomed? “goodbye” bbuf)
; true

(bloomed? “byegood” bbuf)
; false

;; ———————————————————————————-
;; 9. Finally, have an ice-cold beer 😀

Conclusions

As a Clojure newbie this has been a fun exercise where I feel I’ve learnt a number of things:

  1. Clojure Dojos are fun!
  2. Thinking in terms of Data and Functions in a top-down manner works well.
  3. Clojure with the REPL is a lot of fun.

The above solution could be improved by:

  • Each call to put will generate a new buffer. Therefore, use a single mutable Bloom Filter buffer.
  • Have multiple threads to add/check a word.
  • Improve the number and types of hash functions.

I look forward to the next Dojo. 😀

PS: If anyone knows of a nicer way to show Clojure code then I’d be grateful if you would please let me know.

 

Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: