Parsing custom binary formats in Clojure

Table of Contents

  1. Parsing custom binary formats
    1. Exploration
  2. Binary Format Basics
  3. Reading the MNIST data files
  4. Misc I/O
  5. An alternative higher-level API, ByteBuffer -> asIntBuffer

Parsing custom binary formats

This is a rough draft of some notes on working with binary data in Clojure.

I’m going to use the “Parsing Binary Files” section from Practical Common Lisp as a reference, but I’m going to write the code in Clojure.

The practical thing we are going to do is read image data from the MNIST database. http://yann.lecun.com/exdb/mnist/

 TRAINING SET IMAGE FILE (train-images-idx3-ubyte):
[offset] [type]          [value]          [description]
0000     32 bit integer  0x00000803(2051) magic number
0004     32 bit integer  60000            number of images
0008     32 bit integer  28               number of rows
0012     32 bit integer  28               number of columns
0016     unsigned byte   ??               pixel
0017     unsigned byte   ??               pixel
........
xxxx     unsigned byte   ??               pixel

Most image data sets I’ve dealt with up until now have just been file directories. The training images will be files in one directory, the test images will be files in another directory, the labels will be part of the filename… something like that.

When I unzipped the image dataset from the link above, I was surprised it wasn’t a tar archive that extracted into a directory structure. It was just a single file. That’s when I read the rest of the page, saw the description of the format, and thought this would be a good opportunity to do some parsing of binary data.

Exploration

I quickly wanted to just view the file, byte-by-byte.

Clojure’s input-stream from clojure.java.io returns a java.io.InputStream.

You can see in the Clojure source code of clojure.java.io where the File protocol is extended with an implementation of make-input-stream that takes a File and returns a java.io.FileInputStream (which inherits from java.io.InputStream https://docs.oracle.com/javase/7/docs/api/java/io/FileInputStream.html)

java.io.FileInputStream has a read function which reads a byte of data from the input stream https://docs.oracle.com/javase/7/docs/api/java/io/FileInputStream.html#read().

Let’s see it in action!

(with-open [in (io/input-stream "resources/train-images-idx3-ubyte")]
  (println (.read in))
  (println (.read in))
  (println (.read in))
  (println (.read in)))

There are the first 4 bytes of the file.

That kind of looks like what is described by the website!

 TRAINING SET IMAGE FILE (train-images-idx3-ubyte):
[offset] [type]          [value]          [description]
0000     32 bit integer  0x00000803(2051) magic number
0004     32 bit integer  60000            number of images

That says the first 32 bits is a magic number with value 0x00000803 in hexadecimal (2051 in decimal)

Note that it’s only a coincidence that the bytes we just read and printed out look so much like the bytes described on the website. The bytes 0x00000803 are represented in hexadecimal but when we run (println ,,,) in Clojure, the values are printed out in decimal. The only reason they look the same is because 0x08 and 0x03 in hex have the same representation as 08 and 03 in decimal. If the first bytes in the data format were 0x00000C0A, then our output lines would have been 0, 0, 12, 10 instead of 0, 0, 8, 3 and it wouldn’t have been so obvious that we were on the right track.

Ok. Cool. We can read bytes one at a time. Now what? The data format doesn’t really care about individual bytes. It treats data in chunks of 32 bits, or 4 bytes, at a time.

How can we write our code so that we are dealing with 32 bit chunks? Should we read in the entire stream into a list and then partition it into 4-byte partitions?

.read returns -1 when the end of the file is reached. So we can use loop to keep calling .read until the byte we got is -1 and we can append each byte to a vector.

(with-open [in (io/input-stream "resources/train-images-idx3-ubyte")]
  (let [byte-vector (loop [result [] current-byte (.read in)]
                      (if (= -1 current-byte) result
                          (recur (conj result current-byte)
                                 (.read in))))
        byte-32-vector (partition 4 byte-vector)]
    (take 4 byte-32-vector)))

((0 0 8 3) (0 0 234 96) (0 0 0 28) (0 0 0 28))

That’s kind of cool. But there’s a few things that feel wrong.

For one thing, it takes a while to run those few lines of code. We are only asking for the first 4 32-bit integers in the end. But we are having to read the entire file before getting them. Maybe we could use some of Clojure’s built-in laziness. Let’s just make note of this little annoyance for now and move on.

The other thing is that we’re still dealing with an odd format. Each 32-bit integer is a list of 4 8-bit integers? That’s inconvenient. Clojure’s number data type is a 64-bit Long. Can’t we just use that rather than this list of 8-bit integers? Maybe we’re stuck reading 8 bytes at a time since that’s what the FileInputStream gives us, but is there a way we can combine the bytes into groups of things that are easier to work with?

Binary Format Basics

Practical Common Lisp mentions the case where you’re dealing with a binary format that uses an unsigned 16-bit integer as a primitive data type. The description of the MNIST data tells us that it’s using 32 bit integers. It sounds like we’ll be able to follow along with Practical Common Lisp and just make a few small adjustments to make the code work with 32 bit integers in Clojure rather than 16-bit integers in Common Lisp.

Peter explains that to read an integer from a binary format that uses a 16-bit integer as a primitive data bype, you need to read two bytes (a byte is 8 bits) and then combine them into a single number by multiplying one byte by 256 (28), and then adding it to the other byte.

If the binary format specifies that such 16-bit quantities are stored in big-endian form, with the most significant byte first, you can read such a number with this function (in Common Lisp).

(defun read-u2 (in)
  (+ (* (read-byte in)) 256 (read-byte in)))

In Clojure, it might look like this.

(defn read-u2 [in]
  (+ (* (.read in) 256) (.read in)))

PCL mentions that Common Lisp provides a more convenient way to perform this kind of bit twiddling, the LDB or “load byte” function. That can be used with setf to set any number of contiguous bits from an integer.

I don’t see an equivalent in Clojure/Java, but maybe the bitwise operators are a good comparison?

(defn read-u2 [in]
  (let [u2 0]
    (-> u2
        (bit-or (.read in))
        (bit-shift-left 8)
        (bit-or (.read in)))))

But that’s not much different than the previous option. What would an actual “load byte” function look like? Let’s write one.

It will help to see what’s going on if we have an easy way to convert a binary string to an integer.

(defn b->i [b]
  (Integer/parseInt b 2))

(comment
    (b->i "01")
    ;; => 1
    (b->i "10")
    ;; => 2
    (b->i "11")
    ;; => 3
    )

(defn load-byte
  "Returns `byte-size` number of bits of `n` starting from the bit at `byte-position`.
  Example:
  `11110011` in binary is `243` in decimal.
  If we want to get 4 bits from `11110011` starting from the bit that is 2 from the right:
  (load-byte [4 2] 243) -> `1100`
  "
  [[byte-size byte-position] n]
  (-> n
      ; First, drop everything to the right of byte-position.
      (bit-shift-right byte-position)
      ; Then, drop everything to the left of byte-size.
      (bit-and (dec (bit-shift-left 1 byte-size)))))

(comment
  (let [byte-size 4
        byte-position 2
        ; 4 bytes starting from position 2 should be:
        ;             "..1100.."
        integer (b->i "11110011")]
    (-> (load-byte [byte-size byte-position] integer)
        (Integer/toBinaryString)))
  ;; => "1100"
  )

(defn deposit-byte [new-byte [byte-size byte-position] integer]
  (let [deposit (-> new-byte
                    (bit-and (dec (bit-shift-left 1 byte-size)))
                    (bit-shift-left byte-position))
        mask (bit-not
              (bit-shift-left
               (dec (bit-shift-left 1 byte-size))
               byte-position))]
    (bit-or deposit (bit-and integer mask))))

(comment
  (let [byte-size 3
        byte-position 1
                           ; aligned for demo
                           ; size 3
                           ; position 1
        new-byte (b->i     "101")
        integer  (b->i "11110001")
        ; expected out "11111011"
        ]
    (Integer/toBinaryString
     (deposit-byte new-byte [byte-size byte-position] integer)))
  ;; => "11111011"
  )

Now we have load-byte and deposit-byte functionality similar to what Peter Siebel refers to in PCL.

Let’s proceed.

Re-writing read-u2 with our new functions looks like this.

(defn read-u2 [in]
  (reduce
   (fn [u2 byte-position]
     (deposit-byte (.read in) [8 byte-position] u2))
   0
   (range 8 -1 -8)))

(comment
  (let [baos (java.io.ByteArrayOutputStream. 2)]
    (.write baos 8)
    (.write baos 3)
    (let [bais (java.io.ByteArrayInputStream. (.toByteArray baos))]
      (read-u2 bais)))
  ;; => 2051 
  )

Note that we can’t use the thread macro in what might appear to be an intuitive way.

(defn read-u2 [in]
  (let [u2 0]
    (->> u2
         (deposit-byte (.read in) [8 8])
         (deposit-byte (.read in) [8 0]))))

(macroexpand
 '(->> u2
       (deposit-byte (.read in) [8 8])
       (deposit-byte (.read in) [8 0])))
;; => (deposit-byte (.read in) [8 0] (deposit-byte (.read in) [8 8] u2))

The threading macro expands so that the first time (.read in) is called is when the byte-specifier is [8 0] and the second time (.read in) is called is when the byte-specifier is [8 8]. That’s the opposite order of what we need.

Rather than fight with the ordering when dealing with stateful Java interop, I’d rather just use reduce over a range of byte-position\s. The other benefit of that is by simply changing the range we can easily move from a 16 bit data structure to a 32 or 64 bit data structure.

Also note that we’re kind of trying to fit a square peg in a round hole by writing functions to mimic Common Lisps load-byte and deposit-byte.

They are kind of nice to have when you want to do certain types of bit fiddlring. But for just reading multiple 8-bit values into a 16 or 32 bit value, we could use something much simpler.

read-u2 could be written in a very simple and direct way.

(defn read-u2 [in]
  (bit-or (bit-shift-left (.read in) 8) (.read in)))

Reading the MNIST data files

The first few things to read in the MNIST data format are 32 bit integers.

We can make a slight adjustment to our read-u2 function above to change it from reading in 16-bits to reading in 32 bits.

(defn read-u4 [in]
  (loop [r 0 i 3]
    (if (= i -1)
      r
      (recur
       (bit-or
        r
        (bit-shift-left (.read in) (* i 8)))
       (dec i)))))

A big chunk of code is soon to follow. Here’s a quick reminder of the structure of the MNIST data.

 TRAINING SET IMAGE FILE (train-images-idx3-ubyte):
[offset] [type]          [value]          [description]
0000     32 bit integer  0x00000803(2051) magic number
0004     32 bit integer  60000            number of images
0008     32 bit integer  28               number of rows
0012     32 bit integer  28               number of columns
0016     unsigned byte   ??               pixel
0017     unsigned byte   ??               pixel
........
xxxx     unsigned byte   ??               pixel

Pixels are organized row-wise. Pixel values are 0 to 255. 0 means background (white), 255 means foreground (black).

We’ll eventually want to be working with a list of “images”, where an “image” is a 28x28 matrix of pixel values.

So we’ll open the file and read off 4 32-bit integers, one for the magic number, one for the number of images, one for the number of rows, and one for the number of columns. Then we just have a sequence of bytes where each byte represents a pixel.

Since each image is 28 pixels wide by 28 pixels tall, then we know there are 28 \* 28, or 784 pixels per image. By reading 784 pixels at a time and then partititioning those 784 pixels into 28 pixel chunks, we have our 28 rows by 28 columns representation that we’re looking for.

The code that follows is one way of doing that. There’s some helper functions to generate an inline comment that helps visualize the data. If you stand back from your screen, you can almost see the numbers in the images in the inline comments at the bottom.

(defn read-n
  "Calls `.read` on `in` `n` times and returns a vector of the result.
  If we get to the end of the input stream, returns the result vector
  with meta-data `:end` so that the calling function can handle end-of-input."
  [in n]
  (loop [r []
         i 0
         c (.read in)]
    (cond
      (= -1 c) ^:end r
      (= i n) r
      :else (recur (conj r c) (inc i) (.read in)))))

(defn lazy-read-n
  "Returns a lazy sequence of reading `n` values from `in`."
  [in n]
  (let [c (read-n in n)]
    (if (:end (meta c))
      (cons c nil)
      (cons c (lazy-seq (lazy-read-n in n))))))

(defn load-images
  [file]
  (let [in (io/input-stream (io/file file))
        _ (read-u4 in) ; magic number
        num-images (read-u4 in)
        num-rows (read-u4 in)
        num-cols (read-u4 in)
        num-pixels (* num-rows num-cols)
        images (map
                (partial partition num-rows)
                (lazy-read-n in num-pixels))]
    images))

(defn format-pixel
  "Normalizes values between 0 and 255 to values between 0 and 9.
  Makes it easier to visualize a 28x28 matrix of 8-bit integers."
  [p]
  (let [p (/ p 256.0)]
    (int (* 10 p))))

(comment
  (let [in (io/input-stream (io/file "resources/train-images-idx3-ubyte"))]
    (->> (take 2 (load-images "resources/train-images-idx3-ubyte"))
         (map
          (fn [image]
            (map
             (fn [row]
               (map format-pixel row))
             image)))))
  ;; => (((0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0)
  ;;      (0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0)
  ;;      (0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0)
  ;;      (0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0)
  ;;      (0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0)
  ;;      (0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 5 6 1 6 9 9 4 0 0 0 0)
  ;;      (0 0 0 0 0 0 0 0 1 1 3 6 6 9 9 9 9 9 8 6 9 9 7 2 0 0 0 0)
  ;;      (0 0 0 0 0 0 0 1 9 9 9 9 9 9 9 9 9 9 3 3 3 2 1 0 0 0 0 0)
  ;;      (0 0 0 0 0 0 0 0 8 9 9 9 9 9 7 7 9 9 0 0 0 0 0 0 0 0 0 0)
  ;;      (0 0 0 0 0 0 0 0 3 6 4 9 9 8 0 0 1 6 0 0 0 0 0 0 0 0 0 0)
  ;;      (0 0 0 0 0 0 0 0 0 0 0 6 9 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0)
  ;;      (0 0 0 0 0 0 0 0 0 0 0 5 9 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0)
  ;;      (0 0 0 0 0 0 0 0 0 0 0 0 7 9 2 0 0 0 0 0 0 0 0 0 0 0 0 0)
  ;;      (0 0 0 0 0 0 0 0 0 0 0 0 1 9 8 6 4 0 0 0 0 0 0 0 0 0 0 0)
  ;;      (0 0 0 0 0 0 0 0 0 0 0 0 0 3 9 9 9 4 0 0 0 0 0 0 0 0 0 0)
  ;;      (0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 7 9 9 5 1 0 0 0 0 0 0 0 0)
  ;;      (0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 9 9 7 0 0 0 0 0 0 0 0)
  ;;      (0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 9 9 2 0 0 0 0 0 0 0)
  ;;      (0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 5 7 9 9 8 0 0 0 0 0 0 0 0)
  ;;      (0 0 0 0 0 0 0 0 0 0 0 0 1 5 8 9 9 9 9 7 0 0 0 0 0 0 0 0)
  ;;      (0 0 0 0 0 0 0 0 0 0 0 4 8 9 9 9 9 7 3 0 0 0 0 0 0 0 0 0)
  ;;      (0 0 0 0 0 0 0 0 0 2 8 9 9 9 9 7 3 0 0 0 0 0 0 0 0 0 0 0)
  ;;      (0 0 0 0 0 0 0 6 8 9 9 9 9 7 3 0 0 0 0 0 0 0 0 0 0 0 0 0)
  ;;      (0 0 0 0 2 6 8 9 9 9 9 9 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0)
  ;;      (0 0 0 0 5 9 9 9 8 5 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0)
  ;;      (0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0)
  ;;      (0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0)
  ;;      (0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0))
  ;;     ((0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0)
  ;;      (0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0)
  ;;      (0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0)
  ;;      (0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0)
  ;;      (0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 6 9 6 1 0 0 0 0 0 0 0 0 0)
  ;;      (0 0 0 0 0 0 0 0 0 0 0 0 0 1 9 9 9 9 9 0 0 0 0 0 0 0 0 0)
  ;;      (0 0 0 0 0 0 0 0 0 0 0 0 2 8 9 9 9 9 9 2 0 0 0 0 0 0 0 0)
  ;;      (0 0 0 0 0 0 0 0 0 0 0 2 8 9 9 9 7 3 9 9 4 0 0 0 0 0 0 0)
  ;;      (0 0 0 0 0 0 0 0 0 0 6 9 9 9 9 9 9 3 7 9 6 0 0 0 0 0 0 0)
  ;;      (0 0 0 0 0 0 0 0 0 1 9 9 9 7 4 9 8 1 3 9 6 0 0 0 0 0 0 0)
  ;;      (0 0 0 0 0 0 0 0 1 9 9 9 6 0 2 4 0 0 0 9 9 1 0 0 0 0 0 0)
  ;;      (0 0 0 0 0 0 0 1 6 9 9 8 3 0 0 0 0 0 0 9 9 6 0 0 0 0 0 0)
  ;;      (0 0 0 0 0 0 0 6 9 9 2 0 1 0 0 0 0 0 0 9 9 7 0 0 0 0 0 0)
  ;;      (0 0 0 0 0 0 2 9 9 2 0 0 0 0 0 0 0 0 0 9 9 7 0 0 0 0 0 0)
  ;;      (0 0 0 0 0 0 7 9 7 0 0 0 0 0 0 0 0 0 0 9 9 7 0 0 0 0 0 0)
  ;;      (0 0 0 0 0 2 9 9 4 0 0 0 0 0 0 0 0 0 0 9 9 5 0 0 0 0 0 0)
  ;;      (0 0 0 0 0 3 9 8 0 0 0 0 0 0 0 0 0 0 5 9 7 0 0 0 0 0 0 0)
  ;;      (0 0 0 0 0 3 9 8 0 0 0 0 0 0 0 0 0 5 9 8 2 0 0 0 0 0 0 0)
  ;;      (0 0 0 0 0 3 9 5 0 0 0 0 0 0 0 1 6 9 6 0 0 0 0 0 0 0 0 0)
  ;;      (0 0 0 0 0 3 9 8 0 0 0 0 0 0 4 9 9 6 0 0 0 0 0 0 0 0 0 0)
  ;;      (0 0 0 0 0 3 9 9 5 1 1 3 6 8 9 8 6 2 0 0 0 0 0 0 0 0 0 0)
  ;;      (0 0 0 0 0 3 9 9 9 8 8 9 9 9 7 5 0 0 0 0 0 0 0 0 0 0 0 0)
  ;;      (0 0 0 0 0 1 7 9 9 9 9 9 9 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0)
  ;;      (0 0 0 0 0 0 0 5 9 9 9 5 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0)
  ;;      (0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0)
  ;;      (0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0)
  ;;      (0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0)
  ;;      (0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0)))
  )

One problem with the code above is that it’s slow. I won’t go into it now since this is just a brief overview of byte-level i/o.

Misc I/O

Common Lisp has a few IO functions that are very similar to what Clojure has available.

In Practical Common Lisp, Peter Siebel uses with-open-file to read unsigned-byte\s from a file.

In Common Lisp, every time an input stream is passed to read-byte, it returns an integer between 0 and 255 (assuming :element-type was (unsignted-byte 8)).

Let’s see if this is how Clojure behaves. Let’s see how .read and .write work from BufferedInputStream and BufferedOutputStream.

We’re going to be doing a lot of inspecting of binary numbers here, so let’s get a little helper function so it will be more obvious what we are looking at.

(require '[clojure.string :as string])
(defn as-binary-string [b]
  (string/replace
   (format "%8s" (Integer/toBinaryString b))
   #" "
   "0"))
(comment
  (as-binary-string 1)
  ;; => "00000001"
  (as-binary-string 3)
  ;; => "00000011"
  (as-binary-string 255)
  ;; => "11111111"
  (as-binary-string 256)
  ;; => "100000000"
  (as-binary-string (dec (Math/pow 2 31)))
  ;; => "1111111111111111111111111111111"
  (count (as-binary-string (dec (Math/pow 2 31))))
  ;; => 31
  )

Anything bigger than (dec (Math/pow 2 31)) results in an exception, Value out of range for int: 2147483649.

If Integers are stored as 32 bits, why is (Math/pow 2 32) out of range? Because these are “signed” integers in Clojure, and that 32nd bit is used to signify the sign.

Ok. Now we have a way to easily view the actual bits of some bytes.

Let’s go test some reading and writing bytes with files.

(with-open [out (io/output-stream "resources/test")]
  (.write out 255))
(with-open [in (io/input-stream "resources/test")]
  (.read in))

Sweet. That works. What happens if we try to write something that can’t fit in an 8-bit byte?

(with-open [out (io/output-stream "resources/test")]
  (.write out 256))
(with-open [in (io/input-stream "resources/test")]
  (.read in))

Hmmm. We know 255 is the largest 8-bit number; 11111111. 256 takes up 9 bits, 100000000. We can see in the documentation that (.read in) only reads 8 bits (1 byte) at a time. What happens if we try to write more than 8 bites? Can we write more than 8 bits with (.write out 256)?

Let’s try reading 2 bytes from the file and what’s in the following byte after writing 256.

(with-open [out (io/output-stream "resources/test")]
  (.write out 256))
(with-open [in (io/input-stream "resources/test")]
  (.read in)
  (.read in))

According to the Java docs, -1 indicates the end of a stream has been reached.

So, if we try to write more than 8 bits, then any bits above 8 just get truncated and lost.

1

An alternative higher-level API, ByteBuffer -> asIntBuffer

Here’s an option if your entire file is 32 bit integers. It won’t work for the MNIST dataset since only the first few records are 32 bit ints and the image pixels are 8-bit bytes.

(require '[clojure.java.io :refer [input-stream] :as io])
(with-open [xin (input-stream (file "resources/train-images-idx3-ubyte"))
            xout (java.io.ByteArrayOutputStream.)]
  (io/copy xin xout)
  (-> (.toByteArray xout)
      (java.nio.ByteBuffer/wrap)
      (.asIntBuffer)
      (.get 1)))
Show Comments