Visualising Entropy

Entropy has many meanings in different contexts but the general idea is that it's a measure of randomness. The entropy to be visualised in this discussion is the randomness of a large data set of raw binary data such as a data file or a block device.

The intention is to visualise the data stored on a standard (12cm diameter) DVD optical disk. The capacity of such discs is 4.7GB for a typical single layer disc or 8.5GB for a dual layer one. (1GB=10^9 bytes.)

Searching the web for an existing solution revealed some pages that discussed similar ideas but at a file level, an order of magnitude smaller than what's required here.

Some quick calculations gives an idea of what's involved. Assuming the visualisation will take the form of a 16:9 widescreen 1920x1080 pixel image:

  • Data size = 4,700,372,992 bytes (DVD+R media size)
  • Image size = 1920 x 1080 = 2,073,600 pixels
  • Data resolution = 4,700,372,992 / 2,073,600 = 2266.77 bytes per pixel

The data resolution is the ratio of data bytes to image pixels: it means that there will be one pixel in the image for every 2266 bytes of input data.

This leaves two questions: how to translate the input data into pixel data and how to organise the pixel data in the image.

A simplistic answer to these questions is to sample the input data with a period equal to the data resolution: the first three of every 2266 bytes become pixel RGB values and the remainder are discarded. The pixels are then placed linearly top-left to bottom-right. A small tool written in Ruby shows what this looks like and allows an encrypted and plain volume to be compared:

This image shows a disc image that contains three volumes and there is a lot of unused space at the end (it's all zeroes and shows as black). Distinct lines are visible separating the three volumes and a pattern is visible in the data. Below shows another disc image containing the same three volumes as before but this one is encrypted. The difference is plain to see.

The next question is whethere there are other ways to visualise the data that can reveal more about it.