The first step in any morloc project is to describe the relevant data. In this post, I will describe how a set of training and testing data for a machine learning project can be typed into morloc. This post will introduce the concept of dimensional typing, which is not yet implemented in morloc, so the morloc code in this post will not currently run. Instead the focus of this post is to describe the features I want to build into morloc in the near future.

As a concrete example of a training/testing dataset, I will use the MNIST collection of hand-drawn digits. This collection consists of 60000 28X28 pixel training images and 10000 28X28 pixel testing images. All images are on the grayscale with values between 0 and 255 representing intensities (from black to white).

We can represent MNIST as a type with 4 parameters. This type captures the top-level shape of the data (a pair or training and testing pairs).

type (MNIST x_training y_training x_testing y_testing) =
    ((x_training, y_training), (x_testing, y_testing))

To capture the shape one layer deeper, we can remove the parameterized types with a list of lists of lists of integers (awkward) for the data and a list of strings for the labels.

type MNIST = (([[[Int]]], [Int]), ([[[Int]]], [Int]))

Where [a] represents an ordered list of elements of type a. Besides being awkward, the term [] is does not capture the rectangular shape of the data. We can replace the internal with Matrix Int as below:

type MNIST = (([Matrix Int], [Int]), ([Matrix Int], [Int]))

This type still does not capture the dependencies between data and label dimensions. So we can extend the types with explicit dimensional data:

type (MNIST label) =
  ( (Tensor_{60000,28,28} Int, Tensor_{60000} Int)
  , (Tensor_{10000,28,28} Int, Tensor_{10000} Int))

The addition of dimensions to the type allows the dimensionality of the program to be typechecked at compile time and serves as machine and human readable documentation. They also allow runtime validation of input data. One further step we could take would be to replace the Tensor type by generalizing the [] notation to arbitrary dimension. As shown below:

type (MNIST label) =
  ( ([Int]_{n=60000,28,28}, [Int]_n)
  , ([Int]_{m=10000,28,28}, [Int]_m))

I am also introducing dimension variables (m and n). Note that all indexing expressions in a given type signature are in the same scope. They form a small, integer-based language nested inside the larger type language. Repeated dimensions could be removed, though in this case it doesn’t improve readability:

type (MNIST label) =
  ( ([Int]_{n=60000, x=28, y=x}, [Int]_n)
  , ([Int]_{m=10000, x,    y  }, [Int]_m))

In the case of the MNIST data, we already know the dimensionality of the data. But if we want a description of learning data that can be reused in many contexts we can generalize the type as follows:

type (MLData cell label) =
  ( ([cell]_{n, x...}, [label]_n)
  , ([cell]_{m, x...}, [label]_m))

Where n and m represent the number of training or testing objects; cell and label represent the generic data and label types; and x…​ represents the dimensionality of the data. This general machine learning input data type requires the form and dimension of the training and test data be the same and requires exactly one label for each data object. The MNIST data type is a subset of this more general MLData type.

There is more type information we could encode in the MNIST type, such as constraints in the allowed values for data values (0-255) and label values (0-9). I will discuss constraints later in this post. We also might want to layer an ontology over the data, for example stating that the input matrices are of logical type "GrayscaleImage" which is itself a term in a broader ontology. Mapping morloc types into deep ontological frameworks is of great importance to the morloc ecosystem, but I will leave this discussion to a future post.