Array Systems

What is an array system?

“Array systems” are my answer to the question of “what is the most basic mathematical formalism for a collection of 1-dimensional arrays”?

In order to specify such a data structure, we need to know three things.

  1. How many arrays are there?
  2. How long is each array?
  3. What is in each array?

One approach would be to specify in the schema the integer lengths of each array. However, really what we want is a collection of “abstract domains”, and in an instance of the schema, all arrays with the same abstract domain have the same length.

For instance, the src and tgt arrays in a data structure for graphs both have the same length, which is equal to the number of edges in the array.

So we can start to describe the schema as two finite sets, \mathtt{Domain} and \mathtt{Property}, along with a function \mathrm{dom} \colon \mathtt{Property} \to \mathtt{Domain}. An instance of this schema consists of a function \mathrm{size} \colon \mathtt{Domain} \to \mathbb{N}, along with for every p \in \mathtt{Property}, an array \mathrm{values}(p) of length \mathrm{size}(\mathrm{dom}(p)). But what is in that array?

One option would to simply have a set \mathtt{Type} of types for properties, so for instance, \mathtt{Int}, \mathtt{Float}, \mathtt{String}, ... \in \mathtt{Type}. Then a schema would be a finite set \mathrm{Domain}, a finite set \mathrm{Property}, and a span (\mathrm{dom}, \mathrm{codom}) of type

\mathtt{Domain} \leftarrow \mathtt{Property} \to \mathtt{Type}

An instance of this schema consists of a function \mathrm{size} \colon \mathtt{Domain} \to \mathbb{N}, along with for every p \in \mathtt{Property}, an array \mathrm{values}(p) of length \mathrm{size}(\mathrm{dom(p)}) with element type [\mathrm{codom}(p)], where for every t \in \mathtt{Type}, [t] is the instantiation of that type in a programming language.

We can express this mathematically as a morphism of spans:

In the above diagram, \mathtt{Array} is the set of all arrays of all lengths and element types, and the two functions out of it are the obvious ones.

However, this does not allow us to express the constraint that, for instance, each value of the src array in a graph should point to an edge.

So instead, we consider a grammar of types, parameterized on the set of domains.

variables: t1, t2 in Type(Domain), d in Domain

Type(Domain) := Int
              | Float
              | t1 x t2
              | t2 + t2
              | d

This turns \mathtt{Type} into a functor \mathsf{Set} \to \mathsf{Set}.

Then a schema is a span

\mathtt{Domain} \leftarrow \mathtt{Property} \to \mathtt{Type}(\mathtt{Domain})

and an instance is a span morphism

An array with element type, say, 5, is simply an array of integers between 1 and 5, but we could also have an array with element type 5 + \mathtt{String}, whose values would be either integers between 1 and 5 or strings.

I am currently calling such a diagram an array system, because while these may or may not turn out to be categorically convenient, it seems to me that these are the simplest mathematical description of “a bunch of arrays that can contain data and also refer to one another”.

The mathematics of array systems

What I have described so far is sufficient to give a full implementation of array systems in a manner similar to how we have implemented ACSets in Julia, Python, Scala, and TypeScript.

However, the category theorist is naturally drawn to ask “what is the higher categorical structure of array systems”. And I believe that with Kevin Arlin I have worked out an idea of what this should be.

To spoil the punch line, I think that the diagram above should be thought of as “a functor between \mathtt{Type}-comulticategories where \mathtt{Type} is reinterpreted as a 2-monad on \mathsf{Cat}” (see generalized multicategory)

What does this mean? Basically, if a category is a span (in \mathsf{Set} or another category with pullbacks)

along with composition and identity operators of the form

then a T-comulticategory for a monad (T, \eta, \mu) on some category is a span

along with slightly modified composition and identity operators of the form

T-comulticategories are less famous than their dual cousins, T-multicategories, which include things like multicategories and virtual double categories.

For our purposes, we would want \mathtt{Type} to be defined in something like the following way.

Let \mathtt{Prim} be a category whose objects are “primitive” types (like \mathtt{Int}, \mathtt{String}, \ldots) and whose morphisms are primitive operations on these types, like lengths : String -> Int. We might also put some “type variables” in \mathtt{Prim}. Let \mathrm{PC} \colon \mathsf{Cat} \to \mathsf{Cat} be the finite product/coproduct completion 2-monad. Then define \mathtt{Type} \colon \mathsf{Cat} \to \mathsf{Cat} by \mathtt{Type}(C) = \mathrm{PC}(C + \mathtt{Prim}). So an object of \mathtt{Type}(\{A, B\}) might look something like A^2 + \mathtt{String} \times B + \mathtt{Int}, and morphisms in \mathtt{Type}(\{A,B\}) would be built out of the universal property of products/coproducts along with primitive operations on the primitive types.

Then a schema is a \mathtt{Type}-comulticategory with a discrete, finite object category, and a discrete, finitely-generated property category. Any span of the form \mathtt{Domain} \leftarrow \mathtt{Property} \to \mathtt{Type}(\mathtt{Property}) freely generates such a schema.

An instance of a schema is a functor into the \mathtt{Type}-comulticategory \mathsf{Array}, where the object category is \mathsf{FinSet}, and a morphism from A \in \mathsf{FinSet} to T \in \mathtt{Type}(\mathsf{FinSet}) is an A-indexed, T-valued array. If we have included type variables in \mathtt{Prim}, then we have to “concretize” them to define the category \mathsf{Array}, i.e. assign them to actual types in a programming language. So we would really have a \mathtt{Type}-comulticategory \mathsf{Array}_K for each concretization K.

Now, what does reframing the discussion in the first section to be about \mathtt{Type}-comulticategories give us?

  1. We can compose properties using the product/coproduct structure of \mathtt{Type}. For instance, we could compose a property f \colon A \to B + C with two properties g \colon B \to D and h \colon C \to D to get a property A \to D. This allows us to express laws in the schema.
  2. We get a category of instances of a schema, because we can consider natural transformations between the functors as instances.

Finally, if we let \mathtt{Type}(C) = C + A, for some fixed set A, then we recover the notion of “ACSets with attribute types in A”, and the corresponding category of ACSets. However, I think that array systems are simpler and more general than ACSets. They are simpler because there are only two types of thing: domains and properties, rather than “obs, homs, attrs, attrtypes”. They are more general because they encompass all of the things that it is practical to put in an array, i.e. primitive types, elements of finite sets, elements of type variables, and product/sum types of all of the above.

Working out the details of all of this will definitely be some non-trivial higher category theory. But I’m pretty sure it will work out, and so I’m going to try and implement this before the theory is fully worked out, because we need these in AlgebraicJulia yesterday.


If we let \mathtt{Domain} be a poset, this could express a subtype relation between domains which we interpreted by having each domain be a subset of the set of UUIDs, and the subtyping relation between domains be interpreted as the subset relation between subsets of UUIDs, which would be a slick way of doing what we now do with m-category schemas @epatters.

Now that I think about it, I think that we should just use the traditional naming convention of category theory. A schema S is a diagram

\mathrm{Ob}(S) \leftarrow \mathrm{Hom}(S) \to \mathtt{Type}(\mathrm{Ob}(S))

or just

\mathrm{Ob} \leftarrow \mathrm{Hom} \to \mathtt{Type}(\mathrm{Ob})

when we aren’t talking about multiple schemas. A morphism in a \mathtt{Type}-comulticategory should just be called a morphism! Forget all this Attribute/Property/Column/ForeignKey stuff, we don’t need it because we only have one type of morphism!

1 Like

@dspivak I think \mathtt{Type} should be a prolynomial monad, specifically \mathcal{s} \triangleleft \mathcal{s}^{\mathrm{op}} \triangleleft (y + \mathtt{Prim}), or whatever the finite version of this (I’m taking \mathcal{s} from It's Proly like Poly but better // The Topos Lab). Which also means that morphisms in \mathtt{Type}(\mathrm{Ob}) are basically multivariate polynomial morphisms.

1 Like