Structured Version Control for Scientific Models

owenlynch · September 14, 2023, 12:36am

Introduction

State management is a very hard problem. The traditional approach to managing state is to store it either in a database, or to store it in a textual version control system such as git.

However, neither of these options are suitable for scientific models. Scientific models are composed out of both combinatorial and recursive data. By combinatorial, I mean scientific models often have actions of permutation groups on them that leave their semantics unchanged, and by recursive I mean that scientific models both contain symbolic expressions, represented as trees, and also are typically built compositionally out of other scientific models.

Because of the combinatorial data, comparison of two scientific models may involve functions between finite sets of variables, and this is very ill-suited to textual line-by-line diffs.

But because of the recursive data, scientific models don’t fit very well in relational databases, and thus the change tracking present in relational databases also cannot be effectively used.

This tempts the user to use a so-called “document database”, which has a data model similar to JSON. But such a document database is just as poor of a representation for combinatorial data as text.

So the only choice is to boldly go forth and build our own system, drawing from the best aspects of the old systems.

Principles

In this section, I aim to lay out a generic structure for version control that could be suitable for scientific models. Of course, the simplest way of doing this will likely go far beyond just scientific models.

Generality and simplicity are goals here, but fantastic performance is not. Rather, we should have reasonable performance which is “fast enough” to handle moderately sized scientific models.

Categorical Version Control

Let \mathtt{B} be the set of bytestrings. Then define a patch category to be a polynomial comonoid C (i.e. a category where morphisms are indexed by domain and fibered by codomain), along with:

An initial object 0 \in C
For every i \in C, an injection e_i \colon C[i] \to \mathtt{B}, where C[i] is the set of morphisms out of i (this is what it means to be indexed by domain)
For every i \in C, a commutative monoid operation + on C[i] such that f + g is a pushout of f and g, and such that + commutes with precomposition. Note that this means + is not idempotent! If I add a new species to a Petri net, and you add a new species to the same Petri net, we should get two new species even though there is nothing to distinguish these species. Another way of doing this would be to have + be idempotent, and for C[i] to be a semilattice, in which case we would give each entity a UUID or something so that different people adding entities would produce two entities.

The intuition behind patch categories is that the objects of the patch category represent something that you want to version control, and the morphisms represent patches.
Having an initial object means that each object can be represented as a patch out of the initial object. Given an object, we can serialize and deserialize patches out of that object. And finally, we can merge patches via chosen, coherent pushouts.

In order for pushouts to always exist, we might have to artificially enlarge our definition of object to explicitly represent merge conflicts, as is done in Pijul.

The hope is that whatever syntax for scientific models we choose, if we can make it the objects of a patch category, then we can use generic tools to version it.

Version Control Primitives

Version control systems are typically built off of two parts.

The first part is a content-addressed storage. This consists of:

A set of bytestrings A
A mapping lookup from hashes to bytestrings, such that for s in A, lookup[hash(s)] = s
For each bytestring a, a collection P[a] of hashes of other bytestrings, which we call pointers.
The third component allows the block storage to track the dependencies of bytestrings. This is useful, because in order to interpret a certain bytestring, it might be necessary to have certain other bytestrings. Keeping track of dependencies allows us to know when we are syncing or garbage collecting that we should also sync dependencies, or not throw out dependencies. In git, we might store the diff for a commit as a bytestring in the set of bytestrings, and then we might store the parents of that commit as pointers.

Different computers contain local content-addressed storages which they synchronize. Synchronization is easy, because content-addressed storages are conceptually append-only sets, so the only possible result of synchronization is to enlarge the set of bytestrings, which can be done monotonically.

The second part is a mutable reference storage. This consists of a mapping from a collection of paths, i.e. strings like olynch/bugfix to hashes that are available in the local content-addressed storage. This can be mutated at will.

In git, these are used for branches and tags. Conceptually, branches and tags are the same thing, they are just used in different ways; branches are meant to be updated, while tags are meant to stay the same once assigned.

I propose a slight elaboration of the mutable reference store, which gives it some better properties.

First of all, each “node” (i.e. repository) should have a UUID. Then, references should be scoped to a UUID, instead of doing what git does which is just have a naming convention of <remote name>/branch to track remote branches. Each UUID should also have a URL for where it can be found, and a name.

Secondly, instead of mutating the hash assigned to each reference, there should just be a set of “timestamped reference change events”. This means that even if a commit is not in the current commit history of a reference, one can still follow the reference itself back in time, which solves the problem of history being lost after a rebase.

Finally, as a consequence of this, it makes sense to store the reference storage itself inside the content-addressed storage, as a hashcons list of change events. Then the reference storage can be just a single hash per remote. Syncing from a remote just pulls in all of the changes. Each list of change events is unique to an instance, so there are never any conflicts.

Of course, in practice in a “running” database will want a “cache” of the reference storage for performance, but that cache is a pure function of the storage, so maintaining it is a purely local problem, not a synchronization problem.

Implementation

With the definition of a patch category, and the primitives at hand, it is fairly straightforward to implement a version control system.

A commit is a serialization of a patch (i.e. morphism in the patch category), along with a pointer to a previous commit, or no pointer in the case of a patch out of the initial object. To recreate an object from a commit, one simply follows the linked list of patches back to the initial object, and then applies them (i.e. repeatedly takes codomains) until we get an object of the patch category. A commit might also contain metadata, like author, timestamp, and a list of commit messages (when squashing commits, we append their lists of commit messages).

Given two commits, we can merge them by following them back to a common ancestor, composing the patches for each commit back to the common ancestor using the categorical composition, and then adding together using the monoidal structure on the morphisms out. Note that this creates a new commit whose parent is the common ancestor. While this may appear to “lose history”, because of the design of the reference store we don’t actually lose history. This replicates the common pattern of squashing before merging.

References in the reference storage all point to commits. Operations like fetching and pushing happen via syncing reference stores and content stores between repositories.

Further work: compositional structure

@david.jaz came up with an idea for how to represent “mutable hierarchical compositions of systems” via polynomials in the category of delta lenses. If this all works out, this should allow us to construct a patch category from a “patch operad and patch algebra” where the objects are nested compositions of systems.

That is, an object consists of some syntax for composing systems (i.e. a wiring diagram), along with a subsystem for each of the “boxes” in that syntax. These subsystems themselves might be primitive subsystems, or they may again be a wiring diagram with subsystems of their own.

If the compositional syntax and the primitive subsystems are both objects in patch categories, then I think that a “composed system commit” should be represented by a tuple of “syntax commit” and then an assignment of boxes in that syntax commit to “system commits”, where a “system commit” is either a primitive system commit, or another composed system commit.

Of course, the mathematics of this needs to be worked out. But in principle I think that this makes sense, and is not terribly difficult to implement, based on what I just sketched out.

KevinDCarlson · September 14, 2023, 3:06am

Just a couple of remarks:

Every map in a patch category has to be epi to get the pushout-monoids to be homomorphic, since if you precompose a pushout of (\mathrm{id},\mathrm{id}) with some map f then the result is a pushout if and only if f is epi. I figure every category of epis with an initial object is equivalent to a patch category.
The suggested algorithm for applying a commit seems to depend on the assumption that I have a nice data structure for encoding a morphism of my patch category without having explicitly encoding its domain and codomain. Makes sense, just had me confused for a bit.
How does this story compare to the pijul approach?

owenlynch · September 14, 2023, 3:00pm

Interesting about epi-ness; I think that this speaks to somehow using UUIDs for part numbers, in such a way that two changes which add the same UUID are actually idempotent. So the patches are kind of like spans in a poset; the diff between two C-sets is given by a literal set difference. In the category of spans in a poset, are all the morphisms epi?

KevinDCarlson · September 18, 2023, 4:37pm

Let’s see, a span in a poset (with binary meets) is just a triple (a,b,c) with b\le a,c and the composition (a,b,c)\cdot (c,d,e) is (a,b\wedge d,e), so (a,b,c)\cdot (c,d,e)=(a,b,c)\cdot (c,d',e) if and only if b\wedge d=b\wedge d', but I guess no, that doesn’t generally imply (c,d,e)=(c,d',e), i.e. d=d'.

I was thinking that if we like the epi-ness it might be because a morphism is basically a sequence of patches, so that precomposing just makes the sequence longer, without ever erasing any information.

Topic	Replies	Views
Structure-aware version control via observational bridge types General	7	December 4, 2024
Structure-aware version control via observational bridge types General	14	November 13, 2024
A categorical approach to scientific data management General	11	October 10, 2024
A categorical approach to scientific data management General	69	April 14, 2023
Kart: git-based version control for tabular databases General	133	October 30, 2023