# Variable-first categorical probability

A typical approach to probability theory within category theory is by considering the Kleisli category of the Giry monad on some suitable category of spaces, usually either finite or measurable.

This approach emphasizes the “pure-math” point of view on probability, that probability theory is about measures on measurable spaces. However, in applied math probability theory is treated differently; probability theory is seen as the study of *random variables*. From the scientist’s perspective, this makes a lot of sense; we don’t observe the underlying probability space, we just observe variables in the world. The typical “hack” to work around this in measure theory is that we postulate some measure space \Omega which is left unspecified, and then variables are measurable functions X \colon \Omega \to \mathbb{R}.

I have always felt this was somewhat unsatisfying from a math perspective, because it is non-compositional. We assume that there is a global world state \Omega and all variables are globally defined, which precludes thinking about “local” models that are composed after the fact.

It is also somewhat unsatisfying from a computer science perspective, because in computer science one actually would have to pick an \Omega. This is not how the practice of doing stochastic and statistical computations works. In practice, we have a stream of pseudo-random values, and then a variable is a way of translating a finite sequence of pseudo-random values into a single number.

So I was lead to search for an alternative foundation for probability theory that would have a more satisfactory (to me) treatment of random variables. In this post, I give a brief account of where this search has taken me.

The duality between algebra and geometry has been on my mind for the last couple months, (see Algebraic Geometry for the Working Programmer), and I had already been thinking about a “variable-first” approach to differential geometry following Jet Nestruev’s “Smooth Manifolds and Observables”. So I was lead to look to algebra for some structure suitable for this problem.

I found this structure in commutative C^\ast-algebras, which I had learned about in a functional analysis class. Most of the content for this post is just from Conway’s “A Course in Functional Analysis”, so if you want to learn more about C^\ast-algebras I encourage that you start there.

Noncommutative C^\ast-algebras are an important part of quantum mechanics, but it turns out that commutative C^\ast-algebras can play a similar model for stochastic mechanics! This is interesting because it implies that C^\ast-algebras in general bridge the classical and quantum worlds, but I won’t go too far into this now.

But what are C^\ast-algebras? The idea is the following. Given variables X_1,X_2,\ldots, what can we do with them? Well, we can form new variables X_1 + X_2, X_1 X_2, \lambda X_1 for \lambda \in \mathbb{R}, etc. More generally, given any polynomial p(x_1,\ldots,x_n), we can make a new variable p(X_1,\ldots,X_n). Additionally, *under certain conditions*, if we have a sequence of variables Y_1,Y_2,\ldots, we can take their limit Y. Coming up with necessary conditions for such a convergence is a tricky problem in general, but a sufficient condition is that the variables form a Cauchy-sequence for the supremum-norm.

So let’s try and axiomatize this. The fact that we can take products, sums, and multiply by real numbers means that we have the structure of an \mathbb{R}-algebra (which you can either think of as a ring where you can multiply by real scalars, or a vector space where you can multiply elements of the vector space). The fact that we can take limits of convergent sequences means that we need a *norm* on the \mathbb{R}-algebra, and the \mathbb{R}-algebra needs to be *complete* with respect to this norm.

This gets us to a *Banach algebra* (nlab, wikipedia). A good Banach algebra to have as an example to keep in mind is L^\infty(\mathcal{X}) for \mathcal{X} a measure space.

C^\ast-algebras are then Banach algebras with one more feature. Before we get into that feature, let’s talk a little bit about probability theory in the setting of Banach algebras.

Assume that we have a Banach algebra A. What is a probability distribution in this setting? Well, given a probability distribution \mu and a variable X \in A, we can take the expectation \mathbb{E}_\mu[X] \in \mathbb{R}. In the variables-first perspective, we identify the probability distribution with map \mathbb{R}_\mu \colon A \to \mathbb{R}.

What properties should this map have?

- It should be linear. \mathbb{E}_\mu[\lambda X + \eta Y] = \lambda \mathbb{E}_\mu[X] + \eta \mathbb{E}_\mu[Y]
- It should be the identity on constants. \mathbb{E}_\mu[\lambda] = \lambda
- It should be
*bounded*, if X \leq a always (or almost surely), then \mathbb{E}_\mu[X] \leq a. If we interpret the norm on our Banach algebra as giving the “least constant bound” on the absolute value of a variable, then this is equivalent to \mathbb{E}_\mu[X] \leq ||X||. - It should be
*positive*, \mathbb{E}_\mu[X^2] \geq 0

This last condition is not quite right, however. This is because any complex Banach algebra is also a real Banach algebra, and in a complex Banach algebra, \mathbb{E}[i^2] = -1. Also, what should \mathbb{E}_\mu[i] be? So we need a better conception of what it means to be positive, and we need to let the expectation take values in the complex numbers.

Technology to do this is provided by the notion of a C^\ast-algebra. A C^\ast-algebra is a Banach algebra that has an involution (-)^\ast on it, such that (-)^\ast behaves like the adjoint operation. That is:

- (a + b)^\ast = a^\ast + b^\ast
- (ab)^\ast = b^\ast a^\ast
- (\lambda a)^\ast = \bar{\lambda} a^\ast for \lambda \in \mathbb{C}
- | a |^2 = | a a^\ast | = | a^\ast a |

In the case of A = L^\infty(X), where we are just taking real-valued functionals, then a^\ast = a. If we take complex-valued functionals, then a^\ast is pointwise complex conjugation.

The point is that for a proper conception of “positivity”, we need to have the structure of a C^\ast-algebra, and we need to require that

- \mathbb{E}_\mu[X X^\ast] \geq 0

For this post, we call a map that satisfies 1-4 a positive functional.

It turns out that positive functionals on A=L^\infty(\mathcal{X}) are in bijective correspondence with probability measures on \mathcal{X}, this is the content of the famous Reisz-Markov-Kakutani theorem. This is connected to the fact that the distribution of a variable X is determined by the moments \mathbb{E}[X^n] for every n.

Notice one condition that we did *not* place on positive functionals: we did not require preservation of multiplication! This is because in general, \mathbb{E}_\mu[XY] \neq \mathbb{E}_\mu[X]\mathbb{E}_\mu[Y]. If we require preservation of multiplication, then \mu must be a Dirac delta distribution. This is because the only way that all variables can be independent is for the measure to be concentrated in a single point. We will call a positive linear functional that preserves multiplication and conjugation an m-functional (I couldn’t find/remember if there’s a standard short name for this, so I made that up).

We think of an m-functional as a “point” of A, and a positive functional as a “state” on A; in fact some authors define a state in this way.

We can generalize functionals to general maps. Define a positive map f between C^\ast-algebras A and B to be a map that satisfies conditions 1-3 and additionally preserves positivity, so that positive elements of A are sent to positive elements of B. Then notice that if you have a point \phi \colon B \to \mathbb{C}, we can postcompose it with f to get \phi \circ f, which is a state on A. This means that we can think of f like a Markov kernel; it sends points to states! This is formally explored in this paper; I’m not totally up to date on the literature with this so there might be a better reference.

EDIT: Evan commented to tell me about the 2020 paper Gelfand duality for commutative von Neumann algebras, which gives this story in full detail; it turns out there’s a good amount of subtlety to it!

Moreover, I believe (I have not checked this fully either, I would be surprised if it’s not in the literature) that the dual category of C^\ast-algebras and positive maps has the structure a Markov category when you take… one of the definitions of tensor product for the symmetric monoidal product. This is because every C^\ast algebra has a monoid structure given by its multiplication, and a positive map is “deterministic” (i.e. sends points to points) when it preserves the multiplication, i.e. the monoid structure. So obviously in the dual category, it is deterministic precisely when it preserves the comonoid structure, which is the defining feature of a Markov category. If anyone knows a reference for somewhere which spells this out in more detail, I’d love to see it, otherwise I suppose I’d better roll up my sleeves and prove it myself.

I find this all interesting because it seems to me that “variables-first” is somehow more similar to the way that scientists use probability, and understanding this in depth might lead to better software abstractions for doing probabilistic programming and statistics. Additionally, in a future blog post I’m going to talk about the Hille-Yosida theorem, which characterizes continuous-time stochastic processes as well as a variety of other things, and the functional analysis framing is essential for that. Although I expect the category of C^\ast-algebras and positive maps to be equivalent as a Markov category to some Kleisli category of a Giry monad, it seems to me that C^\ast-algebras are classically better suited for thinking about dynamical systems, so I’m hoping that better understanding the connection will solve some issues in the category theory of continuous time stochastic dynamical systems.