I’ve been thinking around an idea for some time now about running a small multimodal health study using ACSets as the core data structure to relate the different modalities of data I have together to investigate a research question (any kind at the moment) as an exercise in what is possible with this data structure. To describe my data, here is the type of data that I have:
- Dataset 1: Climate data sampled at a regular time series
- Exists at geographic regions (regions, census groups, states, territories, counties, etc.)
- Sampling can occur hourly or even more precise over long
time horizons (i.e. decades) - Exists as a database or file
- Dataset 2: Electronic health records sampled at irregular time series
- Exists at individual person level
- Sampling can occur very irregularly over long time horizons (i.e. decades)
- Exists as a large database
- Dataset 3: Census microdata sampled at regular time series over a long time horizon
- Can exist at general population levels, geographic regions, and more
- Sampling occurs regularly but is very sparse over decades worth of time
- Wide variety of data ranging from socioeconomic to demographic information
- Exists as a database or file
What I have been curious about with using ACSets in this problem space is how it could truly be operationalized in this context.
Suppose, for example, I am running a study where I am trying to predict what patients go on to have a heart attack after having a history of heat-related illnesses (or highly correlated with high temperatures). How I might do this normally is the following (bear with me, I am skipping over some of the specific nuance):
- Define a patient population who have had a history of heat-related illnesses but have not had a heart attack at some initial time point, t_1 (using Dataset 2)
- Define a patient population who have had heart attacks at some later time point, t_2 (using Dataset 2)
- Harmonize Dataset 1 and 3 against my patient population defined in 1 using some key or index (maybe geographic location or demographic location)
- Construct a data frame with a variety of prediction features and an outcome column from my population defined in 2
- Run a simple logistic regression to see how well I could predict using my features which patients from my initial group go on to have heart attacks.
In this situation, I would do a lot of manual harmonization and some feature engineering. Although this is somewhat contrived as an example, this is similar to some questions I would love to explore. Additionally, I know a possible critique of my line of thought is, “well, if this approach works for you, why don’t you just continue using it?” The whole reason I am going through this mental exercise is I am very curious to see what sort of questions or process improvements the adoption of ACSets in my workflow could enable or produce.
I know that @slwu has been looking at ACSet use here and there and I’ve come across some of the applied category theory work of Simon Frost and, of course, Nathaniel Osgood’s work. Finally, I am just very curious about pushing the bounds of what can be done with this data structure.
Reading up on the original ACSet paper, Categorical Data Structures for Technical Computing, I know the following about ACSets:
-
Act as a unifying abstract data type
-
Particularly useful for graphs and data frames, data structures
-
Combinatorial data could be thought of the data that:
- Exists solely within a graph structure
- Defines vertices in a graph structure
- Defines edges in a graph structure
- Set of all vertices and set of all edges are isomorphic
- As long as edge-vertex relationships are maintained
-
Attribute data has something concrete that describes it apart from a graph structure
- Encodes symmetries or relationships in data that are important to that data
And the paper mentions that this could lead to the development of novel data structures but at the moment, I am not quite seeing how I could make such structures. What else am I missing or not understanding?
In conclusion, the following questions still are floating in my mind:
- How could I use ACSets to associate my datasets together?
- How can I perform a statistical method on top of the relations I defined with an ACSet?
- What am I missing in my mental model of developing a study with ACSets as a core data structure? What other categorical knowledge should I recall about this work?
Like I said, this really is more of an exercise to see what is possible as well as for me to “push the envelope” in my own research. If I can come out of this exercise with more knowledge about even greater visions for ACT applications, that would be a huge win for me. A paper or at least blog post would be another interesting outcome too!
Thanks all!
jz
P.S. I was also inspired by your work on ODEs and ACSets @kris-brown so I am going to CC you and @owenlynch here.