Data Systems in Computational Biology

Research engineering sits at a weird intersection: the data is heterogeneous, schemas evolve constantly, and the questions asked rarely fit into the boxes that traditional data infrastructure expects. This is a tour through the systems that shape how I think about data engineering.

The Metadata Problem: Graphs

For my first job, I inherited a Laboratory Information Management System (LIMS) built on Google Sheets. It worked, but was struggling at scale (the year that I joined, the genomics platform alone completed 2,903 sequencing runs and 1.4M+ libraries).

The primary problem I faced was that biological experiments don't have fixed schemas. A sequencing run might involve 50 metadata fields one week and 80 the next, depending on the protocol, the lab, and whatever novel technique someone just published.

The traditional approach—endless ALTER TABLE migrations—doesn't scale when tens of labs are submitting data with different conventions, and only 20% of a full-time engineer is available for maintenance. I wanted something that could handle schema heterogeneity as a feature, not a bug.

Graph databases were the core technology I experimented with. Adding new node types, relationships, or properties doesn't require changing existing data or rewriting application code—schema transitions are as simple as inserting new graph structures.

Stepping back through data provenance is one of the high-friction yet high-frequency tasks any (data) scientist engages in—particularly when experiments or data come out 'wrong'. The graph let scientists traverse relationships that would've required multi-way joins in SQL. Samples connect to sources, libraries connect to sequencing runs, runs connect to pipelines, pipelines produce outputs that feed into analyses:

MATCH (sample:Sample)-[:SEQUENCED_IN]->(run:Run)
      -[:PROCESSED_BY]->(pipeline:Pipeline)
      -[:GENERATED]->(result:Result)
WHERE result.name = 'zebrafish_123_repeat_2'
RETURN sample, run, pipeline

The biggest unanswered question I had was scalability. As data load increases over years, do graph queries remain performant—particularly when a given node accumulates thousands of edges? This is the well-documented supernode problem: nodes with disproportionately many relationships degrade traversal performance. Mitigations exist (relationship type filtering, relationship property indexes, proxy node modeling), but the concern is real for LIMS where entities like common reagents or popular gene ontology terms naturally attract dense connections.

Pipeline Orchestration: From DSLs to Expressions

Bioinformatics pipelines are fragile and finnicky. Dependencies break, reference datasets change, multiple languages get roped together, scalability is challenging, and producing deterministic outcomes can be a nightmare.

The DSL Approach: Nextflow and Viash (CZ Biohub, 2021-2023)

Nextflow makes pipeline deployment simpler by separating what needs to run from how it runs. The documentation puts it directly: while a process defines what command or script has to be executed, the executor determines how that script is actually run on the target system.

What runs is a series of process blocks chained together into a workflow and run in containers:

params.query = "/some/data/sample.fa"
params.db = "/some/path/pdb"

process blast_search {
  input:
  path query
  path db

  output:
  path "top_hits.txt"

  script:
  """
  blastp -db $db -query $query -outfmt 6 > blast_result
  cat blast_result | head -n 10 | cut -f 2 > top_hits.txt
  """
}

process extract_top_hits {
  input:
  path top_hits
  path db

  output:
  path "sequences.txt"

  script:
  """
  blastdbcmd -db $db -entry_batch $top_hits > sequences.txt
  """
}

workflow {
  def query_ch = channel.fromPath(params.query)
  blast_search(query_ch, params.db)
  extract_top_hits(blast_search.out, params.db).view()
}

How a pipeline runs is organized by executor configuration, which adapts a generic pipeline to run on AWS Batch, an HPC cluster, Kubernetes, locally, or elsewhere—without changing the pipeline code.

One of the limitations of Nextflow is that it is designed around stable process blocks, making it clunky when components of your pipeline involve unstable experimental methodologies. The Groovy-based DSL also presents a learning curve for teams more comfortable in Python or R.

Viash extends Nextflow by adding a component abstraction layer. Instead of writing Nextflow-specific processes, you define components in a platform-agnostic config.vsh.yaml format paired with a script (Bash, Python, R, Scala, JavaScript, or C#) that Viash compiles into Nextflow modules, Docker containers, or standalone executables. Each component gets automatic CLI generation with --help, argument parsing, and type validation.

Viash is an incredibly helpful addition to Nextflow, making it simple to chain together complex workflows with component-level containerization. However, integrating with Python-native ML tooling still requires subprocess bridges or serialization hacks. Nextflow/Viash is best suited to quickly implementing and deploying stable pipelines.

The Expression Approach: Redun (insitro, 2023-2025)

Redun—yet another redundant workflow engine—takes a different approach. Instead of defining pipelines in a DSL, you write regular (ideally pure) Python functions decorated with @task:

from redun import task

@task()
def step1(x):
    return x + 1

@task()
def adder(values):
    # Arguments are always guaranteed to be concrete values.
    return sum(values)

@task()
def main():
    results = [step1(1), step1(2), step1(3)]
    # `results` is now a list of Expressions.
    # To add them together we need to pass them to another task.
    total = adder(results)
    return total

A task call returns immediately with a TaskExpression rather than blocking for results. A scheduler builds the DAG by reducing these expressions, analyzing dependencies, and parallelizing independent tasks on the executor (local threads/processes, AWS Batch). An optimized dependency graph emerges naturally from how you compose function calls:

Content-addressed caching means memoization is based on both data and code. Cache keys are computed as hash([hash(task), hash(args), hash(kwargs)]). When upstream tasks change, the scheduler can skip unchanged portions of the graph, re-evaluating only affected downstream tasks.

Data provenance is first-class. Redun persists a content-addressable Call Graph in a SQL database: CallNode objects with inputs, outputs, and code, structured similarly to a Merkle tree. In a given pipeline, you can trace the exact sequence of tasks, inputs, and code that produced any output:

redun log --task step1
redun log <execution_id> --output  # trace provenance of a specific result

Redun also has several well-designed abstractions:

executor and scheduler configs deploy pipelines across compute environments (local, AWS Batch, Spark).
File classes (backed by fsspec) allow engineers to work transparently across local storage, S3, GCS, and any fsspec-compatible backend—critical when bioinformatics workflows pass around files rather than in-memory results.
Backend allows users to define where and how the Call Graph is persisted.

The Python-native approach has consequences. Control flow uses standard Python—conditionals, loops, recursion, higher-order functions. Error handling is try/except. Type checking works. The entire Python ecosystem (Pandas, NumPy, PyTorch) integrates naturally. Writing experimental pipelines or code is intuitive and accessible.

One tradeoff is ecosystem maturity. Nextflow has years of community bioinformatics modules (nf-core), tools, and infrastructure that Redun doesn't have. A second is that writing pure Python functions in practice is sometimes unrealistic—side effects creep in. Third, containerization can become complex for multi-modal pipelines (component-level containerization comes ready-made with Viash + Nextflow).

All Data is an Array

All data can be construed as some form of array (tables are 2D arrays with heterogeneous data types, an image is a 2D array, indexes are a 1D array, graphs are two inter-related arrays...).

One of the most frustrating problems that Biotech faces is data system fragmentation. I observed significant friction from having to navigate through SQL databases, cloud file storage, and NoSQL systems in order to link together related data. This is further accentuated by the fact that scientific data is often represented in a variety of different file formats, each requiring their own boutique Python or R library to work with. Navigating these systems is not a good use of any computational biologist's or scientist's time.

Despite the fragmentation, there are fundamental constants. Data is frequently worked with in memory as a NumPy array or Pandas DataFrame. Data is frequently retrieved in a slice-by-slice fashion. Annotations are frequently accessed to contextualize data. An emergent thought from this: why don't we consolidate into a singular data API? This was heavily influenced by experience working with AnnData—which provides a dynamic yet consistent UX. Several organizations have already deployed around this paradigm, notably XArray and TileDB.

H5AD

H5AD is the on-disk file format for AnnData objects, built on top of HDF5. While single-cell transcriptomics datasets were limited to fewer than 500,000 cells, this format worked well. It enabled a hierarchical organization of NumPy arrays and Pandas DataFrames in a single file. The AnnData API constructs this into an in-memory data object aligned on shared indexes. The on-disk representation is consistent with the in-memory representation:

As single-cell datasets have grown, the limitations of H5AD became apparent. It is possible to read individual slots from an H5AD file without loading the whole thing (using anndata.read_elem() or passing backed='r'):

import h5py
from anndata import read_elem

f = h5py.File("dataset.h5ad", "r")
obs_df = read_elem(f["obs"])  # read only cell metadata

However, backed mode has significant limitations: only X is lazily accessed—all annotations (obs, var, obsm, uns) still load into memory. Files can't be opened by multiple processes simultaneously. Reads and writes cannot be performed directly to cloud-object stores (where most large datasets are hosted). As larger single-cell atlases grew to millions of cells across tens of thousands of genes, even the backed X matrix became unwieldy.

TileDB

TileDB represents a fundamentally similar yet improved approach: data is stored as multi-dimensional arrays with native support for sparse data, cloud storage (S3, Azure, GCS), and parallel I/O. Rather than loading an entire dataset, you query slices of any array using NumPy-like syntax:

import tiledb

def read_slice():
    with tiledb.open(array_name, mode='r') as A:
        # Slice only rows 1, 2 and cols 2, 3, 4.
        data = A[1:3, 2:5]
        print(data["a"])

TileDB reads only the relevant 'tiles' of an array into memory and returns the sliced portion. Dimensions act as the indexes (that users query by) and attributes represent a dictionary of data at a given index:

The tiledb.Config class provides extensive parameter tuning: tiling strategies, memory buffers, parallelization, cloud storage-specific parameters, and more. Decisions around what to set as dimensions and how to tile your data on-disk can significantly impact query performance. It is a little overwhelming but enormously valuable—particularly when fine-tuning custom data structures for machine-learning or analysis.

TileDB-SOMA builds on TileDB specifically to scale AnnData. It implements the SOMA (Stack of Matrices, Annotated) specification—an API jointly developed by the Chan Zuckerberg Initiative and TileDB. It is a hierarchical data structure unified by a shared integer index (soma_joinid):

SOMAExperiment: The primary container with two mandatory components: obs (observation annotations) and ms (a collection of measurements containing X matrices and axis annotations)
SOMADataFrame: Multi-column tables with user-defined schemas and point indexing (obs, var...)
SOMASparseNDArray: N-dimensional sparse arrays supporting disjoint access patterns—critical for larger-than-memory operations (X)
SOMACollection: Persistent named containers enabling hierarchical organization (used for layers of X)

H5ADs are usually generated per dataset, scattered across a cloud object store. Dataset metadata is split between Jira tickets, Benchling, and S3 (obs or CSV files). Finding related datasets—say, all scRNA-seq from a specific cell line—can take weeks of hunting. A SOMA atlas unifies datasets into a queryable structure:

import tiledbsoma

census = tiledbsoma.open("s3://example-atlas/rna")

# Query cell-line annotations across all experiments
cells = census["census_data"]["homo_sapiens"].obs.read(
    column_names=["cell_type", "dataset_id", "disease"],
    value_filter="disease == 'normal'"
).concat().to_pandas()

TileDB-SOMA is ready-made for analysis and machine-learning at scale. The CZ CELLxGENE Census is the flagship deployment: 93M+ cells queryable through this exact API.

XArray

XArray introduces labeled, multi-dimensional arrays with lazy evaluation and partial reads:

import xarray as xr

brain_scan = xr.DataArray(
    data=radiology_data,
    dims=["x", "y", "z"],
    coords={
        "x": np.arange(192),
        "y": np.arange(256),
        "z": np.arange(192),
    },
    attrs={"patient_id": "P001", "modality": "T1w"}
)

# Operations preserve labels
midline = brain_scan.sel(x=96)  # Named indexing

Labeled dimensions prevent an entire class of bugs—transposed arrays, off-by-one errors in axis selection, metadata that drifts out of sync with data.

XArray and TileDB are best understood as complementary rather than alternatives. TileDB is a storage engine with its own open-source, directory-based format optimized for cloud object stores and parallel I/O, with database features like time-travel versioning. XArray is a Python library for labeled multi-dimensional data that works on top of various storage backends—NetCDF4, HDF5, Zarr, and others. The key distinction: XArray gives you the in-memory representation and labeling semantics; the engineer is responsible for choosing how data is written to and read from disk (e.g. via Dask for parallelism, stored as Zarr or Parquet). TileDB bundles storage format, I/O optimization, and cloud-native access into a single system.

Tidy Data

Hadley Wickham's tidy data principles—articulated for R's tidyverse—further influenced how I think about data structures.

Tidy data is Codd's 3rd normal form adapted for statisticians: each variable forms a column, each observation forms a row, and each type of observational unit forms a table. The distinction matters: a variable measures the same attribute across units (e.g., gene expression), while an observation contains all measurements for the same unit (e.g., a single cell). Fixed variables describing experimental design come first, followed by measured variables, ordered so related variables are contiguous.

This structure enforces alignment: if you subset cells, the expression matrix and all metadata stay synchronized. It also enables tool composition—the output of one tidy operation becomes valid input for another. Messy data violates these rules in predictable ways: column headers that aren't variable names, multiple variables encoded in one column, or observations scattered across files. Most can be resolved through melting, string splitting, and casting.

This framework is a guidebook for how hierarchical data structures should be composed—particularly if all data is an array.

Visualization

Visualization in computational biology often means writing notebooks. This doesn't scale: every analysis requires custom code, results aren't interactive, and non-programmers can't explore the data. CELLxGENE Discover is one of my north stars when it comes to visualizing complex data. The full-stack codebase is well-organized and easily adapted. It is straightforward to trace how and where data is transformed and why it is transformed there based on compute requirements.

Layered Approach to Scientific Data Access

CELLxGENE's broader approach inspired how I build scientific tooling in a layered fashion:

This provides ample opportunity for pivots, refactors, and restructuring based on feedback loops. I tend to treat system design as an emergent property of writing code, rather than something I start with.

First Principles

Looking across these systems, a pattern emerges: the tools that work for biological data share certain properties.

Cloud compute abstractions. In an environment where you may have to frequently change where your code runs, effective abstractions over compute are essential. In my experience this has primarily meant executor abstractions (Nextflow executors, Redun executors) and filesystem abstractions (Redun's redun.File, TileDB's VFS).

Schema flexibility over rigid structure. Biology evolves faster than database migrations. Graph databases, sparse arrays, and document stores all handle evolving schemas better than normalized relational tables.

Lazy evaluation over eager loading. Datasets are too large to fit in memory. Query metadata (annotations) first, filter, then load only what's needed. This principle runs through AnnData's backed mode, TileDB's tile-based I/O, and XArray's lazy evaluation via Dask.

Labeled dimensions over positional indexing. The axis-order bugs in manual data joins are predictable and preventable. Named and shared dimensions—XArray's dims, SOMA's soma_joinid, tidy data's variable-as-column—catch errors at the point of access rather than downstream in analysis.

Composable abstractions over monolithic tools. Custom point solutions don't scale across multi-modal research organizations. The tools that endure are those with clean separation of concerns: Redun separating task logic from execution, Nextflow separating process from executor, TileDB separating storage from query.