Back to Thoughts
·15 min

RadiObject: A cloud-native radiology atlas

Building RadiObject

engineeringradiologyimagingmachine learning

Radiology data is a rich and informative modality for computational analysis of human biology. There are so many features of a given scan that are 'invisible' to the human eye, yet capture vital prognostic data relevant to human health. The data formats used to store a given scan (or 'series' in radiology terms) were developed in the 1990s and early 2000s, and have not seen a facelift since. RadiObject is meant to provide an updated in-memory and on-disk representation of individual scans, as well as population-scale datasets, that is simple and easy to work with.

What's in a Series?

These are a few examples of key radiology modalities, though this list extends to many more:

ModalityPhysicsBiological Information
CTX-ray attenuationTissue density, calcification, anatomy
MRIMagnetic resonanceSoft tissue contrast, water/fat content, diffusion
PETRadiotracer uptakeMetabolic activity, receptor expression

Any given series contain a wealth of quantitative and structural information hidden from the human eye, such as:

CategoryExamplesWhat It Captures
First OrderMean, entropy, skewness, sphericityIntensity distribution and geometric morphology
Second OrderContrast, correlation, run length emphasisSpatial relationships between neighboring voxels
Higher-OrderWavelet, Gabor, fractal featuresMulti-scale structural analysis

The goal of radiomics is to formalize the extraction of these features into methods. One such example of these invisible (Higher-Order) structures is fat infiltration into muscle (myosteatosis)—one of the single best predictors of longevity from CT scans. Pickhardt et al. (2023) demonstrated that AI-derived myosteatosis was the most important body composition predictor of mortality in asymptomatic adults, while Linge et al. (2021) showed adverse muscle composition carried a hazard ratio of 3.71 for all-cause mortality across 40,178 UK Biobank participants. In order to conduct these studies, data scientists had to analyze radiology 'data'.

The 'data' is a 3-dimensional array. Sometimes 4 for fMRI—where the fourth dimension is time. Each 'voxel' (or position) in this array is a physical measurement of tissue properties (CT scans measure tissue density). Variations across voxels in different dimensions (e.g. time) provide a quantitative measurement of other biological features, such as water content, protein binding, iron deposition and many more.

How are these Arrays Represented?

DICOM files were the original representation of radiology data, first standardized in 1993 as ACR-NEMA Version 3.0. Thirty years on, the format remains the clinical standard, but was never designed for computational post-processing. Each 2D slice (or 'instance') from a given MRI scan is saved as a single .dcm file:

[ SINGLE DICOM FILE (.dcm) ]
+-------------------------------------------------------+
|  HEADER (Variable Length)                             |
|  Structure: (Group, Element) | VR | Length | Value    |
|  ---------------------------------------------------  |
|                                                       |
|  [Patient Level]                                      |
|  (0010,0010) PN: "Doe^John"                           |
|  (0010,0020) LO: "123456" (Patient ID)                |
|                                                       |
|  [Study Level]                                        |
|  (0020,000D) UI: "1.2.840..." (Study Instance UID)    |
|                                                       |
|  [Series Level] (The Linker)                          |
|  (0020,000E) UI: "1.2.840..." (Series Instance UID)   |
|  *All 512 files share this SAME ID.*                  |
|                                                       |
|  [Image/Slice Level] (The Unique Parts)               |
|  (0020,0013) IS: "1" (Instance Number)                |
|  (0020,0032) DS: "-120.5\-30.0\105.0" (Position X/Y/Z)|
|  (0028,0030) DS: "0.5\0.5" (Pixel Spacing)            |
+-------------------------------------------------------+
|                                                       |
|  PIXEL DATA (7FE0,0010)                               |
|  ---------------------------------------------------  |
|  Raw 2D Array (Uint16)                                |
|  Typically 512 x 512 pixels                           |
|                                                       |
|  [ 00 00 ... 04 A1 ... ]                              |
|                                                       |
+-------------------------------------------------------+

Each DICOM file is saved to a directory with hundreds of these files, which must be composed into a single scan (by software systems) using the 'Position' metadata saved in the file header. To simply construct a given tensor for analysis a user must:

  1. Open 500 files
  2. Parse 500 headers
  3. Sort them by float coordinates (handling floating point errors)
  4. Stack the arrays
  5. Handle missing slices (if the patient moved)

How Much Waste Are We Talking About?

Let's do some napkin maths for a single CT scan (512×512 pixels, 300 slices, uint16):

ResourcePer Volume100-Subject Study
Files opened30030,000
Headers parsed300 × ~2 KB each = 600 KB60 MB of redundant metadata
Sorting overhead300 floats with IEEE 754 comparison30,000 floats
Raw pixel data512 × 512 × 2 bytes × 300 = 150 MB15 GB

Each of those 300 file opens triggers a system call, a file descriptor allocation, and a header parse. On a standard SSD, opening and parsing a single DICOM file takes ~1-2ms—meaning just the I/O overhead of assembling a single volume is 300-600ms before you've touched the pixel data. Multiply that by 100 subjects across 4 modalities (CT, FLAIR, T1w, T2w) and you're spending 2-4 minutes purely on file system operations.

The redundancy is equally absurd: every one of those 300 files carries near-identical patient/study/series metadata in its header. For a 100-subject, 4-modality study, that's 120,000 files each containing duplicated information that only varies by a few bytes (instance number and position).

The NIfTI format is the next data structure used for analysis (pipelines as a first step will convert DICOM → NIfTI). The biggest thing NIfTI does is sandwiching the 2D arrays from DICOM into a 3D array in its Voxel data.

[ NIfTI FILE (.nii / .nii.gz) ]
+-------------------------------------------------------+
|  HEADER (348 Bytes)                                   |
|  ---------------------------------------------------  |
|  [MetaData]                                           |
|   • dim[8]:      Shape (e.g., 512, 512, 300)          |
|   • pixdim[8]:   Voxel Size (e.g., 0.5mm, 0.5mm, 1mm) |
|   • datatype:    INT16, FLOAT32, etc.                 |
|                                                       |
|  [The "Rosetta Stone" - Coordinate Mapping]           |
|   • sform_code:  Method to align to world space       |
|   • srow_x:      [ Row 1 of Affine Matrix ]           |
|   • srow_y:      [ Row 2 of Affine Matrix ]           |
|   • srow_z:      [ Row 3 of Affine Matrix ]           |
+-------------------------------------------------------+
|  EXTENSION (Optional, usually variable length)        |
|  Custom JSON data, DICOM tags, etc.                   |
+-------------------------------------------------------+
|                                                       |
|  VOXEL DATA (The "Payload")                           |
|  ---------------------------------------------------  |
|  Raw binary block of numbers.                         |
|  No shape info here, just a long 1D stream of bytes.  |
|  Must be reshaped using 'dim' from header.            |
|                                                       |
|  [ 00 A4 ... FF 01 ... ] (Compressed with Gzip)       |
|                                                       |
+-------------------------------------------------------+
FeatureNIfTI (.nii)DICOM (.dcm folders)
Data ModelA 3D VolumeA stack of 2D Slices
MetadataCentralized (One Header)Distributed (Redundant headers in every file)
Readingf.seek() to one offsetos.listdir() then open 500 file handles
SortingPre-sortedMust sort manually by Z-coordinate
Use CaseMachine Learning / ResearchHospital PACS / Clinical Archive
  • Cloud-Native Data Access
    • Population-scale radiology datasets are often stored in cloud-object stores (think AWS S3, Azure Blobstore...), as these datasets easily reach terabytes in size. The UK Biobank imaging study alone generates ~2.7 GB per participant across 100,000 subjects—that's 270 TB of imaging data. Even a moderate 1,000-subject, 4-modality study at 300 MB per volume weighs in at 1.2 TB. Hosting 10 TB on AWS S3 Standard costs ~$235/month, and at 100 TB you're looking at ~$2,300/month—making cloud storage a necessity, not a luxury, for datasets of this scale.
    • For any analysis, data scientists must download significant chunks of data to local disk before reading them into memory. This download operation is slow, expensive, and unnecessary given modern software.
  • Metadata
    • Metadata is stored in the headers of these files. Even in a moderately sized dataset, the cost of scanning hundreds of file headers to extract metadata for a basic operation/analysis is prohibitive.
    • Often times metadata is exported to a SQL Table/CSV file, which contain a column linking metadata to a specific file. This approach is more functional, but flaky. The CSV/SQL Table context is not directly tied to the data, and that link is all-too-easily broken as data is transformed during analysis.

These are the two primary drivers behind my frustrations with NIfTI and DICOM for analysis. A much more comprehensive list of NIfTI nits can be found on Brainder's NIfTI explanation blog.

What does analysis look like?

Most analysis is done from the NIfTI format. NIfTI images can be read from a local disk using NiBabel:

import nibabel as nib

# Load a NIfTI file - for .nii.gz this decompresses the ENTIRE volume into memory
img = nib.load("scan.nii.gz")

# Access the header (cheap - already parsed)
header = img.header
voxel_dims = header.get_zooms()       # e.g. (0.5, 0.5, 1.0) mm
shape = header.get_data_shape()       # e.g. (512, 512, 300)

# Get voxel data as a NumPy array - THIS is where the full decompression happens
# For a 512x512x300 float32 volume: ~314 MB allocated in RAM
data = img.get_fdata()                # Returns np.ndarray, dtype=float64 by default

# Want just a single axial slice? Too bad - you already loaded the whole thing.
axial_slice = data[:, :, 150]         # 512x512 slice, but cost was 314 MB

The critical thing to notice: there is no way to read a partial volume from a .nii.gz file. Gzip is a stream-compressed format—you cannot seek to byte offset N without decompressing everything before it. Even if you only need a single 512×512 slice (0.5 MB), you must decompress and load the full 314 MB volume.

TorchIO is a commonly-used framework for doing machine learning from NIfTIs/NumPy data. Published by Pérez-García et al. (2021), it provides significant savings when it comes to GPU memory—though it cannot escape the fundamental I/O constraints of the flat-file format. This is a brief example of what a standard setup looks like for analysis:

import torchio as tio
from torch.utils.data import DataLoader

# Create dataset from filepaths
subjects = [
    tio.Subject(
        image=tio.ScalarImage(nifti_path),
        label=tio.LabelMap(label_path),
    )
    for nifti_path, label_path in zip(image_paths, label_paths)
]

dataset = tio.SubjectsDataset(subjects, transform=tio.Compose([
    tio.ToCanonical(),                    # Reorient to RAS
    tio.Resample(1.0),                    # Isotropic 1mm spacing
    tio.ZNormalization(),                 # Zero mean, unit variance
]))

# Patches are extracted from FULL volumes loaded into CPU RAM
sampler = tio.UniformSampler(patch_size=(96, 96, 96))
queue = tio.Queue(dataset, max_length=300, samples_per_volume=10, sampler=sampler)
loader = DataLoader(queue, batch_size=16)

for batch in loader:
    images = batch['image'][tio.DATA]      # (B, C, 96, 96, 96)
    labels = batch['label'][tio.DATA]
    # ... training step

A single CT scan at 512×512×300 voxels with float32 = 314 MB per volume. With batch size of 4 and model activations, you easily exceed 80GB GPU memory. This is the fundamental constraint driving TorchIO/MONAI architectural decisions:

NIfTI/DICOM (Disk)  →  NumPy Array (RAM)  →  Torch Tensor (GPU)
     ↓                      ↓                      ↓
  .nii.gz file         float32 ndarray        CUDA tensor
  ~50-200 MB            ~300 MB               ~300 MB + gradients
  compressed            uncompressed          + activations (~2-10x)

TorchIO specifically solves for this by only passing relevant chunks of data to GPU RAM:

  • CPU RAM: Full NIfTI volumes do load entirely (no way around this—you can't randomly access compressed .nii.gz). But the Queue keeps only a rolling subset loaded, not the entire dataset.
  • GPU RAM: Only patches (e.g., 16×96³ = ~200MB) are passed instead of full volumes (4×512×512×300 = 1.2GB+). This is where the real savings happen for TorchIO.

Where is the bottleneck?

Here's what a typical ML training workflow looks like end-to-end, from cloud storage to GPU:

The bottleneck isn't the GPU—it's everything before it. Let's quantify the cost of a realistic training run: 1,000 subjects, 4 modalities (CT, FLAIR, T1w, T2w), 300 MB per volume:

StageWhat HappensCost
Downloadaws s3 sync entire dataset to local SSD1.2 TB transferred, ~20 min at 1 Gbps
File opens per epochNiBabel opens one file per volume per epoch4,000 file handles per epoch
DecompressionEvery .nii.gz is fully gzip-decompressed4,000 × 457 ms = 30 min CPU per epoch
Wasted readsNeed a 96³ patch (3.5 MB) but load 300 MB98.8% of data read is discarded

For a 100-epoch training run, you're decompressing the same files 400,000 times, spending ~50 hours on gzip decompression alone—for a dataset that fits comfortably in cloud storage.

There are better, more performant mechanisms. The answer lies in how other fields have already solved this problem.

TileDB and Cloud-Native Analysis

The bioinformatics community faced a remarkably similar problem a few years earlier: population-scale genomic datasets outgrew flat files and needed cloud-native random access. The migration path they carved out is instructive.

Zarr emerged as an early cloud-native array format, gaining traction in geospatial (adopted as an OGC community standard) and bioimaging (OME-Zarr) communities. Its key insight was simple: store N-dimensional arrays as independently-readable chunks in a key-value store (like S3), rather than as a monolithic blob. This meant you could read a single chunk without downloading the entire array.

AnnData, the de facto standard for single-cell genomics data (via Scanpy), evolved in a similar direction—adding Zarr-backed storage to support out-of-core access to datasets that no longer fit in RAM.

TileDB took a more principled approach. Published by Papadopoulos et al. (2016) out of Intel Labs and MIT, TileDB is a full multi-dimensional array storage engine—not just a format—with built-in compression, tiling, cloud-native I/O, and concurrent access. Its adoption in genomics via TileDB-VCF demonstrated that even massive variant call datasets could be queried efficiently from S3 without local copies.

The most compelling example of this paradigm shift is the SOMA specification (Stack Of Matrices, Annotated), developed jointly by the Chan Zuckerberg Initiative and TileDB. SOMA provides a language-agnostic API for storing and querying annotated multi-dimensional biological data. It powers the CZ CELLxGENE Census—a unified atlas of ~85 million single-cell observations from 700+ datasets, all queryable cloud-natively from S3. The key architectural ideas in SOMA are:

  1. Hierarchical composition: Groups of arrays, not a single monolithic file
  2. Shared indexes: Observation metadata linked to data arrays via shared keys
  3. Cloud-native access: Every read translates to targeted S3 range requests
  4. Schema flexibility: Heterogeneous data (dense arrays, sparse matrices, dataframes) under one roof

SOMA's limitation? It's designed for tabular + matrix data (gene expression counts). Radiology data lives in a fundamentally different shape: dense 3D/4D arrays with spatial structure, where tiling strategy directly impacts I/O performance. This is the gap RadiObject fills.

RadiObject System Design

RadiObject is inspired by SOMA's architectural principles—hierarchical composition, shared indexes, cloud-native access—but purpose-built for dense volumetric radiology data. The core insight is:

Ingest once from NIfTI/DICOM into a tiled TileDB structure, then read only the tiles you need—forever after.

There is an upfront cost: the initial read/write step from NIfTI/DICOM into RadiObject takes time and increases storage (TileDB's tile-level compression is less aggressive than monolithic gzip—~5.7 GB vs 2.1 GB for a 20-subject dataset). But once that one-time ingestion is done, every subsequent read is dramatically more efficient.

Before looking at code, it helps to understand how RadiObject identifies data:

ConceptRadiObjectDICOMBIDSSOMA / AnnData
Subject identityobs_subject_idPatientIDsub-XXobs columns
Volume identityobs_idSeriesInstanceUIDfilename stemobs index
Subject metadataobs_metaPatient-level tagsparticipants.tsv.obs DataFrame
Volume metadataobsSeries-level tagssidecar JSON.var / measurement metadata
Data groupingVolumeCollectionSeries Descriptionmodality suffixMeasurement / layer

Temporal or other volume-specific subject-level annotations can be captured by obs.

from radiobject import RadiObject

# One-time ingestion from NIfTI (the "tax" you pay once)
radi = RadiObject.from_niftis(
    uri="s3://my-bucket/study",
    images={
        "CT": "./imagesTr/*.nii.gz",
        "seg": "./labelsTr",
    },
    validate_alignment=True,
    obs_meta=metadata_df,             # DataFrame with obs_subject_id column
)

# Every read after this is cloud-native and tile-aware
vol = radi.CT.iloc[0]                           # O(1) metadata lookup - no data loaded
vol.obs_id                                      # "sub-01_CT" — globally unique
patch = vol[100:164, 100:164, 50:114]           # Reads ~1-8 tiles, NOT the full volume
axial = vol.axial(z=50)                         # Single tile read: 3.4ms vs 1,053ms (MONAI)

The Architecture

RadiObject provides two things simultaneously: a hierarchical framework for organizing population-scale datasets, and an ML-optimized storage structure for analysis. The diagram below maps directly to the identity model. obs_meta (green) holds subject-level metadata indexed by obs_subject_id—analogous to participants.tsv in BIDS or the .obs DataFrame in AnnData. Each VolumeCollection (orange) has its own obs array with per-volume metadata, where obs_id provides a globally unique volume key—analogous to DICOM's SeriesInstanceUID. The dotted obs_subject_id FK arrow is the join key linking subject metadata to volume metadata across all collections.

This means you can query at the subject level (all volumes for patients over 40) or at the volume level (the CT with obs_id sub-03_CT), and the metadata relationship is always structurally enforced—not maintained by an external CSV.

Why This Structure Works for ML

The tiled storage is what makes RadiObject fundamentally different from NIfTI for ML workloads. ML training almost always reads chunks of data—patches, slices, ROIs—not full volumes. RadiObject's tile structure aligns storage with access patterns:

The benchmarks tell the story:

OperationRadiObject (local)RadiObject (S3)MONAITorchIO
2D Axial Slice3.4 ms152 ms1,053 ms731 ms
64³ ROI Patch1.6 ms151 ms1,022 ms722 ms
Full Volume525 ms7,135 ms1,244 ms756 ms
Memory (slice)1 MB1 MB912 MB304 MB

RadiObject is 300-640× faster for partial reads because it reads only the tiles containing the requested data. Even over S3, partial reads are 5-7× faster than loading a local NIfTI file with MONAI/TorchIO—because those frameworks must decompress the entire volume regardless.

The tiling strategy you choose at ingestion time determines which access pattern is optimized:

StrategyTile ShapeBest ForConfig
AxialFull X-Y plane, Z=12D slice viewing, radiologist workflowsSliceOrientation.AXIAL
SagittalFull Y-Z plane, X=1Sagittal slice accessSliceOrientation.SAGITTAL
CoronalFull X-Z plane, Y=1Coronal slice accessSliceOrientation.CORONAL
Isotropic64³ cubes3D patch extraction, ML trainingSliceOrientation.ISOTROPIC

How It Fits Into Existing Ecosystems

RadiObject is not a replacement for TorchIO or MONAI—it replaces the I/O layer underneath them. Use RadiObject for data loading and storage, and your existing framework for transforms and augmentation:

# With MONAI transforms
from monai.transforms import Compose, NormalizeIntensityd, RandFlipd
from radiobject.ml import create_training_dataloader

transform = Compose([
    NormalizeIntensityd(keys="image"),
    RandFlipd(keys="image", prob=0.5, spatial_axis=[0, 1, 2]),
])

loader = create_training_dataloader(collections=radi.CT, transform=transform)

# With TorchIO Queue
from radiobject.ml import VolumeCollectionSubjectsDataset
import torchio as tio

dataset = VolumeCollectionSubjectsDataset(collections=radi.T1w)
queue = tio.Queue(dataset, max_length=100, samples_per_volume=10)

The VolumeCollectionDataset outputs {"image": tensor, ...} dicts—directly compatible with MONAI dict transforms. The VolumeCollectionSubjectsDataset yields tio.Subject objects for TorchIO's Queue-based training. Both leverage RadiObject's tile-aware reads under the hood.

The Data Access API

RadiObject's data access is split into two modes, inspired by Polars' lazy/eager distinction. Both operate at the VolumeCollection level and use the same transform signature: (volume: np.ndarray, obs_row: pd.Series) → volume or → (volume, obs_updates_dict).

Eager Queries

Eager queries execute immediately. Call .map() on a VolumeCollection; chain multiple .map() calls, with obs_updates propagating through the pipeline.

radi = RadiObject("s3://bucket/study")

# Eager: transforms execute immediately, results held in memory
ct_processed = (
    radi.CT
    .map(normalize_intensity)
    .map(compute_radiomics)          # returns (volume, {"entropy": 4.2, ...})
    .write("s3://bucket/processed")  # obs_updates persisted to obs metadata
)

Lazy Queries

vc.lazy() returns a LazyQuery—a deferred execution plan. No data is touched until a terminal method (.write(), .count(), .iter_volumes(), .to_numpy_stack()) is called. The lazy API exposes filtering (.filter(), .filter_subjects(), .iloc(), .loc(), .head(), .tail(), .sample()), transforms (.map()), and introspection (.count(), .to_obs()).

# Lazy: deferred until terminal call, enables single-pass streaming
(
    radi.CT.lazy()
    .filter("diagnosis == 'tumor'")
    .head(100)
    .map(normalize_intensity)
    .map(compute_radiomics)
    .write("s3://bucket/tumor-cohort")
)

# Lazy iteration — volumes stream one at a time
for vol in radi.CT.lazy().filter_subjects(["sub-01", "sub-03"]).iter_volumes():
    process(vol)

# Metadata-only queries (no voxel data loaded)
count = radi.CT.lazy().filter("age > 40").count()
obs_df = radi.CT.lazy().filter("age > 40").to_obs()

Transform Composition

Transforms returning (volume, obs_updates_dict) annotate metadata that persists through the pipeline and is written to obs on .write(). In chained .map() calls, obs_updates from earlier transforms are merged into obs_row before the next transform sees it.

def compute_brain_volume(volume: np.ndarray, obs_row: pd.Series) -> tuple:
    mask = volume > obs_row.get("threshold", 0.5)
    return volume, {"brain_voxel_count": int(mask.sum())}

# brain_voxel_count is available in obs_row for downstream_fn
radi.CT.lazy().map(compute_brain_volume).map(downstream_fn).write(uri)

The Storage-Performance Tradeoff

Let's be honest about the tradeoff:

FormatSize (20 subjects)Can Partial Read?Full Volume Load
NIfTI (.nii.gz)2.1 GB❌ No457 ms
RadiObject5.7 GB✅ Yes (local & S3)120 ms (isotropic)
NIfTI (.nii)6.7 GB❌ No38 ms

RadiObject uses ~2.7× more disk space than gzip-compressed NIfTI. That's the price of tile-level compression vs. monolithic compression. But this trades disk space for random access—and for ML workloads where you're reading patches, slices, or ROIs, the I/O savings are enormous.

Furthermore, users can easily update/alter the TileDB configuration itself through the RadiObject context API, and optimize along whichever axis they prioritise. Be it less disk space, faster reads, more parallelism.


The full source code, tutorials, benchmarks, and API documentation are available on GitHub. If you're working with radiology data at scale—or just tired of waiting for gzip to decompress—give it a try:

pip install radiobject