RadiObject: A cloud-native radiology atlas
Building RadiObject
Radiology data is a rich and informative modality for computational analysis of human biology. There are so many features of a given scan that are 'invisible' to the human eye, yet capture vital prognostic data relevant to human health. The data formats used to store a given scan (or 'series' in radiology terms) were developed in the 1990s and early 2000s, and have not seen a facelift since. RadiObject is meant to provide an updated in-memory and on-disk representation of individual scans, as well as population-scale datasets, that is simple and easy to work with.
What's in a Series?
These are a few examples of key radiology modalities, though this list extends to many more:
| Modality | Physics | Biological Information |
|---|---|---|
| CT | X-ray attenuation | Tissue density, calcification, anatomy |
| MRI | Magnetic resonance | Soft tissue contrast, water/fat content, diffusion |
| PET | Radiotracer uptake | Metabolic activity, receptor expression |
Any given series contain a wealth of quantitative and structural information hidden from the human eye, such as:
| Category | Examples | What It Captures |
|---|---|---|
| First Order | Mean, entropy, skewness, sphericity | Intensity distribution and geometric morphology |
| Second Order | Contrast, correlation, run length emphasis | Spatial relationships between neighboring voxels |
| Higher-Order | Wavelet, Gabor, fractal features | Multi-scale structural analysis |
The goal of radiomics is to formalize the extraction of these features into methods. One such example of these invisible (Higher-Order) structures is fat infiltration into muscle (myosteatosis)—one of the single best predictors of longevity from CT scans. Pickhardt et al. (2023) demonstrated that AI-derived myosteatosis was the most important body composition predictor of mortality in asymptomatic adults, while Linge et al. (2021) showed adverse muscle composition carried a hazard ratio of 3.71 for all-cause mortality across 40,178 UK Biobank participants. In order to conduct these studies, data scientists had to analyze radiology 'data'.
The 'data' is a 3-dimensional array. Sometimes 4 for fMRI—where the fourth dimension is time. Each 'voxel' (or position) in this array is a physical measurement of tissue properties (CT scans measure tissue density). Variations across voxels in different dimensions (e.g. time) provide a quantitative measurement of other biological features, such as water content, protein binding, iron deposition and many more.
How are these Arrays Represented?
DICOM files were the original representation of radiology data, first standardized in 1993 as ACR-NEMA Version 3.0. Thirty years on, the format remains the clinical standard, but was never designed for computational post-processing. Each 2D slice (or 'instance') from a given MRI scan is saved as a single .dcm file:
[ SINGLE DICOM FILE (.dcm) ]
+-------------------------------------------------------+
| HEADER (Variable Length) |
| Structure: (Group, Element) | VR | Length | Value |
| --------------------------------------------------- |
| |
| [Patient Level] |
| (0010,0010) PN: "Doe^John" |
| (0010,0020) LO: "123456" (Patient ID) |
| |
| [Study Level] |
| (0020,000D) UI: "1.2.840..." (Study Instance UID) |
| |
| [Series Level] (The Linker) |
| (0020,000E) UI: "1.2.840..." (Series Instance UID) |
| *All 512 files share this SAME ID.* |
| |
| [Image/Slice Level] (The Unique Parts) |
| (0020,0013) IS: "1" (Instance Number) |
| (0020,0032) DS: "-120.5\-30.0\105.0" (Position X/Y/Z)|
| (0028,0030) DS: "0.5\0.5" (Pixel Spacing) |
+-------------------------------------------------------+
| |
| PIXEL DATA (7FE0,0010) |
| --------------------------------------------------- |
| Raw 2D Array (Uint16) |
| Typically 512 x 512 pixels |
| |
| [ 00 00 ... 04 A1 ... ] |
| |
+-------------------------------------------------------+
Each DICOM file is saved to a directory with hundreds of these files, which must be composed into a single scan (by software systems) using the 'Position' metadata saved in the file header. To simply construct a given tensor for analysis a user must:
- Open 500 files
- Parse 500 headers
- Sort them by float coordinates (handling floating point errors)
- Stack the arrays
- Handle missing slices (if the patient moved)
How Much Waste Are We Talking About?
Let's do some napkin maths for a single CT scan (512×512 pixels, 300 slices, uint16):
| Resource | Per Volume | 100-Subject Study |
|---|---|---|
| Files opened | 300 | 30,000 |
| Headers parsed | 300 × ~2 KB each = 600 KB | 60 MB of redundant metadata |
| Sorting overhead | 300 floats with IEEE 754 comparison | 30,000 floats |
| Raw pixel data | 512 × 512 × 2 bytes × 300 = 150 MB | 15 GB |
Each of those 300 file opens triggers a system call, a file descriptor allocation, and a header parse. On a standard SSD, opening and parsing a single DICOM file takes ~1-2ms—meaning just the I/O overhead of assembling a single volume is 300-600ms before you've touched the pixel data. Multiply that by 100 subjects across 4 modalities (CT, FLAIR, T1w, T2w) and you're spending 2-4 minutes purely on file system operations.
The redundancy is equally absurd: every one of those 300 files carries near-identical patient/study/series metadata in its header. For a 100-subject, 4-modality study, that's 120,000 files each containing duplicated information that only varies by a few bytes (instance number and position).
The NIfTI format is the next data structure used for analysis (pipelines as a first step will convert DICOM → NIfTI). The biggest thing NIfTI does is sandwiching the 2D arrays from DICOM into a 3D array in its Voxel data.
[ NIfTI FILE (.nii / .nii.gz) ]
+-------------------------------------------------------+
| HEADER (348 Bytes) |
| --------------------------------------------------- |
| [MetaData] |
| • dim[8]: Shape (e.g., 512, 512, 300) |
| • pixdim[8]: Voxel Size (e.g., 0.5mm, 0.5mm, 1mm) |
| • datatype: INT16, FLOAT32, etc. |
| |
| [The "Rosetta Stone" - Coordinate Mapping] |
| • sform_code: Method to align to world space |
| • srow_x: [ Row 1 of Affine Matrix ] |
| • srow_y: [ Row 2 of Affine Matrix ] |
| • srow_z: [ Row 3 of Affine Matrix ] |
+-------------------------------------------------------+
| EXTENSION (Optional, usually variable length) |
| Custom JSON data, DICOM tags, etc. |
+-------------------------------------------------------+
| |
| VOXEL DATA (The "Payload") |
| --------------------------------------------------- |
| Raw binary block of numbers. |
| No shape info here, just a long 1D stream of bytes. |
| Must be reshaped using 'dim' from header. |
| |
| [ 00 A4 ... FF 01 ... ] (Compressed with Gzip) |
| |
+-------------------------------------------------------+
| Feature | NIfTI (.nii) | DICOM (.dcm folders) |
|---|---|---|
| Data Model | A 3D Volume | A stack of 2D Slices |
| Metadata | Centralized (One Header) | Distributed (Redundant headers in every file) |
| Reading | f.seek() to one offset | os.listdir() then open 500 file handles |
| Sorting | Pre-sorted | Must sort manually by Z-coordinate |
| Use Case | Machine Learning / Research | Hospital PACS / Clinical Archive |
- Cloud-Native Data Access
- Population-scale radiology datasets are often stored in cloud-object stores (think AWS S3, Azure Blobstore...), as these datasets easily reach terabytes in size. The UK Biobank imaging study alone generates ~2.7 GB per participant across 100,000 subjects—that's 270 TB of imaging data. Even a moderate 1,000-subject, 4-modality study at 300 MB per volume weighs in at 1.2 TB. Hosting 10 TB on AWS S3 Standard costs ~$235/month, and at 100 TB you're looking at ~$2,300/month—making cloud storage a necessity, not a luxury, for datasets of this scale.
- For any analysis, data scientists must download significant chunks of data to local disk before reading them into memory. This download operation is slow, expensive, and unnecessary given modern software.
- Metadata
- Metadata is stored in the headers of these files. Even in a moderately sized dataset, the cost of scanning hundreds of file headers to extract metadata for a basic operation/analysis is prohibitive.
- Often times metadata is exported to a SQL Table/CSV file, which contain a column linking metadata to a specific file. This approach is more functional, but flaky. The CSV/SQL Table context is not directly tied to the data, and that link is all-too-easily broken as data is transformed during analysis.
These are the two primary drivers behind my frustrations with NIfTI and DICOM for analysis. A much more comprehensive list of NIfTI nits can be found on Brainder's NIfTI explanation blog.
What does analysis look like?
Most analysis is done from the NIfTI format. NIfTI images can be read from a local disk using NiBabel:
import nibabel as nib
# Load a NIfTI file - for .nii.gz this decompresses the ENTIRE volume into memory
img = nib.load("scan.nii.gz")
# Access the header (cheap - already parsed)
header = img.header
voxel_dims = header.get_zooms() # e.g. (0.5, 0.5, 1.0) mm
shape = header.get_data_shape() # e.g. (512, 512, 300)
# Get voxel data as a NumPy array - THIS is where the full decompression happens
# For a 512x512x300 float32 volume: ~314 MB allocated in RAM
data = img.get_fdata() # Returns np.ndarray, dtype=float64 by default
# Want just a single axial slice? Too bad - you already loaded the whole thing.
axial_slice = data[:, :, 150] # 512x512 slice, but cost was 314 MB
The critical thing to notice: there is no way to read a partial volume from a .nii.gz file. Gzip is a stream-compressed format—you cannot seek to byte offset N without decompressing everything before it. Even if you only need a single 512×512 slice (0.5 MB), you must decompress and load the full 314 MB volume.
TorchIO is a commonly-used framework for doing machine learning from NIfTIs/NumPy data. Published by Pérez-García et al. (2021), it provides significant savings when it comes to GPU memory—though it cannot escape the fundamental I/O constraints of the flat-file format. This is a brief example of what a standard setup looks like for analysis:
import torchio as tio
from torch.utils.data import DataLoader
# Create dataset from filepaths
subjects = [
tio.Subject(
image=tio.ScalarImage(nifti_path),
label=tio.LabelMap(label_path),
)
for nifti_path, label_path in zip(image_paths, label_paths)
]
dataset = tio.SubjectsDataset(subjects, transform=tio.Compose([
tio.ToCanonical(), # Reorient to RAS
tio.Resample(1.0), # Isotropic 1mm spacing
tio.ZNormalization(), # Zero mean, unit variance
]))
# Patches are extracted from FULL volumes loaded into CPU RAM
sampler = tio.UniformSampler(patch_size=(96, 96, 96))
queue = tio.Queue(dataset, max_length=300, samples_per_volume=10, sampler=sampler)
loader = DataLoader(queue, batch_size=16)
for batch in loader:
images = batch['image'][tio.DATA] # (B, C, 96, 96, 96)
labels = batch['label'][tio.DATA]
# ... training step
A single CT scan at 512×512×300 voxels with float32 = 314 MB per volume. With batch size of 4 and model activations, you easily exceed 80GB GPU memory. This is the fundamental constraint driving TorchIO/MONAI architectural decisions:
NIfTI/DICOM (Disk) → NumPy Array (RAM) → Torch Tensor (GPU)
↓ ↓ ↓
.nii.gz file float32 ndarray CUDA tensor
~50-200 MB ~300 MB ~300 MB + gradients
compressed uncompressed + activations (~2-10x)
TorchIO specifically solves for this by only passing relevant chunks of data to GPU RAM:
- CPU RAM: Full NIfTI volumes do load entirely (no way around this—you can't randomly access compressed .nii.gz). But the Queue keeps only a rolling subset loaded, not the entire dataset.
- GPU RAM: Only patches (e.g., 16×96³ = ~200MB) are passed instead of full volumes (4×512×512×300 = 1.2GB+). This is where the real savings happen for TorchIO.
Where is the bottleneck?
Here's what a typical ML training workflow looks like end-to-end, from cloud storage to GPU:
The bottleneck isn't the GPU—it's everything before it. Let's quantify the cost of a realistic training run: 1,000 subjects, 4 modalities (CT, FLAIR, T1w, T2w), 300 MB per volume:
| Stage | What Happens | Cost |
|---|---|---|
| Download | aws s3 sync entire dataset to local SSD | 1.2 TB transferred, ~20 min at 1 Gbps |
| File opens per epoch | NiBabel opens one file per volume per epoch | 4,000 file handles per epoch |
| Decompression | Every .nii.gz is fully gzip-decompressed | 4,000 × 457 ms = 30 min CPU per epoch |
| Wasted reads | Need a 96³ patch (3.5 MB) but load 300 MB | 98.8% of data read is discarded |
For a 100-epoch training run, you're decompressing the same files 400,000 times, spending ~50 hours on gzip decompression alone—for a dataset that fits comfortably in cloud storage.
There are better, more performant mechanisms. The answer lies in how other fields have already solved this problem.
TileDB and Cloud-Native Analysis
The bioinformatics community faced a remarkably similar problem a few years earlier: population-scale genomic datasets outgrew flat files and needed cloud-native random access. The migration path they carved out is instructive.
Zarr emerged as an early cloud-native array format, gaining traction in geospatial (adopted as an OGC community standard) and bioimaging (OME-Zarr) communities. Its key insight was simple: store N-dimensional arrays as independently-readable chunks in a key-value store (like S3), rather than as a monolithic blob. This meant you could read a single chunk without downloading the entire array.
AnnData, the de facto standard for single-cell genomics data (via Scanpy), evolved in a similar direction—adding Zarr-backed storage to support out-of-core access to datasets that no longer fit in RAM.
TileDB took a more principled approach. Published by Papadopoulos et al. (2016) out of Intel Labs and MIT, TileDB is a full multi-dimensional array storage engine—not just a format—with built-in compression, tiling, cloud-native I/O, and concurrent access. Its adoption in genomics via TileDB-VCF demonstrated that even massive variant call datasets could be queried efficiently from S3 without local copies.
The most compelling example of this paradigm shift is the SOMA specification (Stack Of Matrices, Annotated), developed jointly by the Chan Zuckerberg Initiative and TileDB. SOMA provides a language-agnostic API for storing and querying annotated multi-dimensional biological data. It powers the CZ CELLxGENE Census—a unified atlas of ~85 million single-cell observations from 700+ datasets, all queryable cloud-natively from S3. The key architectural ideas in SOMA are:
- Hierarchical composition: Groups of arrays, not a single monolithic file
- Shared indexes: Observation metadata linked to data arrays via shared keys
- Cloud-native access: Every read translates to targeted S3 range requests
- Schema flexibility: Heterogeneous data (dense arrays, sparse matrices, dataframes) under one roof
SOMA's limitation? It's designed for tabular + matrix data (gene expression counts). Radiology data lives in a fundamentally different shape: dense 3D/4D arrays with spatial structure, where tiling strategy directly impacts I/O performance. This is the gap RadiObject fills.
RadiObject System Design
RadiObject is inspired by SOMA's architectural principles—hierarchical composition, shared indexes, cloud-native access—but purpose-built for dense volumetric radiology data. The core insight is:
Ingest once from NIfTI/DICOM into a tiled TileDB structure, then read only the tiles you need—forever after.
There is an upfront cost: the initial read/write step from NIfTI/DICOM into RadiObject takes time and increases storage (TileDB's tile-level compression is less aggressive than monolithic gzip—~5.7 GB vs 2.1 GB for a 20-subject dataset). But once that one-time ingestion is done, every subsequent read is dramatically more efficient.
Before looking at code, it helps to understand how RadiObject identifies data:
| Concept | RadiObject | DICOM | BIDS | SOMA / AnnData |
|---|---|---|---|---|
| Subject identity | obs_subject_id | PatientID | sub-XX | obs columns |
| Volume identity | obs_id | SeriesInstanceUID | filename stem | obs index |
| Subject metadata | obs_meta | Patient-level tags | participants.tsv | .obs DataFrame |
| Volume metadata | obs | Series-level tags | sidecar JSON | .var / measurement metadata |
| Data grouping | VolumeCollection | Series Description | modality suffix | Measurement / layer |
Temporal or other volume-specific subject-level annotations can be captured by obs.
from radiobject import RadiObject
# One-time ingestion from NIfTI (the "tax" you pay once)
radi = RadiObject.from_niftis(
uri="s3://my-bucket/study",
images={
"CT": "./imagesTr/*.nii.gz",
"seg": "./labelsTr",
},
validate_alignment=True,
obs_meta=metadata_df, # DataFrame with obs_subject_id column
)
# Every read after this is cloud-native and tile-aware
vol = radi.CT.iloc[0] # O(1) metadata lookup - no data loaded
vol.obs_id # "sub-01_CT" — globally unique
patch = vol[100:164, 100:164, 50:114] # Reads ~1-8 tiles, NOT the full volume
axial = vol.axial(z=50) # Single tile read: 3.4ms vs 1,053ms (MONAI)
The Architecture
RadiObject provides two things simultaneously: a hierarchical framework for organizing population-scale datasets, and an ML-optimized storage structure for analysis. The diagram below maps directly to the identity model. obs_meta (green) holds subject-level metadata indexed by obs_subject_id—analogous to participants.tsv in BIDS or the .obs DataFrame in AnnData. Each VolumeCollection (orange) has its own obs array with per-volume metadata, where obs_id provides a globally unique volume key—analogous to DICOM's SeriesInstanceUID. The dotted obs_subject_id FK arrow is the join key linking subject metadata to volume metadata across all collections.
This means you can query at the subject level (all volumes for patients over 40) or at the volume level (the CT with obs_id sub-03_CT), and the metadata relationship is always structurally enforced—not maintained by an external CSV.
Why This Structure Works for ML
The tiled storage is what makes RadiObject fundamentally different from NIfTI for ML workloads. ML training almost always reads chunks of data—patches, slices, ROIs—not full volumes. RadiObject's tile structure aligns storage with access patterns:
The benchmarks tell the story:
| Operation | RadiObject (local) | RadiObject (S3) | MONAI | TorchIO |
|---|---|---|---|---|
| 2D Axial Slice | 3.4 ms | 152 ms | 1,053 ms | 731 ms |
| 64³ ROI Patch | 1.6 ms | 151 ms | 1,022 ms | 722 ms |
| Full Volume | 525 ms | 7,135 ms | 1,244 ms | 756 ms |
| Memory (slice) | 1 MB | 1 MB | 912 MB | 304 MB |
RadiObject is 300-640× faster for partial reads because it reads only the tiles containing the requested data. Even over S3, partial reads are 5-7× faster than loading a local NIfTI file with MONAI/TorchIO—because those frameworks must decompress the entire volume regardless.
The tiling strategy you choose at ingestion time determines which access pattern is optimized:
| Strategy | Tile Shape | Best For | Config |
|---|---|---|---|
| Axial | Full X-Y plane, Z=1 | 2D slice viewing, radiologist workflows | SliceOrientation.AXIAL |
| Sagittal | Full Y-Z plane, X=1 | Sagittal slice access | SliceOrientation.SAGITTAL |
| Coronal | Full X-Z plane, Y=1 | Coronal slice access | SliceOrientation.CORONAL |
| Isotropic | 64³ cubes | 3D patch extraction, ML training | SliceOrientation.ISOTROPIC |
How It Fits Into Existing Ecosystems
RadiObject is not a replacement for TorchIO or MONAI—it replaces the I/O layer underneath them. Use RadiObject for data loading and storage, and your existing framework for transforms and augmentation:
# With MONAI transforms
from monai.transforms import Compose, NormalizeIntensityd, RandFlipd
from radiobject.ml import create_training_dataloader
transform = Compose([
NormalizeIntensityd(keys="image"),
RandFlipd(keys="image", prob=0.5, spatial_axis=[0, 1, 2]),
])
loader = create_training_dataloader(collections=radi.CT, transform=transform)
# With TorchIO Queue
from radiobject.ml import VolumeCollectionSubjectsDataset
import torchio as tio
dataset = VolumeCollectionSubjectsDataset(collections=radi.T1w)
queue = tio.Queue(dataset, max_length=100, samples_per_volume=10)
The VolumeCollectionDataset outputs {"image": tensor, ...} dicts—directly compatible with MONAI dict transforms. The VolumeCollectionSubjectsDataset yields tio.Subject objects for TorchIO's Queue-based training. Both leverage RadiObject's tile-aware reads under the hood.
The Data Access API
RadiObject's data access is split into two modes, inspired by Polars' lazy/eager distinction. Both operate at the VolumeCollection level and use the same transform signature: (volume: np.ndarray, obs_row: pd.Series) → volume or → (volume, obs_updates_dict).
Eager Queries
Eager queries execute immediately. Call .map() on a VolumeCollection; chain multiple .map() calls, with obs_updates propagating through the pipeline.
radi = RadiObject("s3://bucket/study")
# Eager: transforms execute immediately, results held in memory
ct_processed = (
radi.CT
.map(normalize_intensity)
.map(compute_radiomics) # returns (volume, {"entropy": 4.2, ...})
.write("s3://bucket/processed") # obs_updates persisted to obs metadata
)
Lazy Queries
vc.lazy() returns a LazyQuery—a deferred execution plan. No data is touched until a terminal method (.write(), .count(), .iter_volumes(), .to_numpy_stack()) is called. The lazy API exposes filtering (.filter(), .filter_subjects(), .iloc(), .loc(), .head(), .tail(), .sample()), transforms (.map()), and introspection (.count(), .to_obs()).
# Lazy: deferred until terminal call, enables single-pass streaming
(
radi.CT.lazy()
.filter("diagnosis == 'tumor'")
.head(100)
.map(normalize_intensity)
.map(compute_radiomics)
.write("s3://bucket/tumor-cohort")
)
# Lazy iteration — volumes stream one at a time
for vol in radi.CT.lazy().filter_subjects(["sub-01", "sub-03"]).iter_volumes():
process(vol)
# Metadata-only queries (no voxel data loaded)
count = radi.CT.lazy().filter("age > 40").count()
obs_df = radi.CT.lazy().filter("age > 40").to_obs()
Transform Composition
Transforms returning (volume, obs_updates_dict) annotate metadata that persists through the pipeline and is written to obs on .write(). In chained .map() calls, obs_updates from earlier transforms are merged into obs_row before the next transform sees it.
def compute_brain_volume(volume: np.ndarray, obs_row: pd.Series) -> tuple:
mask = volume > obs_row.get("threshold", 0.5)
return volume, {"brain_voxel_count": int(mask.sum())}
# brain_voxel_count is available in obs_row for downstream_fn
radi.CT.lazy().map(compute_brain_volume).map(downstream_fn).write(uri)
The Storage-Performance Tradeoff
Let's be honest about the tradeoff:
| Format | Size (20 subjects) | Can Partial Read? | Full Volume Load |
|---|---|---|---|
| NIfTI (.nii.gz) | 2.1 GB | ❌ No | 457 ms |
| RadiObject | 5.7 GB | ✅ Yes (local & S3) | 120 ms (isotropic) |
| NIfTI (.nii) | 6.7 GB | ❌ No | 38 ms |
RadiObject uses ~2.7× more disk space than gzip-compressed NIfTI. That's the price of tile-level compression vs. monolithic compression. But this trades disk space for random access—and for ML workloads where you're reading patches, slices, or ROIs, the I/O savings are enormous.
Furthermore, users can easily update/alter the TileDB configuration itself through the RadiObject context API, and optimize along whichever axis they prioritise. Be it less disk space, faster reads, more parallelism.
The full source code, tutorials, benchmarks, and API documentation are available on GitHub. If you're working with radiology data at scale—or just tired of waiting for gzip to decompress—give it a try:
pip install radiobject