6 Hours to Fully Reproduce 5 Years of Research

Data Engineering

Project Management

Open Science

Data Visualisation

How I built an end-to-end analysis pipeline that condensed 40 TB of PhD data into 12 fully reproducible figures. The real stress-test: the day before my defence, I wanted a new colour scheme for my slides — so I regenerated every figure in the paper in a single afternoon. The system held.

Published

June 30, 2026

The Challenge

After five years of PhD experiments , a ccollaboration spanning institutions across 3 countries — microscopy, laser ablation, pipette aspiration, simulations — I had 40 TB of raw data, and no shared system to organize and reproduce any of it.

Traditionally, in Biological research labs, data is rarely documented or version-controlled systematically. Without a system, regenerating a single figure means days of manual work and no guarantee of consistency.

I built one that could do it in an afternoon. This is how.

What I Built

I designed a modular, figure-centric analysis pipeline from first principles, treating the paper as a product and each figure as a tracked deliverable.

40 TB+ raw data consolidated

12 figures, all reproducible

215 commits of audit trail

3 institutions coordinated

Repository architecture — a structured Git repo with per-figure folders and six dedicated notebooks, designed so collaborators could work in parallel without waiting on a central coordinator — cutting the back-and-forth that slows most multi-institution projects.

Standardised ingestion — all raw experimental data and simulation outputs (from collaborators Dr. Yann-Edwin Keta at ESPCI Paris and Dr. Silke Henkes at Leiden University) were normalised into a single format. Three countries, three languages, one schema.

Single-source-of-truth figures — every figure is generated directly from raw data via its own notebook, with no manual steps in between. Changing a parameter, colour scheme, or dataset propagates automatically across the entire paper. This eliminated a whole category of human error risk.

Future-proofed environment — the full analysis environment is reproducible in a single command, with every dependency pinned to an exact version. Step-by-step documentation ensures any future researcher — or journal reviewers — can recreate results independently, with no configuration required.

Cross-continental coordination — managed asynchronous collaboration across three countries over two years, aligning contributors on data standards, figure revisions, and code conventions with standards that fit publishers.

Data consolidation at scale — 40 TB of raw experimental data, collected across five years using ever changing bio-image formats and conventions, was systematically catalogued, cleaned, and reduced into structured datasheets that feed directly into the pipeline. Every number in the paper traces back to a specific archieved file.

The pipeline wasn’t designed once and handed over — it evolved across 2.5 years as the scientific story changed.

Figures were added, reframed, and cut entirely.

Reviewer feedback demanded new analyses mid-process.

The system had to be flexible enough to absorb that ambiguity without breaking — and robust enough that every change was traceable.

The Result

The paper was submitted to Nature Communications. Every figure — across 12 panels drawing on 40 TB of source data — is fully reproducible and stays reproducible from a single resource, with no manual steps and no institutional knowledge required.

The real test came the day before my PhD defence. I needed a different colour scheme for my slides, so I regenerated all 12 figures from scratch. It took under 6 hours. Without the pipeline, the same task would have taken days — and risked introducing inconsistencies across the paper.

The full repository is publicly archived under CC BY-SA 4.0 as an open-science resource.

View repository →