6 Hours to Fully Reproduce 5 Years of Research
The Challenge
After five years of PhD experiments , a ccollaboration spanning institutions across 3 countries — microscopy, laser ablation, pipette aspiration, simulations — I had 40 TB of raw data, and no shared system to organize and reproduce any of it.
Traditionally, in Biological research labs, data is rarely documented or version-controlled systematically. Without a system, regenerating a single figure means days of manual work and no guarantee of consistency.
I built one that could do it in an afternoon. This is how.
What I Built
I designed a modular, figure-centric analysis pipeline from first principles, treating the paper as a product and each figure as a tracked deliverable.
40 TB+ raw data consolidated
12 figures, all reproducible
215 commits of audit trail
3 institutions coordinated
Repository architecture — a structured Git repo with per-figure folders and six dedicated notebooks, designed so collaborators could work in parallel without waiting on a central coordinator — cutting the back-and-forth that slows most multi-institution projects.
Standardised ingestion — all raw experimental data and simulation outputs (from collaborators Dr. Yann-Edwin Keta at ESPCI Paris and Dr. Silke Henkes at Leiden University) were normalised into a single format. Three countries, three languages, one schema.
Single-source-of-truth figures — every figure is generated directly from raw data via its own notebook, with no manual steps in between. Changing a parameter, colour scheme, or dataset propagates automatically across the entire paper. This eliminated a whole category of human error risk.
Future-proofed environment — the full analysis environment is reproducible in a single command, with every dependency pinned to an exact version. Step-by-step documentation ensures any future researcher — or journal reviewers — can recreate results independently, with no configuration required.
Cross-continental coordination — managed asynchronous collaboration across three countries over two years, aligning contributors on data standards, figure revisions, and code conventions with standards that fit publishers.
Data consolidation at scale — 40 TB of raw experimental data, collected across five years using ever changing bio-image formats and conventions, was systematically catalogued, cleaned, and reduced into structured datasheets that feed directly into the pipeline. Every number in the paper traces back to a specific archieved file.
The pipeline wasn’t designed once and handed over — it evolved across 2.5 years as the scientific story changed.
Figures were added, reframed, and cut entirely.
Reviewer feedback demanded new analyses mid-process.
The system had to be flexible enough to absorb that ambiguity without breaking — and robust enough that every change was traceable.
The Result
The paper was submitted to Nature Communications. Every figure — across 12 panels drawing on 40 TB of source data — is fully reproducible and stays reproducible from a single resource, with no manual steps and no institutional knowledge required.
The real test came the day before my PhD defence. I needed a different colour scheme for my slides, so I regenerated all 12 figures from scratch. It took under 6 hours. Without the pipeline, the same task would have taken days — and risked introducing inconsistencies across the paper.
The full repository is publicly archived under CC BY-SA 4.0 as an open-science resource.