Preserving Science: The case for scalable, sustainable cloud based data archiving
In an era of data-intensive science, we are facing a quiet but growing crisis: how to sustainably preserve the sheer volume of data produced by modern research, while ensuring it remains accessible, verifiable, and usable for future generations.
Whether it’s genomics, climate modelling, particle physics, or AI training sets, scientific workflows are generating petabytes of data, often at speeds outpacing our capacity to store, curate, and retrieve them meaningfully. As data volumes grow, so too does the risk of digital obsolescence, fragmented access, excessive power usage and prohibitive storage costs.
The imperative for low-cost, long-term scientific data storage
High-value research data cannot be treated as disposable. Institutions like Cambridge University Press and members of the Digital Preservation Coalition recognise that scientific knowledge is a cultural asset, and digital preservation is fundamental to scholarly continuity. Yet traditional cloud storage models are often cost-prohibitive for long-term retention, especially when retrieval fees and unpredictable egress costs are factored in.
Scientific institutions need archiving solutions that:
Are cost-effective enough to scale with exponential data growth.
Provide independent, verifiable storage with transparent metadata and integrity guarantees.
Allow easy and timely retrieval, not only for the data producers but also for external collaborators, auditors, and future researchers.
Bridging infrastructure and stewardship: a new model for research storage
Platforms like Cloud Cold Storage are increasingly relevant in hybrid data infrastructures, where active AI and high-performance computing workloads coexist with vast, infrequently accessed datasets. Digital Realty’s AI-ready architectures emphasize the need for tiered data strategies—optimising hot storage for real-time inference while offloading reference datasets, training corpora, or audit logs to low-cost cold storage layers. This approach not only ensures performance but dramatically reduces energy and operational costs.
At the same time, organisations such as Arkivum highlight the importance of policy-driven, compliant data stewardship throughout the data lifecycle and Cambridge University Library have a blue print for an open source data repository for searchable research data. Arkivum advocate for digital archiving aligned with FAIR principles, ensuring that data remains Findable, Accessible, Interoperable, and Reusable, while adhering to strict regulatory frameworks such as GDPR, GxP, and HIPAA. This becomes particularly critical in sectors like healthcare, higher education, and pharmaceutical research, where data integrity, provenance, and chain of custody must be verifiable years, or even decades, after collection. While this post by University of Cambridge Library discusses how to embed scalable preservation activities into normal workflows with no specialist DP required in an open environment that is inclusive with research data accessible to all .
These approaches signal a growing convergence between infrastructure efficiency and responsible data curation: hybrid architectures underpinned by cold storage provide the technical foundation, while governance frameworks like those Arkivum supports provide the operational assurance. For scientific institutions seeking both scalability and sustainability, this model represents a compelling blueprint for long-term research data strategy.
Preservation is no longer optional
Scientific rigour demands reproducibility. But reproducibility depends on access, not only to published findings but to the underlying data itself. As preservation responsibilities shift from short-term research projects to long-term institutional strategy, infrastructure must adapt. The tools and platforms used to store this data must be transparent, affordable, and compatible with how science is done today, and tomorrow.
What next?
As research institutions, data architects, and funders consider their long-term preservation strategies, the conversation must evolve beyond simple backups or “just-in-case” storage. We must treat archival storage as a foundational layer of scientific infrastructure.
To explore emerging models of scientific data preservation, including low-cost cold storage with guaranteed access, visit CloudColdStorage.com/research.
Ensure the science produced today can be verified, built upon, and trusted in the decades to come.