The Challenges of Digital Archiving
Comment
Stakeholder Type

The Challenges of Digital Archiving

The Challenges of Digital Archiving

Many people think of the web, email, social media and other forms of digital communication as somewhat ephemeral — certainly not as weighty and solid as the books, letters, buildings and historical artefacts that make up our valued cultural history. But in fact, these material objects can be lost or destroyed, and while copying offers some protection, it is often laborious and imperfect.

2025 Science Breakthrough Radar

Jane Winters, Professor of Digital Humanities, School of Advanced Studies, University of London, UK

Digital data, on the other hand, can be copied accurately and repeatedly, and can be accessed in perpetuity if appropriately stored, maintained and preserved.

But that is far from what is happening. Much of the data our actions generate today — ever more of which exists only in digital form — vanishes quickly. From a historian's perspective, it is absolutely essential that we archive this “born digital” material in all its ever-changing complexity. If we can grab and store it now, today’s researchers can use it to explore how our culture and society are developing, while their successors will find it invaluable in understanding our times retrospectively.

As an example, consider the study of recent elections in the Global North. It is impossible to make sense of the campaigns and results without understanding how social media has affected how people form their world-views. But there is no systematic archiving of this data — certainly not in forms that offer accessibility and permanence. Much of the history of computing and the web has already been lost, trapped on degrading and inaccessible media, discarded by corporate fiat or never stored in the first place.

So how, in the digital age, do we curate our history and give it permanence? To date, most of the work done has focused on simply archiving the material. Websites disappear all the time, and hacking interventions or high-level decisions can cause government bodies or companies to lose or discard years’ worth of data. That makes it important for us to capture all of this material now — we can start to think about what we do with it once it is safely stored.

There is already a substantive archive of digital material in the Internet Archive’s Wayback Machine, which started archiving the web in late 1996, and many national memory institutions also have significant web archives. But there are other things to capture, such as social media, emails, memes — plus the algorithms that select which content gets delivered. Ideally, our archives would include not just the communication itself, but how it was shared, used and reused. Only then can we really start to think about how it might have affected people's behaviours, experiences and futures.

With social media, there is the dynamic aspect to consider. Capturing a post at a time when it has only been liked twice, before it goes viral, means you haven't really captured its meaning and context at all. There is good work under way here: for example, Valérie Schafer at the Luxembourg Centre for Contemporary and Digital History, University of Luxembourg, led a team aiming to find ways to archive such virality, capturing memes and the ways in which they circulate, so that we can understand more about how information and ideas are distributed in the digital age.

There are many challenges facing the field — including legal issues. In many countries, national libraries’ and archiving institutions’ web-archiving activities are conducted on the basis of legal deposit legislation, which gives a sound legal footing on which to crawl the web and to store contents without having to ask permission. Such legislation doesn’t exist in some countries, however, meaning that archiving is only possible where permission has been granted, making its scope very limited. In other cases the legislation treats an archived web page as if it were a physical book — a single object that two people can’t both access simultaneously, even though it was once accessible to any and all on the public internet. Some workarounds do exist: researchers based in a Danish research institution can apply to have remote access to an archive held at the Royal Danish Library, for example. But the web is largely archived — with the exception of the Internet Archive — on a national domain basis.

Privacy laws add a layer of complexity to the storage of web archives and emails. The UK National Archives is doing interesting work to try to make sure that personal information can be stripped out so that the records can be opened up to the public at the appropriate time; this might be relevant to the archiving of digital culture. Often, though, people simply aren’t aware that their online presence is being preserved as part of a national heritage, and we need to find ways to make sure we aren’t showcasing people who don’t want to be put on display. We also need to make sure that such records come from diverse sources and represent a broad swathe of our societies.

Another challenge is in the infrastructure required to archive our digital existences. The cloud-based storage requirements for the petabytes of data are huge and growing, and ever more video and audio streaming means that this will continue to mushroom. The costs of storing all this are environmental as well as financial, and in practice we need to consider what we're choosing to store, especially since there can be a huge amount of duplication. Indexing and discoverability is a significant challenge for digital archiving, especially when multiple copies or versions of web pages are being stored. How do we differentiate between them and make accessing specific data easier — while also appending metadata about how and when it was created, used and shared?

It may well be that, suitably directed, AI is able to perform some of the detailed analysis work required for this. But AI is also an object of study in itself. Just as historians study the industrial revolution through its objects and artefacts, we are beginning to study the similarly disruptive revolution that AI will bring. But we need to understand the infrastructures that are delivering AI, as well as the information it processes and how those processes are shaped by algorithms. That means creating frameworks for the archiving of what Richard Sandford at University College London has described as algorithmic heritage, among other endeavours.

All of this means there is a need for programmers and other skilled technologists in this area. The Portuguese web archive has just completed a technical initiative which has reanimated Flash-based websites, which had been abandoned after the technology became obsolete, for instance. Many more such projects will be required if we are to store a truly representative set of digital artefacts. That includes learning how to preserve and work with “hybrid” archives that include hard drives, floppy disks and online data as well as printed and manuscript materials.

There is also a danger of huge inequities arising in terms of geography and wealth. Web archiving is primarily an activity in and of the global north. When we archive material related to something like the COVID-19 pandemic, for which there was an internationally coordinated set of activities, it is important to ensure that we collect public-health information, news bulletins and so on right from the earliest days of the pandemic and from all around the world — a huge challenge when so much more material is readily accessible for countries in, say, Europe compared with countries in Africa.

No archive has ever been truly representative: archiving is always shaped by criteria that limit both its scope and its relevance as times change and cultural priorities shift. Digital archives will be no different. But the more thought and resources we can pour into them now, the more grateful future historians — and perhaps future societies — will be.

It is urgent that we should do this work and do it collectively, drawing on the knowledge and experience of historians, archivists, technologists and data scientists to ensure that at least some of our digital present persists into the digital future. It will be vital to invest in and sustain a community of practice in the next decade, both to effectively globalise born-digital archiving and to develop solutions for emerging digital formats. What will succeed generative AI, for example, and will we be prepared to archive and preserve it when it emerges? By that point, perhaps, we will have been able to combine technological innovation with human skill and imagination to make sense of the vast born-digital archives that document contemporary society and culture. In doing so, we will gain insight into the political, economic, health and environmental challenges that mark the first quarter of the 21st century.