How the mapping of the Danish web happens

By Nicolai Devantier , 11/07/19
Through the supercomputer at The Royal Danish Library and newly developed algorithms, Professor Niels Brügger dives into the Danish part of the World Wide Web to map our digital history. Here he tells how.

Niels Brügger researches and writes about digital media history. In connection with his research, he has dived down into the Danish part of the World Wide Web and has analysed the historical web development between 2005 and 2015.

“It is a rather crazy project, and when I tell people about it, the reaction is often: “Aaarrhh, is that possible?” says Niels Brügger, newly appointed professor of Media Studies at Aarhus University.

“But it is!”

The data for his research, which is carried out with Assistant Professor Janne Nielsen from Media Studies at Aarhus University and Ditte Laursen from The Royal Danish Library, is the Danish Netarchive (Netarkivet) at The Royal Danish Library, storing a copy of the Danish web with around one million websites.

On top of the many data, different algorithms make out the framework for the calculations on DeiC’s high performance computing system, The Cultural Heritage Cluster, located at The Royal Danish Library in Aarhus.

The purpose of this supercomputer is to give researchers, primarily within the humanities and social sciences, the opportunity to work quantitatively with big data.

Niels Brügger is a Professor of Media Studies at the School of Communication and Culture, Aarhus University. He is head of NetLab (a part of DIGHUMLAB), head of Centre for Internet Studies (CFI), coordinator of the European network RESAW, managing editor of the international journal Internet histories: Digital technology, culture and society..

Most websites are 10 to 20 megabytes

From the beginning, the project has been divided into two main phases. The first phase is to create historical knowledge about the web from 2005 to 2015.

The second phase has dealt with developing methods and procedures for how to do this in practice through supercomputing.

“The Danish part of the web has naturally developed through the period, but not as intensely as one might think. In 2005, it was at four TB, and in 2015, it was only just around four times as big,” he tells.

Still, working with this kind of size, it has required solid computer power to reach every nook and cranny of the data. The way they chose to approach the many data was, popularly speaking, by sending a probe into the “data haystack” and draw out the desired information.

“At first, it was all about counting files and objects to find out how big the Danish websites really are. And actually, 97 % of all Danish websites are only at between 10 and 20 megabytes,” he tells and goes on:

“The rest is the really large sites whose share has not changed significantly over the 10 years we have examined.”

In addition, Niels Brügger and his team have looked at for instance file types, number of images, number of password protected sites on the Danish web to map the elements from different angles.

One of the coming tasks is to look at links from all sites. A task that demands solid data cleansing to ensure that the same link is represented only once.

The development of the Danish web (number of objects in millions and size in TB) from 2005 to 2015. Source: Digital media and digital media science, Aarhus University.

Teamwork between humanists and IT experts

One of the main tasks in the project has been to find sorting methods, algorithms that collect the defined data in the big pile of information.

“We have felt our way, and for that reason, there has been a lot of documentation through the process where we constantly have kept track of our requests and searches. This way, we have refined the method and defined the rules for analysis continuously,” Niels Brügger explains.

In this regard, the Cultural Heritage Cluster has been an indispensable partner. The counting work is extremely comprehensive in these amounts of data.

“The teamwork between IT specialists, archivists and researchers has also been a necessity to obtain the concrete results. Naturally, it has required us to speak the same “language”, so we can communicate precisely with each other. It takes time, but it is also an exciting and rewarding process.”

In short, the researchers need to know what is feasible, and the researchers need to tell the IT experts exactly what they are looking for.

Now, the cooperative culture, the computer models and the algorithms are in place, and the comprehensive work has put Denmark firmly on the global map in terms of mapping national internet content.

“We are the first in the world to have used HPC [high performance computing] to conduct analyses of national parts of the internet which means that there is a lot of interest from abroad around our project,” Niels Brügger tells.

All the work has been described in detail, and scripts are publicly available, so researchers from abroad can use the experiences from this project to analyse other parts of the Internet.

The results of Niels Brügger's research can be read in the following publications:

About the project

The overall purpose of the project is to analyse the historical development of the entire Danish web based on the material in the Danish web archive Netarkivet. The first ideas for the project were drawn up in 2014, but they were only made possible once the Cultural Heritage Cluster came about.

The project goes on for the rest of 2019 and is supported by Aarhus University, The Royal Danish Library and in 2016-17 by The Ministry of Culture Research Fund.

For more information

Related news

Related content