RESAW 2025: THE DATAFIED WEB
PROGRAM FOR FRIDAY, JUNE 6TH
Days:
previous day
all days

View: session overviewtalk overview

09:00-10:30 Session 9: KEYNOTE by Jonathan Gray: Public data cultures (chair: Tatjana Seitz)
KEYNOTE : Public data cultures

ABSTRACT. This talk explores how data is made public on the Internet amidst the rise of social media, platforms and AI. Retracing the emergence of legal and technical conventions of open data, it looks towards a more expansive understanding of public data cultures which shape how we know and live together. Through a series of empirical vignettes, the talk reconsiders data as cultural material, medium of participation and site of transnational coordination. It then turns to two forms of intervention: making data that is considered missing and entrypoints for critical data practice. As well as situating public data cultures in relation to the datafication and platformisation of the web, the talk will highlight the role of web archives in studying these developments.

10:30-11:00Coffee Break
11:00-12:30 Session 10A: Panel : The Skybox research programme
11:00
More than data : the Skybox research programme

ABSTRACT. Skyblog (2002 - ∞) was a pioneering and emblematic social networking platform of the French web in the 2000s. By 2011, it hosted up to 33.5 million blogs, 90% created by teenagers. Skyblog offered users a free and customizable digital space where they could easily create blogs, share content such as text, images, videos, and music, personalize their pages, and connect with others through virtual friendships. The platform left a significant mark on web culture and the history of the French web. In 2023, Skyrock, the platform editor, announced the closure of Skyblog, sparking collaboration with digital heritage institutions such as the National Library of France (BnF) and the National Audiovisual Institute (Ina) to guarantee its long-term preservation. Through cooperation, the BnF optimized the crawl processes, collecting original datasets, resulting in a collection of up to 12.6 million blogs and 40 terabytes of data. The aim of this panel is to present the challenges of managing a vast digital archive, with a particular focus on the inherent difficulties faced by web archivists and research teams involved in the Skybox research programme. We will begin by reviewing the technical aspects and methodological goals of the Skybox research programme scheduled to run from 2024 to 2027. The project objective is to develop an epistemology of the web archive based notably on quantitative methods and computational approaches, using the Skyblog collection as a field of study. One of the researchers involved in the Skybox project is Quentin Lobbé, whose work focuses on the analysis of digital migrations. He studies how skybloggers moved from and within the plateform. Emmanuelle Bermès employs a methodology that combines link mapping and datavisualization with personal narratives, mapping connections between blogs while addressing sensitive content, particularly that involving minors, in order to preserve the emotional depth of the datafied web. 302 words

Présentation n°1 (500 mots) : The Skyblog web archive behind the scene - Alexandre Faye, Sara Aubry and Marina Hervieu (BnF) The Skyblog collection is without doubt one of the most complex and comprehensive preservation projects ever undertaken by the BnF web archiving team within the context of its legal deposit mission [1]. The team has previously engaged in the preservation of French blogging platforms, yet none on the scale of Skyblog. The preliminary estimates, based on the data from the pilot collection, indicated a capture period of more than two years and a data volume of 80 terabytes. However, these estimates proved to be technically unfeasible. How to capture all available blog contents (mainly texts, images, photos), but also the entire social network dimension of the platform (comments, followers, favorites, avatars, rewards)? This challenge was overcome through the implementation of a methodical data collection preparation process and the establishment of a collaborative relationship with the technical team at Skyrock [2]. The optimisation process entailed modifications to the blogs codebase and the platform's back-end infrastructure. For instance, the source code was changed to display 24 posts per page instead of 8, allowing the crawlers to archive more information more quickly. Another aspect of the project involved the identification of data sets managed by Skyblog, including usage stats, user profile data, editorial information, moderation terms, and music files, and their subsequent integration into the collection. This collection gives rise to questions pertaining to the professional practices of archivists. How can this mass of data be rendered accessible? What are the legal, ethical and technical issues involved in using the data? What would be the best tools, existing or to be developed to make them searchable and usable? Given the heterogeneity of the large data set, the age of the bloggers and the diversity of their practices on the web (often unconventional) [3], it is evident that the majority of the data is not given if not capta [4]. For example, the data sets recovered are of different types: technical tracking data that can be quantified (creation date, number of posts, number of comments, number of friends) and user-generated data (username, place of residence, body size and weight, astrological sign, etc.). The challenge is to facilitate informed exploration of this diversity of datasets, to take into account the actual needs, achievements and ideas of researchers and to create a synergy around this collection. This is the whole purpose of the collective project Skybox, which aims to establish a unified interface for consulting datasets, use cases and methodological recommendations. The project transforms the web archive into a datafield object of study. Concurrently, it also contributes to the process of making Skyblog part of our heritage. In this presentation, we will take a look back at the technical challenges of the preservation methodological and technical processes as well as the preliminary thoughts and work on the Skybox research programme. (464 words)

References: 1. BnF. 2024. Le web français collecté par la BnF pour le patrimoine et la recherche, Paris, URL : https://www.bnf.fr/fr/depot-legal-du-web 2. Faye, Alexandre. 04/09/2024. Aujourd’hui les skyblogs entrent dans l’Histoire. Web Corpora, URL : https://doi.org/10.58079/128yr 3. Deseilligny, Oriane. 2009. Pratiques d’écriture adolescentes: l’exemple des Skyblogs. Le Journal des psychologues, (9), Paris, France, 30-35. 4. Drucker, Johanna. 2020. Visualisation. L’interprétation Modélisante. Rennes, France : B42 Press.

Présentation n°2 (500 mots) : “I’m shutting down my blog, follow me!” Digital migration: a mirror for identity formation in adolescence - Quentin Lobbé (EHESS) “Digital migration” refers to the way all or part of an online community can move from one territory of the Web to another, whether or not this move is coordinated — a recent example being users massively migrating towards the federated network Mastodon after E. Musk has bought Twitter. Digital migrations are extremely interesting to study from a historical point of view since they can be the reflection of: Major evolution within the web itself (web 1.0 to web 2.0, launch of social media platforms, popularisation of mobile web, etc.) [1,2]; A frustration, weariness or disappointment regarding a given platform [3]; A reaction to a socio-political context outside the web (repressive legislation, surveillance, censorship, war, etc.) [4]. In this presentation, I aim to analyse the migration trajectories of « Skyblog » users - a French blogging platform - based on the National Library of France’s web archive containing nearly 13 million blogs. As studying migration movements means dealing with issues of discontinuity inherent to any body of web archives, I will first focus on the technical difficulties such a project implies. How to follow a collective trajectory, let alone an individual one, through potentially either incomplete, inconsistent or redundant archives? I will detail how I could, on the one hand, automate the detection process for potential traces of digital migration from Skyblog archives, and reconstruct migration trajectories spanning over several years on the other. I will then proceed to explain how the Skyblog platform is, according to me, a special case in the history of the Web. My first hypothesis was indeed that Skyblog users had gradually abandoned their blogs to resort to more modern platforms such as Facebook or Twitter, but my first results show that over 80% of detected digital migrations away from Skyblog are actually migrations within the platform itself, i.e. that Skyblog users create a first blog, then shut it down to open a second, then a third one, etc. Although these are individual migrations, they very often take place as part of a collective movement since the motives for most detected migrations are advertised by a text posted on the closing blog so that online friends can read it. For many young French speakers of the late 2000s, Skyblog has indeed been the ideal place to discover a new way of socialising online [5,6]. My research also shows that the main reason for shutting down a Skyblog results from a discrepancy between a past identity and a new one, be it still under construction. Digital migrations on the Skyblog platform therefore appear as a mirror for identity formation in adolescence. With this contribution I aim to enrich the scientific literature of the 2010s on « the online Self ».

References: 1. Weltevrede, Esther, Helmond, Ann, 2012. Where do bloggers blog ? platform transitions within the historical dutch blogosphere. First Monday 2. Lobbé, Quentin, 2018. Where the dead blogs are : a disaggregated exploration of web archives to reveal extinct online collectives , in : International Conference on Asian Digital Libraries, Springer. pp. 112–123. 3. Horbinski, Andrea. 2018. Talking by letter: the hidden history of female media fans on the 1990s internet. Internet Histories, 2(3-4), 247-263. 4. Ermoshina, Ksenia, Musiani Francesca. 2021, The Telegram ban: How censorship “made in Russia” faces a global Internet, First Monday, 26(5). 5. Cardon, Dominique, & Delaunay-Téterel, Hélène. 2006. La production de soi comme technique relationnelle: un essai de typologie des blogs par leurs publics. Réseaux, (4), 15-71. 6. Fluckiger, Cédric. 2006. La sociabilité juvénile instrumentée: L’appropriation des blogs dans un groupe de collégiens. Réseaux, (4), 109-138. 7. Stora, Michaël. 2009. «Ça ne regarde que les autres!» ou le blog à l’épreuve de l’adolescence. Empan, (4), 066-071. 8. Deseilligny, Oriane. 2009. Pratiques d’écriture adolescentes: l’exemple des Skyblogs. Le Journal des psychologues, (9), Paris, France, 30-35. (456 words)

Présentation n°3 (500 mots) : From data to emotions - Emmanuelle Bermès (ENC) With 12.6 million blogs and 40 terabytes, the 2023 Skyblog archive is probably the largest web corpus within the BnF collections [4]. When entering such an impressive amount of content, the researcher can only be abashed and disoriented. Using quantitative methods and distant reading may be the first idea that comes to mind. However, there is a long way from lists, counts, statistics and metrics to the individual stories hidden in the archive: stories of men, women, teenagers who often unveil themselves in an intimate and intense way. By treating these persons as if they were only data, there is a high risk to betray their legacy. Ethical concerns should be at the forefront of our preoccupations when studying vernacular content from the early social networks, especially when minors were involved [3]. In order to enter the corpus in a way that allows us to connect with the emotions conveyed by this very special corpus, we have designed a research method that is very similar to the crawling process used by archiving bots. We start with seeds - individual websites that have been identified and selected by librarians and their partners in the course of the creation of the web archives collections at BnF, since the mid 2000’s. Using the web crawler Hyphe, developed by the medialab at Sciences Po and tailored to recrawl the BnF web archive during the ResPaDon project [1], we explore the Skyblog corpus by conducting what we have called a « fractal exploration ». Following the numerous links between blogs on the Skyblog platform, we discover hundreds of new blogs, and we can then leverage datavisualization in order to identify clusters. In this presentation, we will show how this method leads to the identification of communities and provides a way for the researcher to progress towards close reading and the discovery of significant stories. Instead of trying to get an overview of the corpus using numbers, we use the links that originate from skybloggers connecting with one another, just like breadcrumbs which show a way through this mass of content. We will discuss the benefits of this approach, but also question its limitations in the context of the crawl realized by the BnF in 2023, more than 10 years after the platform’s peak of popularity, considering that only a third of the blogs remained online at that time. Finally, we will discuss how the combination of distant reading and close reading can help to get a global sense of the corpus, without relinquishing the emotions, which are constitutive of any type of cultural heritage [2]. (430 words)

References: 1. Aubry, Sara, Audrey Baneyx, Emmanuelle Bermès, Laurence Favier, Alexandre Faye, Marie-Madeleine Géroudet, and Benjamin Ooghe-Tabanou. 2024. « A network to develop the use of web archives: Three outcomes of the ResPaDon project ». In Exploring the Archived Web During a Highly Transformative Age, edited by Sophie Gebeil and Jean-Christophe Peyssard. Florence, Italie: Firenze University Press. 2. Bermès, Emmanuelle. 2024. De l’écran à l’émotion: quand le numérique devient patrimoine. Paris, France: École nationale des chartes-PSL. 3. Milligan, Ian. 2019. « Learning to See the Past at Scale: Exploring Web Archives through Hundreds of Thousands of Images ». In Seeing the Past with Computers, édité par Kevin Kee et Timothy Compeau, 116‑36. Experiments with Augmented Reality and Computer Vision for History. University of Michigan Press, URL : https://www.jstor.org/stable/j.ctvnjbdr0.10. 4. Tybin, Vladimir. 2024. « Les skyblogs au service de la science ». Chroniques, mars 2024, URL: https://www.bnf.fr/fr/les-skyblogs-au-service-de-la-science.

11:00-12:30 Session 10B: Platform Histories Roundtable
11:00
Platform Histories Roundtable (with Miglė Bareikytė, Marcus Burkhardt, Devika Naraya, Anne Helmond, Fernando van der Vlist)

ABSTRACT. Platforms have multiple histories. The global histories of the political economy of platform capitalisms can be dated back from racial capitalism of so-called "platform or racial fixes'' during the global financial crisis of 2008 to the history of flexibilization in just-in-time production in 1960s Japan. In terms of media history, platform histories tie in with the modularization and outsourcing of software development and the archeology of algorithmic techniques, the privatization and monopolization of infrastructural services and capitalist data capture. Platform histories scale from the development of singular modules, platform ecosystems to global political economies of platforms. Despite these many historical perspectives on platforms, platform historiography remains largely a desideratum of platform studies and lacks systematic theoretical and methodological approaches. The proposed roundtable aims to provide the first collection dedicated to drawing together and synthesizing the existing multiplicity of platform-centric research as well as cross-platform histories, while focusing on exploring and developing multifaceted platform histories. Platform giants are internationally operating organizations embedded in complex technologies and infrastructures. For instance, social media platforms rely on exploitative, labor-intensive content moderation, while platform labor is organized within and through meticulously designed interfaces, apps, and their infrastructures. We aim to bring together perspectives from platform labor research and platform-centric research. How can critical platform history be written amidst the tensions and forces of infrastructural power, data-intensive economies, and geographic specificities? This roundtable responds to this challenge theoretically and methodologically – multi-layered, multi-sided and globally entangled. Within the roundtable, we want to discuss these historiographical approaches to the most central infrastructures of the datafied web – platforms – with researchers from various fields. The roundtable serves as preparation for a special issue, which will be the first to systematically deal with platform histories.

Organised and moderated by : Sebastian Randerath and Tatjana Seitz. With the participation of : Miglė Bareikytė, Marcus Burkhardt, Devika Naraya, Anne Helmond, Fernando van der Vlist

11:00-12:30 Session 10C: Past Metrics
11:00
Translating Web Data into Media History: A Methodological Reflection of Archiving and Analyzing the XS4ALL Homepage Collection.

ABSTRACT. Web archives have become an invaluable resource for contemporary historical research, providing new primary sources and unique opportunities to investigate online cultures (Milligan; 2019). The increasing reliance on born-digital materials, such as websites, has led to the adoption of digital humanities methods in historical research, notably through the use of a “web-minded approach” (Brügger, 2018). This approach stresses the need to consider the specific characteristics of archived web pages, to be mindful of the processes behind their archiving, and to apply methods appropriate for working with such material. While historians have traditionally depended on source criticism, engaging with web archives requires additional skills and insights to interpret these digital artifacts and translate them into meaningful historical analysis. This paper examines the steps involved in this process, fostering dialogue between a web archivist and a media history scholar. It offers a methodological reflection on the types of data that are significant within web archives, why these are crucial for historians, and how they can be effectively incorporated into historical research.

Key aspects to both the archivist and historian concerning archived web collections will be discussed such as collection formation, metadata selection, and sample preparation for tools like the SOLR Wayback. Furthermore, the paper reflects how the various types of data included in a collection can be appropriated for DH methods like multi-modal content analysis, link analysis, or topic modelling. Preliminary phases should be taken into account as well, hence curatorial decisions and related technical considerations like harvest dates and crawl depth, will be examined too. All of these factors are to be considered by the web archivist, subsequently affecting the content of a collection as well as the material’s periodisation, authenticity, and thus the notions scholars can construct using them.

Historical research using the XS4ALL homepage collection archived at the Dutch Royal Library will form the exemplary base for this paper. This collection includes a variety of URLs of websites created by XS4ALL subscribers, who could design their own homepage (de Bode & Teszelszky, 2021). The collection presents notable cases to be considered by both archivists and historians. For example, it was harvested from a curated list of URLs rather than being indexed by search engines due to historical significance of XS4ALL. Furthermore, this period of the early web offers interesting obsolete technologies to be studied (i.e. Flash) or data challenges like independent websites that are not linked to any other URL, or lack inbound links altogether. Another technical aspect is that XS4ALL websites underwent a domain name change. The leading question is what web data aspects historians should know to properly use archived web collections.

This paper seeks to investigate the translation of web data into historical narratives by examining the XS4ALL homepage collection through both archival and historical lenses, employing a web-minded approach. This process is shaped by the interplay of curatorial, archival, and technical decisions that affect how digital-born materials should be interpreted and understood by scholars – a source that will continue to gain prominence in contemporary history research.

11:20
The early datafied web: Visitor counters on the Danish web in the 1990s

ABSTRACT. One of the earliest ways of datafying the web was web counters calculating the number of visitors on a website. Visitor counters were the only way that a website holder could get automatic feedback information from and about the visitors Although the information about the number of visitors was very limited and not very detailed it gave the website owner a sense of how popular the website was, while at the same time flagging this information for the users of the website.

This paper paper discuss the emergence, spread and development of early visitor counters on the Danish web in the 1990s. The paper takes the point of departure in the research project ‘Histories of the Danish web in the 1990s’. The paper is guided by the following research question: Which role(s) did visitor counters play as one of the early web's fundamental infrastructure elements?

Based on this research question the paper presents an initial mapping by investigated the following aspects of the development of early visitor counters: (1) Visitor counters, producers, companies, economy, market: An analysis of the main actors who produced visitor counters, including business model, and the market, from early handheld handcoding, via peer-to-peer distribution of relevant HTML-code to professional international web companies like Digits (digits.com) or Internet Audit Bureau (internet-audit.com), as well as Danish counter providers like chart.dk and Danmarks Top100 (danmarks-top100.dk), and web hotels like Cybernet (cybernet.dk). (2) Technology: An analysis or how visitor counters were constructed, and how they worked on a website, including how they collected clicks and communicated with the visitor counter companies with a view for the website owner to become part of the top hit charts on the producers website. (3) Statistics: A mapping of how many visitor counters existed on the Danish web in total and relative to the total number of websites. (4) Network: An analysis of the hyperlink network between websites using a visitor counter, and the providers of counters. (5) Website owners, use forms, and aesthetics: An analysis of which types of websites visitor counters were used on, and of how they were communicatively and aesthetically framed by the website owner, including wording, icons, placement on the web page etc.

Sources: Websites from the Internet Archive, extracted from the national Danish web archive Netarkivet, and accessed through a SolrWayback interface which allows for free text search and extraction of all elements of the web pages. Internal documents from visitor counter companies in so far this has been provided. Research interviews with a limited number of website holders from the 1990s.

The presentation will outline the results within each of the focus areas above, including how they interrelate.

11:40
From Hit Counters to the Professionalisation of Web Metrics in Luxembourg (1990s-Mid-2000s)

ABSTRACT. The objective of this presentation is twofold: firstly, to identify the top-ranking websites in Luxembourg during the late 1990s; and secondly, to trace the evolution of the professionalization of measurement metrics in the country. This study focuses on how various stakeholders organised themselves to provide standardised data, thereby fostering the development of the nascent online advertising industry.

Furthermore, this presentation seeks to elucidate the methodological challenges inherent to the analysis of website metrics, particularly those associated with the use of web archives. These challenges include the limitations of web archives in capturing the user perspective (Meyer, Thomas & Schroeder, 2011) and the inherent issues of web archives themselves, such as incompleteness and temporal and spatial inconsistencies between archived fragments (Brügger, 2018). To illustrate, an analysis of the website cim.be, a Belgian company that many Luxembourg editors were affiliated with in 2001 to ensure certified Internet audience measurement and data veracity, revealed only fragmented data: it makes it difficult to draw comparisons over time. This data was supplemented with information from newspapers, magazines, and company websites to gain insight into the audience of the 1990s and 2000s (Arend, 2006).

In the 1990s, website traffic was measured by analysing server logs, which provided information through thousands of lines. However, the market soon evolved towards more user-friendly web analytics solutions, such as web counters. Each company had its own system, which often lacked reliability (Webster, Phalen & Lichty, 2013; Shiu, n.d.) It can be argued that one of the driving forces behind Luxembourg's development of standardised metrics and the proposal for a neutral institution to oversee them was the burgeoning online advertising industry, which led to the first conference on online advertising as early as 1999.

In addition, we provide a list of the top visited websites from December 1997 to August 1998 provided by the first Internet directories for Luxembourg websites and web portals and 2004, from CIM.be to include the users in the website mapping of Luxembourg.

References:

Brügger, N. (2008). The archived website and website philology: A new type of historical document? Nordicom Review, 29, 155–175 Meyer, E., Thomas A., & Schroeder, R. (June 30, 2011) Web Archives: The Future(s). SSRN Scholarly Paper. Rochester, NY. https://doi.org/10.2139/ssrn.1830025 Arend, Olivia. (2006, March). 2001-2005: Splendeur et misères du Web Luxembourgeois. Paperjam, 118–121. Webster, J. G., Phalen, P. F., & Lichty, L. W. (2013). The Audience Measurement Business. In Ratings Analysis (4th ed.). Routledge. Shiu, Alicia. (n.d.). The Early Days of Web Analytics. Amplitude. Retrieved October 15, 2024, from https://amplitude.com/blog/the-early-days-of-web-analytics

12:30-14:00Lunch Break
14:00-15:30 Session 11A: Data Regimes
14:00
Historicizing Environmental Data on the Web: Surfrider.org, 1997-2024

ABSTRACT. Web-based environmental data dashboards provide critical points of access for users hoping to gain knowledge concerning their surroundings, yet their historical development has not been explicitly tracked in existing literature. This paper historicizes environmental data on the web by examining preserved copies of the US-based nonprofit organization Surfrider Foundation’s coastal water quality monitoring data dating back to its earliest crawl via Wayback Machine in October, 1997.

The Surfrider Foundation was founded in in 1984 to pursue coastal environmentalism. Successful early initiatives included various sewage runoff and industrial waste management infrastructure improvements across the US east and west coasts. Since at least its earliest web crawl, the organization has provided information concerning Southern California’s coastal water quality to users via its website at surfrider.org. Today, the organization’s Blue Water Task Force tracks water quality data through a network of volunteers collecting, processing, and logging water samples in dozens of locations globally. While data were published via text on a static HTML web page in the late 1990s, today they are presented in downloadable JSON and CSV files, which are in turn contextualized within a dynamic JavaScript-based map. This study examines the earliest iterations of Surfrider’s water quality data publication efforts on the web and compares them to its most recent.

Analysis comprises two phases. First, I examine the nature of water quality data and data structures presented on the Surfrider website, noting data categories and formats. Next, I examine the Surfrider web site’s source code to identify the web design techniques used to publicize environmental data. In each phase of analysis, findings from the 1997 Wayback Machine crawl are compared against findings from the website in its current form to better understand the historical development of public-facing environmental data sharing practices on the web over time.

Conceptually speaking, this study builds on recent scholarship concerning the role of data dashboards in the sociotechnical construction of coastal water quality knowledge (Hodges 2024), and contributes to the literature by introducing a historical perspective. While previous research has shown contemporary coastal water quality data initiatives to emphasize bacteria quantities above all other water quality metrics, the present study shows that during Surfrider Foundation’s earlier phases, water quality data assumed the form of richer, “thicker” descriptions more akin to ethnographic field notes than discrete bacteria counts. Each approach in turn performs a different form of “synchronization” between data and reality, thus facilitating different ideological and material activities (Bowker 2008, p. 48). In conclusion, I argue that Surfrider’s current emphasis on discrete, tabular bacteria data synchronizes their initiatives with an emphasis on the potential for acute bacterial illnesses, rather than the long-term illnesses cased by other forms of pollution or the embodied risk-management practices outlined in their 1990s-era descriptions.

References:

Bowker, G. C. (2005). Memory Practices in the Sciences. MIT Press.

Hodges, J.A. (16 July, 2024) “Comparing Ocean Epistemologies: Reverse-Engineering Los Angeles’ Data Dashboards.” Society for the Social Studies of Science/European Association for the Study of Science and Technology (4S/EASST), quadrennial joint meeting. Amsterdam, NL.

14:20
The un/expected work of open data policies

ABSTRACT. This paper examines the history of the datafied web from a literal perspective: the use of the web to make scientific research data accessible. Here we focus on how policies, laws, and guidelines have constructed the web as a site for sharing and accessing datasets with a focus on open science data. The history of “open science” policies are distinct from that of “open data”; they are both important to accounts of data access on the web. While the US government has long been involved in producing scientific data (e.g. Aronova et al., 2010; Edwards, 2010) and collecting data about its citizens (Bouk, 2017; Igo, 2018) that is useful to scientists and social scientists, the precursors to open science polices that ensure that citizens have access to science research has been traced to the start of the National Science Foundation after World War II (Pasek, 2017). Meanwhile, open data in US policy is rooted in traditions of transparency of the US government and online digital access initiatives (Schrock, 2016). The newest iteration of open data laws and policies emphasize that data needs to be “machine readable” or “machine actionable” (Rep. Ryan, 2022; Wilkinson et al., 2016). Scholars have examined how open technology activism reproduces neoliberal ideologies (e.g. Hester, 2016; Kelty, 2008), and in the realm of open science data, private corporations are often best positioned to make these resources serve their own profit-seeking ends (Leonelli, 2013; Mirowski, 2018).

The discourses around open data imagine that data is something that can be plucked from its context via the open web and used elsewhere, but the worlds in which many of these policies went into effect have shifted rapidly due to AI firms accessing data from the open web. Data misuse occurs when data’s original context, intended use, or its "originary domain" limits are ignored (Acker, 2018). Furthermore, as many indigenous and Black feminist scholars have shown, data access is typically envisioned for those who will use the data, and not always for those who may be most impacted by data’s reuse and deployment (e.g. Carroll et al., 2022, Sutherland, 2024). The consequences of the context of data reuse are further shifted due to new emphases on the machine actionability because AIs can become the new context of data reuse. By incorporating open data into AI via web infrastructures, new ethical, material, labor, intellectual property and fairness dimensions for open access come into focus. AI shifts the stakes of unbridled access to ope science data and prompts us to revisit policies governing web infrastructure.

14:40
Investigative turn in the Baltics in times of war in Europe

ABSTRACT. The datafication during polycrises (Henig & Knight, 2023; Norman, Ford & Cold-Ravnkilde, 2024) has contributed to the “investigative turn” - intensification of (digital) investigative practices in working with digital media and data to resist emergent digital injustices. Professional journalists were joined by think tanks, government institutions and individuals in using different types of data to analyze and critique the growing phenomena of the dark side of digitalisation and infrastructural globalization, including disinformation, corruption networks, polarisation. These actors use different (digital platform) data as evidence in producing new narratives about ongoing controversies, conflicts and wars (Bedenko & Bellish, 2024; Pastor-Galindo et al., 2020). Investigations conducted by Bellingcat, such as into the downing of MH-17 with the help of geolocating the origin of the Buk missile, by using geographical landmarks or intercepts from Russian security services or the documentation of digital evidence of the Syrian revolution by Mnemonic in building the Syrian Archive are iconic examples of such contemporary data-based investigations. Nevertheless, on the one hand, systematic efforts to gather and utilize information from open source information, can be traced back to the mid-19th century in the United States and the early 20th century in Europe (Block, 2023); on the other hand, the geographical diversity of investigative actors goes beyond those located in western parts of Europe. Since the illegal occupation of Crimea and the ongoing Russian war against Ukraine, a complex landscape of investigative actors has also emerged in Central and Eastern Europe with a diverse focus in terms of topics, strategies and methods of cooperation. Within these frameworks, investigative practices aim to: counter disinformation (Denisenko, 2023), expose corruption and sanctions-evasion networks, preserve digital memory of the ongoing war (Nazaruk, 2022; Bareikyte & Skop, 2022), develop new narratives, methodologies and sustainable digital infrastructures to research the war (Bareikyte et al., 2024) and its aftermath in the future, create new cultures of evidence-based research, and securitise societies and environments in Central and Eastern Europe. Within CEE, the Baltic states (Estonia, Latvia and Lithuania) represent an interesting but also complicated case in the context of investigative practices. While these countries are currently not under direct attack from Russia, as Ukraine is, their well-developed digital infrastructures have experienced digital attacks at both narrative and infrastructure levels, including disinformation, cyber-attacks and GPS jamming (Braw, 2024; LETA/TBT, 2024). While investigative journalism has experienced a massive decline a decade ago (Houston, 2010), investigative media and data practices have been increasingly used in the Baltic states of Estonia, Latvia and Lithuania in recent years as a response to Russia’s information war (Denisenko, 2023). The analytical, critical, and educational role of investigative journalists, citizen activists, think tanks, and scholars in countering informational attacks while using investigative practices and digital data, is crucial to the formation and development of cooperative action and meaning making practices in complicated times for this region (Chakars & Ekmanis, 2022). This diverse range of actors working on different “fronts” and in the different parts of society illustrate the emergent culture of contemporary premonition of war that shapes the contemporary cultures of preparedness in the Baltics. In our talk, we focus on the investigative practices in the Baltics, which include investigative journalism, fact-checking, OSINT and experimental-educational work, which we explore through semi-structured interviews and fieldwork in 2024-2025. Interviewees comprise representatives of non-profit organisations, public broadcasters, private media companies, academics and freelance journalists. We map and present the actors, focusing on their audience strategies, the role of platform and other types of data in their work, and cooperation practices in their work, outlining the meaning of investigative practices in contemporary datafied and securitised cultures in Central and Eastern Europe.

14:00-15:30 Session 11B: Web archives Practices
14:00
Temporally Extending Existing Web Archive Collections for Longitudinal Analysis

ABSTRACT. The Environmental Governance and Data Initiative (EDGI) regularly crawled US federal environmental websites between 2016 and 2020 to capture changes between two presidential administrations. However, because it does not include the previous administration ending in 2008, the collection is unsuitable for answering our research question, "Were the website terms deleted by the Trump administration added by the Obama administration?" Thus, like many researchers using the Wayback Machine’s holdings for historical analysis, we do not have access to a complete collection suiting our needs. To answer our research question, we must extend the EDGI collection back to January, 2008. This includes discovering relevant pages that were not included in the EDGI collection that persisted through 2020, not just going further back in time with the existing pages. We pieced together artifacts collected by various organizations for their purposes through many means (Save Page Now, Archive-It, and more) in order to curate a dataset sufficient for our intentions.

In this paper, we contribute a methodology to extend existing web archive collections temporally to enable longitudinal analysis, including a dataset extended with this methodology. We identified the reasons URL candidates could be missing from the initial EDGI dataset, and crawled the past web of 2008 in order to identify these missing pages. We also identified small domains that were vulnerable to being missed by our past web crawl, and found that these domains benefited from a complete web archive index lookup instead. We probed another large collection, the End of Term 2008 dataset, for additional longitudinal candidates, but found that crawler traps were inflating the size of the dataset, leading to only a small number of additional URLs. By analyzing the provenance of the final collection, we determined that this new longitudinal dataset covering three US presidential administrations only exists because of aggregation of artifacts collected by many organizations. We also found that automated brute-force methods alone were not sufficient to create this collection, and that iterative manual analysis of automated results produced more seeds for candidates. Our new dataset includes 1,220 archived triplets (2008, 2016, and 2020) of US federal environmental webpages.

We use our new dataset to analyze our question, "Were the website terms deleted by the Trump administration added by the Obama administration?" We find that 81 percent of the pages in the dataset changed between 2008 and 2020, and that 87 percent of the pages with terms deleted by the Trump administration were terms added during the Obama administration. We probed for change trends: when agencies had the same terms repeatedly removed across their websites. We found that certain agencies experienced a large number of change trends, including OSHA, NIH, and NOAA, while 17 of the 30 agencies, including NASA and the Department of Energy, experienced no change trends. Finally, we analyzed the 56 deleted terms and phrases tracked by EDGI and found that the terms fell into two categories: climate and regulation, and identified that there were more change trends in regulation term deletions than climate term deletions.

14:20
Engaging audiences with the UK Web Archive: Strategies for general readers, data users, and the digitally curious

ABSTRACT. This paper explores approaches to engaging three distinct audiences —general readers, data users, and the digitally curious— with the UK Web Archive. Building on collaborative work with the National Archives UK, and drawing on experiences from the Cambridge University Libraries and the National Library of Scotland, we present practical recommendations and demonstrate best practices for designing web archives to meet diverse user needs while ensuring broad and equitable access to digital resources. To enhance the experience of general readers, we have introduced exploratory, user-friendly, and gamified interfaces that encourage interactive exploration of web archive collections. Additionally, public engagement is a key focus, with outreach events such as exhibitions designed to raise awareness of these valuable digital resources among library users. By creating engaging experiences that invite discovery, we aim to bridge the gap between casual web users and the rich historical material contained within web archives.

For data users, we prioritize curating detailed metadata and implementing Datasheets for Data to support the quantitative analysis of web archive collections. Outreach initiatives for this community include hands-on workshops and data visualization calls, which invite users to interpret and represent the collections through visual mediums. The visual outputs from these calls often enrich future public-facing resources, further enhancing the archives' accessibility to general readers. Through these efforts, we aim to foster a collaborative ecosystem that encourages innovative research and deeper exploration of the collections.

A major focus of our work is addressing the digital skills gap, particularly for the digitally curious—those who recognize the potential of web archives but lack the technical skills to fully engage with them. To support this group, we are developing in-library workshops tailored to building foundational digital literacy and data analysis skills. These workshops are designed not only to upskill participants but also to inspire them to explore web archives more confidently. By equipping users with the tools to navigate and analyze collections, we hope to empower a broader demographic to engage with these resources.

In summary, in this paper, we present a strategy to improve the usability of the UK Web Archive across varied institutions. Through a combination of material development (datasets, interfaces) and diverse outreach events (exhibitions, data visualization calls, workshops), we aim to meet the needs of general readers, data users, and the digitally curious. By tailoring our approach to these distinct groups, we strive to create an inclusive, dynamic web archive experience that invites exploration, research, and digital empowerment.

14:40
Seed lists on themes and events on Arquivo.pt: a curious starting point for discovering a web archive

ABSTRACT. Every year, Arquivo.pt makes special collections dedicated either to a particular topic or to events. To do this, it starts by producing a list of seeds (addresses of selected web pages) which it then puts on record. The recorded content becomes accessible after a year of embargo, along with additional information, such as the seed lists, contextual information and, in some cases, logs and cdx indexes.

This presentation briefly explains 1) the mission of Arquivo.pt to support research; 2) the criteria for creating a thematic collection or a collection about an event; 3) the method for obtaining lists of seeds; 4) the results obtained; 5) issues relating to the recording of seeds, namely the tools used and limitations; 6) the use case of the special collection on Portuguese artists; 7) Finally, we mention the lessons learnt and the challenges that have come from researchers.

Arquivo.pt (the Portuguese Web Archive) is a public service that anyone can use to find old web pages. However, its primary mission is to serve scientific research. Organically, it belongs to the government research support organisation Fundação para a Ciência e a Tecnologia. Arquivo.pt makes all the data available using various strategies and services: search interface, API for automatic processing, open datasets and seed lists.

Arquivo.pt has made special collections on the occasion of events, such as political elections, the Olympic games, the start of the pandemic. Others have focussed on specific topics, such as museums, the press, street art, artists, etc. The thematic collections are intended to arouse the curiosity of the community, feeding the Arquivo.pt collection with content from their field of study.

The selection of seeds was partly manual and, in some collections, the community participated. However, Arquivo.pt uses an automatic methodology to obtain a large number of seeds. Services such as BingSearchAPI are used.

In October 2024, the Archive published the 51st set of open data, more than half of which were seed lists on specific events and topics (on Dados.gov - open data portal; arquivo.pt/datasets).

The seed list was just a starting point for recording. It can be a starting point for research, raising various questions. Is the content recorded representative of a particular theme? How many of these seeds were not successfully recorded? In which cases was it due to technologies that blocked access?

To illustrate the ideas in this presentation, a special collection on Portuguese visual artists is mentioned. This collection emerged from a collaboration with the artists' community. A PHD researcher included this material in her research project.

Among the lessons learnt, we would highlight the following: the seed lists are useful for a first approach by researchers; it is useful to gather even more information about the selection and recording process. In this sense, the seed lists on themes and events on Arquivo.pt are a curious starting point for discovering a web archive. The challenge now remains for researchers to test their methodologies on these datasets.

14:00-15:30 Session 11C: Methods
14:00
Critical AI technography: Researching the material political economy and power of AI platforms

ABSTRACT. This paper proposes technography as a valuable methodology for conducting critical empirical and historical studies on the material political economy and power of artificial intelligence (AI) platforms. We argue that technography—a descriptive and interpretive approach to analysing the structural and operational aspects of technical systems (Bucher, 2016; Helmond and Van der Vlist, 2019)—can be applied to critically examine AI platforms like Azure OpenAI, Amazon SageMaker, Google’s Vertex AI, and NVIDIA AI. This methodology is crucial for scrutinising how major technology companies, or “Big AI”, are driving the AI’s “industrialisation” across various sectors and in everyday digital life.

Our adaptation of AI technography draws from existing research to investigate the material, evolutionary, and discursive components of AI systems (Van der Vlist et al., 2024; Luitse, 2024) and their broader platform infrastructures (Burkhardt, 2020; Helmond et al., 2019). It employs sources like technical platform documentation, corporate blogs, financial reports, and archived product pages from the Internet Archive to provide a historically grounded understanding of the workings and power structures underlying AI platforms (Helmond and Van der Vlist, 2019). These sources enable a critical evaluation of the objectives, functions, and claims made by these companies and reveal their evolving influence on AI development and deployment.

Additionally, this methodology allows researchers to examine the specific strategies employed by major technology companies to consolidate and exert economic, infrastructural, and symbolic power. For instance, Amazon’s AWS and Google Cloud have become dominant by providing essential cloud infrastructure services that have become the backbone of the “datafied web” since the early 2010s, coinciding with the rise of data-driven “surveillance advertising” as its dominant business model (Crain, 2021; Van der Vlist and Helmond, 2021). In this context, our method offers a critical, empirical framework for analysing three critical dimensions of “Big AI” and its political economy within the broader history of the datafied web.

First, it addresses AI’s deep industrialisation, where major technology companies drive economic and technological expansion across various sectors, consolidating market power and monopolisation dynamics (Van der Vlist et al., 2024). This reinforces existing power structures, with Big Tech leveraging control over AI infrastructure to limit competition, particularly concerning cloud-reliant large language models (LLMs) (Kak and Myers West, 2023; Luitse and Denkena, 2021; Narayan, 2022).

Second, it addresses the evolving infrastructural power of AI platforms. These platforms have evolved alongside large-scale cloud infrastructure dependencies, as major technology companies set new standards and shape the conditions for AI production and implementation across domains such as cultural production, healthcare, and security (Jacobides et al., 2021; Van der Vlist et al., 2024). This includes strategies of vertical integration, complementary innovation, and abstraction to obscure the complex operations and governance of AI platforms (Luitse, 2024).

Third, it addresses the evolving symbolic power of AI platforms. Companies use discursive strategies to influence dominant ideas about desirable AI types and promote notions of “openness” and “democratisation” (Burkhardt, 2020; Widder et al., 2023), or AI ethics (Aradau and Blanke, 2022).

Taken together, critical AI technography is oriented towards how companies like Microsoft, Google, Amazon, and NVIDIA shape contemporary AI trajectories within broader web history, through their converging economic, infrastructural, and symbolic power. As AI increasingly permeates economic sectors and digital life, it is essential for critical scholars, journalists, activists, policymakers, and regulators to trace and critique the forces driving AI’s evolution.

14:20
AI: A Lever for ‘Decolonizing’ Archives? Web Archives as a Datafield for Critical and Inclusive Uses of AI in History

ABSTRACT. Concluded in 2024, the European program Polyvocal Interpretation Of Contested Colonial Heritage (PICCH) aimed to explore how archival documents created from a colonial perspective could be reappropriated and reinterpreted to become an effective source for constructing an inclusive future society. In France, the term ‘decolonization’ has been heavily instrumentalized, losing the profound meaning attributed to it by historical thinkers like Achille Mbembe. In this project, decolonizing French television and web archives aims to make these materials from former colonial powers more inclusive and respectful towards populations still facing discrimination today, challenges that have been driving archivists worldwide for years (Ghaddar & Caswell, 2019). One of the project’s objectives was to refine the metadata of television archives as well as web data concerning narratives of events related to the colonial past or post-colonial issues. We scrutinized the media coverage of the 1983 March for Equality and Against Racism from a transmedia perspective, based on web video corpora and archived web pages from the INA. One of the goals was to examine the visibility accorded by the media to the marchers themselves: in 1983, they were young suburbanites, born to immigrant parents in French urban suburbs, perceived as Maghrebi or Black, leading to an essentialization of the discourse on this event in the media. The marchers were relegated to the periphery of the journalistic narrative from the 1980s until more recent commemorations, and they utilize the web to reclaim the narrative of this event. Given the volume of data (archived web pages, voice-over text from videos, video metadata), we employed AI programs to automate the identification of the marchers, whether through text (names, nicknames) or through their faces in the videos.

Based on this case study, this paper will eschew the interpretation of the online media coverage of the march to concentrate on the methodological and hermeneutical questions raised by cultural biases when employing deep learning AI programs to analyze web data. It seeks to investigate under what conditions the application of AI programs to analyze archived web data can enhance the consideration of marginalized historical actors in the analysis of contemporary transmedia narratives.

Firstly, we will present the corpus and methodology used to study the media treatment of the marchers in the television and web archives of the Institut national de l’audiovisuel. Secondly, we will review the application of AI to these corpora, focusing on the significance of cultural biases in data processing through two examples: the thematization of text from HTML pages archived by the INA in 2013 and automated visual recognition in videos. Finally, we will consider the lessons learned from this experience and propose hermeneutic and ethical reflections for web historians confronted with hegemonic biases in the processing of web data.

14:40
Echoes of Dolly: isolating long-term political schemata by abstracting web archives as Zotero collections

ABSTRACT. Abstracting web archives as data at the document level and as metadata at the corpus level facilitates the systematic exploration of niche topics in massive collections. This paper focuses on showing how adapting a scientometrics software on web archives enabled both the abstraction corpuses of web archive pages as Zotero collection and a deeper document-level exploration to help unravel past and present controversies.

The live birth of Dolly, the first ever cloned large mammal, was a striking biotechnological performance which quickly became a commonplace of public life. What made it different from previous comparable events is that it happened in the age of the early internet, a social era that we can now partially access through web archives. This makes it a rare opportunity to evaluate the trajectory of very specific political schemata, namely those concerning what is politically at stake with developments in biotechnology.

Most of these schemata have been defined and refined throughout the 20th century. Is progress in genome engineering the key to eternal life? An impending revival of the third Reich ideology? The sign that nothing would ever be sacred anymore? Every time a new technical milestone is reached, the sociopolitical imaginaries around eugenics, heredity, and what would constitute “fitness” in person or a population are reactivated. Yet the debates over Dolly are peculiar because they are the first of that nature in the internet age. For the first time, netizens will have the opportunity to discuss the announcements of scientists and provide their own perspectives on what the existence of that sheep would mean for the present and for the future.

This study is focused on the mention of Dolly in the French political debate during the 2002 presidential election. It has been conducted by deploying a new methodology that abstracts extractions of full-text indexed web archive as Zotero collections of documents. Using PANDORÆ, a software originally designed to perform scientometric analysis, the web archives are queried at the text content, abstracted as Zotero-compatible data to enable curation, and then re-imported and explored both at the corpus and document level through ad-hoc data visualization algorithms. The exhaustive study of the relevant archived web pages shows that in the French political web of 2002, Dolly is constructed both as a creature symbolizing the hubris of humankind and a symptom of new policy problems that policymakers are ill-equipped to handle.

15:30-15:45Coffee Break
15:45-16:30 Session 12: My PhD in 5 Mns
15:45
Before WEB 2.0: A Cultural History of Early Web Practices in the Netherlands from 1994 until 2004

ABSTRACT. My PhD research aims to construct a cultural history of web practices in the Netherlands from 1994 to 2004, a pivotal period that predates the rise of Web 2.0. This project focuses on the transformative 1990s and early 2000s, an era of rapid internet culture evolution that remains underexplored in media historical scholarship (Verhoef, 2023, p. 1). Often overshadowed by the swift development of platforms and social media, this research seeks to fill a significant historiographical gap. By moving beyond conventional American-centric narratives in Internet Studies (Abbate, 2017), the project examines diverse contexts, particularly the role of amateur contributions, reinforcing the notion that "everyday people made the internet social" (Driscoll, 2022, p. 194) and highlighting the concept of the early "vernacular web" (cf. Howard, 2008).

Utilising an interdisciplinary methodology that integrates media history, digital humanities, web anthropology, and archaeology, the research unfolds in two phases. The first phase investigates the Netherlands' interpretation of the web, identifying dominant socio-technical imaginaries that illuminate both technological and societal developments. This analysis also explores how various actors leveraged the web to achieve broader social, economic, and political objectives. Complementing this historical narrative, a study of intellectual imaginaries draws from influential academic publications in Internet Studies, which have shaped local web initiatives.

In the second phase, the focus shifts to a bottom-up approach, utilising archived web collections to examine the practices of early adopters, particularly the XS4ALL homepages. Additionally, I aim to move beyond prominent initiatives, such as XS4ALL and DDS, by exploring other web localities in the northern Netherlands through oral history. By merging these two stages, the research addresses themes like small-scale web entrepreneurship and the creative practices of amateur users, while also diving into critical archival studies by examining digital heritage, canonisation, source criticism, and the ethics of working with personal archival materials.

15:50
Manifesting The Web: Network Imaginaries in Manifesto Writing Between the 1980s and the 2020s

ABSTRACT. "We are the mice living in the foundations of the Internet. If it needs doing, we do it ourselves. We voluntarily restrict our use of CPU, memory, disk space, and bandwidth. We prefer simple protocols like Gopher. We prefer simple formats like plain text." (Small Internet Manifesto, 2019)

― this is a quote from one of the dozens of manifestos published online by tech movements as a response to an increasing datafication and platformisation of the web. Manifesto has been an important genre in the history of computer networks for the last 4 decades. Despite the rhetorical inflation of internet myths, manifesto writing has been consistently present among tech activists and social movements. My dissertation focuses on the literary and epistemic history of this genre: manifestos written about the internet and published online. The main material is a corpus of web archives of 125 manifesto webpages.

Manifesto is one of the literary and rhetorical forms that historically contribute to shaping what Paolo Bory (2020) calls “network imaginaries”. Literary forms that shape network imaginaries receive a noticeable amount of attention and include metaphors (Wyatt 2021, Markham 2020), maps (Bory and Rikitianskaia, 2020), anecdotes (Natale, 2016), myths (Katz-Kimchi, 2015), and narratives of internet pioneers (Bory, Benecchi, Balbi, 2016). In contrast to those, internet manifestos are published on the web and make use of different formatting possibilities of webpages. This creates an interesting self-referentiality: the web materially influences the production of network imaginaries.

Drawing from the cultural history of the internet, (German) media studies, and electronic literature, I analyse changes in manifesto writing throughout internet history. The conference presentation will focus on manifestos that appear as a direct reaction to the datafication of the web. In this, the talk will correspond to the conference's interest in “histories of practices of resistance to datafication and platform economies”.

15:55
Battlefield of Truth(s) on Investigative Frontlines: From Data Activism to OSINT Professionalism

ABSTRACT. Open Source Intelligence (OSINT), viewed through the lens of civic tech, covers various aspects of citizen engagement. OSINT practices interrogate datafication and democratic participation by instrumentalising data for good in creating positive social impact (Daly, David & Mann, 2019; Guterriez, 2022; Milan & van der Welden, 2016). I am particularly interested in the social impact and diffusion in the application (cf. Rosenberg, 2013; Belgith, Venkatagiri & Luther, 2022) of intelligence practices during Russia´s war against Ukraine (Brantly, 2024, rather than the moment of invention or introduction of new tools and techniques and thereby arguing that human action is shaping technology (Pinch & Bijker, 1984). By integrating established academic research methods and inventive strategies employed by OSINT collectives (Kazansky et al., 2019), my PhD project examines what role play civilian OSINT practitioners in producing OSINT-derived evidence through collaborating in a participative work environment. My research explores how OSINT hobbyists strengthen their analytical skills through training and building an OSINT profession, but also how a loosely organised, EU-based collective, as my case study, evolves into a professionalised organisation with established hierarchies, a commitment to high standards of objectivity, accuracy and transparency as an oath of professionalisation, and stakeholder outreach strategies to produce more and faster volumes of actionable intelligence. Using collaborative digital team ethnography (Beneito-Montagut, R., Begueria, A., & Cassián, N., 2017) in contemporary organisational settings (Akemu & Abdelnour, 2018), I explore the collection practices, organisational hierarchies and community strategies of the above mentioned collective. The matter of overcoming hierarchies and competition is my main research interest, but also addressing the potential inequalities that may result, such as neglecting potential impacts on communities and individuals or disregarding the individual attribution and authorship of OSINT operatives by selling ready-made actionable intelligence reports to external stakeholders, by exploiting and amplifying the output of local actors and data producers to shape particular narratives about Russia's war against Ukraine (Müller and Wiik 2023:206).

16:30-16:45Coffee Break