RESAW 2025: THE DATAFIED WEB
PROGRAM FOR WEDNESDAY, JUNE 4TH
Days:
next day
all days

View: session overviewtalk overview

13:00-15:00 Session 1: Pre-conference 2
Towards an “Algorithmic Archive”: Developing Collaborative Approaches to Persistent Social and Algorithmic Data Services for Researchers

ABSTRACT. Proposed Duration: 120 minutes (the duration of the workshop could be adjusted to accommodate available time and conference organisation needs)

Social media platforms have become a fundamental means of communication, shaping the contemporary understanding of human behaviours, health and political crisis as well as documenting historical events (Simon, 2012; van Dijck, 2011). However, social platforms are private organisation that impose strict limitations to access and the preservation of data (Bruns, 2019; Thomson, 2016). Despite their importance for research purposes and as cultural heritage material, only a few memory institutions consistently archive this important source of information posing a significant risk to their long-term availability. The Algorithmic Archive project is part of the Bodleian Libraires’ broader strategy to further unlock the potential of the existing born-digital collections. The Algorithmic Archive project seeks to develop a sustainable strategy to create a persistent social and algorithmic data archive which can support research efforts in a wide range of disciplines. Members of the project will moderate the session.

The workshop will be divided into two corresponding sessions, each beginning with a brief introduction (5 minutes) outlining the session’s aims and the tasks for the audience, as follows:

1. Use Case Presentations and Breakout Discussions (60 minutes)

- Participants will be invited to share their experiences using social media and algorithmic data in their projects, highlighting research questions addressed, methodologies employed, and challenges encountered.

- Participants will then break into small groups to discuss specific themes, such as data access, tool reliability, data and metadata structures, and interdisciplinary approaches, fostering a collaborative environment for knowledge exchange. A set of questions and topics for discussions will be provided.

2. Building A Sustainable Infrastructure for A Persistent Social and Algorithmic Data Service (60 minutes)

- A guided session to gather participants insights on key aspects regarding the development of social data archive services, including issues and expectations surrounding short- and long-term access. This session will also offer the opportunity to identify potential partnerships for the development of standards to preserve and access social data.

The workshop will conclude with a brief (5 minutes) session to summarise key insights, outlining action points, and discussing how memory institutions can support researchers’ needs as well as identifying potential partners for the development of shared standards for the collection of social and algorithmic data.

This workshop welcomes insights and perspectives from researchers, data scientists, archivists, librarians, and anyone interested in the research implications of social media and algorithmic data. By bringing together these diverse perspectives, the workshop aims to foster discussions and partnerships to develop sustainable strategies to collect social media platforms, and ultimately benefit both scholarship and society.

14:00-15:00 Session 2: Pre-conference 1
Demonstration of BelgicaWeb: Sustaining Access to Belgium’s Born-Digital Heritage

ABSTRACT. BelgicaWeb is an innovative web archiving project to preserve and provide sustainable access to Belgium's born-digital heritage, including websites and social media content. BelgicaWeb is a BRAIN-be 2.0 project funded by BELSPO (the Belgian Science Policy Office). The BelgicaWeb project brings together partners with different expertise. KBR (Royal Library of Belgium) is the project coordinator, CRIDS from the University of Namur provides expertise on the relevant legal frameworks and IDLab, GhentCDH and imec-mict-ugent of Ghent University work on data enrichment, user engagement and evaluation and outreach to the research community, respectively.

This session comprises of a demo and a more interactive component. The demo will showcase the features of a user-friendly interface to KBR’s web archived content and API that are being developed within the project. Both are optimised for archived websites and social media, enabling researchers and the public to explore these collections. During this demonstration, key features will be highlighted, including full-text search, multilingual functionalities, and data-level access through a robust API designed for big data analyses and digital humanities research. This demonstration aims to engage both technical and non-technical audiences.

The interactive part of the session is aimed at finetuning results we obtained through a survey on user requirements that ran last year. By means of a live poll, participants will be invited to indicate preferences and share their opinion about certain features of the access platform and API, thereby directly influencing the future technical developments within the BelgicaWeb project.

The dual approach of a demo and a live poll will provide the possibility to exchange best practices with researchers working with archived web material, thereby providing additional useful insights for the BelgicaWeb project.

 

15:00-15:30Coffee Break
15:30-17:30 Session 3A: Pre-Conference 3
Mentorship for Early Career Scholars in Web Archive Studies

ABSTRACT. This session aims to create a space for open discussion and networking for early career scholars (PhD students and postdoctoral researchers). Organized by five advanced scholars with strong expertise in web archives, this 2-hours session will focus first on the role and place of web archives in research, including how case studies, close and distant reading, and different tools and methods may be used, refined, and presented in research. Ethical and legal issues will also be addressed, based on concrete needs (copyright, anonymization of research results, FAIR Data, and so on). The session may also move on to more general questions related to academic careers, strongly keeping in mind the research areas of participant scholars in web studies. The topics to be considered may include: opportunities for funding; relevant journals and strategies for publication; avenues for promoting and disseminating research to both academic and public audiences, participation of the general public, including social networks, the main conferences related to web archives, and other forums; issues of inclusivity and diversity facing researchers in the field, etc.

Collectively, the organizers of this session combine a wide range of disciplinary and professional perspectives, knowledge of different national and international contexts, and diverse skillsets and expertise in web archives. They have also had varying career trajectories, and in particular have come to working with web archives via different routes and through different experiences, partnerships and topics.

The session will strongly focus on the needs and adapt to the requests of early scholars, in order to align closely with the challenges that they may face in our area. With this second session of mentorship (the first one was organized in Marseille for RESAW23), we hope to establish a regular session at the RESAW conferences. It may also facilitate the development of a peer network among the attendees, who may develop their research and build their professional networks alongside each other.

15:30-17:30 Session 3B: Pre-Conference 4
Empowering Data-Driven Research Through Digital Archives with Internet Archive’s ARCH

ABSTRACT. In this comprehensive 2-hour session, we will explore and discuss the latest advancements and innovations of the Internet Archive's ARCH platform.

ARCH (Archives Research Compute Hub) is a cutting-edge platform engineered to facilitate the building of research collections, enable computational analysis, and support the generation of datasets from terabytes and even petabytes of data. ARCH supports the open publication and preservation of user-generated datasets created from thousands of libraries, archives, and memory organizations worldwide, empowering researchers, students, and information professionals to study, analyze, and interpret digital collections in unprecedented ways.

Designed with a focus on curating research collections using primary digital sources such as web pages, texts, and images, ARCH enables users to effortlessly create over a dozen distinct datasets from these sources with a simple click. These datasets can be directly downloaded either through an in-browser interface or via an API, enhancing accessibility and user experience.

Moreover, ARCH facilitates the efficient utilization of these research-ready datasets by offering in-browser data previews and visualizations. More interactive analysis is encouraged and supported by enabling the integration of computational tools such as Jupyter Notebooks, Google CoLab, Gephi, and Voyant into the research process.

A significant feature of ARCH is its one-click publication mechanism on archive.org, allowing datasets to be easily accessed, shared, and preserved indefinitely. This feature not only promotes open access to information but also ensures the long-term preservation of valuable data.

To support and enhance user experience, ARCH provides comprehensive technical support, online training, and extensive help center documentation. These resources are designed to optimize the effective use of the platform, making sophisticated research processes more accessible to users who may not have advanced coding or scripting skills.

ARCH benefits from the robust, non-profit infrastructure of the Internet Archive and utilizes open-source tools to streamline the computational handling of digital collections. This enables librarians, collection managers, and educators to offer sophisticated research tools to their communities, thereby democratizing access to advanced research methodologies.

Recently, ARCH has integrated AI-powered tools that enhance the platform's capabilities. These tools are readily accessible on our dedicated computing cluster, equipped with GPU support, making advanced computational tasks more feasible for our users.

ARCH is available for both institutional and individual use, offering flexible access options for a diverse range of professionals including researchers, librarians, archivists, museum staff, journalists, and more.

This format provides a comprehensive overview of ARCH’s features, but we will also delve deeper into the technical details and underlying technologies. It will feature a combination of presentations, brief demonstrations, and interactive live sessions. Participants will have the opportunity to engage with the tools interactively, ask questions, and view actual datasets, making this an informative experience that offers participants a clear view into how the ARCH platform can enhance their research capabilities.

15:30-17:30 Session 3C: Pre-Conference 5
Qualitative Digital Methods Workshop: Mapping the Evolution of (AI) Content Generation Infrastructures

ABSTRACT. This workshop will explore YouTube (YT) as a source for recent historical data. Working with a case study exploring the evolution of "AI" tool ecosystems and infrastructures since 2022, we will walk participants through strategies and skills for working with YT data, including API tools, data cleaning, and methods and source criticism.

Conceptually, this workshop will explore why and when one might choose to use YT data, how to conceptualize and map "AI" as a techno-socio phenomenon, quanti-qualitative approaches to historical research, and media research with sensitivity to chronology and periodization. Concretely, we will offer opportunities to build skills with specific tools and methods, including YouTube API, periodization and query design, and natural language processing (NLP) for digitally-born textual data. 

The workshop is designed to be an informal choose-your-own-adventure, where participants can follow at-pace to get a quick overview of an entire workflow for building chronologically structured data visualizations based on textual data from YT, or break off to linger on one or two skills that they'd like to focus on in particular.  For example, participants could choose to spend their time exploring query design with periodization (e.g., if you're new to digital methods) or comparing different NLP approaches to building visualizations for distant reading of historical data (e.g., if you're already familiar with digital methods). Participants can choose to build and explore their own datasets or work with pre-processed data to practice applying more advanced techniques. We have prepared datasets of 5k-50k YT datapoints, with varying degrees of filtering and pre-processing, including spaCy and LLM-driven named entity recognition (NER) on both metadata and transcription data. 

This is an informal, hands-on workshop where we will be trying things out and working through challenges and questions together. Participants will need a reliable laptop and some enthusiasm for working with spreadsheets. Bring lots of questions and curiosity to share with the whole group. Participants are welcome to work together or alone on prepared worksheets.