Demonstration of BelgicaWeb: Sustaining Access to Belgium’s Born-Digital Heritage
ABSTRACT. BelgicaWeb is an innovative web archiving project to preserve and provide sustainable access to Belgium's born-digital heritage, including websites and social media content. BelgicaWeb is a BRAIN-be 2.0 project funded by BELSPO (the Belgian Science Policy Office). The BelgicaWeb project brings together partners with different expertise. KBR (Royal Library of Belgium) is the project coordinator, CRIDS from the University of Namur provides expertise on the relevant legal frameworks and IDLab, GhentCDH and imec-mict-ugent of Ghent University work on data enrichment, user engagement and evaluation and outreach to the research community, respectively.
This demo will showcase the features of a user-friendly interface to KBR’s web archived content and API that are being developed within the project. Both are optimised for archived websites and social media, enabling researchers and the public to explore these collections in novel ways. By enriching metadata through techniques like Natural Language Processing and Linked Open Data, the project provides advanced search and data interaction capabilities.
The BelgicaWeb platform addresses the challenges of ephemeral born-digital content by creating new collections of archived web content, aggregating existing (meta)data, and ensuring that these collections are Findable, Accessible, Interoperable, and Reusable (FAIR). During this demonstration, we will highlight key features, including full-text search, multilingual functionalities, and data-level access through a robust API designed for big data analyses and digital humanities research.
This demonstration aims to engage both technical and non-technical audiences, providing insights into the development of the access platform and API. The possibility to exchange best practices with researchers working with archived web material during the demo can provide additional useful insights for the BelgicaWeb project and is therefore an added value.
Towards an “Algorithmic Archive”: Developing Collaborative Approaches to Persistent Social and Algorithmic Data Services for Researchers
ABSTRACT. Proposed Duration: 120 minutes (the duration of the workshop could be adjusted to accommodate available time and conference organisation needs)
Social media platforms have become a fundamental means of communication, shaping the contemporary understanding of human behaviours, health and political crisis as well as documenting historical events (Simon, 2012; van Dijck, 2011). However, social platforms are private organisation that impose strict limitations to access and the preservation of data (Bruns, 2019; Thomson, 2016). Despite their importance for research purposes and as cultural heritage material, only a few memory institutions consistently archive this important source of information posing a significant risk to their long-term availability. The Algorithmic Archive project is part of the Bodleian Libraires’ broader strategy to further unlock the potential of the existing born-digital collections. The Algorithmic Archive project seeks to develop a sustainable strategy to create a persistent social and algorithmic data archive which can support research efforts in a wide range of disciplines. Members of the project will moderate the session.
The workshop will be divided into two corresponding sessions, each beginning with a brief introduction (5 minutes) outlining the session’s aims and the tasks for the audience, as follows:
1. Use Case Presentations and Breakout Discussions (60 minutes)
- Participants will be invited to share their experiences using social media and algorithmic data in their projects, highlighting research questions addressed, methodologies employed, and challenges encountered.
- Participants will then break into small groups to discuss specific themes, such as data access, tool reliability, data and metadata structures, and interdisciplinary approaches, fostering a collaborative environment for knowledge exchange. A set of questions and topics for discussions will be provided.
2. Building A Sustainable Infrastructure for A Persistent Social and Algorithmic Data Service (60 minutes)
- A guided session to gather participants insights on key aspects regarding the development of social data archive services, including issues and expectations surrounding short- and long-term access. This session will also offer the opportunity to identify potential partnerships for the development of standards to preserve and access social data.
The workshop will conclude with a brief (5 minutes) session to summarise key insights, outlining action points, and discussing how memory institutions can support researchers’ needs as well as identifying potential partners for the development of shared standards for the collection of social and algorithmic data.
This workshop welcomes insights and perspectives from researchers, data scientists, archivists, librarians, and anyone interested in the research implications of social media and algorithmic data. By bringing together these diverse perspectives, the workshop aims to foster discussions and partnerships to develop sustainable strategies to collect social media platforms, and ultimately benefit both scholarship and society.
Mentorship for Early Career Scholars in Web Archive Studies
ABSTRACT. This session aims to create a space for open discussion and networking for early career scholars (PhD students and postdoctoral researchers). Organized by five advanced scholars with strong expertise in web archives, this 2-hours session will focus first on the role and place of web archives in research, including how case studies, close and distant reading, and different tools and methods may be used, refined, and presented in research. Ethical and legal issues will also be addressed, based on concrete needs (copyright, anonymization of research results, FAIR Data, and so on). The session may also move on to more general questions related to academic careers, strongly keeping in mind the research areas of participant scholars in web studies. The topics to be considered may include: opportunities for funding; relevant journals and strategies for publication; avenues for promoting and disseminating research to both academic and public audiences, participation of the general public, including social networks, the main conferences related to web archives, and other forums; issues of inclusivity and diversity facing researchers in the field, etc.
Collectively, the organizers of this session combine a wide range of disciplinary and professional perspectives, knowledge of different national and international contexts, and diverse skillsets and expertise in web archives. They have also had varying career trajectories, and in particular have come to working with web archives via different routes and through different experiences, partnerships and topics.
The session will strongly focus on the needs and adapt to the requests of early scholars, in order to align closely with the challenges that they may face in our area. With this second session of mentorship (the first one was organized in Marseille for RESAW23), we hope to establish a regular session at the RESAW conferences. It may also facilitate the development of a peer network among the attendees, who may develop their research and build their professional networks alongside each other.
Empowering Data-Driven Research Through Digital Archives with Internet Archive’s ARCH
ABSTRACT. In this comprehensive 2-hour session, we will explore and discuss the latest advancements and innovations of the Internet Archive's ARCH platform.
ARCH (Archives Research Compute Hub) is a cutting-edge platform engineered to facilitate the building of research collections, enable computational analysis, and support the generation of datasets from terabytes and even petabytes of data. ARCH supports the open publication and preservation of user-generated datasets created from thousands of libraries, archives, and memory organizations worldwide, empowering researchers, students, and information professionals to study, analyze, and interpret digital collections in unprecedented ways.
Designed with a focus on curating research collections using primary digital sources such as web pages, texts, and images, ARCH enables users to effortlessly create over a dozen distinct datasets from these sources with a simple click. These datasets can be directly downloaded either through an in-browser interface or via an API, enhancing accessibility and user experience.
Moreover, ARCH facilitates the efficient utilization of these research-ready datasets by offering in-browser data previews and visualizations. More interactive analysis is encouraged and supported by enabling the integration of computational tools such as Jupyter Notebooks, Google CoLab, Gephi, and Voyant into the research process.
A significant feature of ARCH is its one-click publication mechanism on archive.org, allowing datasets to be easily accessed, shared, and preserved indefinitely. This feature not only promotes open access to information but also ensures the long-term preservation of valuable data.
To support and enhance user experience, ARCH provides comprehensive technical support, online training, and extensive help center documentation. These resources are designed to optimize the effective use of the platform, making sophisticated research processes more accessible to users who may not have advanced coding or scripting skills.
ARCH benefits from the robust, non-profit infrastructure of the Internet Archive and utilizes open-source tools to streamline the computational handling of digital collections. This enables librarians, collection managers, and educators to offer sophisticated research tools to their communities, thereby democratizing access to advanced research methodologies.
Recently, ARCH has integrated AI-powered tools that enhance the platform's capabilities. These tools are readily accessible on our dedicated computing cluster, equipped with GPU support, making advanced computational tasks more feasible for our users.
ARCH is available for both institutional and individual use, offering flexible access options for a diverse range of professionals including researchers, librarians, archivists, museum staff, journalists, and more.
This format provides a comprehensive overview of ARCH’s features, but we will also delve deeper into the technical details and underlying technologies. It will feature a combination of presentations, brief demonstrations, and interactive live sessions. Participants will have the opportunity to engage with the tools interactively, ask questions, and view actual datasets, making this an informative experience that offers participants a clear view into how the ARCH platform can enhance their research capabilities.