Hacking the News: from digitised newspapers to the archived-web: an introductory workshop to text and data-mining
ABSTRACT. Hacking the News: from digitised newspapers to the archived-web: an introductory workshop to text and data-mining
Libraries have been digitising historical newspapers in earnest since the early 2000’s. Google also lent a hand for several years. The Europeana Newspapers’ project has been aggregating European efforts in this area since 2012. However, to what extent are these digitised newspaper archives being used in digital humanities research?
Within the last few years, a number of workshops have taken place where the use of digital tools and methods to analyse digital newspaper archives and data has been explored. Similarly, initiatives have been organised for the study of the archived web and projects such as the Archives Unleashed Toolkit (AUT) and Cloud (AUK) have been developed in order help researchers work with WARCs. Furthermore, there are a number of innovative projects in this area to explore the text and data mining of historical newspaper collections. However, the true potential of digital newspaper corpora and news in web-archives is as yet underexploited. “Hacking the news: from digitised newspapers to the archived-web” is intended to help redress this balance.
This workshop builds on lessons learnt from the “Hacking the News” event organised in collaboration with the National Library of Finland at the 2018 DHN conference. Rather than focusing solely on the research use of web archives, our approach is to show how web archives can be combined with other digital resources in order to create a project around a specific topic such as news.
Primarily intended for, but not limited to, early career researchers, the aim of this workshop is to provide a general introduction to a range of topics to consider when undertaking digital analysis of newspaper corpora or analysing news in web-archives for historical research. The workshop is organised around four key themes: 1) Newspapers and their digitisation: how are digital newspaper corpora created? What legal issues need to be considered? 2) Newspapers and the archived web: this session will introduce workshop participants to the archived web for historical and digital humanities research, focusing in particular on online news. It will demonstrate how news on the archived web differs fundamentally from digitised newspapers (Brügger, 2016, Bødker and Brügger, 2017); 3) Defining and preparing your corpus – combining born digital and digital news; 4) analysing your corpus: an introduction to text and data mining and the Archives Unleashed Toolkit.
Digging into the WARCs: Hands-on With the Archives Unleashed Toolkit
ABSTRACT. The Archives Unleashed Toolkit, or AUT, is an open-source platform for managing and analyzing web archives built on Apache Spark. With AUT, researchers are able to systematically track, visualize and analyze content within web archives as well as to analyze how change occurs over time within web archives. It is the result of a collaboration between computer scientists, librarians, and historians, who engaged in an iterative co-design process to build an analytics framework that is usable by humanities scholars and social scientists with no formal computer science training.
The FAAV Cycle We introduce AUT primarily through the research process that we have developed to work with web archives. We call it the Filter-Aggregate-Analyze-Visualize (FAAV) cycle:
Filtering: focusing on a particular portion of the web archive, selected using both metadata and content. For example, the scholar is only interested in pages from 2012 from the Liberal Party that link to the Conservative Party and contain the phrase “Keystone XL”. Analyzing: After selecting a subcollection of interest, the scholar would want to perform some type of analysis, which typically involves extracting some information of interest. For example, extracting links and associated anchor text. Aggregating: Next, the scholar usually wishes to aggregate or summarize the output of the analysis from the previous step. For example, how often is a certain politician mentioned or a website linked to across a web archive. Visualizing: Finally, the aggregate data are presented in some sort of visualization, which could be as simple as a table of results or as complex as requiring an external application. For example, using Gephi to show an interactive hyperlink network.
The workshop will be structured around these four steps, allowing researchers to use either our sample WARC files or their own and proceed iteratively through the cycle.
Workshop Rationale and Plan Given the hands-on nature of AUT, and our desire to help researchers analyze web archive files themselves, we believe that a workshop format is ideal for this. We have run workshops on AUT before and bring a track-record of engagement with the community to bear.
In this workshop, aimed at researchers who have access to their own web archives, we will do the following:
Introduce the Archives Unleashed Toolkit (10 minute presentation);
Walk-through the capabilities of the Toolkit (25 minute presentation);
Extracting basic statistics about the collection;
Learning how to extract plain text from the collection, based on filtering functions including date, keywords in webpages, and language;
Learning how to extract a hyperlink network from a collection and visualize it;
and quickly using Gephi;
Divide into groups of no more than five or six people per group (10 minutes)
Future Historians of the Internet: A Speculative Workshop
ABSTRACT. Current digital archives cover only a patchwork of websites, digital communication, and expressive culture. Professional organizations like the American Institute for Conservation and the Society of American Archivists have largely focused their digital preservation and conservation efforts on digital remediations of older expressive forms—such as music, video, and artwork—and their born-digital younger siblings. Everyday networked interactions, from email correspondence to gif deployment to navigation patterns, have not received the same support. Attempts to preserve and maintain special collections that offer large-scale immersions into daily digital practices, such as the Library of Congress’s Twitter Archive, have not yet been successfully launched. And while sites like the Internet Archive capture static versions of a large but eclectic slice of the public internet, at the same time much has fallen to link rot. Further, the proliferation of proprietary formats and networks has made it extremely difficult to maintain accessible, stable personal archives. There are already worrying omissions even in areas like government accountability, where a more comprehensive, longer-term approach might be expected. In the US, federal departments variously interpret the National Archives and Records Administration’s guidelines for the preservation of electronic media, at times deleting electronic materials within as soon as three years. Underlying all of these choices are assumptions about what is important, what is collectable, and what future historians will want.
In this half-day workshop we flip this to ask, who are the future historians of the internet? From the printing press to the typewriter to the Web itself, the widespread adoption of new communication technologies has transformed roles, created new careers, and challenged gatekeepers. Who can do what changes. We argue that that we need to expect this transformative trend to continue: as augmented reality, virtual reality, and artificial intelligence-powered automation and their various networked entanglements become even more firmly tied to our lives, who is a historian and when and how they mobilize history, will change. In this workshop, we draw on speculative writing techniques to imagine future historians across scales and timeframes, writing everything from longue durée to microhistories, from the personal to the planetary. Through this focus on the people who trace, build, and solicit history we explore, too, future questions of internet history and future experiences of internet history.
Together, we will use speculative techniques to generate insights into the future that can guide present decisions. Our practice is simple: writing (or drawing) short responses to speculative prompts, sharing and analyzing our responses collaboratively, and iterating. During the workshop, we expect to complete 3-5 cycles. At the end of the workshop we will compile examples of imagined futures and the insights they have generated to be published in an accessible article or essay. Workshop participants do not need previous speculative writing experience and may choose not to include their responses in the workshop publication if they prefer.