DOCENG 2015: 15TH ACM SIGWEB INTERNATIONAL SYMPOSIUM ON DOCUMENT ENGINEERING
PROGRAM FOR THURSDAY, SEPTEMBER 10TH
Days:
previous day
next day
all days

View: session overviewtalk overview

09:00-10:00 Session 8: Keynote II: The Venice Time Machine

Frédéric Kaplan

09:00
The Venice Time Machine

ABSTRACT. The Venice Time Machine is an international scientific programme launched by the EPFL and the University Ca’Foscari of Venice with the generous support of the Fondation Lombard Odier. It aims at building a multidimensional model of Venice and its evolution covering a period of more than 1000 years. The project ambitions to reconstruct a large open access database that could be used for research and education. Thanks to a parternship with the Archivio di Stato in Venice, kilometers of archives are currently digitized, transcribed and indexed setting the base of the largest database ever created on Venetian documents. The State Archives of Venice contain a massive amount of hand-written documentation in languages evolving from medieval times to the 20th century. An estimated 80 km of shelves are filled with over a thousand years of administrative documents, from birth registrations, death certificates and tax statements, all the way to maps and urban planning designs. These documents are often very delicate and are occasionally in a fragile state of conservation. In complementary to these primary sources, the content of thousands of monographies have been indexed and made searchable.

The documents digitised in the Venice Time Machine programme are intricately interweaved, telling a much richer story when they are cross-referenced. By combining this mass of information, it is possible to reconstruct large segments of the city’s past: complete biographies, political dynamics, or even the appearance of buildings and entire neighborhoods. The information extracted from the primary and secondary sources are organized in a semantic graph of linked data and unfolded in space and time in an historical geographical information system. The resulting platform can serve for both research and education. About a hundred researchers and students collaborate already on this programme. A doctoral school is organised every year in Venice and several bachelor and master courses currently use the data produced in the context of the Venice Time Machine. Through all these initiatives, the Venice Time Machine explores how “big data of the past” can change research and education in historical sciences, hopefully paving the way towards a general methodology that could be applied to many other cities and archives.

10:00-10:30Coffee Break
10:30-11:45 Session 9: Documents Made Accessible
10:30
Towards Mobile OCR: How To Take a Good Picture of a Document Without Sight
SPEAKER: unknown

ABSTRACT. The advent of mobile OCR (optical character recognition) applications on regular smartphones holds great promise for enabling blind people to access printed information. Unfortunately, these systems suffer from a problem: in order for OCR output to be meaningful, a well-framed image of the document needs to be taken, something that is difficult to do without sight. This contribution presents an experimental investigation of how blind people position and orient a camera phone while acquiring document images. We developed experimental software to investigate if verbal guidance aids in the acquisition of OCR-readable images without sight. We report on our participant's feedback and performance before and after assistance from our software.

11:00
MSoS: A Multi-Screen-Oriented Web Page Segmentation Approach
SPEAKER: unknown

ABSTRACT. In this paper we describe a multiscreen-oriented approach for segmenting web pages. The segmentation is an auto- matic and hybrid visual and structural method. It aims at creating coherent blocks which have dierent functions determined by the multiscreen environment. It is also char- acterized by a dynamic adaptation to the page content. Ex- periments are conducted on a set of existing applications that contain multimedia elements, in particular YouTube and video player pages, and results are evaluated comparing to one segmentation method from the literature and to a ground truth manually created. With a 81% precision, the MSoS is a promising method that is capable of producing good results.

11:15
Creating eBooks with Accessible Graphics Content
SPEAKER: unknown

ABSTRACT. We present a new model for presenting graphics in eBooks to blind readers. It is based on the GraViewer app which allows an accessible graphic embedded in an iBook to be explored on an iPad using speech and non-speech audio feedback. We introduce a web-based tool, GraAuthor, for creating such accessible graphics and describe the workflow for including these in an iBook. Unlike previous approaches our model provides an integrated digital presentation of both text and graphics and allows the general public to create accessible graphics.

11:30
Investigation of Ancient Manuscripts based on Multispectral Imaging
SPEAKER: unknown

ABSTRACT. This work is concerned with the digitization and analysis of historical documents. The investigation of the documents has been conducted in three successive interdisciplinary projects. The team involved in the projects consists of philologists, chemists and computer scientists specialized in the field of digital image processing. The manuscripts investigated are partially degraded since they have been infected by mold, are corrupted by background clutter or contain faded-out or even erased writings. Since these degradations impede a transcription by scholars and worsen the performance of automated document image analysis techniques, the documents have been imaged with a portable multispectral imaging system. By using this non-invasive investigation technique, the contrast of the faded out characters can be increased, compared to ordinary white light illumination. Post-processing techniques, such as dimension reduction tools, can be used to gain a further legibility increase. The resulting images are used as a basis for further document analysis methods. These methods have been especially designed for the historical documents investigated and involve Optical Character Recognition and writer identification. This paper presents an overview on selected methods that have been developed in the projects.

12:15-14:00Lunch
14:00-16:00 Session 11: Scholarly Papers Analysis and Authoring
Chair:
14:00
Similarity-Based Support for Text Reuse in Technical Writing
SPEAKER: unknown

ABSTRACT. Technical writing in professional environments, such as user manual authoring for new products, is a task that relies heavily on reuse of content. Therefore, technical content is typically created following a strategy where modular units of text have references to each other. One of the main challenges faced by technical authors is to avoid duplicating existing content, as this adds unnecessary effort, generates undesirable inconsistencies, and dramatically increases maintenance and translation costs. However, there are few computational tools available to support this activity. This paper presents an exploratory study on the use of different similarity methods for the task of identification of reuse opportunities in technical writing. We evaluated our results using existing ground truth as well as feedback from technical authors. Finally, we also propose a tool that combines text similarity algorithms with interactive visualizations to aid authors in understanding differences in a collection of topics and identifying reuse opportunities.

14:30
Exploring scholarly papers through citations
SPEAKER: unknown

ABSTRACT. Bibliographies are fundamental components of academic papers and both the scientific research and its evaluation are fundamentally organized around the correct examination and classification of scientific bibliographies. Currently, most digital libraries publish bibliographic information about their content for free, and many include the bibliographies (outgoing and in some cases even incoming) to the papers they manage. Unfortunately no sophistication is spent for these lists: monolithic pieces of text where it is even difficult to tell automatically the authors apart from the title or publication details, and where users are provided with no mechanisms to filter and access full context of each citation. For instance, there is no way to know in which sentence a work was cited (the citation context) and why (the citation function). In this paper we introduce a novel environment for navigating, filtering and making sense of citations. The interface, called BEX, exploits data freely available in a Link Open Dataset about scholarly papers; end-user testing proved its efficacy and usability.

15:00
Filling the gaps: Improving Wikipedia stubs
SPEAKER: unknown

ABSTRACT. The availability of only a limited number of contributors on Wikipedia cannot ensure consistent growth and improvement of the online encyclopedia. With information being scattered on the web, our goal is to automate the process of generation of content for Wikipedia. In this work, we propose a technique of improving stubs on Wikipedia that do not contain comprehensive information. A classifier learns features from the existing comprehensive articles on Wikipedia and recommends content that can be added to the stubs to improve the completeness of such stubs. We conduct experiments using several classifiers - Latent Dirichlet Allocation (LDA) based model, a deep learning based architecture (Deep belief network) and TFIDF based classifier. Our experiments reveal that the LDA based model outperforms the other models (~6% F-score). Our generation approach shows that this technique is capable of generating comprehensive articles. ROUGE-2 scores of the articles generated by our system outperform the articles generated using the baseline. Content generated by our system has been appended to several stubs and successfully retained in Wikipedia.

15:15
BBookX: An Automatic Book Creation Framework
SPEAKER: unknown

ABSTRACT. As more educational resources become available online, it is possible to acquire more up-to-date knowledge and informa- tion. However, there has not been a tool which can auto- matically retrieve and organize these open resources for ed- ucational purposes. This paper introduces BBookX, a novel computer facilitated system that automatically builds free open online books using publicly available educational re- sources such as Wikipedia. BBookX has two separate com- ponents: one that creates an open version of existing books by linking different book chapters to Wikipedia articles and another that with an interactive user interface supports in- teractive real-time book creation, during which users are allowed to modify a generated book with explicit feedback that is used to improve the ranking of the returned educa- tional resources.

15:30
VEDD: A Visual Editor for Creation and Semi-Automatic Update of Derived Documents
SPEAKER: unknown

ABSTRACT. Document content is increasingly customised to a particular audience. Such customised documents are typically built by combining content from selected logical content modules and then editing this to create the custom document. A major difficulty is how to efficiently update these derived documents when the source documents are changed. Here we describe a web-based visual editing tool for both creating and semi-automatically updating derived documents from modules in a source library.

15:45
Madoko: Scholarly Documents for the Web
SPEAKER: Daan Leijen

ABSTRACT. Madoko is a novel authoring system for writing complex documents. The main design goal of Madoko is to enable light-weight creation of high-quality scholarly and industrial documents for the web and print, while maintaining John Gruber's Markdown philosophy of simplicity and focus on plain text readability. In particular, it overcomes limitations of LaTeX and uses standard CSS to create both paginated PDF and also rescaleable and reflowable HTML.

16:00-16:30Coffee Break
16:30-17:30 Session 12: Posters

Short Papers Presented as Posters:

  • Fine Grained Access Interactive Personal Health Records. Helen Balinsky (HP Laboratories), Nassir Mohammad (HP)
  • Does a Split-View Aid Navigation Within Academic Documents? Juliane Franze (Monash University and Fraunhofer), Kim Marriott (Monash University), Michael Wybrow (Monash University)
  • An Approach for Designing Proofreading Views in Publishing Chains. Léonard Dumas (Université de Technologie de Compiègne), Stéphane Crozat (Université de Technologie de Compiègne), Bruno Bachimont (Université de Technologie de Compiègne), Sylvain Spinelli (Kelis)
  • High-Quality Capture of Documents on a Cluttered Tabletop with a 4K Video Camera. Chelhwon Kim (University of California, Santa Cruz), Patrick Chiu (FXPAL), Henry Tang (FXPAL)
  • Segmentation of overlapping digits through the emulation of a hypothetical ball and physical forces. Alberto Nicodemus Lopes Filho (CIn - UFPE), Carlos Mello (Universidade Federal de Pernambuco)
  • AERO: An extensible framework for adaptive web layout synthesis. Rares Vernica (HP Labs), Niranjan Damera Venkata (HP Labs)
  • Automatic Text Document Summarization Based on Machine Learning. Gabriel Silva (Federal University of Pernambuco), Rafael Lins (Federal University of Pernambuco), Luciano Cabral (CIn-UFPE), Rafael Ferreira (Federal University of Pernambuco), Hilário Tomaz (Federal University of Pernambuco), Steven Simske (Hewlett-Packard Labs), Marcelo Riss (Hewlett-Packard)
  • Searching Live Meeting Documents "Show me the Action". Laurent Denoue (FXPAL), Scott Carter (FXPAL), Matthew Cooper (FXPAL)
  • Multimedia Document Structure for Distributed Theatre. Jack Jansen (CWI: Centrum Wiskunde & Informatica), Michael Frantzis (Goldsmiths), Pablo Cesar (CWI: Centrum Wiskunde & Informatica)
  • Change Classification in Graphics-Intensive Digital Documents. Jeremy Svendsen (University of Victoria), Alexandra Branzan Albu (University of Victoria)

ProDoc:

  • Automatic Content Generation for Wikipedia. Siddhartha Banerjee (Pennsylvania State University)
  • Sentiment Analysis for Web Documents. Fathima Sharmila Satthar (University of Brighton)