DOCENG 2015: 15TH ACM SIGWEB INTERNATIONAL SYMPOSIUM ON DOCUMENT ENGINEERING
PROGRAM FOR WEDNESDAY, SEPTEMBER 9TH
Days:
next day
all days

View: session overviewtalk overview

08:00-08:45Registration / Welcome Coffee
09:00-10:00 Session 2: Keynote I: Documents as Data, Data as Documents. What we learned about Semi-Structured Information for our Open World of Cloud & Devices.

Jean Paoli

09:00
Documents as Data, Data as Documents. What we learned about Semi-Structured Information for our Open World of Cloud & Devices.
SPEAKER: Jean Paoli

ABSTRACT. Many of us always believed in a unique vision unifying documents and data through semantically-rich semi-structured information. This vision is even more critical today in our open interconnected world of Clouds and Devices.

The last 20 years represents a real-life worldwide experiment in this area that fueled a massive set of market applications. In this talk, we review the history and trends of a lot of what is enabling today’s core interchanges on the internet: from initial research adding document user interfaces to data, to the specification of structured documents, to the generalization of document markup techniques to the wide acceptance of document databases. We will also review our share of historical acronyms such as “Star”, “Grif”,“OpenDoc”,“WorldWideWeb/Nexus”,“Amaya”,“InfoPath”“HTML”, “SGML”, “XML”, “JSON”, “YAML”,“Markdown”, “Schema”, “Semantics”,“MongoDB”, “Hadoop”, “DocumentDB” and many others.

We will then turn, cautiously and humbly, to the future and try to guess: what would the world need? And what do we need to think about to make it happen?

We truly believe in the potential of the open Internet. We see pieces of information (that we once called “Diamonds of the Internet”), being created, shared, re-shaped, re-routed, modified by users or tiny small devices, understood through big data and machine learning, and processed by cloud services. We see the potential of fundamentally designing open platforms connected worldwide. By bridging technologies, we create higher level abstractions and thus more complex organisms (software) that can help everyone. But at the core remains the need for semi-structured open information fundamentally unifying documents and data.

 

10:00-10:30Coffee Break
10:30-11:45 Session 3: Layouts Improved
10:30
The browser as a document composition engine
SPEAKER: unknown

ABSTRACT. Printing has long been a neglected aspect of the Web, and the print function of browsers, when used on documents designed for on-screen consumption, often leads to a poor result. Whereas print CSS goes some way towards optimizing the paper experience, it still does not enable full control over the page layout, which is necessary to obtain a publication-quality print result. Furthermore, its use requires Web authors to invest additional resources for a feature that might only be used infrequently. This paper introduces a framework designed to alleviate these issues and improve the print experience on the Web. We describe the technologies that enable us to automatically compose and optimize the layout of a document, and generate a high quality PDF fully within the browser. This functionality can be offered to web publishers in the form of a print button, enabling content to be simultaneously delivered in screen and print formats, ensuring a publication-quality result that adheres to the publisher’s design guidelines.

11:00
Document Layout Optimization with Automated Paraphrasing
SPEAKER: unknown

ABSTRACT. We introduce a new concept in document layout optimization. In our approach, paraphrase-based layout optimization, layout issues (e.g. widows due to poor page breaking) are automatically fixed by rewording the neighboring sentences. Techniques of paraphrasing are borrowed from the field of natural language processing towards this goal, which is the first attempt in the field of document engineering. We implemented a prototype TEX pre/post-processing system that includes two simple paraphrase generators. The experiment shows that our approach is promising and effective for improving document layout.

11:15
Knuth-Plass revisited: Flexible line-breaking for automatic document layout
SPEAKER: unknown

ABSTRACT. There is an inherent flexibility in typesetting a block of text. Traditionally, line breaks would be manually chosen at strategic points in such a way as to minimize the amount of whitespace in each line. Hyphenation would only be used as a last resort. Knuth and Plass automated this optimization procedure, which has been used in various typesetting systems and DTP applications ever since. However, an optimal solution for the line-breaking problem does not necessarily lead us to an optimal document layout on the whole. The flexibility of choosing line breaks enables us, in many cases, to adjust the height of a paragraph by changing the number of lines, without having to make adjustments to font size, leading, etc. In many cases, the word spacing remains within the usual tolerances and visual quality does not noticeably suffer. This paper presents a modification to the Knuth-Plass algorithm to return several results for a given column of text, each corresponding to a different height, and describes steps to quantify the amount of expected flexibility in a given paragraph. We conclude with a discussion on how such "sub-optimal" results can lead to a better overall document layout, particularly in the context of mobile layouts, where flexibility is of key importance.

11:30
Hiding information in multiple level-line moirés
SPEAKER: unknown

ABSTRACT. Secure documents often comprise an information layer that is hard to reproduce. Moiré techniques for the prevention of counterfeiting rely on the superposition of an array of transparent lines or microlenses on top of a base layer containing hidden information. Level-line moirés consist of shapes that appear to be beating upon relative translation of a revealing grating on top of a base, in which the desired information is encoded. Usually, the base only contains the information corresponding to one moiré. In order to increase the difficulty of counterfeiting, we use tessellations to incorporate two or more moirés within the same layer. With the method we propose, the information corresponding to up to seven level-line moirés can be embedded within a single base layer. The moirés are recovered with a revealer printed on a transparency or with an array of cylindrical lenses. This method is general and can be extended to other fabrication technologies.

12:00-14:00Lunch (including BoF Session)
14:00-15:30 Session 5: Knowledge Extraction
14:00
TEXUS: A Task-based Approach for Table Extraction and Understanding
SPEAKER: unknown

ABSTRACT. In this paper, we propose a precise, comprehensive model of table processing which aims to remedy some of the problems in the discussion of table processing in the literature. The model targets application-ndependent, end-to-end table processing, and thus encompasses a large subset of the work in the area. The model can be used to aid the design of table processing systems (and we provide an example of such a system), can be considered as a reference framework for evaluating the performance of table processing systems, and can assist in clarifying terminological dierences in the table processing literature.

14:30
Multi-oriented Text Extraction from Information Graphics
SPEAKER: unknown

ABSTRACT. Existing research on analyzing information graphics focuses on layout and other structural aspects such as extracting a high-level message. These works assume to have a perfect text detection and extraction from infographics, but text extraction from information graphics is far from solved. To fill this gap, we propose a novel processing pipeline for multi-oriented text extraction from infographics. The pipeline applies a combination of data mining and computer vision techniques to identify text elements, cluster them into text lines, compute their orientation and use a state-of-the-art open source OCR engine to perform the text recognition. We evaluate our method on 121 infographics extracted from an open access corpus of scientific publications. The results show that our approach is effective and significantly outperforms a state of the art baseline.

14:45
Interlinking English and Chinese RDF Data Using BabelNet
SPEAKER: unknown

ABSTRACT. Linked data technologies enable to publish and link structured data on the Web. Although RDF is not about text, many RDF data providers publish their data in their own language. Cross-lingual interlinking consists in discovering links between identical resources across knowledge bases of different languages. In this paper, we present a method for interlinking RDF resources described in English and Chinese using the BabelNet multilingual lexicon. Resources are represented as vectors of identifiers and then similarity between these resources is computed. The method achieves the F-measure of 88%. The results are also compared to a translation-based method.

15:00
Efficient Computation of Co-occurrence Based Word Relatedness
SPEAKER: unknown

ABSTRACT. Measuring document relatedness using unsupervised co-\\occurrence based word relatedness methods is a processing-time and memory consuming task. This paper introduces the application of compact data structures for efficient computation of word relatedness based on corpus statistics. The data structure is used to efficiently lookup: (1) the corpus statistics for the Common Word Relatedness Approach, (2) the pairwise word relatedness for the Algorithm Specific Word Relatedness Approach. These two approaches significantly accelerate the processing time of word relatedness methods and reduce the space cost of storing co-occurrence statistics in memory, making text mining tasks like classification and clustering based on word relatedness practical.

15:15
Automatic Extraction of Figures from Scholarly Documents
SPEAKER: unknown

ABSTRACT. Scholarly papers (journal and conference papers, technical reports, etc.) usually contain multiple ``figures'' such as plots, flow charts and other images which are generated manually to symbolically represent and illustrate visually important concepts, findings and results. These figures can be analyzed for automated data extraction or automatic interpretation of the intended message. Surprisingly, large scale automated extraction of such figures from PDF documents has received little attention. Here we discuss the challenges: how to build a heuristic independent trainable model for such an extraction task and how to extract these figures at scale. Motivated by recent developments in table extraction, we define three new evaluation metrics: figure-precision, figure-recall and figure-F1-score. Our dataset consists of a sample of 200 PDFs, randomly collected from five million scholarly PDFs and manually tagged for 180 figure locations. Initial results from our work demonstrate an accuracy greater than 80\%.

15:30-16:00Coffee Break
16:00-17:15 Session 6: Information Summarized
16:00
Generating Abstractive Summaries from Meeting Transcripts
SPEAKER: unknown

ABSTRACT. Summaries of meetings are very important as they convey the essential content of discussions in a concise form. Both participants and non-participants are interested in the summaries of meetings to plan for their future work. Generally, it is time consuming to read and understand whole documents. Therefore, summaries play an important role as the readers are interested in only the important context of discussions. In this work, we address the task of meeting document summarization. Automatic summarization on meeting conversations developed so far have been primarily extractive, resulting in unacceptable summaries that are hard to read. The extracted utterances contain disfluencies that affect the quality of the extractive summaries. To make summaries much more readable, we propose an approach to generate abstractive summaries by fusing important content from several utterances. We first separate meeting transcripts into various topic segments, and then identify the important utterances in each segment using a supervised learning approach. The important utterances are then combined together to generate a one-sentence summary. In the text generation step, the dependency parses of the utterances in each segment are combined together to create a directed graph. The most informative and well-formed sub-graph obtained by integer linear programming (ILP) is selected to generate a one-sentence summary for each topic segment. The ILP formulation reduces disfluencies by leveraging grammatical relations that are more prominent in non-conversational style of text, and therefore generates summaries that is comparable to human-written abstractive summaries. Experimental results show that our method can generate more informative summaries than the baselines. In addition, readability assessments by human judges as well as log-likelihood estimates obtained from the dependency parser show that our generated summaries are significantly readable and well-formed.

16:30
Enhancing Exploration with a Faceted Browser through Summarization
SPEAKER: unknown

ABSTRACT. An enhanced faceted browsing system has been developed to support users' exploration of large multi-tagged document collections. It provides summary measures of document result sets at each step of navigation through a set of representative terms and a diverse set of documents. These summaries are derived from pre-materialized views that allow for quick calculation of centroids for various result sets. The utility and efficiency of the system is demonstrated on the New York Times Annotated Corpus.

16:45
A Quantitative and Qualitative Assessment of Automatic Text Summarization Systems
SPEAKER: unknown

ABSTRACT. Text summarization is the process of automatically creating a shorter version of one or more text documents. This paper presents a qualitative and quantitative assessment of the 22 state-of-the-art extractive summarization systems using the CNN corpus, a dataset of 3,000 news articles.

17:00
Automatic Document Classification using Summarization Strategies
SPEAKER: unknown

ABSTRACT. An efficient way to automatically classify documents may be provided by automatic text summarization, the task of creating a shorter text from one or several documents. This paper presents an assessment of the 15 most widely used methods for automatic text summarization from the text classification perspective. A naive Bayes classifier was used showing that some of the methods tested are better suited for such a task.

17:15-18:15 Session 7: Posters (including ProDoc)

Short Papers Presented as Posters:

  • Fine Grained Access Interactive Personal Health Records. Helen Balinsky (HP Laboratories), Nassir Mohammad (HP)
  • Does a Split-View Aid Navigation Within Academic Documents? Juliane Franze (Monash University and Fraunhofer), Kim Marriott (Monash University), Michael Wybrow (Monash University)
  • An Approach for Designing Proofreading Views in Publishing Chains. Léonard Dumas (Université de Technologie de Compiègne), Stéphane Crozat (Université de Technologie de Compiègne), Bruno Bachimont (Université de Technologie de Compiègne), Sylvain Spinelli (Kelis)
  • High-Quality Capture of Documents on a Cluttered Tabletop with a 4K Video Camera. Chelhwon Kim (University of California, Santa Cruz), Patrick Chiu (FXPAL), Henry Tang (FXPAL)
  • Segmentation of overlapping digits through the emulation of a hypothetical ball and physical forces. Alberto Nicodemus Lopes Filho (CIn - UFPE), Carlos Mello (Universidade Federal de Pernambuco)
  • AERO: An extensible framework for adaptive web layout synthesis. Rares Vernica (HP Labs), Niranjan Damera Venkata (HP Labs)
  • Automatic Text Document Summarization Based on Machine Learning. Gabriel Silva (Federal University of Pernambuco), Rafael Lins (Federal University of Pernambuco), Luciano Cabral (CIn-UFPE), Rafael Ferreira (Federal University of Pernambuco), Hilário Tomaz (Federal University of Pernambuco), Steven Simske (Hewlett-Packard Labs), Marcelo Riss (Hewlett-Packard)
  • Searching Live Meeting Documents "Show me the Action". Laurent Denoue (FXPAL), Scott Carter (FXPAL), Matthew Cooper (FXPAL)
  • Multimedia Document Structure for Distributed Theatre. Jack Jansen (CWI: Centrum Wiskunde & Informatica), Michael Frantzis (Goldsmiths), Pablo Cesar (CWI: Centrum Wiskunde & Informatica)
  • Change Classification in Graphics-Intensive Digital Documents. Jeremy Svendsen (University of Victoria), Alexandra Branzan Albu (University of Victoria)

ProDoc:

  • Automatic Content Generation for Wikipedia. Siddhartha Banerjee (Pennsylvania State University)
  • Sentiment Analysis for Web Documents. Fathima Sharmila Satthar (University of Brighton)