DOCENG 2019: 19TH ACM SYMPOSIUM ON DOCUMENT ENGINEERING
PROGRAM FOR WEDNESDAY, SEPTEMBER 25TH
Days:
previous day
next day
all days

View: session overviewtalk overview

09:40-10:30 Session 15
09:40
Searching and Ranking Questionnaires: an Approach to Calculate Similarity Between Questionnaires

ABSTRACT. Questionnaires are useful tools for research purposes and are generally used for collecting information about a population of interest, by focusing on different intentions. During the questionnaire project, or for sharing data purposes, it may be useful to check if there is already a questionnaire with the same intention as that being carried out. Well-designed questions can induce respondents to provide better answers. However, examining research questionnaires is not a trivial task since a question can be structured in different ways. In this paper, we propose a similarity measure to match questionnaires that are characterized by the heterogeneity of their questions and to provide a ranking method based on variations of a given query. In determining the effectiveness of this approach, we evaluated it through an experimental study, using recall, precision, f-value, MAP and NDGC, and this enabled us to obtain more effective results than other proposals.

10:05
Multi-Layered Edits for Meaningful Interpretation of Textual Differences

ABSTRACT. The way humans and algorithms look at and understand differences between versions and variants of the same text may be very different. While correctness and overall byte length are fundamental aspects of good outputs of diff algorithms, they do not usually provide immediately interesting values for humans trying to make sense of the events that lead from one version to another of a text.

In this paper we propose 3-edit, a layered model to group and organize individual differences (i.e., edits) between document versions in a conceptual value-based scaffolding that provides an easier and more approachable characterization of the modifications occurred to a text document. Through the structural and semantic classification of the individual edits, it becomes possible to differentiate between modifications, so as to show them differently, show only some of them, or emphasize some of them, so that the human mind can more easily identify the types of modifications that matter for its reading purpose.

An algorithm that provides structural and semantic grouping of basic mechanical INS/DEL edits is described as well.

10:30-11:00Coffee Break
11:00-12:30 Session 16
11:00
Predictable and Consistent Information Extraction

ABSTRACT. Information extraction programs (extractors) can be applied to documents to isolate structured versions of some content, that is, to create tabular records corresponding to facts found in the documents. If the data in an extracted table needs to be updated for any reason (for example, as a result of data cleaning), the source document will no longer be synchronized with the data. But documentsare the principal medium for sharing information among humans. We therefore wish to ensure that changes to extracted tables are reflected correctly in their source documents.

In this work, we characterize extractors for which we are able to predict the effects that updates to source documents will have on extracted records. We introduce three general properties for extractors that, if satisfied, can guarantee that consistency will be maintained if the lineage of extracted records is respected when changing the documents. We propose a property verification process that uses static analysis for a substantial subset of JAPE, a well-established rule-based extraction language, and illustrate it through an example based on a freely-available extractor library.

11:25
Prediction of Mathematical Expression Declarations Based on Spatial, Semantic, and Syntactic Analysis

ABSTRACT. Mathematical expressions (ME) and words are carefully bonded together in most science, technology, engineering, and mathematics (STEM) documents. They respectively give quantitative and qualitative descriptions of a system model under discussion. This paper proposes a general model for finding the co-reference relations between words and MEs, based on which we developed a novel algorithm for predicting the natural language declarations of MEs--the ME-Dec. The prediction algorithm is applied in a three-level framework, where the first level is a customized tagger to identify the syntactic roles of MEs and the part-of-speech (POS) tags of words in the ME-word mixed sentences. The second level screens the ME-Dec candidates based on the hypothesis that most ME-Dec are noun phrases (NP). A shallow chunker is trained from the fuzzy process mining algorithm, which uses the labeled POS tag series in the NTCIR-10 dataset as input to mine for the frequent syntactic patterns of NP. In the third level, using distance, word stem, and POS tag respectively as the spatial, semantic, and syntactic features, the bonding model between MEs and ME-Dec candidates is trained on the NTCIR-10 training set. The final prediction results are made upon the majority votes of an ensemble of Naïve Bayesian classifiers based on the three features. Evaluation of the model on the NTCIR-10 test set, the proposed algorithm achieved 75% and 71% average F1 score in soft matching and strict matching, respectively, which outperforms the state-of-the-art solutions by a margin of 5-18%.

11:50
An Algorithm for Extracting Shape Expression Schemas from Graphs

ABSTRACT. Unlike traditional data such as relational databases and XML documents, most of graphs do not have their own schema. However, schema is a concise representation of a graph, and if we can extract a “good” schema from a graph, we can take advantage of the extracted schema for effective graph data management. In this paper, we focus on Shape Expression Schemas (ShEx) and consider extracting ShEx schemas from RDF/graph data. To manage both efficiency and quality of extracted schema, our algorithm consists of two schema extraction steps: (i) edge-label based clustering and (ii) type-merge method for target nodes of outgoing edges. We made preliminary experiments, which result suggests that our algorithm can extract ShEx schemas appropriately.

12:10
Multi-Context Information for Word Representation Learning

ABSTRACT. Word embedding techniques in literature are mostly based on Bag of Words models where words that co-occur with each other are considered to be related. However, it is not necessary for similar or related words to occur in the same context window. In this paper, we propose a new approach to combine different types of resources for training word embeddings. The lexical resources used in this work are Dependency Parse Tree and WordNet. Apart from the co-occurrence information, the use of these additional resources helps us in including the semantic and syntactic information from the text in learning the word representations. The learned representations are evaluated on multiple evaluation tasks like Semantic Textual Similarity, Word Similarity. Results of the experimental analyses highlight the usefulness of the proposed methodology.

12:30-14:00Lunch Break (BoF Session)
14:00-15:30 Session 17
14:00
Searching Document Repositories Using 3D Model Reconstruction

ABSTRACT. A common representation of a three dimensional object is a multiview collection of two dimensional images showing the object from multiple angles. This technique is often used with document repositories such as collections of engineering drawings and governmental repositories of design patents and 3D trademarks. It is rare for the original physical artifact to be available. When the original physical artifact is modeled as a set of images, the resulting multi-view collection of images may be indexed and retrieved using traditional image retrieval techniques. Consequently, massive repositories of multi-view collections exist. While these repositories are in use and easy to construct, the conversion of a physical object into multi-view images results in a degraded representation of both the original three dimensional artifact and the resulting document repository. We propose an alternative approach where the archived multi-view representation of the physical artifact is used to reconstruct the 3D model, and the reconstructed model is used for retrieval against a database of 3D models. We demonstrate that document retrieval using the reconstructed 3D model achieves higher accuracy than document retrieval using a document image against a collection of degraded multi-view images. The Princeton Shape Benchmark 3D model database and the ShapeNet Core 3D model database are used as ground truth for the 3D image collection. Traditional indexing and retrieval is simulated using the multi-view images generated from the 3D models. A more accurate 3D model search is then considered using a reconstruction of the original 3D models from the multi-view archive, and this model is searched against the 3D model database.

14:25
Text Localization in Scientific Figures Using Fully Convolutional Neural Networks on Limited Training Data

ABSTRACT. Text extraction from scientific figures has been addressed in the past by different unsupervised approaches due to the limited amount of training data. Motivated by the recent advances in Deep Learning, we propose a two-step neural-network-based pipeline to localize and extract text using Fully Convolutional Networks. We improve the localization of the text bounding boxes by applying a novel combination of a Residual Network with the Region Proposal Network based on Faster R-CNN. The predicted bounding boxes are further pre-processed and used as input to the of-the-shelf optical character recognition engine Tesseract 4.0. We evaluate our improved text localization method on five different datasets of scientific figures and compare it with the best unsupervised pipeline. Since only limited training data is available, we further experiment with different data augmentation techniques for increasing the size of the training datasets and demonstrate their positive impact. We use Average Precision and F1 measure to assess the text localization results. In addition, we apply Gestalt Pattern Matching and Levenshtein Distance for evaluating the quality of the recognized text. Our extensive experiments show that our new pipeline based on neural networks outperforms the best unsupervised approach by a large margin of 19-20%.

14:50
Automatic Identification and Normalisation of Physical Measurements in Scientific Literature

ABSTRACT. We present Grobid-quantities, an open-source application for extracting and normalising measurements from scientific and patent literature. Tools of this kind, aiming to understand and make unstructured information accessible, represent the building blocks for large-scale Text and Data Mining (TDM) systems. Grobid-quantities is a module built on top of Grobid [6] [13], a machine learning framework for parsing and structuring PDF documents. Designed to process large quantities of data, it provides a robust implementation accessible in batch mode or via a REST API. The machine learning engine architecture follows the cascade approach, where each model is specialised in the resolution of a specific task. The models are trained using CRF (Conditional Random Field) algorithm [12] for extracting quantities (atomic values, intervals and lists), units (such as length, weight) and different value representations (numeric, alphabetic or scientific notation). Identified measurements are normalised according to the International System of Units (SI). Thanks to its stable recall and reliable precision, Grobid-quantities has been integrated as the measurement-extraction engine in various TDM projects, such as Marve (Measurement Context Extraction from Text), for extracting semantic measurements and meaning in Earth Science [10]. At the National Institute for Materials Science in Japan (NIMS), it is used in an ongoing project to discover new superconducting materials. Normalised materials characteristics (such as critical temperature, pressure) extracted from scientific literature are a key resource for materials informatics (MI) [9].

15:10
Generating Digital Libraries of M.Sc. and Ph.D. Theses

ABSTRACT. Postgraduate degrees are one of the most important propellers of all areas of science. M.Sc. and Ph.D. theses witness the important developments and provide a solid and global account of research projects. This paper describes a platform developed with the aim of generating digital libraries of theses and dissertations. Printed theses have to be scanned and then processed by the platform for marginal border removal, skew and orientation correction, image segmentation and enhancement, compression and pdf file generation. Both scanned and digitally generated theses are processed in the platform to extract relevant indexing information.

15:30-16:00Coffee Break
16:00-17:30 Session 18
16:00
PaperWork: Exploring the Potential of Electronic Paper on Office Work

ABSTRACT. Electronic paper (e-paper) is a display technology that aims to imitate conventional paper. Currently most e-paper applications on handheld devices are restricted to digital book readers. Few studies explore the potential of e-paper on input oriented applications. In this paper, we introduce a novel e-paper application PaperWork, which allows users to offload their commonly used office applications from a PC to an e-paper device remotely. There are considerable challenges when building a system for a resource constrained e-paper device which we will highlight in this work. In addition to presenting a new e-paper system we also conduct a usability study of this system.

16:25
TRIVIR: a Visualization System to Support Document Retrieval with High Recall

ABSTRACT. In this paper, we propose TRIVIR, a novel interactive visualization tool powered by an Information Retrieval (IR) engine that implements an active learning protocol to support IR with high recall. The system integrates multiple graphical views in order to assist the user identifying the relevant documents in a collection, including a content-based similarity map obtained with multidimensional projection techniques. Given representative documents as queries, users can interact with the views to label documents as relevant/not relevant, and this information is used to train a machine learning (ML) algorithm which suggests other potentially relevant documents on demand. TRIVIR offers two major advantages over existing visualization systems for IR. First, it merges the ML algorithm output into the visualization, while supporting several user interactions in order to enhance and speed up its convergence. Second, it tackles the problem of vocabulary mismatch, by providing term’s synonyms and a view that conveys how the terms are used within the collection. Besides, TRIVIR has been developed as a flexible front-end interface that can be associated with distinct text representations and multidimensional projection techniques. We describe two use cases conducted with collaborators who are potential users of TRIVIR. Results show that the system simplified the search for relevant documents in large collections, based on the context in which the terms occur.

16:50
Globally Optimal Page Breaking with Column Balancing – a Case Study

ABSTRACT. The paper presents a dynamic programming algorithm that finds the globally optimal sequence of page breaks for a book avoiding widows and orphans when the only source of variation is the possibility to break selected paragraphs into varying number of lines by skillful selection of line breaks. The text is set in two-columns, on each last page of a chapter the columns must be balanced. We show how the balancing process can be included in the global optimization.

The algorithm is applied to a real-life problem of typesetting a small-format two-column 800 pages long dictionary. We analyze the typesetting process including the proofing phase where local changes in the text can globally influence page breaks.

This problem provides an ideal test-bed for global optimization since the typographic model involved is relatively easy – the material processed is merely a stream of paragraphs. On the other hand, breaking the book under these conditions is a very tedious and frustrating job for a human typesetter, as it typically requires hours of trial and error.

17:10
A Vision for User-Defined Semantic Markup

ABSTRACT. Typesetting systems, such as LATEX, permit users to define custom markup and corresponding formatting to simplify authoring, ensure the consistent presentation of domain-specific recurring elements and, potentially, enable further processing, such as the generation of an index of such elements. In XML-based and similar systems, the separation of content and form is also reflected in the processing pipeline: while document authors can define custom markup, they cannot define its semantics. This could be said to be intentional to ensure structural integrity of documents, but at the same time it limits the expressivity of markup. The latter is particularly true for so-called lightweight markup languages like Markdown, which only define very limited sets of generic elements. This vision paper sketches an approach for user-defined semantic markup that could permit authors to define the semantics of elements by formally describing the relations between its constituent parts and to other elements, and to define a formatting intent that would ensure that a default presentation is always available.