DOCENG '17: ACM SYMPOSIUM ON DOCUMENT ENGINEERING 2017
PROGRAM FOR WEDNESDAY, SEPTEMBER 6TH
Days:
previous day
next day
all days

View: session overviewtalk overview

09:30-10:30 Session 8

Keynote - John Collomosse

10:30-11:00 Session 9

Poster Lightning Talks

10:30
Towards a Model and a Textual Representation for Location-based Games
SPEAKER: unknown

ABSTRACT. Location-Based Mobile Games (LBMGs) are a subclass of pervasive games that make use of location technologies to consider the players' geographic position in the game rules and mechanics. This research presents LEGaL, a language to model and represent the structure and multimedia contents (e.g., video, audio, 3D objects, etc.) of LBMGs. LEGaL is an extension of NCL (Nested Context Language) that allows the modelling and representation of mission-based games by supporting spatial and temporal relationships between game elements.

10:32
SketchTab3d: An hybrid sketch library using tablets and immersive 3d environments

ABSTRACT. This paper proposes a 2d sketching tool and an immersive 3d sketch library as an approach to easily create and access documents (i.e. sketches). The sketch library allows users to store, arrange and assemble their own sketches and others’ in theoretically unlimited space. A user can get an idea about the general activities of all users since the sketch library is updated whenever changes are made. The system provides 2d and 3d means to access the sketch library. Whereas the 2d interfaces offers a standard dash board, the 3d environment provides unrestricted spatial access to the sketch library. Furthermore, a 2d sketching interfaces is provided in order to create sketch-based documents. Possible application areas are in the fields of engineering, design, public displays, shared knowledge applications, and art. The system was evaluated among eight participants regarding its pragmatic and hedonic qualities as well as searching performance. The results suggest that the users appreciate the particular combination of 2d and 3d technologies in SketchTab3d and requested for improvement in the 3d interaction technique. No significant differences were found in the search performance, however the physical demand during searching was perceived significantly higher in the 3d condition than in the 2d condition.

10:34
A tool for mixing XML annotations

ABSTRACT. XML documents, in particular critical editions are usually very heavily annotated. They usually represent abbreviated part of words, variant readings, edition operations etc. Among such documents, a part of the PCDATA\footnote{Character contents of the text} is the actual edition of the text. Very often, one wants to run automatic tools on this ``simple'' text and thereafter re-embed the result into the original file.

The tool we present here is dedicated to this embedding of annotations.

In order to achieve this, the tool sets the problem as an ambiguous input and parses that ambiguous input by the grammar of the XML language. It then proposes those solutions that are syntactically correct. In case there are none, the input is modified and reparsed until at least a solution is found.

The tool is available at https://github.com/bgaiffe/XMLMixer.

10:36
Authenticity in a digital era: Still a document process. The case of laboratory notebooks
SPEAKER: unknown

ABSTRACT. Asymmetric cryptography brings the ability for anyone on the earth to check the signature of a digital object (Diffie & Hellman, 1976). From there, trusted timestamping of a digital object provides very strong evidence of its author or inventor and integrity (Haber, 1991). 26 years later, trusted timestamping could have replaced, for example, traditional paper laboratory notebooks a long time ago. It has not happened yet. In this paper, we argue that the reason is because authenticity is a document process: while trusted timestamping remains a necessary part of the process, a digital object must be involved in a sociotechnical process in order to become a document. We fisrt point out the gap, intractable with paper, between the strict administrative workflow required to create strong evidence, and the fluidity of collaborative authoring needed for creativity. This gap is relevant to laboratory notebooks, as they are commonly used by inventors to attest that they discovered elements at a specific time, in a specific context. Then we explain the design and implementation of our software system, according to document theory (Buckland, 1997), in order to reinvent the whole process to minimize the administrative burden, while preserving its well-known and valuable properties.

10:38
Fast Binarization with Chebyshev Inequality
SPEAKER: unknown

ABSTRACT. In order to enhance the binarization result of degraded document images with smudged and bleed-through background, we present an fast binarization technique that applies the Chebyshev theory in the image preprocessing. We introduce the Chebyshev filter which uses the Chebyshev inequality in the segmentation of objects and background. Our result shows that the Chebyshev filter is not only effective, but also simple, robust and easy to implement. Because of its simplicity, our method is sufficiently efficient to process live image sequences in real-time. We have implemented and compared with the Document Image Binarization Contest datasets (H-DIBCO 2014) for testing and evaluation. The experimental outcomes have demonstrated that this method achieved good result in this literature.

10:40
Post-Processing OCR Text using Web-Scale Corpora
SPEAKER: unknown

ABSTRACT. We introduce a (semi-)automatic OCR post-processing system that utilizes web-scale linguistic corpus in providing high-quality correction. This paper is a comprehensive system overview with the focus on the computational procedure, applied linguistic analysis, and batch processing optimization.

10:42
Qqmbr and indentml: extensible mathematical publishing for web and paper
SPEAKER: Ilya Schurov

ABSTRACT. We present qqmbr, novel publishing system aimed on preparation of high-quality mathematical publications. One source can be converted to a single interactive webpage, multipage website or PDF (via LaTeX). The markup language behind qqmbr entitled indentml is designed to be both human-readable and machine-readable (easily parsable). It is possible to extend basic qqmbr markup with custom tags that enrich its semantics and build plugins and applications that query qqmbr documents, extract information from them and process it in arbitrary way without much efforts.

10:44
The Common Fold: Utilizing the Four-Fold to Dewarp Printed Documents from a Single Image
SPEAKER: unknown

ABSTRACT. Handheld cameras are currently the device of choice for performing document digitization, due to their convenience, ubiquity and high performance at low cost. Software methods process a captured image, to rectify distortions and reconstruct the original document. Existing methods struggle to reconstruct a flattened version given a single image of a document distorted by folding. We propose a novel non-parametric page dewarping approach from a single image based on deep learning to identify creases due to folds on the paper. Our method then performs a 2D boundary method based on polynomial regression, and a Coons patch, to get a flattened reconstruction. We found our method improves OCR word accuracy by nearly 2.5 times when compared to the original distorted image.

10:46
Improving Version-Aware Word Documents
SPEAKER: unknown

ABSTRACT. Coakley et al. described how they developed Version Aware Word Documents, which is an enhanced document representation that includes a detailed version history that is self-contained and portable. However, they were not able to adopt the unique-ID-based techniques that have been shown to support efficient merging and differencing algorithms.

This application note describes how it is possible to adapt existing features of MS Word's OOXML representation to provide a system of unique element IDs suitable for those algorithms. This requires taking over Word's Revision Save ID (RSID) system and also defining procedures for specifying ID values for elements that do not support the RSID mechanism. Important limitations remain but appear surmountable.

10:48
Classification of MathML Expressions Using Multilayer Perceptron
SPEAKER: unknown

ABSTRACT. MathML (Mathematical Markup Language) consists of two sets of elements: Presentation Markup and Content Markup. The former is more widely used to display math expressions in Web pages, while the latter is more suited to the calculation of math expressions. In this paper, we consider classifying math expressions in Presentation Markup. In general, a math expression in Presentation Markup cannot be uniquely converted into the corresponding expression in Content Markup. If the class of a given math expression can be identified automatically, such conversions can be done more appropriately. Moreover, identifying the class of a given math expression is useful for text-to-speech of math expression. In this paper, we propose a method for classifying math expressions in Presentation Markup by using a kind of deep learning; multilayer perceptron. Experimental results show that our method classifies math expressions with high accuracy.

11:00-12:00Coffee Break
12:00-12:45 Session 10

Document Analysis: classification and similarity

12:00
Learning before Learning: Reversing Validation and Training
SPEAKER: unknown

ABSTRACT. In the world of ground truthing—that is, the collection of highly valuable labeled training and validation data—there is a tendency to follow the path of first training on a set of data, then validating the data, and then testing the data. However, in many cases the labeled training data is of non-uniform quality, and thus of non-uniform value for assessing the accuracy and other performance indicators for analytics algorithms, systems and processes. This means that one or more of the so-labeled classes is likely a mixture of two or more clusters or sub-classes. These data may inhibit our ability to assess the classifier to use for deployment. We argue that one must learn about the labeled data before the labeled data can be used for downstream machine learning; that is, we reverse the validation and training steps in building the classifier. This “learning before learning” is assessed using a CNN corpus (cnn.com) which was hand-labeled as comprising 12 classes. We show how the suspect classes are identified using the initial classification, and how pruning of low-quality classes proceeds through a simple but data-structurally-relevant figure of merit. We then apply this process to the CNN corpus and show that it consists of 9 high-quality classes and three mixed-quality classes. The effects of this validation-training approach is then shown and discussed.

12:15
Detecting In-line Mathematical Expressions in Scientific Documents
SPEAKER: unknown

ABSTRACT. One of the issues in extracting natural language sentences from PDF documents is the identification of non-textual elements in a sentence. In this paper, we report our preliminary results on the identification of in-line mathematical expressions. We first construct a manually annotated corpus and apply conditional random field (CRF) for the math-span identification using both layout features such as font types, and linguistic features such as context n-grams, obtained from the PDF documents. Although our method is naive and with a small size of annotated training data, our method achieved 80.30% F-measure compared to 20.34% of the existing Math OCR software.

12:30
A High Performance Computational Framework for Phrase Relatedness
SPEAKER: unknown

ABSTRACT. TrWP is a text relatedness measure that computes semantic similarity between words and phrases utilizing aggregated statistics from the Google Web-1T corpus. The phrase similarity computation in TrWP has significant overhead in time and memory cost, making TrWP impractical for real-world usage. In this work, we present an in-memory computational framework for TrWP, which optimizes the corpus search by perfect hash indexing and minimizes the required memory cost by variable length encoding. Using the Google Web 1T 5-gram corpus, we demonstrate that the computational speed of our framework outperforms a file-based implementation by several orders of magnitude.

13:00-14:15Lunch Break
14:15-16:00 Session 12

Document Analysis: content analysis

14:15
Automatic Knowledge Base Construction from Scholarly Documents
SPEAKER: unknown

ABSTRACT. Œe continuing growth of published scholarly content on the web ensures the availability of the most recent scientific findings to researchers.€ Scientific information extraction from these documents into a structured knowledge graph representation facilitates automated machine understanding of the documents. Traditional information extraction approaches, that either require training samples or a preexisting knowledge base to assist in the extraction, can be challenging when applied to such repositories. Labeled training examples for such large scale are difficult to obtain for such datasets. Also, most available knowledge bases are built from web data and do not have sufficient coverage to include concepts found in scientific articles. In this paper we aim to construct a knowledge graph from scholarly documents while addressing both these issues. We propose a fully automatic, unsupervised system for€ scientific information extraction that does not build on an existing knowledge base and avoids manually-tagged training data. We describe and evaluate a constructed taxonomy graph resulting from applying our approach to 10k documents.

14:30
Clinically Significant Information Extraction from Radiology reports
SPEAKER: unknown

ABSTRACT. Radiology reports are one of the most important medical documents that a diagnostician looks into, especially in the emergency situations. They provide the emergency physicians with critical information regarding the condition of the patient and help the physicians take immediate action on urgent conditions.However, the reports are complex and unstructured.

We developed a machine learning system to efficiently extract the clinically significant parts and their level of importance in radiology reports. The system also classifies the overall report into critical or noncritical which help radiologists in identifying potential high priority reports. As a starting point, the system uses Chest X-RAY reports of adults (de-identified) and provides the doctors with 3 levels of medical phrases namely High critical conditions, Critical conditions and Non- critical conditions.

Our model uses CRF to identify clinically significant phrases with an average F1-score of 75.5% . The CRF Model is used as a filter with the web interface which highlights the medical phrases and their criticality level to the emergency physician. The overall classification of the report is identified using Stochastic Gradient Descent and features used are phrases extracted from the CRF model which provides an average accuracy of 85% .

15:00
Towards a Transcription System of Sign Language Video Resources via Motion Trajectory Factorisation
SPEAKER: unknown

ABSTRACT. Sign languages are visual languages used by the Deaf community for communication purposes. Whilst recent years have seen a high growth in the quantity of sign language video collections available online, much of this material is hard to access and process due to the lack of associated text-based tagging information and because `extracting' content directly from video is currently still a very challenging problem. Also limited is the support for the representation and documentation of sign language video resources in terms of sign writing systems. In this paper we start with a brief survey of existing sign language technologies and we assess their state of the art from the perspective of a sign language digital information processing system. We then introduce our work, focusing on vision-based sign language recognition. We apply the factorisation method to sign language videos in order to factor out the signer's motion from the structure of the hands. We then model the motion of the hands in terms of a weighted combination of linear trajectory basis and apply a set of classifiers on the basis weights for the purpose of recognising meaningful phonological elements of sign language. We demonstrate how these classification results can be used for transcribing sign videos into a written representation for annotation and documentation purposes. Results from our evaluation process indicate the validity of our proposed framework.

15:30
The intangible nature of drama documents: an FRBR view
SPEAKER: unknown

ABSTRACT. As a pervasive form of artistic expression through ages and media, drama features a twofold nature of its tangible manifestations (theatrical performances, movies, books, etc.) and its intangible abstraction (the story of Cinderella underlying Disney movie and Perrault's fable). The encoding of the intangible drama abstraction of drama documents is relevant for the preservation of cultural heritage and the didactics and research on drama documents. This paper addresses the task of encoding the notion of intangible story abstraction from the drama documents. The reference model is provided by a computational ontology that formally encodes the elements that characterize a drama, for purposes of semantic linking and inclusion in annotation schemata. By providing a formal expression posited between drama as work and its manifestations, the ontology-based representation is compliant with the model of Functional Requirements for Bibliographic Records (FRBR).

16:00-17:00Coffee Break
17:00-17:30 Session 13

Sigweb Presentation