DOCENG 2019: 19TH ACM SYMPOSIUM ON DOCUMENT ENGINEERING
PROGRAM FOR TUESDAY, SEPTEMBER 24TH
Days:
previous day
next day
all days

View: session overviewtalk overview

10:30-11:00Coffee Break
11:00-12:30 Session 8
11:00
An Effective Scheme for Generating an Overview Report over a Very Large Corpus of Documents

ABSTRACT. How to efficiently generate an accurate, well-structured overview report (ORPT) over thousands of documents is challenging. A well-structured ORPT is divided into sections of multiple levels (e.g., a two-level structure consists of sections and subsections). None of the existing multi-document summarization (MDS) algorithms is suitable for accomplishing this task. To overcome this obstacle, we devise NDORGS (Numerous Documents’ Overview Report Generation Scheme) that integrates text filtering, keyword scoring, single-document summarization (SDS), topic modeling, MDS, and title generation to generate a coherent, well-structured ORPT. We then present a multi-criteria evaluation method using techniques of text mining and multi-attribute decision making on a combination of human judgments, running time, information coverage, and topic diversity. We evaluate ORPTs generated by NDORGS on two large corpora of documents, where one is classified and the other unclassified. We show that, using Saaty’s pairwise comparison 9-point scale and TOPSIS, the ORPTs generated on SDS’s with the length of 20% of the original documents are the best overall on both datasets.

11:25
The CNN-Corpus: a Large Textual Corpus for Single-Document Extractive Summarization

ABSTRACT. This paper details the features and the methodology adopted in the construction of the CNN-corpus, a test corpus for single document extractive text summarization of news articles. The current version of the CNN-corpus encompasses 3,000 texts in English, and each of them has an abstractive and an extractive summary. The cor- pus allows quantitative and qualitative assessments of extractive summarization strategies.

11:50
A Cell-Detection-Based Table Structure Recognition Method

ABSTRACT. If tables are automatically recognized to extract the numerical values in them, digital documents containing such tables can be augmented with graphs generated using the recognized tables. In this paper, we propose a cell-detection-based table-structure recognition method for such automatic graph generation from tables. In detecting cells in a table, ruled lines are crucial but do not necessarily surround all cells. We therefore propose a method to detect cells by estimating implicit ruled lines, where necessary, to recognize the table structure. We demonstrate the effectiveness of the proposed method by experiments using the ICDAR 2013 table competition dataset.

12:10
XLIndy: Interactive Recognition and Information Extraction in Spreadsheets

ABSTRACT. Over the years, spreadsheets have established their presence in many domains, including business, government, and science. However, challenges arise due to spreadsheets being partially-structured and carrying implicit (visual and textual) information. This translates into a bottleneck, when it comes to automatic analysis and extraction of information. Therefore, we present XLIndy, a Microsoft Excel add-in with a machine learning back-end, written in Python. It showcases our novel methods for layout inference and table recognition in spreadsheets. For a selected task and method, users can visually inspect the results, change configurations, and compare different runs. This enables iterative fine-tuning. Additionally, users can manually revise the predicted layout and tables, and subsequently save them as annotations. The latter is used to measure performance and (re-)train classifiers. Finally, data in the recognized tables can be extracted for further processing. XLIndy supports several standard formats, such as CSV and JSON.

12:30-14:00Lunch Break (Steering Committee Lunch)
14:00-15:30 Session 9
14:00
Augmenting Music Sheets with Harmonic Fingerprints

ABSTRACT. Common Music Notation (CMN) is the well-established foundation for the written communication of musical information, such as rhythm or harmony. CMN suffers from the complexity of its visual encoding and the need for extensive training to acquire proficiency and legibility. While alternative notations using additional visual variables (e.g., color to improve pitch identification) have been proposed, the community does not readily accept notation systems that vary widely from the CMN. Therefore, to support student musicians in understanding harmonic relationships, instead of replacing the CMN, we present a visualization technique that augments digital sheet music with a harmonic fingerprint glyph. Our design exploits the circle of fifths, a fundamental concept in music theory, as visual metaphor. By attaching such glyphs to each bar of a composition we provide additional information about the salient harmonic features available in a musical piece. We conducted a user study to analyze the performance of experts and non-experts in an identification and comparison task of recurring patterns. The evaluation shows that the harmonic fingerprint supports these tasks without the need for close-reading, as when compared to a not-annotated music sheet.

14:25
Writer Characterization and Identification in Short Modern and Historical Documents: Reconsidering Paleographic Tables

ABSTRACT. Handwriting is considered a unique “fingerprint” that characterizes a scribe (it is even used as evidence in modern forensics). In paleography (the study of ancient writing), it is presumed that each writer has a one prototype for each letter in the alphabet. Commonly, for ancient inscriptions, letters are organized into paleographic tables (where the rows are the alphabet letters, and the columns represent the examined inscriptions). These tables play a significant role in dating inscriptions based on their resemblance to columns in the table. In this paper, we argue that each scribe "fingerprint" is not represented by a single character prototype, but in fact by a distribution of characters. We introduce a framework for automatically identifying the writer style and constructing paleographic tables based on character histograms. Subsequently, we propose a method for comparing short documents utilizing letter distribution. We demonstrate the validity of the methods on two handwritten datasets: Modern and Ancient Hebrew pertaining to the First Temple period. Our methodology on the ancient dataset enables us to provide additional evidence concerning the level of literacy in the kingdom of Judah ca. 600 BCE.

14:45
Digital Degree Certificates for Higher Education in Brazil

ABSTRACT. Higher Education Degree Certificates in Brazil are a tool for social mobility. Access to higher education is still an issue in a developing economy with continental size and historic inequalities. Some people see this combination as an opportunity to exploit the system, producing fake degree certificates, or issuing official degree certificates to people that did not enrol in courses. Degree certificates can be easily bought in the country and they produce the desired social ascension. To tackle that, the Brazilian Ministry of Education enacted a regulation instituting the Digital Degree Certificate for higher education. The regulation only specifies that the degree certificates must be digitally signed with the country’s official PKI. This regulation does not bring the technical details of how this can be implemented. We aim to discuss these problems of the Degree Certificates’ black-market in Brazil, its social consequences, and how a technical specification for that regulation can be conceived and put in practice following the ministerial regulation. The outcome of this research is a proposal for implementing digitally signed degree certificates that fulfil the legal requirements in Brazil, as well as can be easily integrated with computerized information systems and that can be maintained securely in the long term.

15:10
An Exploratory Analysis of Precedent Relevance in the Brazilian Supreme Court Rulings

ABSTRACT. The new Brazilian Code of Civil Procedure (CPC) has elevated the importance of precedents in the legal decision-making process. This increased the need to find relevant precedents for a given issue or dispute. Precedents play a central role in judicial thinking by providing information to judges about the legal relevance of particular facts and by establishing legal rules. Precedents are also an important argumentative tool, enabling lawyers to present arguments based on previous decisions. The automated search for relevant precedents is an unattended issue in the Brazilian scenario, partly due to the court’s massive production of decisions — only in 2018 the Brazilian Supreme Court (STF) produced more than 121.000 new rulings — and partly due to the technical challenges arising from the unstructured nature of the court’s practices. In this paper, we present a study of precedent relevance, taking into account the uniqueness of the Brazilian legal system and of STF. To do so, we conducted an exploratory investigation over the precedent network extracted from 1.152.963 decisions published by the STF between 2008 and 2018. This exploratory analysis, although interesting in itself, reveals important challenges that need to be overcome by future research in order for the technology to have the kind of impact it can have on legal practice and academia. In our conclusion, we set out possible paths forward, briefly considering some of the most promising ways to sort out the signal from the noise.

15:30-16:00Coffee Break
16:45-17:30 Session 12: Lightning Talks
16:45
Semi-Automatic LaTeX-Based Labeling of Mathematical Objects in PDF Documents: MOP Data Set

ABSTRACT. Mathematical objects (MO) in PDF documents is paramount in understanding the ontology and mathematical essence in published science, technology, engineering, and mathematical (STEM) documents. As of now, Marmot is the only publicly available data set for optimizing and evaluating MO labeling models in PDF documents. Thus, this paper proposes a semiautomatic labeling MO algorithm that uses PDF documents and their corresponding LaTeX source files to generate a new data set consisting of MO bounding boxes (Bbox) in PDF documents, their LaTeX equation, topic, and subject. The first step in labeling each MO is to transform the LaTeX and PDF documents into a string format. Afterwards, a shortest unique stringmatching technique is proposed to align PDF pages with LaTeX files. On each page, a similar shortest string-matching technique is employed to align each LaTeX MO with its PDF counterpart. Once an MO is located, the PDF and LaTeX MOs are normalized in order to match symbols between their LaTeX and PDF representations. A number of filtering rules are set to eliminate matches that are considered exceedingly inconsistent. Matches that pass these rules will have their MOs highlighted for final manual inspection. A total of 1,802 pages in the high energy physics (hep-th) field were labelled.

16:45
A Hybrid AI Tool to Extract Key Performance Indicators from Financial Reports for Benchmarking

ABSTRACT. We present a tool that enables benchmarking of companies by means of automatic extraction of key performance indicators from publicly available financial reports. Our tool monitors companies of interest so that their reports are automatically downloaded as soon as they become available. After tables and paragraphs have been extracted from the documents using a table detection module based on convolutional neural networks, relevant key performance indicators are stored in a central database. The extracted values are finally displayed in a user-friendly web application where the user can compare time series of key performance indicators against arbitrary available companies.

16:45
Combining Word Embeddings with Taxonomy Information for Multi-Label Document Classification

ABSTRACT. In business contexts, documents o!en need to be classified using company-specific taxonomies. Text-classification approaches based on word embeddings have become increasingly popular as they enable words, documents, and tags to be represented in a semantically robust way (as distributed representations of their contexts) and make documents and tags processable in an algebraic vector space. However, these distributed representations of contexts have their shortcomings when used for multi-label classification tasks: the more similar the contexts of two tags, the more difficult they are to separate in classification. Intensified by poor training data, poor training, or inherent limitations of the word-embedding approach, in practice, we find areas of indistinguishability, leading to false positive predictions (typically in leaf tags of a taxonomy tree). We contribute an approach to tackle the problem of indistinguishable areas for multi-label classification tasks based on word embeddings by including taxonomy information during prediction.

16:45
The CNN-Corpus in Spanish: a Large Corpus for Extractive Text Summarization in the Spanish Language

ABSTRACT. This paper details the development and features of the CNN-corpus in Spanish, possibly the largest test corpus for single document extractive text summarization in the Spanish language. Its current version encompasses 1,117 well-written texts in Spanish, each of them has an abstractive and an extractive summary. The development methodology adopted allows good-quality qualitative and quantitative assessments of summarization strategies for tools developed in the Spanish language.

16:45
Enhancing Document-Camera Images

ABSTRACT. Document Camera digitalization devices are low-cost, easy to use, produce good quality images, are able to digitalize pages of bound books without damaging their spine, etc. On the other hand, they may bring two serious problems. The first one appears if the document to be digitalized is printed on glossy paper. The paper reflects the different illumination sources from the environment producing a specular noise in the document image. The second problem occurs when the document to be digitalized does not lay flat on the digitalization surface. This paper presents solutions to both problems. The results obtained in almost 400 test images may be considered satisfactory.

16:45
The next Millennium Document Format

ABSTRACT. Most of today's leading document formats have their roots in the eighties. Their design was built upon requirements of these days: to represent the document state on one single machine or to exchange a document by floppy disc or modem. Often designed for a single purpose far narrower than their current usage. New features were often accomplished by workarounds. For example, change-tracking of any office format does not track a defined interoperable change. Only the earlier state of the changed area is stored, to be swapped back in case of rejection. Nowadays, with the rise of mobile devices, online collaboration is ubiquitous and creates challenges when dealing with documents designed for an environment from the eighties.

In this paper, we lay out a concept how to evolve a new document format that allows not only collaboration, but responsiveness and interoperability by design.

16:45
Towards Automated Auditing with Machine Learning

ABSTRACT. We present the Automated List Inspection (ALI) tool that utilizes methods from machine learning, natural language processing, combined with domain expert knowledge to automate financial statement auditing. ALI is a content based context-aware recommender system, that matches relevant text passages from the notes to the financial statement to specific law regulations. In this paper, we present the architecture of the recommender tool which includes text mining, language modeling, unsupervised and supervised methods that range from binary classification models to deep recurrent neural networks. Next to our main findings, we present quantitative and qualitative comparisons of the algorithms as well as concepts for how to further extend the functionality of the tool.