DOCENG 2019: 19TH ACM SYMPOSIUM ON DOCUMENT ENGINEERING
PROGRAM FOR THURSDAY, SEPTEMBER 26TH
Days:
previous day
all days

View: session overviewtalk overview

09:30-11:00 Session 21
09:30
On the Expressive Power of Declarative Constructs in Interactive Document Scripts

ABSTRACT. It is difficult to generally compare the succinctness of declarative versus imperative programming as source code size varies. In imperative programs, basic operations have constant cost, but they tend to be more verbose than declarative programs, which increases the potential for defects. This paper presents a novel approach for a generalized comparison by transforming the problem into comparing executed code size of a benchmark imperative algorithm with a partially declarative variant of the same algorithm. This allows input size variation to substitute for source code size variation. For implementation, we use a multiparadigm language called XForms that contains both declarative XPath expressions and imperative script actions for interacting with XML data within web and office documents. A novel partially declarative variant of the quicksort is presented. Amortized analysis shows that only O(n) imperative actions are executed, so the expressive power of the declarative constructs is at least Ω(logn). In general, declarative constructs can have an order of magnitude expressive power advantage compared with only using basic imperative operations. The performance cost factor of the expressive power advantage was determined to beO(log2 n) based on a novel dynamic projection from the generalized tree structure of XML data to a height balanced binary tree.

09:55
Modeling Multimodal-Multiuser Interactions in Declarative Multimedia Languages

ABSTRACT. Recent advances in hardware and software technologies have given rise to a new class of human-computer interfaces that both explores multiple modalities and allows for multiple collaborating users. When compared to the development of traditional single-user WIMP (windows, icons, menus, pointer)-based applications, however, applications supporting the seamless integration of multimodal-multiuser interactions bring new specification and runtime requirements. With the aim of assisting the specification of multimedia applications that integrate multimodal-multiuser interactions, this paper: (1) proposes the MMAM (Multimodal-Multiuser Authoring Model); (2) presents three different instantiations of it (in NCL, HTML, and a block-based syntax); and (3) evaluates the proposed model through a task-based user study. MMAM enables programmers to design and ponder different solutions for applications with multimodal-multiuser requirements. The proposed instantiations served as proofs of concept about the feasibility of our model implementation and provided the basis for practical experimentation, while the performed user study focused on capturing evidence of both the user understanding and the user acceptance of the proposed model. We asked developers to perform tasks using MMAM and then answer a TAM (Technology Acceptance Model)-based questionnaire focused on both the model and its instances. As results, the study indicates that the participants easily understood the model (most of them performed the required tasks with minor or no errors) and found it both useful and easy to use. 94.47% of the participants gave positive answers to the block-based representation TAM questions, whereas 75.17% gave positive answers to the instances-related questions.

10:20
Sentiment Classification Improvement Using Semantically Enriched Information

ABSTRACT. The emergence of new and challenging text mining applications is demanding the development of novel text processing and knowledge extraction techniques. One important challenge of text mining is the proper treatment of text meaning, which may be addressed by incorporating different types of information (e.g., syntactic or semantic) into the text representation model. Sentiment classification is one of the challenging text mining applications. It may be considered more complex than the traditional topic classification since, although sentiment words are important, they may not be enough to correctly classify the sentiment expressed in a document. In this work, we propose a novel and straightforward method to improve sentiment classification performance, with the use of semantically enriched information derived from domain expressions. We also propose a superior scheme for generating these expressions. We conducted an experimental evaluation applying different classification algorithms to three datasets composed by reviews of different products and services. The results indicate that the proposed method enables the improvement of classification accuracy when dealing with reviews of a narrow domain.

10:40
Impact of In-domain Vector Representations on the Classification of Disease-related Tweets: Avian Influenza Case Study

ABSTRACT. A number of methods have been proposed for the construction of vector representations for natural language processing (NLP) tasks. These methods have been applied to various domains and each has its own pros and cons. Despite their effectiveness, the proposed approaches usually ignore the sentiment information concerning specific tasks. In this paper, we examined various types of word vectors and their impact on the performance of a sentiment classification problem in the area of infectious diseases. Vectors were used in the embedding layer of a word-based convolutional neural network (CNN) to identify tweets pertaining to avian influenza. We proposed a new approach to build effective word embeddings for the sentiment analysis task. Furthermore, the performance of the language model was compared in terms of using various corpus sizes and vector dimensions. Our experiments indicated that initializing the sentiment learning network with domain-specific word embeddings outperforms general domain embeddings. We found that the proposed method leads to a considerable improvement in the classification performance.

11:00-11:30Coffee Break
11:30-13:00 Session 22
11:30
Using Knowledge Base Semantics in Context-Aware Entity Linking

ABSTRACT. Entity linking is a core task in textual document processing, which consists in identifying the entities of a knowledge base (KB) that are mentioned in a text. Approaches in the literature consider either independent linking of individual mentions or collective linking of all mentions. Regardless of this distinction, most approaches rely on the Wikipedia encyclopedic KB in order to improve the linking quality, by exploiting its entity descriptions (web pages) or its entity interconnections (hyperlink graph of web pages). In this paper, we devise a novel collective linking technique which departs from most approaches in the literature by relying on a structured RDF KB. This allows exploiting the semantics of the interrelationships that candidate entities may have at disambiguation time rather than relying on raw structural approximation based on Wikipedia’s hyperlink graph. The few approaches that also use an RDF KB simply rely on the existence of a relation between the candidate entities to which mentions may be linked. Instead, we weight such relations based on the RDF KB structure and propose an efficient decoding strategy for collective linking. Experiments on standard benchmarks show significant improvement over the state of the art.

11:55
Multi-Objective GP Strategies for Topical Search Integrating Wikipedia Concepts

ABSTRACT. Genetic Programming techniques have demonstrated great potential in dealing with the problem of query generation. This work explores different Multi-Objective Genetic Programming strategies for evolving a collection of topic-based Boolean queries. It compares three approaches to build topical Boolean queries: using terms, incorporating Wikipedia semantics (Wikipedia concepts) and a hybrid approach, using a combination of both terms and concepts. In addition, different fitness functions are combined giving rise to seven multi-objective schemes. In particular, we investigate the use of the proposed strategies in conjunction with novel fitness functions aimed at attaining high diversity based on the information-theoretic notion of entropy and Jaccard similarity. Experiments were completed using 25 topics from a dataset consisting of approximately 350,000 webpages classified into 448 topics. The results reveal that the use of Wikipedia concepts does not result in statistically significant improvements in precision, global recall or diversity when compared to the term-based approaches. However, the use of concepts has a positive effect on query interpretability since the use of terms leads to artificial queries that are hard to interpret by humans. In the meantime, concept-based queries contain a smaller number of operands than the term-based ones, hence resulting in better execution times without a loss in retrieval performance.

12:20
Enhanced Automated Policy Enforcement eXchange Framework (eAPEX)

ABSTRACT. In this paper, we describe an enhancement of the Automated Policy Enforcement eXchange framework (APEX) called eAPEX. eAPEX uses version-control information as the basis for a more incremental approach to scanning document text to determine if security policies are being followed. Where APEX requires a full or deep scan of the document each time an exposure operation is invoked, eAPEX usually requires only a scan of changed elements (deltas). A new scanning approach was designed and implemented. eAPEX works by combining version-aware document technology with a policy database that functions as a cache of security policy results.

eAPEX is evaluated by testing an application that was created to simulate the behavior of the proposed scannning approaches. This evaluation suggests that an incremental approach to checking document security should yield noticeable performance benefits.

12:40
Enhanced Document Retrieval and Discovery Based on a Combination of Implicit and Explicit Document Relationships

ABSTRACT. With the rapid increase of digital information we are dealing with in our daily work, we face significant document retrieval and discovery challenges. We present a novel document retrieval and discovery framework that addresses some of the limitations of existing solutions. An innovative aspect of our solution is the combination of implicit and explicit links between documents in the retrieval as well as in the visualisation process, in order to improve document retrieval and discovery. Our framework exploits implicit relationships between documents—defined by the similarity of their content as well as their metadata—and explicit links (hyperlinks) defined between documents based on a third-party link service. Further, the software framework can be extended with arbitrary third-party visualisations. Last but not least, our search query interface offers advanced features not available in most existing document retrieval systems.

13:00-14:00Lunch Break