IDEAL 2018: 19TH INTERNATIONAL CONFERENCE ON INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING
PROGRAM FOR WEDNESDAY, NOVEMBER 21ST
Days:
next day
all days

View: session overviewtalk overview

10:30-11:30 Session 1: Tutorial
10:30
Tutorial: Nature-Inspired Optimization Algorithms

ABSTRACT. Many problems in optimization and computational intelligence are very challenging to solve, and some of these problems can be NP-hard, which means that there are often no efficient algorithms to tackle such hard problems. In many cases, nature-inspired metaheuristic algorithms can be a good alternative and such algorithms include genetic algorithms (GA), particle swarm optimization (PSO), firefly algorithm (FA) and many others. Over the last two decades, nature-inspired algorithms have become increasingly popular in solving large-scale, nonlinear, global optimization with many real-world applications. They also become an important of part of optimization and computational intelligence. This tutorial will provide a critical analysis of recent algorithms using mathematical theories such as Markov chains, dynamic systems, random walks and self-organization systems. This will provide some insight into these algorithms and their proper use in applications.

11:30-12:00Coffee Break
12:00-13:00 Session 2: Plenary talk
12:00
Towards enriched Cognitive Security Systems

ABSTRACT. To solve the pressing security challenges of our era, we need more creative approaches capable to detect connections between relations, events concepts, in evolving context characterized by an explosive mixture of structered and unstructered data coming up from multiple sensor and human based networks.

In this talk we explore how to integrate Granular Computing and Computational Intelligence with security based systems in order to enrich the cognitive reaction, in terms of decision-support systems cabability.

We present the evolution of a framework where different application scenarios are described, evidentiating the benefits arising from such an integration. The proposed approaches consider some enabling technologies like multi-agents systems and semantic modelling to provide a solution to face the complexity and heterogeneity of the monitored environment and the capability to represent, in a machine-understandable way, procedural, factual and other kind of knowledge and all the memory facilities that could be required

13:00-14:00Lunch Break
14:10-16:10 Session 3A: Artificial Intelligence & Machine Learning: Theory
14:10
General Structure Preserving Network Embedding
SPEAKER: Caiyan Jia

ABSTRACT. Network embedding has attracted increasing attention in recent years since it represents large scale networks in low-dimensional space and provides an easier way to analysis networks. Existing embedding methods either focus on preserving the microscopic topology structure, or incorporate the mesoscopic community structure of a network. However, in the real world, a network may not only contain community structure, but also have bipartite-structure, star-structure or other general structures, where nodes in each cluster have similar patterns of connections to other nodes. Empirically, general structure is important for describing the features of networks. In this paper, based on nonnegative matrix factorization framework, we propose GS-NMF which is capable of integrating topology structure and general structure into embedding process. The experimental results show that GS-NMF overcomes the limitation of previous methods and achieves obvious improvement on node clustering, node classification, and visualization.

14:25
CGLAD: using GLAD in crowdsourced large datasets

ABSTRACT. In this article, we propose an improvement over the GLAD algorithm that increases the efficiency and accuracy of the model when working on problems with large datasets. The GLAD algorithm allows practitioners to learn from instances labeled by multiple annotators, taking into account the quality of their annotations and the instance difficulty. However, due to the number of parameters of the model, it does not scale easily to solve problems with large datasets, especially when the execution time is limited. Our proposal, CGLAD, solves these problems using clustering from vectors coming from the factorization of the annotation matrix. This approach drastically reduces the number of parameters in the model, which makes using GLAD strategy for solving multiple annotators problems easier to use and more efficient.

14:40
A first approach to face dimensionality reduction through denoising autoencoders

ABSTRACT. The problem of high dimensionality is a challenge when facing machine learning tasks. A high dimensional space has a negative effect on the predictive performance of many methods, specifically, classification algorithms. There are different proposals that arise to mitigate the effects of this phenomenon. In this sense, models based on deep learning have emerged.

In this work, denoising autoencoders are used to reduce dimensionality. To verify its performance, an experimentation is carried out where the improvement obtained with different types of classifiers is verified. The classification method used are: kNN, SVM, C4.5 and MLP. The test for kNN and SVM show a better predictive performance for all datasets. The executions for C4.5 and MLP reflect improvements only in some cases. The execution time is lower for all tests. The conclusions reached open up new lines of future work.

14:55
Compositional Stochastic Average Gradient for Machine Learning and Related Applications

ABSTRACT. Many machine learning, statistical inference, and portfolio optimization problems require minimization of a composition of expected value functions (CEVF). Of particular interest is the finite-sum versions of such compositional optimization problems (FS-CEVF). Compositional stochastic variance reduced gradient (C-SVRG) methods that combine stochastic compositional gradient descent (SCGD) and stochastic variance reduced gradient descent (SVRG) methods are the state-of-the-art methods for FS-CEVF problems. We introduce compositional stochastic average gradient descent (C-SAG) a novel extension of the stochastic average gradient method (SAG) to minimize composition of finite-sum functions. C-SAG, like SAG, estimates gradient by incorporating memory of previous gradient information. We present theoretical analyses of C-SAG which show that C-SAG, like SAG, and C-SVRG, achieves a linear convergence rate when the objective function is strongly convex; However, C-CAG achieves lower oracle query complexity per iteration than C-SVRG. Finally, we present results of experiments showing that C-SAG converges substantially faster than full gradient (FG), as well as C-SVRG.

15:10
Data Set Partitioning in Evolutionary Instance Selection

ABSTRACT. Evolutionary instance selection outperforms in most cases non-evolutionary methods, also for function approximation tasks considered in this work. However, as the number of instances encoded into the chromosome grows, finding the optimal subset becomes more difficult, especially that running the optimization too long leads to over-fitting. A solution to that problem, which we evaluate in this work is to reduce the search space by clustering the dataset, run the instance selection algorithm for each cluster and combine the results. We also address the issue of properly processing the instances close to the cluster boundaries, as this is where the drop of accuracy can appear. The method is experimentally verified on several regression datasets with thousands of instances.

15:25
Instance-based stacked generalization for transfer learning

ABSTRACT. We present a method for improving the prediction accuracy using multiple predictive algorithms. Several techniques have been developed to tackle this issue such as bagging, boosting and stacking. In contrary to the first two that, usually, generate homogeneous ensembles of classifiers, stacking techniques have demonstrated success using heterogeneous ensembles. In our method, we adopt the stacking mechanism. Several models are generated using different learning algorithms. Forward stepwise selection is implemented to link each instance to its appropriate learning model. Experiments with three datasets benchmarked with seven base learner algorithms show that this novel method improves prediction accuracy and can serve as a bridge to transfer knowledge between tasks given the same feature space but different data distributions.

15:40
Multi-Dimensional Bayesian Network Classifier Trees

ABSTRACT. Multi-dimensional Bayesian network classifiers (MBCs) are probabilistic graphical models tailored to solving multi-dimensional classification problems, where an instance has to be assigned to multiple class variables. In this paper, we propose a novel multi-dimensional classifier that consists of a classification tree with MBCs in the leaves. We present a wrapper approach for learning this classifier from data. An experimental study carried out on randomly generated synthetic data sets shows encouraging results in terms of predictive accuracy.

15:55
Optimally Selected Minimal Learning Machine

ABSTRACT. This paper introduces a new approach to select reference points~(RPs) to minimal learning machine~(MLM) for classification tasks. A critical issue related to the training process in MLM is the selection of RPs, from which the distances are taken. In its original formulation, the MLM selects the RPs randomly from the data. We propose a new method called optimally selected minimal learning machine~(OS-MLM) to select the RPs. Our proposal relies on the multiresponse sparse regression~(MRSR) ranking method, which is used to sort the patterns in terms of relevance. After doing so, the leave-one-out~(LOO) criterion is also used in order to select an appropriate number of reference points. Based on the simulations we carried out, one can see our proposal achieved a lower number of reference points with an equivalent, or even superior, accuracy with respect to the original MLM and its variants.

14:10-16:10 Session 3B: Artificial Intelligence & Machine Learning: Applications
Location: Mixed room
14:10
Exploring the Perceived Usefulness and Attitude Towards using Tesys e-Learning Platform

ABSTRACT. In this paper we present a study that aims exploring two components of the Technology Acceptance Model. As evaluation envi- ronment, we used Tesys e-Learning platform, for data processing R pro- gramming language and for data collection Google Forms. The study indented to explore if there are signicant dierences between genders regarding the perceived usefulness and attitude towards evaluation us- ing online educational environments. The questionnaire that explored the above-mentioned components was administrated to a group of stu- dents enrolled in full time education which never used e-Learning plat- forms. The results showed that the questionnaire was reliable and there is no cross-gender dierence regarding students perceived usefulness of e-Learning platforms and their attitude towards using them. The overall feedback gathered from each of the explored items is positive with an above fourth grade on a one to ve Likert scale.

14:25
Peak Alpha Based Neurofeedback Training within Survival Shooter Game

ABSTRACT. Neurofeedback has been proven to be a useful tool in games, music or video applications for both treating various disorders and disabilities, as well as potentially enhancing cognitive functions of healthy individuals. However, most neurofeedback protocols are tedious to use and the usefulness of the results is difficult to validate. In this paper we present a framework that allows connecting various types of neurofeedback protocols inside a computer game in a way that the user can gain full benefits of the neurofeedback based training by simply playing a game. The paper outlines the full potential of cognitive training and it's effectiveness when implemented in an entertaining scenario.

14:40
Support Vector Machine Based Method For High Impedance Fault Diagnosis in Power Distribution Networks
SPEAKER: Katleho Moloi

ABSTRACT. The detection of high impedance faults (HIFs) on a power distribution system has been a subject of concern for many decades. This poses a very unique challenge to the protection engineers as it seems to be invincible to be detected by conventional protection schemes. The major concern about HIFs is that they pose a safety risk, as these faults are associated with arcing which may be dangerous for the surroundings. In this work, we propose a technique which uses a feature extracting, classification and locating algorithm. Discrete wavelet transform (DWT) is used to extract meaningful information, support vector machine (SVM) is used as a classifier and support vector regression (SVR) scheme is used as a fault estimator. The technique is tested on an Eskom real network.

14:55
Single-class bankruptcy prediction based on the data from annual reports

ABSTRACT. The companies involved in all areas of business and industry can due to the unfavorable financial situation or inappropriate investments face financial problems resulting in bankruptcy of the company. The ability to foresee imminent bankruptcy helps managers and stock holders to take corrective actions. In this paper, we analyze annual reports of thousands of limited liability companies and propose bankruptcy prediction model. The available dataset is strongly imbalanced that corresponds to the real-world situation where bankrupt companies constitute only a fraction of all companies. The proposed model is based on single-class least-squares anomaly detection classifier achieving as high as 91% prediction accuracy in some years.

15:10
On Application of Learning to Rank for Assets Management: Warehouses Ranking

ABSTRACT. Abstract. Motivated by applications that can be derived from iot de-vices through connected assets. One such application is resource require-ments ranking. Multiple Criteria Decision Analysis, MCDA, has oftenbeen utilized in finding ranking by computing on collected data andpredefined criteria. However, changes in asset’s environment have a di-rect impact on its resource requirements ranking and decisions on rank-ing must be constantly revised. This is a repetitive process where man-agers require to repetitively making the final decisions on the ranking.With machine learning, such repetitiveness can be heavily minimized by teaching machines how to rank not by instruction but rather by exam-ples of the task being done. In this paper, we present Learn to Rank, LTR, machine learning framework in conjunction with MCDA for ware-house resources requirements ranking application. A framework for smart contract renegotiate resources allocation ranking for each asset on the blockchains is also discussed.

15:25
Towards the intelligent agents for blockchain e-voting system

ABSTRACT. There are many existing voting solutions which have different benefits and issues. The most significant ones are lack of transparency and auditability. Recently developed blockchain technology may be a solution to these issues. The paper describes the use of intelligent agents and multi-agent system concept for Auditable Blockchain Voting System (ABVS), which integrates e-voting process with blockchain technology into one supervised non-remote internet voting system which is end-to-end verifiable.

15:40
Thermal Prediction for Immersion Cooling Data Centers Based on Recurrent Neural Networks
SPEAKER: Jaime Pérez

ABSTRACT. In the data center’s scope, current cooling techniques are not very efficient both in terms of energy, consuming up to 40% of the total energy requirements, and in terms of occupied area. This is a critical problem for the development of new smart cities, which require the proliferation of numerous data centers in urban areas, to reduce latency and bandwidth of processing data analytics applications in real time. In this work, we propose a new disruptive solution developed to address this problem, submerging the computing infrastructure in a tank full of a dielectric liquid based on hydro-fluoro-ethers (HFE). Thus, we obtain a passive two phase-cooling system, achieving zero-energy cooling and reducing its area. However, to ensure the maximum heat transfer capacity of the HFE, it is necessary to ensure specific thermal conditions. Making a predictive model is crucial for any system that needs to work around the point of maximum efficiency. Therefore, this research focuses on the implementation of a predictive thermal model, accurate enough to keep the temperature of the cooling system within the maximum efficiency region, under real workload conditions. In this paper, we successfully obtained a predictive thermal model using a neural network architecture based on a Gated Recurrent Unit. This model makes accurate thermal predictions of a real system based on HFE immersion cooling, presenting an average error of 0.7ºC with a prediction window of 1 minute.

15:55
Suggesting Cooking Recipes Through Simulation and Bayesian Optimization

ABSTRACT. Cooking typically involves a plethora of decisions about in- gredients and tools that need to be chosen in order to write a good cooking recipe. Cooking can be modelled in an optimization framework, as it involves a search space of ingredients, kitchen tools, cooking times or temperatures. If we model as an objective function the quality of the recipe, several problems arise. No analytical expression can model all the recipes, so no gradients are available. The objective function is sub- jective, in other words, it contains noise. Moreover, evaluations are ex- pensive both in time and human resources. Bayesian Optimization (BO) emerges as an ideal methodology to tackle problems with these char- acteristics. In this paper, we propose a methodology to suggest recipe recommendations based on a Machine Learning (ML) model that fits real and simulated data and BO. We provide empirical evidence with two experiments that support the adequacy of the methodology.

14:10-16:10 Session 3C: Workshop on Methods for Interpretation of Industrial Event Logs (MIEL)
Location: Meeting room
14:10
Introductory talk: Computational Sensemaking in Industry 4.0: From Analysis to Interpretation
14:25
Automated, Nomenclature Based Data Point Selection For Industrial Event Log Generation

ABSTRACT. Within the automotive industry today, data collection, for legacy manufacturing equipment, largely relies on the data being pushed from the machine's PLCs to an upper system. Not only does this require programmers' efforts to collect and provide the data, but it is also prone to errors or even intentional manipulation. External monitoring, is available through Open Platform Communication (OPC), but it is time consuming to set up and requires expert knowledge of the system as well. A nomenclature based methodology has been devised for the external monitoring of unknown controls systems, adhering to a minimum set of rules regarding the naming and typing of the data points of interest, which can be deployed within minutes without human intervention. The validity of the concept will be demonstrated through implementation within an automotive body shop and the quality of the created log will be evaluated. The impact of such a fine grained monitoring effort on the communication infrastructure will also be measured within the manufacturing facility. It is concluded that, based on the methodology provided in this paper, it is possible to derive OPC groups and items from a PLC program without human intervention in order to obtain a detailed event log.

14:40
Mining Attributed Interaction Networks on Industrial Event Logs

ABSTRACT. In future Industry 4.0 manufacturing systems reconfigurability and flexible material flows are key mechanisms. However, such dynamics require advanced methods for the reconstruction, interpretation and understanding of the general material flows and structure of the production system. This paper proposes a network-based computational sensemaking approach on attributed network structures modeling the interactions in the event log. We apply descriptive community mining methods for detecting patterns on the structure of the production system. The proposed approach is evaluated using two real-world datasets.

14:55
A Taxonomy for Combining Activity Recognition and Process Discovery in Industrial Environments

ABSTRACT. Despite the increasing automation levels in an Industry 4.0 scenario, the tacit knowledge of highly skilled manufacturing workers remains of strategic importance. Retaining this knowledge by formally capturing it is a challenge for industrial organisations. This paper explores research on automatically capturing this knowledge by using methods from activity recognition and process mining on data obtained from sensorised workers and environments. Activity recognition lifts the abstraction level of sensor data to recognizable activities and process mining methods discover models of process executions. We classify the existing work, which largely neglects the possibility of applying process mining, and derive a taxonomy that identifies challenges and research gaps.

15:10
On the Opportunities for Using Mobile Devices for Activity Monitoring and Understanding in Mining Applications

ABSTRACT. In last decades, number of embedded and portable computer systems for monitoring of activities of miners and underground environmental conditions have been developed. However, their potential of in terms of the computing power and analytic capabilities is still underestimated. In this paper we elaborate on the recent examples of the use of the wearable devices in the mining industry. We identify challenges for high level monitoring of mining personnel with the use of mobile and wearable devices. To address some of them we propose solutions based on our recent works, including context-aware data acquisition framework, physiological data acquisition from wearables, methods incomplete and imprecise data handling, intelligent data processing and reasoning module, hybrid localization using semantic maps, and adaptive power management. We provide basic use case to demonstrate the usefulness of this approach

15:25
Causal rules detection in streams of unlabeled, mixed type values with finit domains
SPEAKER: Szymon Bobek

ABSTRACT. Knowledge discovery from data streams in recent years become one of the most important research area in a domain of data science. This is mainly due to the rapid development of mobile devices, and Internet of things solutions which allow for obtaining petabytes of data within minutes. All of the modern approaches either use representation that is flat in time domain, or follow black-box model paradigm. This reduces the expressiveness of models and limits the intelligibility of the system. In this paper we present an algorithm for rule discovery that allows to capture temporal causalities between numeric and symbolic attributes.

15:40
Creation of an event log from machinery monitoring system on a selected example

ABSTRACT. In the paper we address challenges related to creation of an event log based on sensor data from machinery monitoring systems. In our previous works we proposed a process-oriented approach for analysis of such kind of data and longwall machinery operation analysis with process mining techniques. There are two main challenges in event log creation from machinery sensor data: no clear activities definition as well as, in our case, no clear trace iden-tification. We present mentioned problems on a selected example from longwall monitoring system in an underground mine. In the paper we focus on case id identification for event log creation with heuristic approach. We summarize our experiences in this area showing problems of real industrial data sets.

15:55
Monitoring Equipment Operation through Model and Event Discovery

ABSTRACT. Monitoring the operation of complex systems in real-time is becoming both required and enabled by current IoT solutions. Predicting faults and optimising productivity requires autonomous methods that work without extensive human supervision. One way to automatically detect deviating operation is to identify groups of peers, or similar systems, and evaluate how well each individual conforms with the group.

We propose a monitoring approach that can construct knowledge more autonomously and relies on human experts to a lesser degree: without requiring the designer to think of all possible faults beforehand; able to do the best possible with signals that are already available, without the need for dedicated new sensors; scaling up to ``one more system and component'' and multiple variants; and finally, one that will adapt to changes over time and remain relevant throughout the lifetime of the system.

16:10-16:40Coffee Break
16:40-18:00 Session 4A: Machine Learning: Theory & Applications
16:40
A framework for form applications that use machine learning

ABSTRACT. Machine learning has been used efficiently in applications across multiple domains. As a consequence, there is a growing interest in techniques and artifacts that aid its use. However, most of the artifacts found are aimed at researchers and experienced users. In addition, few artifacts provide more than ready-made algorithms. In this work, we present a framework capable of delivering ready-to-use machine learning algorithms, as well as the code to be reused by form applications that use machine learning algorithms. We used this framework to build two form applications. The results show that the framework is able to reduce 51 percent of the effort when building a new application. In addition, the framework allows to include new state-of-the-art machine learning algorithms in an easy way, as well as it provides a simple flow control that assists inexperienced users in the use of machine learning algorithms.

16:55
Overlap-Based Undersampling for Improving Imbalanced Data Classification

ABSTRACT. Classification of imbalanced data remains an important field in machine learning. Several methods have been proposed to address the class imbalance problem including data resampling, adaptive learning and cost adjusting algorithms. Data resampling methods are widely used due to their simplicity and flexibility. Most existing resampling techniques aim at rebalancing class distribution. However, class imbalance is not the only factor that impacts the performance of the learning algorithm. Class overlap has proved to have a higher impact on the classification of imbalanced datasets than the dominance of the negative class. In this paper, we propose a new undersampling method that eliminates negative instances from the overlapping region and hence improves the visibility of the minority instances. Testing and evaluating the proposed method using 36 public imbalanced datasets showed statistically significant improvements in classification performance.

17:10
Machine Learning Methods based Preprocessing to improve Categorical Data Classification

ABSTRACT. The following study is aimed at dealing with large volumes of data whose main characteristic is to contain a high number of variables, most of which are categorical in nature. In the knowledge extraction process, Knowledge Discovery in Databases (KDD), it is very common to deal with a stage of data pre-processing and dimensionality reduction. A key part of extracting information is having high quality data. This paper proposes the use of the Pairwise and Listwise methods as part of the dimensionality reduction process, when there is a high level of missing values present in one or more variables. As part of the pre-processing, we generate n-clusters using Kohonen Self-Organizing Maps (SOM) algorithm with H2O on R. A comparison of the performance and accuracy of classification algorithms is made with the complete subdata set and the algorithms are applied to each cluster. As a case study, we analyzed the characteristics that influence the level of schooling of women of childbearing age.

17:25
Combined classifier based on quantized subspace class distribution

ABSTRACT. Following paper presents Exposer Ensemble (EE), being a combined classifier based on the original model of quantized subspace class distribution. It presents a method of establishing and processing the Planar Exposer – base representation of discrete class distribution over given subspace, and a proposition how to effectively fuse discriminatory power of many Planar Exposers into a combined classifier. The natural property of the representation used in the following article is its resistance to the imbalance of training data, without the need to use over- or undersampling methods and the constant computational complexity of prediction. Description of proposed algorithm is complemented by a series of computer experiments conducted on the collection of balanced and imbalanced datasets with diverse imbalance ratio, proving its usefulness in a supervised learning task.

17:40
A cluster-based prototype reduction for online k-NN classification

ABSTRACT. Data stream is a challenging research topic in which data can continuously arrive with a probability distribution that may change over time. Depending on the changes in the data distribution, different phenomena can occur, like, for example, concept drift. A concept drift occurs when the concepts associated with a dataset change when new data arrive. We propose a method based on k-Nearest Neighbors that implements a sliding window requiring less instances stored for training than other methods. A clustering approach is used to cluster labeled instances that are similar, in order to summarize data. Besides that, instances close to uncertainty border of existing clusters are also stored in order to adapt the model to concept drift. Considering accuracy performance and time consumption, we compare our method against state-of-the-art classifiers from the data-stream literature. Our experimental evaluation shows that k-Nearest Neighbors can have better performance and less time consumption with fewer information about the concepts are stored in a single sliding window.

16:40-17:40 Session 4B: Artificial Intelligence & Applied Mathematics
Location: Mixed room
16:40
PostProcessing in Constrained Role Mining

ABSTRACT. Constrained role mining aims to define a valid set of roles efficiently representing the organization of a company, easing the management of the security policies. Since the associated problems are NP hard, usually some heuristics are defined to find some sub-optimal solutions. In this paper we define two heuristics for the {\em Permission Distribution} and {\em Role Usage } Cardinality Constraints in the post processing framework, i.e. refining the roles produced by some other algorithm. We discuss the performance of the proposed heuristics applying them to some standard datasets showing the improvements w.r.t. previously available solutions.

16:55
Extended Min-Hash Focusing on Intersection Cardinality
SPEAKER: Hisashi Koga

ABSTRACT. Min-Hash is a reputable hashing technique which realizes set similarity search.
Min-Hash assumes the Jaccard similarity |A\cap B|/|A\cup B| as the similarity measure between two sets A and B.
Accordingly, Min-Hash is not optimal for applications which would like to measure the set similarity with the intersection cardinality |A\cap B|, since the Jaccard similarity decreases irrespective of |A\cap B|, as the gap between |A| and |B| becomes larger.
This paper shows that, by modifying Min-Hash slightly, we can effectively settle the above difficulty inherent to Min-Hash. Our method is shown to be valid both by theoretical analysis and with experiments.

 

17:10
New Fuzzy Singletons Distance Measure by Convolution

ABSTRACT. This article proposes a new method to calculate the distances between fuzzy singleton variables. It uses a measure of generalized fuzzy numbers based on the center of gravity (COG). The fuzzy signals are transformed by applying convolution. To prove the effectiveness of this method, it is applied to a pattern recognition problem that deals with stock markets. Comparison with other classical distance measurements shows that this approach pro-vides a consistent and reliable distance measure for the stock market scenario and can be generalized for any pattern recognition problem.

17:25
A fast Metropolis-Hastings method for generating random correlation matrices

ABSTRACT. We propose a novel Metropolis-Hastings algorithm to sample uniformly from the space of correlation matrices. Existing methods in the literature are based on elaborated representations of a correlation matrix, or on complex parametrizations of it. By contrast, our method is intuitive and simple, based the classical Cholesky factorization of a positive definite matrix and Markov chain Monte Carlo theory. We perform a detailed convergence analysis of the resulting Markov chain, and show how it benefits from fast convergence, both theoretically and empirically. Furthermore, in numerical experiments our algorithm is shown to be significantly faster than the current alternative approaches, thanks to its simple yet principled approach.

16:40-18:00 Session 4C: Special Session on Data Selection in Machine Learning (DSML 1)
Location: Meeting room
16:40
Semi-Supervised Learning to Reduce Data Needs of Indoor Positioning Models

ABSTRACT. Indoor positioning systems answer the need for ubiquitous localisation systems. Frequently, indoor positioning relies on machine learning models developed based on the training data composed of WiFi received signal strength (RSS) vectors observed in different indoor locations. However, this requires expensive collection of RSS vectors in precisely measured locations. In this study, we propose a semi-supervised method, which can reduce the volume of the expensive labelled training data and exploit the availability of unlabelled signal strength measurements. The method relies, inter alia, on the measures of similarity among nearest neighbours of unlabelled vectors. Tests performed with a number of testbed areas confirm that the method improves the accuracy of random forest models used to estimate indoor location of mobile terminals.

16:55
EMnGA : Entropy Measure and Genetic Algorithms Based Method for Heterogeneous Ensembles Selection

ABSTRACT. Generating ensembles of classifiers increase the performances in classification and prediction but on the other hand it increases the storage space and the prediction time. Selection or simplification methods have been proposed to reduce space and time while maintaining or improving the performance of initial ensemble. In this paper we propose a method called EMnGA that uses a diversity-based entropy measure and a genetic algorithm-based search strategy to simplify a heterogeneous ensemble of classifiers. The proposed method is evaluated against its prediction performance and is compared to the initial ensemble as well as to the selection methods of heterogeneous ensembles in the literature using a sequential way.

17:10
A Study of Fuzzy Clustering to Archetypal Analysis

ABSTRACT. Archetypes are extreme points that synthesize data representing ''pure'' individual types, and are assigned by the most discriminating features of data points. Archetypes are rather useful in real applications where ''pure'' representatives of the data serve as their models. A simulation study is conducted with a proper data generator (DG), the \textit{fuzzy proportional membership} DG, whose goal is twofold: first, to analyse the ability of archetypal clustering algorithm to recover Archetypes from data of distinct dimensionalities; second, to show the robustness of archetypal clustering with the presence of outliers on data on low and high dimensionality. The archetypal clustering algorithm is yet compared with a proportional membership fuzzy clustering algorithm using various benchmark data sets from the UCI machine learning repository. The evaluation conducted with five premier internal fuzzy validation indices show the good quality of the results.

17:25
Different Approaches of Data and Attribute Selection on Headache Disorder
SPEAKER: Dragan Simić

ABSTRACT. Half of the general population experiences a headache during any given year, and more than 90% report a lifetime history of head pain. Medical data and in-formation in turn provide knowledge on which physicians base their decisions and actions. It is not easy for a decision maker in data mining and decision-making processes to handle too much data and information. It becomes increasingly necessary to extract useful knowledge and make scientific deci-sions for diagnosis and treatment of a disease from the database. This paper pre-sents comparison of data and attribute selected features by automatic machine learning methods and algorithms, and by diagnostic tools and an expert physi-cian, all from the last decade.

17:40
Feature selection and interpretable feature transformation: a preliminary study on feature engineering for classification algorithms

ABSTRACT. This paper explores the limitation of consistency-based measures in the context of feature selection. These kinds of filters are not very widespread in large-dimensionality problems. Typically, the number of selected of attributes is very small and the ability to do right predictions is a drawback. The principal contribution of this work is the introduction of a new approach within feature engineering to create new attributes after the feature selection stage. The experimentation on multi-class problems with a feature space in the order of tens of thousands shed light on that some improvements took place with the new proposal. As a final insight, some new relationships were discovered due to the combined application of feature selection and feature transformation. Addi-tionally, a new measure for classification problems which relates the number of features and the number of classes or labels is also proposed.

19:30-21:30Cultural event