SMBD2018: INTERNATIONAL CONFERENCE ON STATISTICAL METHODS FOR BIG DATA
PROGRAM FOR FRIDAY, JUNE 8TH
Days:
previous day
all days

View: session overviewtalk overview

09:30-11:00 Session 5
Chair:
Antoni Espasa (universidad Carlos III de Madrid, Spain)
09:30
Nuno Crato (Universidade de Lisboa, Portugal)
Jorge Caiado (Universidade de Lisboa, Portugal)
Pilar Poncela (EC-JRC and Universidad Autónoma de Madrid, Spain)
A clustering procedure for studying big data financial time series

ABSTRACT. Building upon previous joint work with Daniel Peña, we propose and study a frequency-domain procedure for characterizing and comparing large sets of long financial time series. Instead of using all the information available from data, which would be computationally very expensive, we propose some regularization rules in order to select and summarize the most relevant information for clustering purposes. Essentially, we propose to use the snipped periodogram around the driving seasonal components of interest and compare the estimates. This procedure is computationally simple, but able to condense relevant second-order information on the volatility of the time series. We use this procedure to study the evolution of several stock markets indices, extracting information on the European financial integration. We further show the effect of the recent financial crisis over these indices behaviour.

10:00
Marc Hallin (Université libre de Bruxelles, Belgium)
Matteo Barigozzi (London School of Economics, UK)
Stefano Soccorsi (University of Lancaster, UK)
Identifying Global and Local Shocks in International Financial Markets: a General Dynamic Factor Model Approach

ABSTRACT. We employ a two-stage general dynamic factor model method to analyze the comovements between returns and between volatilities of stocks from the US, European, and Japanese financial markets. We find evidence of two common shocks driving the dynamics of volatilities – one global (worldwide) shock and one US-European shock – and four local shocks in the panel of returns, but no global one. Co-movements in the returns and volatilities panels increased considerably in the period 2007-2012 associated with the Great Financial Crisis and the European Sovereign Debt Crisis. We interpret this finding as the sign of a surge, during crises, of interdependencies across markets, as opposed to contagion. Finally, we introduce a new method for structural analysis in general dynamic factor models which is applied to the identification of volatility shocks via natural timing assumptions. The global shock has homogeneous dynamic effects within each individual market but more heterogeneous effects across them, and also has good predictive power on aggregate realized volatilities.

10:30
Sylvia Frühwirth-Schnatter (Vienna University of Business and Economics, Austria)
Hedibert Freitas Lopes (Insper Institute of Education and Research, Brazil)
Recent Advances in Sparse Bayesian Factor Analysis

ABSTRACT. Factor analysis is a popular method to obtain a sparse representation of the covariance matrix of multivariate observations. This is particularly relevant for analyzing the correlation among high-dimensional data. The present talk reviews some recent research in the area of sparse Bayesian factor analysis that tries to achieve additional sparsity in a factor model through the use of point mass mixture priors.

However, despite the popularity of sparse factor models in many applied areas such as genetics, economics, and finance, little attention has been given to formally address identifiability of these models beyond standard rotation-based identification such as the positive lower triangular constraint. In particular, identifiability issues that arise from introducing zeros in the factor loading matrix are largely ignored. In a recent paper [1] we tried to fill this gap.

Identifiability issues that arise from introducing zeros in the factor loading matrix are discussed in detail. We provide a counting rule on the number of nonzero factor loadings that is sufficient for achieving uniqueness of the variance decomposition in the factor representation. Furthermore, we introduce the generalized lower triangular representation to resolve rotational invariance and show that within this model class the unknown number of common factors can be recovered in an overfitting sparse factor model. By combining point-mass mixture priors with a highly efficient and customized MCMC scheme, we obtain posterior summaries regarding the number of common factors as well as the factor loadings via post-processing. The methodology is illustrated for monthly exchange rates of 22 currencies with respect to the euro over a period of eight years and for monthly log returns of 73 firms from the NYSE100 over a period of 20 years.

11:00-11:30Coffee Break
11:30-13:30 Session 6
Chair:
Wenceslao González-Manteiga (Universidad de Santiago de Compostela, Spain)
11:30
J. S. Marron (UNC, United States)
High Dimension Low Sample Size Asymptotics

ABSTRACT. The asymptotics of growing sample size are the foundation of classical mathematical statistics. But modern big data challenges suggest consideration of growing dimension as well. A perhaps extreme case of this has fixed sample size. That context is seen to have some counter-intuitive mathematical structure. These non-standard ways of thinking about data are seen to be the key to understanding important aspects of real genomic data.

12:00
Ricardo Cao (Research Group MODES, Department of Mathematics, CITIC and ITMATI, Universidade da Coruña, Spain)
Laura Borrajo (Research Group MODES, Department of Mathematics, CITIC, Universidade da Coruña, Spain)
Nonparametric mean estimation for big-but-biased data

ABSTRACT. Crawford has recently warned about the risks of the sentence with enough data, the numbers speak for themselves. Some of the problems coming from ignoring sampling bias in big data statistical analysis have been recently reported by Cao. The problem of nonparametric statistical inference in big data under the presence of sampling bias is considered in this work. The mean estimation problem is studied in this setup, in a nonparametric framework, when the biasing weight function is known (unrealistic) as well as for unknown weight functions (realistic). Two different scenarios are considered to remedy the problem of ignoring the weight function: (i) having a small sized simple random sample of the real population and (ii) having observed a sample from a doubly biased distribution. In both cases the problem is related to nonparametric density estimation. Asymptotic expressions for the mean squared error of the estimators proposed are considered. This leads to some asymptotic formula for the optimal smoothing parameter. Some simulations illustrate the performance of the nonparametric methods proposed in this work.

12:30
Pedro Delicado (Universitat Politècnica de Catalunya, Spain)
Variable relevance matrix in algorithmic models

ABSTRACT. We define a variable relevance matrix combining ideas of outliers detection and variable importance measures in random forests. Then we derive an automatic procedure for assigning a relevance measure to each variable in a prediction problem fitted by an prediction algorithm.

13:00
Isabel Casas (BCAM and University of Southern Denmark, Spain)
Xiuping Mao (Zhongnan University of Economics and Law, China)
Helena Veiga (Universidad Carlos III de Madrid, Spain)
Reexamining financial and economic predictability with new estimators of realized variance and variance risk premium

ABSTRACT. This study explores the predictive power of new estimators of the equity variance risk premium and conditional variance for future excess stock market returns, economic activity, and financial instability, both during and after the last global financial crisis. These estimators are obtained from new parametric and semiparametric asymmetric extensions of the heterogeneous autoregressive model. Using these new specifications, we determine that the equity variance risk premium is a predictor of future excess stock returns, whereas conditional variance predicts them only for long horizons. Moreover, a comparison of the overall results reveals that the conditional variance gains predictive power during the global financial crisis period. Furthermore, both the variance risk premium and conditional variance are determined to be predictors of future financial instability, whereas conditional variance is determined to be the only predictor of economic activity for all horizons. Before the global financial crisis period, the new parametric asymmetric specification of the heterogeneous autoregressive model gains predictive power in comparison to previous work in the literature. However, the new time-varying coefficient models are the ones showing considerably higher predictive power for stock market returns and financial instability during the financial crisis, suggesting that an extreme volatility period requires models that can adapt quickly to turmoil.

13:30-15:00Lunch
15:00-16:00 Session Posters
15:00
Abel Guada-Azze (Universidad Carlos III de Madrid, Spain)
Bernardo D'Auria (Universidad Carlos III de Madrid, Spain)
Eduardo García-Portugués (Carlos III University of Madrid, Spain)
Inferring the optimal stopping boundary for a Brownian bridge.

ABSTRACT. The optimal stopping theory has found several applications on financial mathematics, specifically on option trading, where the question of finding the optimal stopping time to exercise an option arises. Here we face that situation when the dynamics of the asset follows a Brownian bridge process with certain volatility, and the gain function is the identity. We show a methodology that does not rely on any previous knowledge about the explicit form of the optimal boundary. We obtain an integral equation defining the boundary function, that allows not only its numerical computation, but also it is a key component to make effective inference of the boundary. This approach is likely to be successfully applied under more general diffusion processes and gain functions that the particular combination studied in this paper.

15:00
Juan C. Laria (Universidad Carlos III de Madrid, Spain)
Rosa Lillo (Universidad Carlos III de Madrid, Spain)
M. Carmen Aguilera-Morillo (Universidad Carlos III de Madrid, Spain)
A variable selection approach in high dimensional problems

ABSTRACT. In high-dimensional supervised learning problems, sparsity constraints in the solution often lead to better performance and interpretability of the results. For problems in which covariates are grouped and sparse structure are desired, both on group and within group levels, the sparse-group lasso (SGL) regularization method has proved to be very efficient. Sometimes the group structure in the covariates is clear, e.g., when we have dummy variables corresponding to different levels of the same original categorical variable. However, many real problems lack of an explicit configuration for the groups. In this work, we focus on a recent contribution, the iterative sparse-group lasso (iSGL). This algorithm automatically weights each group of variables based on cross-validation criteria, finding an optimal solution to the regression problem. We investigate properties of the iSGL when the group structure in the covariates is unknown, derive strategies to supply those missing groups, and compare iSGL to other state-of-the-art algorithms. We support our analysis using both real and synthetic data sets.

15:00
Elisa Cabana (Universidad Carlos III de Madrid, Spain)
Rosa Elvira Lillo (Universidad Carlos III de Madrid, Spain)
Henry Laniado (Universidad EAFIT Colombia, Colombia)
Robust regression using a robust Mahalanobis distance based on Shrinkage estimators

ABSTRACT. We propose a robust method for linear regression based on a robust estimation of the joint location and scatter matrix of the explanatory and response variables. The robust estimators are based on the notion of Shrinkage. We investigate the robustness and goodness of the estimators by simulations and with a real example. Furthermore, the resulting regression technique is computationally feasible, and turns out to perform better than several popular robust regression methods, even in high dimensions.

15:00
Diego Gómez (Universidad Complutense de Madrid, Spain)
Rosa Espínola (Universidad Complutense de Madrid, Spain)
Gaussian Bayesian Networks and Gaussian Mixture Models applications of day-ahead forecast for the Spanish electricity market supply curves.

ABSTRACT. Deregulation in electricity markets caused a higher interest in forecast electricity prices for generation and supply companies, which goal is to maximize their profits. This paper studies the probabilistic relationships between the Spanish electricity market supply curve of the first hour in operation day and the available forecast generations in the Spanish energy mix. We propose the Gaussian Bayesian networks and Gaussian mixture models to model these probabilistic relationships and we have used the data of the years 2015 and 2016. The existence of power generation forecasts for some technologies let us evaluate the responses of supply curve, once we have estimated the probabilistic models. Therefore, this methodology will allow market’s agents advance the probable behavior of supply curve and define theirs buy or sell bidding of power generation in accordance to maximize their profits.

15:00
Javier Cara (Universidad Politécnica de Madrid, Spain)
Jesus Juan (Universidad Politécnica de Madrid, Spain)
Enrique Alarcón (Universidad Politécnica de Madrid, Spain)
Boostraping state space models in experimental modal analysis of a footbridge

ABSTRACT. Experimental modal analysis consists on estimating the modal parameters of a structural/mechanical system (a footbridge in this case) from sensors measurements. The process can be described as: 1) measure the vibrations of the footbridge at different points using accelerometers; 2) estimate a state space model from the multivariate time series of accelerations; 3) compute the modal parameters from the eigenvalues of the state space matrices.

This work analyses the application of the bootstrap to compute the standard error of the modal parameters. First, the method proposed in [1] is applied. The main problem observed with this approach is that the residuals are autocorrelated, so the cannot be resampled. Then, a sieve bootstrap [2] is applied to the residuals, and these residuals are used to generate the bootstrap replicates. Therefore, the proposed method can be described as a two-step bootstrap.

15:00
Eduardo Caro (Universidad Politécnica de Madrid, Spain)
Jesus Juan (Universidad Politécnica de Madrid, Spain)
Javier Cara (Universidad Politécnica de Madrid, Spain)
Short-term forecasting hourly electricity load in Spain

ABSTRACT. This work presents a novel approach to forecast the electric load, assembling the 24-hourly series in a periodic autoregressive moving-average model. The proposed methodology is partially based on the previous work developed by Cancelo and Espasa.

The identification and estimation of a periodic model of order 24 presents enormous complexity. In this work, we propose to estimate the periodic model taking advantage of the existing implementation of estimating univariate ARIMA models and to describe their application in the prediction of the next hours. The new methodology includes two additional contributions: (1) a very exhaustive and complex intervention system that allows the reduction of prediction errors occurring during non-working days, and (2) a meticulous model of the non-linear temperature effect using regression spline techniques. The method is currently being used by the Spanish System Operator (Red Eléctrica de España, REE) to make hourly forecasts of electricity demand from one to ten days ahead.

15:00
Eduardo García-Portugués (Carlos III University of Madrid, Spain)
Davy Paindaveine (Université libre de Bruxelles, Belgium)
Thomas Verdebout (Université libre de Bruxelles, Belgium)
On optimal tests for rotational symmetry against new classes of hyperspherical distributions

ABSTRACT. Motivated by the central role played by rotationally symmetric distributions in directional statistics, we consider the problem of testing rotational symmetry on the hypersphere. We adopt a semiparametric approach and tackle the situations where the location of the symmetry axis is either specified or unspecified. For each problem, we define two tests and study their asymptotic properties under very mild conditions. We introduce two new classes of directional distributions that extend the rotationally symmetric class and are of independent interest. We prove that each test is locally asymptotically maximin, in the Le Cam sense, for one kind of the alternatives given by the new classes of distributions, both for specified and unspecified symmetry axis. The tests, aimed to detect location-like and scatter-like alternatives, are combined into a convenient hybrid test that is consistent against both alternatives. A Monte Carlo study illustrates the finite-sample performances of the tests and corroborates empirically the theoretical findings. Finally, we apply the tests for assessing rotational symmetry in two real data examples coming from geology and proteomics.

15:00
Gabriel Antonio Valverde Castilla (Universidad Complutense de Madrid, Spain)
Jose Manuel Mira McWillians (E.T.S. DE INGENIEROS INDUSTRIALES Universidad Politécnica de Madrid, Spain)
A stochastic simulation experiment for outlier mixture component detection in one and two-layer Self Organizing Maps (SOM).

ABSTRACT. The purpose of this paper is to apply stochastic simulation for a better understanding of the possibilities of outlier mixture component detection with one and two-layer Self Organizing Maps (SOM).

SOM (Self Organizing Maps) were developed by Kohonen (1980) as a tool to represent stuctures such as cortical layers in the brain as two or three dimensional maps. In more asbstract terms, SOM is a clustering technique which implies a reduction of dimensionality to one, two or three dimensions, thus providing a visual description of the clustering.

The essential feature of SOM, besides dimensionality redution into a discrete map, is conservation of topology. In SOM, two forms of learning are applied.

compettitive, by sequential allocation of sample observations to a wiining node in the map and

coperative, by udpate of the weights of ot the winning node and its neighbours.

By means of the cooperative learning, conservation of topology from the original data space to the reduced (typically 2-d) map is achieved.

Here we formulate a stochastic data generating process by means of a mixture of Gaussian distributions, where one of the mixture components is a low-weight outlier. Given that in many real problems it is interesting to keep a representation of the outlier in the map. We compare the performance of one and two-layer SOMs in the outlier representation task.

Stratified samping will be applied for both the one-layer and two-layer SOMs, ie., the same sample will be used for both cases. To estimate the outlying mixture component detection power.

Node initialization was done in two ways:

a) each of the four nodes was initialized from the data in one one of the four strata, i.e, one node per stratum; this would give equal weight in the initialization to all strata, regardless of their relative weights in the stochastic model.

b) intialization taking into account the weights of the strata, i.e, strata with double weight would have their data present in the intialization of twice the number of nodes.

Two layer SOM allows for parallelization. The purpose of the study is to determine if parallelization, which implies much more computing efficiency, does not limit the power of the SOM analysis, or, in other words, under which conditions/parameters of the mixture of Gaussians data generating processs the approximation provided by the two-layer (parallel) SOM is reasonably good.

15:00
Hoang Nguyen (Universidad Carlos III de Madrid, Spain)
María Concepción Ausín (Universidad Carlos III de Madrid, Spain)
Pedro Galeano (UNIVERSIDAD CARLOS III DE MADRID, Spain)
Variational Inference for high dimensional structured factor copulas

ABSTRACT. Factor copula models have been recently proposed by [Krupskii and Joe(2013)], [Krupskii and Joe(2015)] for tackling the curse of dimensionality by describing the joint distribution of variables in terms of a few common latent factors. In this paper, we propose a Bayesian procedure to make inference for structured factor copulas and select the best bivariate copula links using Bayesian model selection criteria. To deal with the high dimensional structure, we propose a Variational Bayesian approximation [Kucukelbir et al.(2016)] to estimate the different specifications of the factor copula models. Compared to the full Bayesian approach, Variational Bayesian approximation is much faster and could handle a sizable problem in a few seconds. We also analyze the posterior estimation from the VB and the full Bayesian inference. We illustrate our proposed procedure with high dimensional real data sets in different contexts.

15:00
Israel Martínez Hernández (King Abdullah University of Science and Technology, KAUST, Saudi Arabia)
Graciela González Farías (Centro de Investigación en Matemáticas, CIMAT,, Mexico)
Long Memory Test for Nonlinear Time Series

ABSTRACT. The detection of persistence or long memory is interesting in sev- eral areas of science, especially in econometrics. It is important to distinguish between time series with short memory and time series with long memory for linear and non-linear time series. The usual tests, such as R/S statistic, has a low-power in the non-linear case. Other tests also fail in detecting long memory in the non- linear case since they are based on the correlation measure and it is not able to describe nonlinear dependence behavior. However, many economic phenomena have non-linear dynamic. Hence, it is interesting to explore new measures with higher power in the nonlinear case. We propose to use Mutual Information (MI) as dependence measure. The MI does not depend on the linearity of the process. MI is estimated at different lags by using a non- parametric estimator. Then, we propose a functional envelope test to test short memory in the time series, using a depth notion to or- der curves. The funcional approach is because the MI is assumed to be a continuous function of lag h. The test has shown high power in the simulation studies.

16:00-17:30 Session 7
Chair:
Ricardo Cao (Research Group MODES, Department of Mathematics, CITIC and ITMATI, Universidade da Coruña, Spain)
16:00
Wenceslao González-Manteiga (Universidad de Santiago de Compostela, Spain)
A review of goodness-of-fit tests for models with functional data with recent results

ABSTRACT. Different goodness-of-fit tests for some null hypothesis of models with functional data are revised in this talk. The more recent tests are based on the residual marked empirical process indexed by random projections for testing the hypothesis of a functional linear model. Some simulations, applications to real data and comparative discussion is given for the sta- tistical analysis of the behaviour of the different tests.

16:30
Fabio Nieto (Universidad Nacional de Colombia, Colombia)
Seasonal dynamic common factors: some new results and a Colombian climatology case study

ABSTRACT. Common factors for seasonal multivariate time series are usually obtained by first filtering the series to eliminate the seasonal component and then extracting the nonseasonal common factors. This approach has two drawbacks. First, we cannot detect common factors with seasonal structure; second, it is well known that a deseasonalized time series may exhibit spurious cycles that the original data do not contain, which can make more difficult the detection of the nonseasonal factors. In this talk is presented a procedure that uses the original data to estimate the dynamic common factors, when some, or all,of the time series are seasonal. The procedure is based on the asymptotic behavior of the sequence of the so-called sample generalized autocovariance matrices and of the sequence of canonical correlation matrices, and it includes a statistical test for detecting the total number of common factors. An application to the Colombian climatology illustrates the statistical method.

17:00
Vanesa Guerrero (Universidad Carlos III de Madrid, Spain)
Emilio Carrizosa (Instituto de Matemáticas de la Universidad de Sevilla, Spain)
Dolores Romero Morales (Copenhagen Business School, Denmark)
Albert Satorra (Universitat Pompeu Fabra, Spain)
Enhancing interpretability in Factor Analysis by means of Mathematical Optimization

ABSTRACT. A natural approach to interpret the latent variables arising in an Exploratory Factor Analysis consists of measuring explanatory variables over the same samples and assign (groups of) them to the factors. Therefore, each latent variable is explained by means of their assigned explanatory variables yielding a straightforward way to give meaning to the factors. Whereas such assignment is usually done by the user based on his/her expertise, we propose an optimization-based procedure which seeks the best transformation of the latent variables that yields the best assignment between them and given (groups of) explanatory variables.

20:30-22:30Tribute Dinner