SMBD2018: INTERNATIONAL CONFERENCE ON STATISTICAL METHODS FOR BIG DATA
PROGRAM FOR THURSDAY, JUNE 7TH
Days:
next day
all days

View: session overviewtalk overview

09:45-10:45 Session 1
Chair:
Pedro Galeano (Universidad Carlos III de Madrid, Spain)
09:45
Ruben Zamar (The University of British Columbia, Canada)
A (Very) Partial Review of Cluster Analisys

ABSTRACT. I will present a very partial review of cluster analysis to showcase my work on this topic and highlight some of my joyful and lasting collaboration with Daniel Pena. First I'll briefly describe four different approaches to cluster analysis: agglomerative, divisive, model based and mode climbing algorithms. Second, I'll present procedures that search for clusters around lower dimensional spaces such as lines, planes, 3-dimensional subspaces and so on. Third, I'll present some current research on robust K-centers with application to the search for small targets in colored pictures such as a metal piece dropped by the NASA's Mars rover Curiosity during its exploration of Mars and a ship lost in the ocean. Finally I'll describe a novel approach to model based clustering and the EM-algorithm.

10:15
Ruey Tsay (University of Chicago, United States)
Zhaoxing Gao (University of Chicago, United States)
A Structural-Factor Approach to Modeling High-Dimensional Time Series

ABSTRACT. This paper considers a structural-factor approach to modeling high-dimensional time series. We decompose individual series into trend, seasonal, and irregular components and consider common factors for the irregular component series. For ease in analyzing many time series, we employ a time polynomial for the trend and a linear combination of trigonometric series for the seasonal component. A new factor model is then proposed for the irregular components to simplify the modeling process and to achieve parsimony in parameterization. We propose a Bayesian Information Criterion (BIC) to consistently determine the order of the polynomial trend and the number of trigonometric functions. A test statistic is used to determine the number of common factors. The convergence rates for the estimators of the trend and seasonal components and the limiting distribution of the test statistic are established under the setting that the number of time series tends to infinity with the sample size, but at a slower rate. We use simulation to study the performance of the proposed analysis in finite samples and apply the proposed approach to model weekly PM2.5 data observed at 15 monitoring stations in the southern region of Taiwan.

10:45-11:15Coffee Break
11:15-13:15 Session 2
Chair:
Helena Veiga (Universidad Carlos III de Madrid, Spain)
11:15
Antoni Espasa (universidad Carlos III de Madrid, Spain)
Top-down and bottom-up disaggregation approaches for econometric analysis of macro variables.

ABSTRACT. One important phenomenon which has been taking place for several decades by now is the steadily growing flow of information available to analysts. In particular, the data published by national official statistics are increasingly becoming available at a higher degree of disaggregation—at the regional, temporal and sector levels. Nevertheless, the econometric work on economic aggregates usually does not consider the information contained in disaggregates. However, the data generation process of a macro variable refers to the data of all its components. This information should not be ignored for an efficient analysis. Even more, the exclusive use of macro data usually would imply imposition of invalid restrictions in the mentioned DGP. The modelling of the components is of great interest. The decision makers need to analyse disaggregates to get specific knowledge of them, to grasp a better understanding of the aggregate and eventually to make better forecasts and decisions. Identical or similar growth values of a macro variable at two moments in time could represent very different situations depending on the growth momentum of the components which respond differently to factors causing growth. The advantage of having a great amount of information when using the components of a macro variable could dissipate with the greater estimation uncertainty which derives from the need of higher dimensional models. Hopefully, the curse of dimensionality could be palliated by discovering good disaggregation structures which really increase the information and/or allow the introduction of valuable restrictions in modelling disaggregates. The disaggregation structure which should be employed in an econometric analysis is not given at the beginning. In many macro variables the disaggregation is available by sectors and regions. Thus, restricting only to the former, one can used the components at the maximum level of disaggregation –we denote them as basic components- or can use disaggregations in which its components are sub-aggregates of the basic components. For instance, the national statistical offices publish data on CPI according to the COICOP classification and the overall index can be disaggregated in several subgroups, each subgroup in classes and finally each class in several subclasses. This can be referred as a disaggregation tree. In this example the subclasses corresponds to the basic components mentioned above and, usually are more than a hundred. Also other breakdowns can be built by the analyst from the basic components, as we discuss in the paper. For instance, carefully designed disaggregated approaches have been built along the 22 years, 1994-2016, in which the Bulletin of EU & US Inflation and Macroeconomic Analysis (BIAM, the acronym from the Spanish name of the publication) was in force. Therefore we need criteria which could lead us to a useful disaggregation. From the work in Lütkepohl (1987) it could be pointed out that the disaggregation analysis will be more relevant when the distributional properties of the components are quite different or when there are cross-restrictions between them. With those hints in this paper we analyse two types of disaggregation approaches. One is the top-down approach followed in the BIAM, which is discussed in Espasa and Senra (2017). It emphasizes in obtaining progressive disaggregations in which the components have very different distribution properties in terms of trend, seasonality, breaks, persistence, non-linearity, etc. and in terms of having different explanatory variables in the formulation of the conditional means. Another one is the bottom-up approach, which was initially proposed by Espasa and Mayo-Burgos (2013) and Carlomagno and Espasa (2016 and 2017) extend it, provide a more precise procedure and study the statistical properties for large and short samples. In this case the analysis starts from the basic components. The aim here is to group the basic components in subsets such that in each one its elements share a single trend, cycle or both. Then the basic components can be modelled by single-equation models taking into account –when it is the case- the restrictions coming from the common features. In this approach besides ending up with individual models for the components we also get a very informative grouping of the basic components in subsets in which the elements in each one share a single, trend, cycle or both. Espasa and Mayo-Burgos (2013) called it ‘disaggregation map’. The study of disaggregation maps of different variables, i.e. production and employment, and their comparison through different national economies could be highly illustrative. For a procedure to use the forecasts of the N basic components in nowcasting see Castle et al. (2010). These two disaggregation approaches are related with the hierarchical proposition done by Hyndman and associates in different papers, see for instance Hyndman et al. (2011). In this literature the disaggregation tree used is in principle fixed. In the top-down approach mentioned above the discovering of a useful disaggregation is obtained by applying economic, institutional and statistical properties of the disaggregates, as discussed in Espasa and Senra (217). On the other hand in the bottom-up method also commented above, the main interest lies in finding certain common features between the components in order, between other things, to improve their forecasts. In this paper the disaggregation scheme for modelling and forecasting is something to be discovered, while in the hierarchical methods is something given, even when it is possible to generate an optimal forecast from different disaggregation trees. The paper discusses how to implement the two mentioned disaggregation approaches and how to use their results. Finally, the paper analysis how to combine them: to produce more precise disaggregation maps and more accurate forecast of the components and indirectly of the aggregate. References. (BIAM 266, 2016), BIAM 266. 2016. Bulletin of EU&US Inflation and Macroeconomic Analysis. Universidad Carlos III de Madrid. http:www.uc3m.es/biam. (Castle and Hendry, 2010), Nowcasting from disaggregates in the face of location shifts. Journal of Forecasting 29 (1-2), 200-214. (Carlomagno and Espasa, 2017), Carlomagno, Guillermo and Antoni Espasa, 2017, “Discovering pervasive and non-pervasive common cycles”, working paper Statistics and Econometrics 17-16, Universidad Carlos III, Madrid, Spain. URI: http://hdl.handle.net/10016/25392. (Carlomagno and Espasa, 2016), Carlomagno, Guillermo and Antoni Espasa, 2016, “Discovering common trends in a large set of disaggregates: statistical procedures and their properties”, working paper Statistics and Econometrics 15-19, Universidad Carlos III, Madrid, Spain. http://hdl.handle.net/10016/21574 (Espasa and Mayo-Burgos, 2013) Espasa, Antoni, and Iván Mayo-Burgos. 2013. Forecasting aggregates and disaggregates with common features. International Journal of Forecasting 29(4): 718-732. (Espasa and Senra, 2017), Espasa, Antoni and Eva Senra 2017, Twenty-two Years of Inflation Assessment and Forecasting Experience at the Bulletin of EU & US Inflation and Macroeconomic Analysis. Econometrics 2017, 5, 44; doi: 10.3390/econometrics5040044. (Hyndman et al., 2011), Hyndman, Rob J., Roman Ahmed, George Athanasopoulos, and Han Lin Shang. 2011. Optimal combination forecasts for hierarchical time series. Computational Statistics & Data Analysis, 55(9): 2579-2589. (Lütkepohl, 1987), Lütkepohl, Helmut. 1987. Forecasting Aggregated Vector ARMA Processes. Berlin: Springer-Verlag.

11:45
Esther Ruiz (Universidad Carlos III de Madrid, Spain)
Gloria Gonzalez (Universidad de California, Riverside, United States)
Javier Vicente (Universidad Carlos III de Madrid, Spain)
Macroeconomic Value in Stress

ABSTRACT. In this paper, we propose a new risk index to measure growth in stressed scenarios of economic underlying factors. The proposed methodology is based on using predictive regression of the output growth of each country augmented with common factors as predictors. The factors are extracted using principal components (PC) from a large set of macroeconomic growths modeled using Dynamic Factor Models (DFMs) with the factors' uncertainty computed using a subsampling procedure. The new index, denoted as Macroeconomic Value in Stress (MViS), is computed for 87 countries using data on macroeconomic growth observed annually from 1985 to 2015.

12:15
Charles Bos (Dept. of Econometrics & Operations Research, Vrije Universiteit Amsterdam, Netherlands)
Estimability of endogenous selection models

ABSTRACT. In an age where information is apparently plentiful, the question arises as to how much information really is contained in the data. The present article studies the endogenous selection model, where a binary outcome variable is only observed if a (binary) decision variable comes out as a 1.

Such a case appears naturally in the case of questionaires (`Will you participate? If yes, did you go through an unemployment spell?`), but also in large-scale high frequency financial databases (`Are stocks A and B moving? If yes, do they move in the same direction?'). Existing literature focusses on the question if the parameters in such models are estimable theoretically, or whether in specific cases the parameter governing correlation between the two decisions may be driven towards a corner solution.

The present article intends to shed light on the information content of such data sets: Is it possible to estimate both the regression parameters for the two equations, together with a possible correlation? What size of a dataset is needed to allow for precise estimation of especially the correlation between the decisions? What is the main cause of a possible lack of information, is it in the missing observations (when the decision variable is 0), is it related to the binary nature of the data, what improvement is found if a Tobit-like outcome equation is observed, and how does this relate to the information in a full-information bivariate regression situation.

Results indicate how information content indeed is lowered moving from the full regression via Tobit and Probit towards the binary selection model. In the latter case, information is extremely scarce, and the asymptotic information matrix is numerically no longer of full rank. In practice, it is found that samples of several millions of datapoints would be necessary before the correlation can be estimated confidently. This results effectively rules out the use of the binary selection model for questionaire-type data, unless prior information on the strength of correlation is imposed.

12:45
Szabolcs Blazsek (School of Business, Guatemala)
Alvaro Escribano (Universidad Carlos III de Madrid, Spain)
Adrian Licht (School of Business, Guatemala)
Score-Driven Multivariate Dynamic Location Models

ABSTRACT. Abstract: In this paper, we introduce a new model by extending the dynamic conditional score (DCS) model of the multivariate t-distribution and name it as the quasi-vector autoregressive (QVAR) model. QVAR is a score-driven nonlinear multivariate dynamic location model, in which the conditional score vector of the log-likelihood (LL) updates the dependent variables. For QVAR, we present the details of the econometric formulation, the computation of the impulse response function, and the maximum likelihood (ML) estimation and related conditions of consistency and asymptotic normality. As an illustration, we use quarterly data for period 1987:Q1 to 2013:Q2 from the following variables: quarterly percentage change in crude oil real price, quarterly United States (US) inflation rate, and quarterly US real gross domestic product (GDP) growth. We find that the statistical performance of QVAR is superior to that of VAR and VARMA. Interestingly, stochastic annual cyclical effects with decreasing amplitude are found for QVAR, whereas those cyclical effects are not found for VAR or VARMA.

Full version of the paper is available as a 2017 working paper of the Department of Economics at: https://e-archivo.uc3m.es/handle/10016/25739#preview Score-driven non linear multivariate dynamic location models

13:30-15:30Lunch
15:30-17:30 Session 3
Chair:
Esther Ruiz (Universidad Carlos III de Madrid, Spain)
15:30
Qiwei Yao (London School of Economics, UK)
Testing for High-dimensional White Noise

ABSTRACT. Testing for white noise is a fundamental problem in statistical inference, as many testing problems in linear modelling can be transformed into a white noise test. While the celebrated Box-Pierce test and its variants tests are often applied for model diagnosis, their relevance in the context of high-dimensional modeling is not well understood, as the asymptotic null distributions are established for fixed dimensions. Furthermore those tests typically lose power when the dimension of time series is relatively large in relation to the sample size. In this talk we introduce two new omnibus tests for high-dimensional time series.

The first method uses the maximum absolute autocorrelations and cross-correlations of the component series as the testing statistic. Based on an approximation by the L-infinity norm of a normal random vector, the critical value of the test can be evaluated by bootstrapping from a multivariate normal distribution. In contrast to the conventional white noise test, the new method is proved to be valid for testing departure from white noise that is not independent and identically distributed.

The second test statistic is defined as the sum of squared singular values of the first q lagged sample autocovariance matrices. Therefore it encapsulates all the serial correlations (upto the time lag q) within and across all component series. Using the tools from random matrix theory, we derive the normal limiting distributions when both the dimension and the sample size diverge to infinity.

(Joint work with Jinyuan Chang, Clifford Lam, Zeng Li, Jeff Yao and Wen Zhou.)

16:00
Agustín Maravall (Bank of Spain, Spain)
Some notes on TRAMO-SEATS and its saga

ABSTRACT. Seasonal Adjustment (SA) has experienced over the last 25 years a golden age. Producers of SA series are typically institutions such as statistical institutes, central banks, or large companies, and the number of series that may be routinely adjusted can often be hundreds of thousands of series. Since 1964 a program (X11) of the US Bureau of the Census (USBC) had dominated the SA world; the method was based on a-priori-designed fixed filters. As the years went by, the number of series to be adjusted grew exponentially, and the limits of X11 became more and more apparent. The European Statistical System (ESS) – that includes National Statistical Offices, Eurostat, Central Banks, and the European Central Bank (ECB) - decided to search for an SA procedure that could improve upon X11 and also foster harmonization of the procedures among European countries. The effort of the ESS turned out to be contagious and many national and international agencies and institutions joined the search. This, in turn, fostered basic research on SA. The USBC had been involved for some years in improving X11 and in 1993 presented the improved version: X12-ARIMA (X12A). It coincided in time with the appearance of two programs, TRAMO and SEATS (TS), that applied to SA a model-based approach using regression-ARIMA type models; the approach was drastically different from X12A. The recommendation upset statisticians that were in charge of SA at many institutions. It implied a drastic change in paradigm: from the fixed filters of X11 to an entirely model-based method, in which the model for seasonality is derived from the structure of the series. The new approach had to be learned and understood and hence heavy teaching would be needed and few teachers were available. Besides, adapting TS to operating systems and databases was not trivial. Furthermore, if TS were to be adopted and applied massively, doubts and problems would inevitably popup; institutional support behind TS would be badly needed. Being, to a large extent, the result of academic research at the European University Institute (EUI) - a very small university centered in the Social Sciences - support uncertainty led to the addition of X12A as an acceptable method. The Bank of Spain (BS) agreed to provide basic support and TS moved to Spain in 1996. It took 5 years for the support to materialize; in 1999, one of the two developers of TS gave up out of frustration. By then, the ESS and the USBC were already working on two interfaces, JDMETRA+ (JD+) and X13-ARIMA-SEATS (X13A-S), respectively. (The two interfaces became fully operational in 2014. Because their main interest was SA, parts of TS were omitted.) In 2001 the BS support finally arrived, mostly due to ECB pressure. The TS situation improved and, in 2007, the National Statistical Institute of Hungary summarized for Eurostat the SA procedures in 32 countries. In the EU ones about 80% of the institutions used TS; 20% of them in combination with X12A. In 2011 the United Nations (“Practical Guide to SA”) recommended TS and/or X12A. Then, in Dec 2014 the remaining developer of TS retired. Support from the BS ended and the Time Series Unit disappeared. Given that programs have to be updated when operating systems (such as FAME or SAS) change, TS users (including the BS) have been forced to move to JD+ or X13A-S. The presentation will briefly summarize the TS peculiar saga. Then, the SEATS model-based structure is reviewed, and it is seen how it affects diagnostics and inference. Some comparisons with alternative approaches (such as X11 or X12) are made, and relevant features of TS that have not been included in the ESS and BC interfaces (examples are business-cycle estimation and data- editing) are described. Finally, relevant features of the TS method that affect some standard macro- econometric procedures are illustrated.

16:30
Victor M. Guerrero (Instituto Tecnologico Autonomo de Mexico (ITAM), Mexico)
Francisco Corona (Instituto Nacional de Estadística y Geografía (INEGI), Mexico)
Optimal retropolation of Mexico´s three Grand Economic Activities

ABSTRACT. This paper is concerned with the estimation of past values of the three Grand Economic Activities quarterly time series for each of the 32 Mexican states, from 2002 to1980, backwards. The resulting figures should satisfy several temporal and contemporaneous accounting constraints, imposed by the System of National Accounts. The estimated database is then joined to its corresponding recorded official counterpart produced by the National Institute of Statistics and Geography (INEGI) from 2003 onwards to get a homogeneous quarterly dataset covering years 1980 to 2016, with breakdown by state and Grand Economic Activity, at constant prices of 2013. The literature on related subjects present many statistical procedures that analysts usually apply at different stages of similar projects, some of them are clearly automatic and heuristic, and some others are model-based. However, those procedures do not use the underlying models appropriately, since they do not consider empirical validation formally. The available methods are: (i) Conversion Techniques, to change base years of the same variable expressed at different prices, including a new industrial classification of economic activities. (ii) Temporal and/or Contemporaneous Disaggregation to estimate observations of a variable originally observed without the required temporal, sectorial or geographical disaggregation. (iii) Restricted Retropolation to extend the record of the database backwards of the ones originally recorded in the official databases. (iv) Reconciliation, so that estimated state figures are made compatible with the national ones, according to the rules imposed by the System of National Accounts. Here, we propose using formal and efficient statistical time series methods to carry out most of the tasks required. The basic tools employed are linear time series models expressed as univariate Auto-Regressive (AR) models or as multivariate Vector Auto-Regressive (VAR) models. We checked the empirical validity of such models with the data at hand by verifying the fulfillment of their underlying assumptions, particularly stationarity and absence of significant residual autocorrelation. This part of the model building process was very time consuming, but it gave us the confidence to say that the methods were correctly applied, allowing us to claim that the results are statistically optimal. The most important result applied to combine the only five different official and publicly available databases at INEGI is described as the Basic Combining Rule established by [1]. This paper illustrates the results obtained for the state called Mexico City and a longer version of this document, available at INEGI, contains the detailed results for the remaining 31 states, as well as for the whole of México as a country. Of course, the most important conclusion of this work is that we were able to estimate a homogeneous database by combining all the official databases publicly available in an optimal statistical manner.

17:00
Antonio Garcia-Ferrer (Universidad Autonoma de Madrid, Spain)
Marcos Bujosa (Universidad Complutense de Madrid, Spain)
Arancha de Juan (Universidad Autonoma de Madrid, Spain)
Antonio Martin-Arroyo (Universidad Autonoma de Madrid, Spain)
Evaluating early warning and coincident indicators of business cycles using trends. Do forecast combinations help?

ABSTRACT. In this paper, we present two composite coincident and leading indicators designed to capture the state of the Spanish economy, and to provide reliable statistical forecasting power, respectively. Our approach based on trends, guarantees that the resulting indicators are reasonable smooth and issue stable signals, reducing the uncertainty associated with the issue of false signals. The coincident indicator has been assessed by comparing it with the one recently proposed by the Spanish Economic Association index. Both indices show similar behavior and ours captures very well the beginning and end of the official recession and expansion periods. Our coincident indicator also tracks very well alternative mass media indicators typically used in the political science literature. On the other hand, our leading indicator systematically predicts the peaks a troughs of the business cycles and provides significant aid in forecasting annual GDP growth rates for the period 2001-2016. Using only data available at the beginning of each forecast period, our leading indicator one-step-ahead forecasts shows sizable improvements over other alternatives, including panels of professional forecasters and different forecast combinations.

17:30-18:00Coffee Break
18:00-19:30 Session 4
Chair:
Ignacio Cascos (Universidad Carlos III de Madrid, Spain)
18:00
Carolina Euan (King Abdullah University of Science and Technology, Saudi Arabia)
Ying Sun (KAUST, Saudi Arabia)
Hernando Ombao (KAUST, Saudi Arabia)
Coherence-based Time Series Clustering for Brain Connectivity Visualization

ABSTRACT. We develop the hierarchical cluster coherence (HCC) method for brain signals, a procedure for characterizing connectivity in a network by clustering nodes or groups of channels that display a high level of coordination as measured by “cluster-coherence.” While the most common approach to measuring dependence between clusters is through pairs of single time series, our method proposes cluster coherence which measures dependence between whole clusters rather than between single elements. Thus it takes into account both the dependence between clusters and within channels in a cluster. The identified clusters contain time series that exhibit high cross-dependence in the spectral domain. That is, these clusters correspond to connected brain regions with synchronized oscillatory activity. In the simulation studies, we show that the proposed HCC outperforms commonly used clustering algorithms, such as average coherence and minimum coherence based methods. To study clustering in a network of multichannel electroencephalograms (EEG) during an epileptic seizure, we applied the HCC method and identified connectivity in alpha (8 − 12) Hertz and beta (16 − 30) Hertz bands at different phases of the recording: before an epileptic seizure, during the early and middle phases of the seizure episode.

18:30
Kalliopi Mylona (Universidad Carlos III de Madrid/King's College London, Spain)
Emily S. Matthews (University of Southampton, UK)
David C. Woods (University of Southampton, UK)
Supersaturated split-plot experiments and industrial applications

ABSTRACT. Screening is a key step in early industrial and scientific experimentation to identify those factors that have a substantive impact on the response. Practical screening experiments often have to be performed with limited resources and in the presence of restrictions in the randomisation of the design due to, for example, the need to incorporate hard-to-change factors via a split-plot design. Moreover, a massive increase in the size of data size often arises from varying very many more factors or features ("the large p, small n" problem). Supersaturated designs, with fewer runs than the number of potential individual and joint factor effects, are now a common tool for screening experiments. We present theoretical methodology that generalises this class of design to split-plot experiments. A linear mixed effects model is used to describe the response from such experiments, and methods for optimal design, model selection and variable-component estimation are developed and presented. Industrial examples from materials and pharmaceutical sciences are used to demonstrate new approaches to both the design and analysis of such supersaturated split-plot experiments.

19:00
Eliud Silva (Unviersidad Anahuac México, Mexico)
Identification of main variables to predict the homicides in Mexico

ABSTRACT. In Mexico, since the so-call war against the narcotrafic began in 2006, the life style and mortality dynamic has changed both men and women. In fact, the life expectancy has had an atypical evolution. Some previous results for example [1] and [2] show that effects of this situation over the Mexican life expectancy. This phenomenon is heterogeneous and increase in all most parts of the country. In this presentation, official datasets of mortality from selected years are employed to predict the main registered variables that produce the occurrence of the deaths by homicides. For this purpose, I focus on machine learning classification with C5.0 decision tree algorithm and the R statistical software. Some different strategies are suggested to make estimates and looking for the probable robustness in the results: seeds, percentages of partition and required parameters for this algorithm. The results give a perspective of the main variables that affect the incidence of homicides in Mexico and then one possibility to sustent and implement health public policies for its prevention.

20:30-22:30Dinner