High-dimensional data in economics and their (robust) analysis

Jan Kalina

doi:10.5937/sjm12-10778

Jan Kalina Institute of Computer Science of the Czech Academy of Sciences

DOI: https://doi.org/10.5937/sjm12-10778

Sažetak

This work is devoted to statistical methods for the analysis of economic data with a large number of variables. The authors present a review of references documenting that such data are more and more commonly available in various theoretical and applied economic problems and their analysis can be hardly performed with standard econometric methods. The paper is focused on highdimensional data, which have a small number of observations, and gives an overview of recently proposed methods for their analysis in the context of econometrics, particularly in the areas of dimensionality reduction, linear regression and classification analysis. Further, the performance of various methods is illustrated on a publicly available benchmark data set on credit scoring. In comparison with other authors, robust methods designed to be insensitive to the presence of outlying measurements are also used. Their strength is revealed after adding an artificial contamination by noise to the original data. In addition, the performance of various methods for a prior dimensionality reduction of the data is compared.

Biografija autora

Jan Kalina, Institute of Computer Science of the Czech Academy of Sciences

Department of Medical Informatics and Biostatistics,

Head

Reference

Ahrens, A., & Bhattacharjee, A. (2015). Two-step lasso estimation of the spatial weights matrix. Econometrics, 3, 128-155.

Atkinson A., & Riani M. (2004). Exploring multivariate data with the forward search. New York, NY, USA: Springer.

Baesens, B. (2014). Analytics in a big data world. New York, NY, USA: Wiley.

Belloni, A., Chernozhukov, V., & Hansen, C.B. (2013). Inference for high-dimensional sparse econometric models. In Acemoglu, D., Arellano, M., & Dekel, E. (Eds.), Advances in Economics and Econometrics, 10th World Congress, Vol. 3. Cambridge, UK: Cambridge University Press.

Belloni, A., Chernozhukov, V., & Wei, Y. (2015). Honest confidence regions for a regression parameter in logistic regression with a large number of controls. Available: http://arxiv.org/abs/1304.3969 (February 20, 2016).

Bühlmann, P., & van de Geer, S. (2011). Statistics for high-dimensional data. Berlin, Germany: Springer.

Candes, E., & Tao, T. (2007). The Dantzig selector: Statistical estimation when p is much larger than n. Annals of Statistics, 35, 2313-2351.

Carrasco, M., Florens, J.-P., & Renault, E. (2007). Linear inverse problems in structural econometrics estimation based on spectral decomposition and regularization. Pp. 5633-5751 in Handbook of Econometrics, Volume 6, Part B.

Einav, L., & Levin, J.D. (2013). The data revolution and economic analysis. NBER working paper No. 19035.

Eisenstein, E.M., & Lodish, L.M. (2002). Marketing decision support and intelligent systems: Precisely worthwhile or vaguely worthless? Pp. 436-454 in Weitz B.A., Wensley R. (Eds.), Handbook of marketing. London, UK: SAGE.

Fan, J., Han, F., & Liu, H. (2014). Challenges of big data analysis. National Science Review, 1, 293-314.

Florens, J.-P., & Simoni, A. (2012). Nonparametric estimation of an instrumental regression: A quasi-Bayesian approach based on regularized prior. Journal of Econometrics, Vol. 170, 458-475.

Greene, W.H. (2012). Econometric Analysis. 7th edn. Harlow, UK: Pearson Education Limited.

Harrell, F.E. (2001). Regression modeling strategies with applications to linear models, logistic regression, and survival analysis. New York, NY, USA: Springer.

Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning. Data mining, inference, and prediction. New York, NY, USA: Springer.

Jurečková, J., Picek, J. (2012). Methodology in robust and nonparametric statistics. Boca Raton, FL, USA: CRC Press.

Kalina, J. (2012). On multivariate methods in robust econometrics. Prague Economic Papers, 21, 69-82.

Kalina, J., & Rensová, D. (2015). How to reduce dimensionality of data: Robustness point of view. Serbian Journal of Management, 10, 131-140.

Kalina, J., & Schlenker, A. (2015). A robust and regularized supervised variable selection. BioMed Research International, Article 320385.

Kalina, J., Schlenker, A., & Kutílek, P. (2015). Highly robust analysis of keystroke dynamics measurements. Pp. 133-138 in Proceedings SAMI 2015, 13th International Symposium on Applied Machine Learning Intelligence and Informatics. Budapest, Hungary: IEEE.

Ledoit, O., & Wolf, M. (2003). Improved estimation of the covariance matrix of stock returns with an application to portfolio selection. Journal of Empirical Finance, 10, 603-621.

Lee, J.A., & Verleysen, M. (2007). Nonlinear dimensionality reduction. New York, NY, USA: Springer.

Leskovec, J., Rajaraman, A., & Ullman, J. (2014). Mining of massive datasets, 2nd edn. Cambridge, UK: Cambridge University Press.

Lessmann, S., Baesens, B., Seow, H.-V., & Thomas, L.C. (2015). Benchmarking state-of-the-art classification algorithms for credit scoring: An update of research. European Journal of Operational Research, 247, 124-136.

Lichman, M. (2013). UCI Machine Learning Repository. Available: http://archive.ics.uci. edu/ml (February 20, 2016). Irvine, CA, USA: University of California.

Liu, B., Yuan, B., & Liu, W. (2008). Classification and dimension reduction in bank credit scoring system. Lecture Notes in Computer Science, 5263, 531-538.

Liu, D. (2014). Essays in theoretical and applied econometrics. Montreal, Canada: Concordia University.

Pourahmadi, M. (2013). High-dimensional covariance estimation. Hoboken, NJ, USA: Wiley.

Ratner, B. (2012). Statistical and machine-learning data mining: Techniques for better predictive modeling and analysis of big data, 2nd edn. Boca Raton, FL, USA: CRC Press.

Roelant, E., Van Aelst, S., & Willems, G. (2009). The minimum weighted covariance determinant estimator. Metrika, 70, 177-204.

Schmarzo, B. (2013). Big data: Understanding how data powers big business. New York, NY, USA: Wiley.

Taylor, L., Schroeder, R., & Meyer, E. (2014). Emerging practices and perspectives on Big Data analysis in economics: Bigger and better or more of the same? Big Data & Society, 1, 1-10.

Varian, H.R. (2014). Big data: New tricks for econometrics. Journal of Economic Perspectives, 28, 3-28.

Víšek, J.Á. (2008). The implicit weighting of GMM estimator. Bulletin of the Czech Econometric Society, 15, 3-29.

Víšek, J.Á. (2009). The least weighted squares I. The asymptotic linearity of normal equations. Bulletin of the Czech Econometric Society, 15, 31-58.

Wang, X., & Tang, X. (2004). Experimental study on multiple LDA classifier combination for high dimensional data classification. Lecture Notes in Computer Science, Vol. 3077, 344-353.

Zhu, Y. (2015). Sparse linear models and l1-regularized 2SLS with high-dimensional endogenous regressors and instruments. Available: http://arxiv.org/pdf/1309.4193 (February 20, 2016).