High-dimensional data in economics and their (robust) analysis
Abstract
This work is devoted to statistical methods for the analysis of economic data with a large number of variables. The authors present a review of references documenting that such data are more and more commonly available in various theoretical and applied economic problems and their analysis can be hardly performed with standard econometric methods. The paper is focused on highdimensional data, which have a small number of observations, and gives an overview of recently proposed methods for their analysis in the context of econometrics, particularly in the areas of dimensionality reduction, linear regression and classification analysis. Further, the performance of various methods is illustrated on a publicly available benchmark data set on credit scoring. In comparison with other authors, robust methods designed to be insensitive to the presence of outlying measurements are also used. Their strength is revealed after adding an artificial contamination by noise to the original data. In addition, the performance of various methods for a prior dimensionality reduction of the data is compared.
References
Ahrens, A., & Bhattacharjee, A. (2015). Two-step lasso estimation of the spatial weights matrix. Econometrics, 3, 128-155.
Atkinson A., & Riani M. (2004). Exploring multivariate data with the forward search. New York, NY, USA: Springer.
Baesens, B. (2014). Analytics in a big data world. New York, NY, USA: Wiley.
Belloni, A., Chernozhukov, V., & Hansen, C.B. (2013). Inference for high-dimensional sparse econometric models. In Acemoglu, D., Arellano, M., & Dekel, E. (Eds.), Advances in Economics and Econometrics, 10th World Congress, Vol. 3. Cambridge, UK: Cambridge University Press.
Belloni, A., Chernozhukov, V., & Wei, Y. (2015). Honest confidence regions for a regression parameter in logistic regression with a large number of controls. Available: http://arxiv.org/abs/1304.3969 (February 20, 2016).
Bühlmann, P., & van de Geer, S. (2011). Statistics for high-dimensional data. Berlin, Germany: Springer.
Candes, E., & Tao, T. (2007). The Dantzig selector: Statistical estimation when p is much larger than n. Annals of Statistics, 35, 2313-2351.
Carrasco, M., Florens, J.-P., & Renault, E. (2007). Linear inverse problems in structural econometrics estimation based on spectral decomposition and regularization. Pp. 5633-5751 in Handbook of Econometrics, Volume 6, Part B.
Einav, L., & Levin, J.D. (2013). The data revolution and economic analysis. NBER working paper No. 19035.
Eisenstein, E.M., & Lodish, L.M. (2002). Marketing decision support and intelligent systems: Precisely worthwhile or vaguely worthless? Pp. 436-454 in Weitz B.A., Wensley R. (Eds.), Handbook of marketing. London, UK: SAGE.
Fan, J., Han, F., & Liu, H. (2014). Challenges of big data analysis. National Science Review, 1, 293-314.
Florens, J.-P., & Simoni, A. (2012). Nonparametric estimation of an instrumental regression: A quasi-Bayesian approach based on regularized prior. Journal of Econometrics, Vol. 170, 458-475.
Greene, W.H. (2012). Econometric Analysis. 7th edn. Harlow, UK: Pearson Education Limited.
Harrell, F.E. (2001). Regression modeling strategies with applications to linear models, logistic regression, and survival analysis. New York, NY, USA: Springer.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning. Data mining, inference, and prediction. New York, NY, USA: Springer.
Jurečková, J., Picek, J. (2012). Methodology in robust and nonparametric statistics. Boca Raton, FL, USA: CRC Press.
Kalina, J. (2012). On multivariate methods in robust econometrics. Prague Economic Papers, 21, 69-82.
Kalina, J., & Rensová, D. (2015). How to reduce dimensionality of data: Robustness point of view. Serbian Journal of Management, 10, 131-140.
Kalina, J., & Schlenker, A. (2015). A robust and regularized supervised variable selection. BioMed Research International, Article 320385.
Kalina, J., Schlenker, A., & Kutílek, P. (2015). Highly robust analysis of keystroke dynamics measurements. Pp. 133-138 in Proceedings SAMI 2015, 13th International Symposium on Applied Machine Learning Intelligence and Informatics. Budapest, Hungary: IEEE.
Ledoit, O., & Wolf, M. (2003). Improved estimation of the covariance matrix of stock returns with an application to portfolio selection. Journal of Empirical Finance, 10, 603-621.
Lee, J.A., & Verleysen, M. (2007). Nonlinear dimensionality reduction. New York, NY, USA: Springer.
Leskovec, J., Rajaraman, A., & Ullman, J. (2014). Mining of massive datasets, 2nd edn. Cambridge, UK: Cambridge University Press.
Lessmann, S., Baesens, B., Seow, H.-V., & Thomas, L.C. (2015). Benchmarking state-of-the-art classification algorithms for credit scoring: An update of research. European Journal of Operational Research, 247, 124-136.
Lichman, M. (2013). UCI Machine Learning Repository. Available: http://archive.ics.uci. edu/ml (February 20, 2016). Irvine, CA, USA: University of California.
Liu, B., Yuan, B., & Liu, W. (2008). Classification and dimension reduction in bank credit scoring system. Lecture Notes in Computer Science, 5263, 531-538.
Liu, D. (2014). Essays in theoretical and applied econometrics. Montreal, Canada: Concordia University.
Pourahmadi, M. (2013). High-dimensional covariance estimation. Hoboken, NJ, USA: Wiley.
Ratner, B. (2012). Statistical and machine-learning data mining: Techniques for better predictive modeling and analysis of big data, 2nd edn. Boca Raton, FL, USA: CRC Press.
Roelant, E., Van Aelst, S., & Willems, G. (2009). The minimum weighted covariance determinant estimator. Metrika, 70, 177-204.
Schmarzo, B. (2013). Big data: Understanding how data powers big business. New York, NY, USA: Wiley.
Taylor, L., Schroeder, R., & Meyer, E. (2014). Emerging practices and perspectives on Big Data analysis in economics: Bigger and better or more of the same? Big Data & Society, 1, 1-10.
Varian, H.R. (2014). Big data: New tricks for econometrics. Journal of Economic Perspectives, 28, 3-28.
Víšek, J.Á. (2008). The implicit weighting of GMM estimator. Bulletin of the Czech Econometric Society, 15, 3-29.
Víšek, J.Á. (2009). The least weighted squares I. The asymptotic linearity of normal equations. Bulletin of the Czech Econometric Society, 15, 31-58.
Wang, X., & Tang, X. (2004). Experimental study on multiple LDA classifier combination for high dimensional data classification. Lecture Notes in Computer Science, Vol. 3077, 344-353.
Zhu, Y. (2015). Sparse linear models and l1-regularized 2SLS with high-dimensional endogenous regressors and instruments. Available: http://arxiv.org/pdf/1309.4193 (February 20, 2016).
The Author wishes to submit the Work to SJM for publication. To enable SJM to publish the Work and to give effect to the parties’ intention set forth herein, they have agreed to cede the first right to publication and republication in the SJM Journal.
Cession
The Author hereby cedes to SJM, who accepts the cession, to the copyright in and to the paper.
The purpose of the cession is to enable SJM to publish the Work, as first publisher world-wide, and for republication in the SJM Journal, and to grant the right to others to publish the Work world-wide, for so long as such copyright subsists;
SJM shall be entitled to edit the work before publication, as it deems fit, subject to the Authors approval
The Author warrants to SJM that:
- the Author is the owner of the copyright in the Work, whether as author or as reassigned from the Author’s employee and that the Author is entitled to cede the copyright to SJM;
- the paper (or any of its part) is not submitted or accepted for publication in any other Journal;
- the Work is an original work created by the Author;
- the Author has not transferred, ceded, or assigned the copyright, or any part thereof, to any third party; or granted any third party a licence or other right to the copyright, which may affect or detract from the rights granted to SJM in terms of this agreement.
The Author hereby indemnifies the SJM as a body and its individual members, to the fullest extent permitted in law, against all or any claims which may arise consequent to the warranties set forth.
No monetary consideration shall be payable by SJM to the Author for the cession, but SJM shall clearly identify the Author as having produced the Work and ensure that due recognition is given to the Author in any publication of the Work.
Should SJM, in its sole discretion, elect not to publish the Work within 1 year after the date of this agreement, the cession shall lapse and be of no further effect. In such event the copyright shall revert to the Author and SJM shall not publish the Work, or any part thereof, without the Author’s prior written consent.