How to reduce dimensionality of data: Robustness point of view
Abstract
Data analysis in management applications often requires to handle data with a large number of variables. Therefore, dimensionality reduction represents a common and important step in the analysis of multivariate data by methods of both statistics and data mining. This paper gives an overview of robust dimensionality procedures, which are resistant against the presence of outlying measurements. A simulation study represents the main contribution of the paper. It compares various standard and robust dimensionality procedures in combination with standard and robust methods of classification analysis. While standard methods turn out not to perform too badly on data which are only slightly contaminated by outliersc, we give practical recommendations concerning the choice of a suitable robust dimensionality reduction method for highly contaminated data. Namely the highly robust principal component analysis based on the projection pursuit approach turns out to yield the most satisfactory results over four different simulation studies. At the same time, we give recommendations on the choice of a suitable robust classification method.
References
Barlow, J.L., Bosner, N., & Drmač, Z. (2005). A new stable bidiagonal reduction algorithm. Linear Algebra and its Applications, 397, 35-84.
Breiman, L. (1996). Heuristics of instability and stabilization in model selection. Annals of Statistics, 24, 2350–83.
Croux, C., Filzmoser, P., & Oliveira, M.R. (2007). Algorithms for projection-pursuit robust principal component analysis. Chemometrics and Intelligent Laboratory Systems, 87 (2), 218-225.
Croux, C., & Haesbroeck, G. (2000). Principal component analysis based on robust estimators of the covariance or correlation matrix: influence functions and efficiencies. Biometrika, 87, 603-618.
Croux, C., & Ruiz-Gazen, A. (2005). High breakdown estimators for principal components: The projection pursuit approach revisited. Journal of Multivariate Analysis, 95 (1), 206-226.
Davies, P.L. (1987). Asymptotic behaviour of S-estimates of multivariate location parameters and dispersion matrices. Annals of Statistics, 15 (3), 1269-1292.
Donoho, D.L. (1982). Breakdown properties of multivariate location estimators. Ph.D. qualifying paper. Boston, MA, USA: Harvard University.
Filzmoser, P., & Todorov, V. (2011). Review of robust multivariate statistical methods in high dimension. Analytica Chinica Acta, 705, 2-14.
Gnadadesikan, R., & Kettenring J.R. (1972). Robust estimates, residuals, and outlier detection with multiresponse data. Biometrics, 28, 81-124.
Greene, W.H. (2012). Econometric analysis. Seventh edn. Upper Saddle River, NJ, USA: Pearson.
Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of machine learning research, 3, 1157-1182.
Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning. New York, NY, USA: Springer.
Hubert, M., Rousseeuw, P.J., & Vanden Branden, K. (2005). ROBPCA: A new approach to robust principal component analysis. Technometrics, 47, 64-79.
Hubert, M., Rousseeuw, P.J., & Van Aelst, S. (2008). High-breakdown robust multivariate methods. Statistical Science, 23, 92-119.
Jurečková, J., & Picek, J.: Robust statistical methods with R. Chapman & Hall/CRC, Boca Raton, 2006.
Kalina, J. (2014a). Classification methods for high-dimensional genetic data. Biocybernetics and Biomedical Engineering, 34 (1), 10-18.
Kalina, J. (2014b). Machine learning and robustness aspects. Serbian Journal of Management, 9 (1), 131-144.
Kalina, J. (2013). Highly robust methods in data mining. Serbian Journal of Management, 8 (1), 9-24.
Kalina, J. (2012a). On multivariate methods in robust econometrics. Prague Economic Papers, 21 (1), 69-82.
Kalina, J. (2012b). Implicitly weighted methods in robust image analysis. Journal of Mathematical Imaging and Vision, 44 (3), 449-462.
Lempert, R.J., Groves, D.G., Popper, S.W., & Bankes, S.C. (2006). A general, analytic method for generating robust strategies and narrative scenarios. Management Science, 52 (4), 514-528.
Liebmann, B., Filzmoser, P., & Varmuza, K. (2009). Robust and classical PLS regression compared. Preprint. Available: http://www.statistik.tuwien.ac.at/forschung/CS/CS-2009-8complete.pdf (July 25, 2014).
Locantore, N., Marron, J.S., Simpson, D.G., Tripoli, N., Zhang, J.T., & Cohen, K.L. (1999). Robust principal component analysis for functional data. Test, 8 (1), 1-73.
Ma, Y., & Zhu, L. (2013). A review on dimension reduction. International Statistical Review, 81 (1), 134-150.
Martinez, W.L., Martinez, A.R., & Solka, J.L. (2011). Exploratory data analysis with MATLAB. Second edn. London, UK: Chapman& Hall/CRC.
McFerrin, L. (2013). Package HDMD. Available: http://cran.r-project.org/web/packages/ HDMD/ HDMD.pdf (June 14, 2013).
Pourahmadi, M. (2013). High-dimensional covariance estimation. Hoboken, NJ, USA: Wiley.
Rencher, A.C. (2002). Methods of multivariate analysis. Second edn. New York, NY, USA: Wiley.
Rousseeuw, P.J. (1985). Multivariate estimation with high breakdown point. Pp. 283-297 in Grossmann W., Pflug G., Vincze I., & Wertz W. (Eds.), Mathematical Statistics and Applications, Vol. B. Dordrecht, NL: Reidel Publishing Company.
Rousseeuw, P.J., & Leroy, A.M. (1987). Robust regression and outlier detection. New York, NY, USA: Wiley.
Stahel, W.A. (1981). Breakdown of covariance estimators. Research report 31, Fachgruppe für Statistik. Zurich, CH: ETH.
Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society B, 58 (1), 267-288.
Víšek, J.Á. (2011). Consistency of the least weighted squares under heteroscedasticity. Kybernetika, 47 (2), 179-206.
Xanthopoulos, P., Pardalos, P.M., & Trafalis, T.B. (2013). Robust data mining. New York, NY, USA: Springer.
The Author wishes to submit the Work to SJM for publication. To enable SJM to publish the Work and to give effect to the parties’ intention set forth herein, they have agreed to cede the first right to publication and republication in the SJM Journal.
Cession
The Author hereby cedes to SJM, who accepts the cession, to the copyright in and to the paper.
The purpose of the cession is to enable SJM to publish the Work, as first publisher world-wide, and for republication in the SJM Journal, and to grant the right to others to publish the Work world-wide, for so long as such copyright subsists;
SJM shall be entitled to edit the work before publication, as it deems fit, subject to the Authors approval
The Author warrants to SJM that:
- the Author is the owner of the copyright in the Work, whether as author or as reassigned from the Author’s employee and that the Author is entitled to cede the copyright to SJM;
- the paper (or any of its part) is not submitted or accepted for publication in any other Journal;
- the Work is an original work created by the Author;
- the Author has not transferred, ceded, or assigned the copyright, or any part thereof, to any third party; or granted any third party a licence or other right to the copyright, which may affect or detract from the rights granted to SJM in terms of this agreement.
The Author hereby indemnifies the SJM as a body and its individual members, to the fullest extent permitted in law, against all or any claims which may arise consequent to the warranties set forth.
No monetary consideration shall be payable by SJM to the Author for the cession, but SJM shall clearly identify the Author as having produced the Work and ensure that due recognition is given to the Author in any publication of the Work.
Should SJM, in its sole discretion, elect not to publish the Work within 1 year after the date of this agreement, the cession shall lapse and be of no further effect. In such event the copyright shall revert to the Author and SJM shall not publish the Work, or any part thereof, without the Author’s prior written consent.