How to reduce dimensionality of data: Robustness point of view

  • Jan Kalina Institute of Computer Science of the Academy of Science of the Czech Republic
  • Dita Rensová Charles University in Prague, Faculty of Mathematics and Physics

Abstract


Data analysis in management applications often requires to handle data with a large number of variables. Therefore, dimensionality reduction represents a common and important step in the analysis of multivariate data by methods of both statistics and data mining. This paper gives an overview of robust dimensionality procedures, which are resistant against the presence of outlying measurements. A simulation study represents the main contribution of the paper. It compares various standard and robust dimensionality procedures in combination with standard and robust methods of classification analysis. While standard methods turn out not to perform too badly on data which are only slightly contaminated by outliersc, we give practical recommendations concerning the choice of a suitable robust dimensionality reduction method for highly contaminated data. Namely the highly robust principal component analysis based on the projection pursuit approach turns out to yield the most satisfactory results over four different simulation studies. At the same time, we give recommendations on  the choice of a suitable robust classification method.

Author Biography

Jan Kalina, Institute of Computer Science of the Academy of Science of the Czech Republic

Department of Medical Informatics and Biostatistics,

Head

References

Barlow, J.L., Bosner, N., & Drmač, Z. (2005). A new stable bidiagonal reduction algorithm. Linear Algebra and its Applications, 397, 35-84.

Breiman, L. (1996). Heuristics of instability and stabilization in model selection. Annals of Statistics, 24, 2350–83.

Croux, C., Filzmoser, P., & Oliveira, M.R. (2007). Algorithms for projection-pursuit robust principal component analysis. Chemometrics and Intelligent Laboratory Systems, 87 (2), 218-225.

Croux, C., & Haesbroeck, G. (2000). Principal component analysis based on robust estimators of the covariance or correlation matrix: influence functions and efficiencies. Biometrika, 87, 603-618.

Croux, C., & Ruiz-Gazen, A. (2005). High breakdown estimators for principal components: The projection pursuit approach revisited. Journal of Multivariate Analysis, 95 (1), 206-226.

Davies, P.L. (1987). Asymptotic behaviour of S-estimates of multivariate location parameters and dispersion matrices. Annals of Statistics, 15 (3), 1269-1292.

Donoho, D.L. (1982). Breakdown properties of multivariate location estimators. Ph.D. qualifying paper. Boston, MA, USA: Harvard University.

Filzmoser, P., & Todorov, V. (2011). Review of robust multivariate statistical methods in high dimension. Analytica Chinica Acta, 705, 2-14.

Gnadadesikan, R., & Kettenring J.R. (1972). Robust estimates, residuals, and outlier detection with multiresponse data. Biometrics, 28, 81-124.

Greene, W.H. (2012). Econometric analysis. Seventh edn. Upper Saddle River, NJ, USA: Pearson.

Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of machine learning research, 3, 1157-1182.

Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning. New York, NY, USA: Springer.

Hubert, M., Rousseeuw, P.J., & Vanden Branden, K. (2005). ROBPCA: A new approach to robust principal component analysis. Technometrics, 47, 64-79.

Hubert, M., Rousseeuw, P.J., & Van Aelst, S. (2008). High-breakdown robust multivariate methods. Statistical Science, 23, 92-119.

Jurečková, J., & Picek, J.: Robust statistical methods with R. Chapman & Hall/CRC, Boca Raton, 2006.

Kalina, J. (2014a). Classification methods for high-dimensional genetic data. Biocybernetics and Biomedical Engineering, 34 (1), 10-18.

Kalina, J. (2014b). Machine learning and robustness aspects. Serbian Journal of Management, 9 (1), 131-144.

Kalina, J. (2013). Highly robust methods in data mining. Serbian Journal of Management, 8 (1), 9-24.

Kalina, J. (2012a). On multivariate methods in robust econometrics. Prague Economic Papers, 21 (1), 69-82.

Kalina, J. (2012b). Implicitly weighted methods in robust image analysis. Journal of Mathematical Imaging and Vision, 44 (3), 449-462.

Lempert, R.J., Groves, D.G., Popper, S.W., & Bankes, S.C. (2006). A general, analytic method for generating robust strategies and narrative scenarios. Management Science, 52 (4), 514-528.

Liebmann, B., Filzmoser, P., & Varmuza, K. (2009). Robust and classical PLS regression compared. Preprint. Available: http://www.statistik.tuwien.ac.at/forschung/CS/CS-2009-8complete.pdf (July 25, 2014).

Locantore, N., Marron, J.S., Simpson, D.G., Tripoli, N., Zhang, J.T., & Cohen, K.L. (1999). Robust principal component analysis for functional data. Test, 8 (1), 1-73.

Ma, Y., & Zhu, L. (2013). A review on dimension reduction. International Statistical Review, 81 (1), 134-150.

Martinez, W.L., Martinez, A.R., & Solka, J.L. (2011). Exploratory data analysis with MATLAB. Second edn. London, UK: Chapman& Hall/CRC.

McFerrin, L. (2013). Package HDMD. Available: http://cran.r-project.org/web/packages/ HDMD/ HDMD.pdf (June 14, 2013).

Pourahmadi, M. (2013). High-dimensional covariance estimation. Hoboken, NJ, USA: Wiley.

Rencher, A.C. (2002). Methods of multivariate analysis. Second edn. New York, NY, USA: Wiley.

Rousseeuw, P.J. (1985). Multivariate estimation with high breakdown point. Pp. 283-297 in Grossmann W., Pflug G., Vincze I., & Wertz W. (Eds.), Mathematical Statistics and Applications, Vol. B. Dordrecht, NL: Reidel Publishing Company.

Rousseeuw, P.J., & Leroy, A.M. (1987). Robust regression and outlier detection. New York, NY, USA: Wiley.

Stahel, W.A. (1981). Breakdown of covariance estimators. Research report 31, Fachgruppe für Statistik. Zurich, CH: ETH.

Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society B, 58 (1), 267-288.

Víšek, J.Á. (2011). Consistency of the least weighted squares under heteroscedasticity. Kybernetika, 47 (2), 179-206.

Xanthopoulos, P., Pardalos, P.M., & Trafalis, T.B. (2013). Robust data mining. New York, NY, USA: Springer.

Published
2014/09/26
Section
Review Paper