On robust information extraction from high-dimensional data

  • Jan Kalina Institute of Computer Science of the Academy of Science of the Czech Republic

Abstract


Information extraction from high-dimensional data represents an important problem in current applications in management or econometrics. An important problem from a practical point of view is the sensitivity of machine learning methods with respect to the presence of outlying data values, while numerical stability represents another important aspect of data mining from high-dimensional data. This paper gives an overview of various types of data mining, discusses their suitability for high-dimensional data and critically discusses their properties from the robustness point of view, while we explain that the robustness itself is perceived differently in different contexts. Moreover, we investigate properties of a robust nonlinear regression estimator of Kalina (2013).

Author Biography

Jan Kalina, Institute of Computer Science of the Academy of Science of the Czech Republic
Department of Medical Informatics and Biostatistics, Chair

References

Belloni A., Chernozhukov V., & Hansen C. (2011). Inference for high-dimensional sparse econometric models. Centre for Microdata Methods and Practice working paper 41/11. [Online] Available: http://arxiv.org/pdf/1201.0220.pdf (February 12, 2014)

Blankertz B., Tangermann M., Popescu F., Krauledat M., Fazli S., Dónaczy M., Curio G., & Müller K.R. (2008). The Berlin brain-computer interface. Lecture Notes in Computer Science, 5050, 79-101.

Bobrowski L., & Łukaszuk T. (2011). Relaxed linear separability (RLS) approach to feature (gene) subset selection. In X. Xia (Ed.), Selected Works in Bioinformatics (pp. 103-118). Rijeka: InTech.

Bobrowski L.,& Łukaszuk T. (2012). Prognostic modeling with high dimensional and censored data. Lecture Notes in Computer Science, 7377, 178-193.

Brandl B., Keber C., & Schuster M. (2006). An automated econometric decision support system: Forecasts for foreign exchange trades. Central European Journal of Operations Research, 14, 401-415.

Christmann A., & Van Messem A. (2008). Bouligand derivatives and robustness of support vector machines for regression. Journal of Machine Learning Research, 9, 915-936.

Dai J.J., Lieu L., & Rocke D. (2006). Dimension reduction for classification with gene expression microarray data. Statistical Applications in Genetics and Molecular Biology, 5 (1), Article 6.

Duan N., & Li K.C. (1991). Slicing regression: A link-free regression method. Annals of Statistics, 19, 505-530.

Fernandez G. (2003). Data mining using SAS applications. Boca Raton: Chapman & Hall/CRC.

Funk M.J., Westreich D., Wiesen C., Stürmer T., Brookhart M.A., & Davidian M. (2011). Doubly robust estimation of causal effects. American Journal of Epidemiology, 173 (7), 761-767.

Furlanello C., Serafini M., Merler S., & Jurman G. (2003). Entropy-based gene ranking without selection bias for the predictive classification of microarray data. BMC Bioinformatics, 4, Article 54.

Greenland S. (2000). When should epidemiologic regressions use random coefficients? Biometrics, 56, 915-921.

Guo Y., Hastie T., & Tibshirani R. (2007). Regularized discriminant analysis and its application in microarrays. Biostatistics, 8 (1), 86-100.

Hall P., Marron J.S., & Neeman A. (2005). Geometric representation of high dimension, low sample size data. Journal of the Royal Statistical Society B, 67 (3), 427-444.

Harrell F.E. (2001). Regression modeling strategies with applications to linear models, logistic regression, and survival analysis. New York: Springer.

Hastie T., Tibshirani R., & Friedman J. (2009). The elements of statistical learning. Data mining, inference, and prediction. New York: Springer.

Hersterberg T., Choi N.H., Meier L., & Fraley C. (2008). Least angle and l1 penalized regression: A review. Statistics Surveys, 2, 61-93.

Hubert M., Rousseeuw P.J., Van Aelst S. (2008). High-breakdown robust multivariate methods. Statistical Science, 23, 92-119.

Jeng J.T., Chuang C.T., & Chuang C.C. (2011). Least trimmed squares based CPBUM neural networks. Proceedings International Conference on System Science and Engineering ICSSE 2011, Washington: IEEE Computer Society Press, 187-192.

Jurczyk T. (2012). Outlier detection under multicollinearity. Journal of Statistical Computation and Simulation, 82 (2), 261-278.

Jurečková J., & Picek J. (2006). Robust statistical methods with R. Boca Raton: Chapman & Hall/CRC.

Jurečková J., & Sen P.K. (2006). Robust multivariate location estimation, admissibility, and shrinkage phenomenon. Statistics & Decisions, 24, 273-290.

Kainen P.C., Kůrková V., Sanguineti M. (2009). Complexity of Gaussian-radial-basis networks approximating smooth functions. Journal of Complexity, 25, 63-74.

Kalina J. (2014). Classification analysis methods for high-dimensional genetic data. Biocybernetics and Biomedical Engineering, 34 (1). In press.

Kalina J. (2013). Highly robust methods in data mining. Serbian Journal of Management, 8 (1), 9-24.

Kalina J., Seidl L., Zvára K., Grünfeldová H., Slovák D., & Zvárová J. (2013): Selecting relevant information for medical decision support with application to cardiology. European Journal for Biomedical Informatics, 9 (1), 2-6.

Kalina J. (2012) On multivariate methods in robust econometrics. Prague Econ. Pap., 21, 69-82.

Kohonen T. (1982). Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43, 59-69.

Liu X., Krishnan A., & Modry A. (2005). An entropy-based gene selection method for cancer classification using microarray data. BMC Bioinformatics, 6, Article 76.

Martinez W.L., Martinez A.R., & Solka J.L. (2011). Exploratory data analysis with MATLAB. (2nd ed.). London: Chapman & Hall/CRC.

McFerrin L. (2013). Package HDMD. [Online] Available: http://cran.r-project.org/web/packages/ HDMD/HDMD.pdf (June 14, 2013)

Mosteller F., & Tukey J.W. (1968). Data analysis, including statistics. In G. Lindzey, E. Aronson (Eds.), Handbook of Social Psychology, Vol. 2 (pp. 80-203). New York: Addison-Wesley.

Nisbet R., Elder J., & Miner G. (2009). Handbook of statistical analysis and data mining applications. Burlington: Elsevier.

Murtaza N., Sattar A.R., & Mustafa T. (2010). Enhancing the software effort estimation using outlier elimination methods for agriculture in Pakistan. Pakistan Journal of Life and Social Sciences, 8, 54-58.

Osuna E., Freund R., & Girosi F. (1997). Training support vector machines: An application to face detection. Proceedings IEEE Computer Society Conference on Computer Vision and Pattern Recognition CVPR 1997, Los Alamitos: IEEE Computer Society Press, 130-136.

Penn B.S. (2005). Using self-organizing maps to visualize high-dimensional data. Computers & Geosciences, 31 (5), 531-544.

Rousseeuw P.J., van Driessen K. (2006). Computing LTS regression for large data sets. Data Mining and Knowledge Discovery, 12, 29-45.

Rowley H., Baluja S., & Kanade S. (1998) Neural network-based face detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20, 23-38.

Rusiecki A. (2008). Robust MCD-based backpropagation learning algorithm. Lecture Notes in Computer Science, 5097, 154-163.

Šebesta V., & Tučková J. (2005). The extraction of markers for the training of neural network dedicated for the speech prosody control. In S. Lecoeuche, D. Tsaptsinos (Eds.), Novel Applications of Neural Networks in Engineering International Conference on Engineering Applications of Neural Networks EANN’05, 245-250.

Smyth G.K. (2005). Limma: linear models for microarray data. In Gentleman R., Carey V., Dudoit S., Irizarry R., Huber W. (Eds.): Bioinformatics and computational biology solutions using R and Bioconductor. Springer, New York, 397-420.

Suzuki T., & Sugiyama M. (2010). Sufficient dimension reduction via squared-loss mutual information estimation. Neural Computation, 25, 725-758.

Turchi M., Perrotta D., Riani M., & Cerioli A. (2013). Robustness issues in text mining. Advances in Intelligent Systems and Computing, 190, 263-272.

Vanden Branden K., & Hubert M. (2005). Robust classification in high dimensions based on the SIMCA method. Chemometrics and Intelligent Laboratory Systems, 79, 10-21.

Vapnik V.N. (1995). The nature of statistical learning theory. New York: Springer.

Víšek J.Á. (2000). On the diversity of estimates. Computational Statistics & Data Analysis, 34, 67-89.

Xanthopoulos P., Pardalos P.M., Trafalis T.B. (2013). Robust data mining. Springer, New York.

Zimmermann H.-G., Grothmann R., & Neuneier R. (2001). Multi-agent FX-market modeling by neural networks. Operations Research Proceedings, 2001, 413-420.

Zuber V., & Strimmer K. (2011). High-dimensional regression and variable selection using CAR scores. Statistical Applications in Genetics and Molecular Biology, 10 (1), Article 34.

Published
2013/12/14
Section
Review