On robust information extraction from high-dimensional data
Abstract
Information extraction from high-dimensional data represents an important problem in current applications in management or econometrics. An important problem from a practical point of view is the sensitivity of machine learning methods with respect to the presence of outlying data values, while numerical stability represents another important aspect of data mining from high-dimensional data. This paper gives an overview of various types of data mining, discusses their suitability for high-dimensional data and critically discusses their properties from the robustness point of view, while we explain that the robustness itself is perceived differently in different contexts. Moreover, we investigate properties of a robust nonlinear regression estimator of Kalina (2013).
References
Belloni A., Chernozhukov V., & Hansen C. (2011). Inference for high-dimensional sparse econometric models. Centre for Microdata Methods and Practice working paper 41/11. [Online] Available: http://arxiv.org/pdf/1201.0220.pdf (February 12, 2014)
Blankertz B., Tangermann M., Popescu F., Krauledat M., Fazli S., Dónaczy M., Curio G., & Müller K.R. (2008). The Berlin brain-computer interface. Lecture Notes in Computer Science, 5050, 79-101.
Bobrowski L., & Łukaszuk T. (2011). Relaxed linear separability (RLS) approach to feature (gene) subset selection. In X. Xia (Ed.), Selected Works in Bioinformatics (pp. 103-118). Rijeka: InTech.
Bobrowski L.,& Łukaszuk T. (2012). Prognostic modeling with high dimensional and censored data. Lecture Notes in Computer Science, 7377, 178-193.
Brandl B., Keber C., & Schuster M. (2006). An automated econometric decision support system: Forecasts for foreign exchange trades. Central European Journal of Operations Research, 14, 401-415.
Christmann A., & Van Messem A. (2008). Bouligand derivatives and robustness of support vector machines for regression. Journal of Machine Learning Research, 9, 915-936.
Dai J.J., Lieu L., & Rocke D. (2006). Dimension reduction for classification with gene expression microarray data. Statistical Applications in Genetics and Molecular Biology, 5 (1), Article 6.
Duan N., & Li K.C. (1991). Slicing regression: A link-free regression method. Annals of Statistics, 19, 505-530.
Fernandez G. (2003). Data mining using SAS applications. Boca Raton: Chapman & Hall/CRC.
Funk M.J., Westreich D., Wiesen C., Stürmer T., Brookhart M.A., & Davidian M. (2011). Doubly robust estimation of causal effects. American Journal of Epidemiology, 173 (7), 761-767.
Furlanello C., Serafini M., Merler S., & Jurman G. (2003). Entropy-based gene ranking without selection bias for the predictive classification of microarray data. BMC Bioinformatics, 4, Article 54.
Greenland S. (2000). When should epidemiologic regressions use random coefficients? Biometrics, 56, 915-921.
Guo Y., Hastie T., & Tibshirani R. (2007). Regularized discriminant analysis and its application in microarrays. Biostatistics, 8 (1), 86-100.
Hall P., Marron J.S., & Neeman A. (2005). Geometric representation of high dimension, low sample size data. Journal of the Royal Statistical Society B, 67 (3), 427-444.
Harrell F.E. (2001). Regression modeling strategies with applications to linear models, logistic regression, and survival analysis. New York: Springer.
Hastie T., Tibshirani R., & Friedman J. (2009). The elements of statistical learning. Data mining, inference, and prediction. New York: Springer.
Hersterberg T., Choi N.H., Meier L., & Fraley C. (2008). Least angle and l1 penalized regression: A review. Statistics Surveys, 2, 61-93.
Hubert M., Rousseeuw P.J., Van Aelst S. (2008). High-breakdown robust multivariate methods. Statistical Science, 23, 92-119.
Jeng J.T., Chuang C.T., & Chuang C.C. (2011). Least trimmed squares based CPBUM neural networks. Proceedings International Conference on System Science and Engineering ICSSE 2011, Washington: IEEE Computer Society Press, 187-192.
Jurczyk T. (2012). Outlier detection under multicollinearity. Journal of Statistical Computation and Simulation, 82 (2), 261-278.
Jurečková J., & Picek J. (2006). Robust statistical methods with R. Boca Raton: Chapman & Hall/CRC.
Jurečková J., & Sen P.K. (2006). Robust multivariate location estimation, admissibility, and shrinkage phenomenon. Statistics & Decisions, 24, 273-290.
Kainen P.C., Kůrková V., Sanguineti M. (2009). Complexity of Gaussian-radial-basis networks approximating smooth functions. Journal of Complexity, 25, 63-74.
Kalina J. (2014). Classification analysis methods for high-dimensional genetic data. Biocybernetics and Biomedical Engineering, 34 (1). In press.
Kalina J. (2013). Highly robust methods in data mining. Serbian Journal of Management, 8 (1), 9-24.
Kalina J., Seidl L., Zvára K., Grünfeldová H., Slovák D., & Zvárová J. (2013): Selecting relevant information for medical decision support with application to cardiology. European Journal for Biomedical Informatics, 9 (1), 2-6.
Kalina J. (2012) On multivariate methods in robust econometrics. Prague Econ. Pap., 21, 69-82.
Kohonen T. (1982). Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43, 59-69.
Liu X., Krishnan A., & Modry A. (2005). An entropy-based gene selection method for cancer classification using microarray data. BMC Bioinformatics, 6, Article 76.
Martinez W.L., Martinez A.R., & Solka J.L. (2011). Exploratory data analysis with MATLAB. (2nd ed.). London: Chapman & Hall/CRC.
McFerrin L. (2013). Package HDMD. [Online] Available: http://cran.r-project.org/web/packages/ HDMD/HDMD.pdf (June 14, 2013)
Mosteller F., & Tukey J.W. (1968). Data analysis, including statistics. In G. Lindzey, E. Aronson (Eds.), Handbook of Social Psychology, Vol. 2 (pp. 80-203). New York: Addison-Wesley.
Nisbet R., Elder J., & Miner G. (2009). Handbook of statistical analysis and data mining applications. Burlington: Elsevier.
Murtaza N., Sattar A.R., & Mustafa T. (2010). Enhancing the software effort estimation using outlier elimination methods for agriculture in Pakistan. Pakistan Journal of Life and Social Sciences, 8, 54-58.
Osuna E., Freund R., & Girosi F. (1997). Training support vector machines: An application to face detection. Proceedings IEEE Computer Society Conference on Computer Vision and Pattern Recognition CVPR 1997, Los Alamitos: IEEE Computer Society Press, 130-136.
Penn B.S. (2005). Using self-organizing maps to visualize high-dimensional data. Computers & Geosciences, 31 (5), 531-544.
Rousseeuw P.J., van Driessen K. (2006). Computing LTS regression for large data sets. Data Mining and Knowledge Discovery, 12, 29-45.
Rowley H., Baluja S., & Kanade S. (1998) Neural network-based face detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20, 23-38.
Rusiecki A. (2008). Robust MCD-based backpropagation learning algorithm. Lecture Notes in Computer Science, 5097, 154-163.
Šebesta V., & Tučková J. (2005). The extraction of markers for the training of neural network dedicated for the speech prosody control. In S. Lecoeuche, D. Tsaptsinos (Eds.), Novel Applications of Neural Networks in Engineering International Conference on Engineering Applications of Neural Networks EANN’05, 245-250.
Smyth G.K. (2005). Limma: linear models for microarray data. In Gentleman R., Carey V., Dudoit S., Irizarry R., Huber W. (Eds.): Bioinformatics and computational biology solutions using R and Bioconductor. Springer, New York, 397-420.
Suzuki T., & Sugiyama M. (2010). Sufficient dimension reduction via squared-loss mutual information estimation. Neural Computation, 25, 725-758.
Turchi M., Perrotta D., Riani M., & Cerioli A. (2013). Robustness issues in text mining. Advances in Intelligent Systems and Computing, 190, 263-272.
Vanden Branden K., & Hubert M. (2005). Robust classification in high dimensions based on the SIMCA method. Chemometrics and Intelligent Laboratory Systems, 79, 10-21.
Vapnik V.N. (1995). The nature of statistical learning theory. New York: Springer.
Víšek J.Á. (2000). On the diversity of estimates. Computational Statistics & Data Analysis, 34, 67-89.
Xanthopoulos P., Pardalos P.M., Trafalis T.B. (2013). Robust data mining. Springer, New York.
Zimmermann H.-G., Grothmann R., & Neuneier R. (2001). Multi-agent FX-market modeling by neural networks. Operations Research Proceedings, 2001, 413-420.
Zuber V., & Strimmer K. (2011). High-dimensional regression and variable selection using CAR scores. Statistical Applications in Genetics and Molecular Biology, 10 (1), Article 34.
The Author wishes to submit the Work to SJM for publication. To enable SJM to publish the Work and to give effect to the parties’ intention set forth herein, they have agreed to cede the first right to publication and republication in the SJM Journal.
Cession
The Author hereby cedes to SJM, who accepts the cession, to the copyright in and to the paper.
The purpose of the cession is to enable SJM to publish the Work, as first publisher world-wide, and for republication in the SJM Journal, and to grant the right to others to publish the Work world-wide, for so long as such copyright subsists;
SJM shall be entitled to edit the work before publication, as it deems fit, subject to the Authors approval
The Author warrants to SJM that:
- the Author is the owner of the copyright in the Work, whether as author or as reassigned from the Author’s employee and that the Author is entitled to cede the copyright to SJM;
- the paper (or any of its part) is not submitted or accepted for publication in any other Journal;
- the Work is an original work created by the Author;
- the Author has not transferred, ceded, or assigned the copyright, or any part thereof, to any third party; or granted any third party a licence or other right to the copyright, which may affect or detract from the rights granted to SJM in terms of this agreement.
The Author hereby indemnifies the SJM as a body and its individual members, to the fullest extent permitted in law, against all or any claims which may arise consequent to the warranties set forth.
No monetary consideration shall be payable by SJM to the Author for the cession, but SJM shall clearly identify the Author as having produced the Work and ensure that due recognition is given to the Author in any publication of the Work.
Should SJM, in its sole discretion, elect not to publish the Work within 1 year after the date of this agreement, the cession shall lapse and be of no further effect. In such event the copyright shall revert to the Author and SJM shall not publish the Work, or any part thereof, without the Author’s prior written consent.
