Highly Robust Methods in Data Mining

  • Jan Kalina Institute of Computer Science of the Academy of Science of the Czech Republic

Abstract


This paper is devoted to highly robust methods for information extraction from data, with a special attention paid to methods suitable for management applications.The sensitivity of available data mining methods to the presence of outlying measurements in the observed data is discussed as a major drawback of available data mining methods. The paper proposes several newhighly robust methods for data mining, which are based on the idea of implicit weighting of individual data values. Particularly it propose a novel robust method of hierarchical cluster analysis, which is a popular data mining method of unsupervised learning.Further, a robust method for estimating parameters in the logistic regression was proposed. This idea is extended to a robust multinomial logistic classification analysis. Finally, the sensitivity of neural networks to the presence of noise and outlying measurements in the data was discussed. The method for robust training of neural networks for the task of function approximation, which has the form of a robust estimator in nonlinear regression, was proposed.

Author Biography

Jan Kalina, Institute of Computer Science of the Academy of Science of the Czech Republic
Department of Medical Informatics, Research Fellow

References

Agresti A. (1990). Categorical data analysis. Wiley, New York.

Beliakov G., Kelarev A., & Yearwood J. (2012). Robust artificial neural networks and outlier detection. Technical report, arxiv.org/pdf/1110.1069.pdf (downloaded November 28, 2012).

Bobrowski L., & Łukaszuk T. (2011). Relaxed linear separability (RLS) approach to feature (gene) subset selection. In Xia X. (Ed.). Selected works in bioinformatics. InTech, Rijeka, 103-118.

Brandl B., Keber C., & Schuster M.G.(2006). An automated econometric decision support system. Forecast for foreign exchange trades. Central European Journal of Operations Research, 14 (4), 401-415.

Briner R.B., Denyer D., & Rousseau D.M. (2009). Evidence,based management. Concept cleanup time? Academy of Management Perspectives, 23 (4), 19-32.

Buonaccorsi J.P. (2010). Measurement error. models, methods, and applications. Boca Raton. Chapman & Hall/CRC.

Chae S.S., Kim C., Kim J.,M., & Warde W.D. (2008). Cluster analysis using different correlation coefficients. Statistical Papers, 49 (4), 715-727.

Chen D.S., & Jain R.C. (1994). A robust back propagation learning algorithm for function approximation. IEEE Transactions on Neural Networks, 5 (3), 467-479.

Christmann A. (1994). Least median of weighted squares in logistic regression with large strata. Biometrika, 81 (2), 413-417.

Čížek P. (2011). Semiparametrically weighted robust estimation of regression models. Computational Statistics & Data Analysis, 55 (1), 774-788.

Čížek P. (2008). Robust and efficient adaptive estimation of binary,choice regression models. Journal of the American Statistical Association, 103 (482), 687-696.

Davies P.L., & Gather U. (2005). Breakdown and groups. Annals of Statistics, 33 (3), 977-1035.

Dreiseitl S., & Ohno,Machado L. (2002). Logistic regression and artificial neural network classification models. A methodology review. Journal of Biomedical Informatics, 35, 352-359.

Dutt,Mazumder A., Button C., Robins A., & Bartlett R. (2011). Neural network modelling and dynamical systém theory. are they relevant to study the governing dynamics of association football players? Sports Medicine, 41 (12), 1003-1017.

Efendigil T., Önüt S., & Kahraman C. (2009). A decision support system for demand forecasting with artificial support networks and neuro,fuzzy models. A comparative analysis. Expert Systems with Applications, 36 (3), 6697-6707.

Fayyad U., Piatetsky,Shapiro G., & Smyth P. (1996). From data mining to knowledge discovery in databases. AI Magazine, 17 (3), 37-54.

Fernandez G. (2003). Data mining using SAS applications.Boca Raton. Chapman & Hall/CRC.

Gao J., & Hitchcock D.B. (2010). James,Stein shrinkage to improve k,means cluster analysis. Computational Statistics & Data Analysis, 54, 2113-2127.

García,Escudero L.A., Gordaliza A., San Martín R., van Aelst S., & Zamar R. (2009). Robust linear clustering. Journal of the Royal Statistical Society, B71 (1), 301-318.

Gruca T.S., Klemz B.R., & Petersen E.A.F. (1999). Mining sales data using a neural network model of market reponse. ACM SIGKDD Explorations Newsletter, 1 (1), 39-43.

Gunasekaran A., & Ngai E.W.T. (2012). Decision support systems for logistic and supply chain management. Decision Support Systems and Electronic Commerce, 52 (4), 777-778.

Hakimpoor H. , Arshad K.A.B. , Tat H.H. , Khani N., & Rahmandoust M. (2011). Artificial neural networks’ applications in management. World Applied Sciences Journal, 14 (7), 1008-1019.

Hand D.J. (2006). Classifier technology and the illusion of progress. Statistical Science, 21 (1), 1-15.

Hastie T. , Tibshirani R. , Friedman J. (2001). The elements of statistical learning. Springer, New York.

Hekimoglu S., Erenoglu R.C., & Kalina J. (2009). Outlier detection by means of robust regression estimators for use in engineering science. Journal of Zhejiang University, Science A, 10 (6), 909-921.

Jaakkola T.S. (2013). Machine learning. http.//www.ai.mit.edu/courses/6.867,f04/lectures/ lecture,5,ho.pdf (downloaded January 4, 2013).

Jeng J.,T., Chuang C.,T., & Chuang C.,C. (2011). Least trimmed squares based CPBUM neural networks. Proceedings International Conference on System Science and Engineering ICSSE 2011, IEEE Computer Society Press, Washington, 187-192.

Kalina J. (2012a). Implicitly weighted methods in robust image analysis. Journal of Mathematical Imaging and Vision, 44 (3), 449-462.

Kalina J. (2012b). On multivariate methods in robust econometrics. Prague Economic Papers, 21 (1), 69-82.

Kalina J. (2011). Some diagnostic tools in robust econometrics. Acta Universitatis Palackianae Olomucensis Facultas Rerum Naturalium Mathematica ,50 (2), 55-67.

Krycha K.A., & Wagner U. (1999). Applications of artificial neural networks in management science. A survey. Journal of Retailing and Consumer Services, 6, 185-203.

Liang K. (2005). Clustering as a basis of hedge fund manager selection. Technical report, University of California, Berkeley, cmfutsarchive/HedgeFunds/hf_managerselection.pdf (downloaded December 20, 2012).

Liano K. (1996). Robust error measure for supervised neural network learning with outliers. IEEE Transactions on Neural Networks, 7 (1), 246-250.

Maronna R.A., Martin R.D., & Yohai V.J. (2006). Robust statistics. Theory and methods. Chichester. Wiley.

Martinez W.L., Martinez A.R., & Solka J.L. (2011). Exploratory data analysis with MATLAB. Second edition. Chapman & Hall/CRC, London.

Mura L. (2012). Possible applications of the cluster analysis in the managerial business analysis. Information Bulletin of the Czech Statistical Society, 23 (4), 27-40. (In Slovak.)

Murtaza N., Sattar A.R., & Mustafa T. (2010). Enhancing the software effort estimation using outlier elimination methods for agriculture in Pakistan. Pakistan Journal of Life and Social Sciences, 8 (1), 54-58.

Nisbet R. , Elder J. , Miner G. (2009). Handbook of statistical analysis and data mining applications. Elsevier, Burlington.

Punj G., & Stewart D.W. (1983). Cluster analysis in marketing research. Review and suggestions for applications. Journal of Marketing Research, 20 (2), 134-148.

Ritchie M.D., Hahn L.W., Roodi N., Bailey L.R., Dupont W.D., Parl F.F., & Moore J.H. (2001). Multifactor,dimensionality reduction reveals high,order interactions among estrogen,metabolism genes in sporadic breast cancer. American Journal of Human Genetics, 69 (1), 138-147.

Rousseeuw P.J., & van Driessen K. (2006). Computing LTS regression for large data sets. DataMining and Knowledge Discovery,12 (1), 29-45.

Rusiecki A. (2008). Robust MCD,based backpropagation learning algorithm. In Rutkowski L, Tadeusiewicz R., Zadeh L., Zurada J. (Eds.). Artificial Intelligence and Soft Computing. Lecture Notes in Computer Science, 5097, 154-163.

Salibián,Barrera M. (2006). The asymptotics of MM,estimators for linear regression with fixed designs. Metrika, 63, 283-294.

Schäfer J., & Strimmer K. (2005). A shrinkage approach to large,scale covariance matrix estimation and implications for functional genomics. Statistical Applications in Genetics and Molecular Biology, 4 (1), Article 32, 1-30.

Seber A.F., & Wild C.J. (1989). Nonlinear regression. Wiley, New York.

Shertzer K.W., & Prager M.H. (2002). Least median of squares. A suitable objective function for stock assessment models? Canadian Journal of Fisheries and Aquatic Sciences, 59, 1474-1481.

Shin S., Yang L., Park K., & Choi Y. (2009). Robust data mining. An integrated approach. In Ponce J., Karahoca A. (Eds.). Data mining and knowledge discovery in real life applications. I,Tech Education and Publishing, New York.

Soda P., Pechenizkiy M., Tortorella F., & Tsymbal A. (2010). Knowledge discovery and computer,based decision support in biomedicine. Knowledge discovery and computer,based decision support in biomedicine. Artificial Intelligence in Medicine, 50 (1), 1-2.

Stigler S.M. (2010). The changing history of robustness. American Statistician, 64 (4), 277-281.

Svozil D., Kalina J, Omelka M., & Schneider B. (2008). DNA conformations and their sequence preferences. Nucleic Acids Research, 36 (11), 3690-3706.

Tibshirani R., Walther G., & Hastie T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society Series B, 63 (2), 411-423.

Vintr T., Vintrová V., & Řezanková H. (2012). Poisson distribution based initialization for fuzzy clustering. Neural Network World, 22 (2), 139-159.

Víšek J.Á. (2001). Regression with high breakdown point. In Antoch J., Dohnal G. (Eds.).Proceedings of ROBUST 2000, Summer School of JČMF, JČMF and Czech Statistical Society, Prague, 324-356.

Yeung D.S., Cloete I., Shi D., & Ng W.W.Y. (2010). Sensitivity analysis for neural networks. Springer, New York.

Youden W.J. (1950). Index for rating diagnostic tests. Cancer 3, 32-35.

Zvárová, J., Veselý A., & Vajda I. (2009). Data, information and knowledge. In P. Berka, J. Rauch and D. Zighed (Eds.), Data mining and medical knowledge management. Cases and applications standards. IGI Global, Hershey, 1-36.

Published
2013/01/18
Section
Original Scientific Paper