Highly Robust Methods in Data Mining
Abstract
This paper is devoted to highly robust methods for information extraction from data, with a special attention paid to methods suitable for management applications.The sensitivity of available data mining methods to the presence of outlying measurements in the observed data is discussed as a major drawback of available data mining methods. The paper proposes several newhighly robust methods for data mining, which are based on the idea of implicit weighting of individual data values. Particularly it propose a novel robust method of hierarchical cluster analysis, which is a popular data mining method of unsupervised learning.Further, a robust method for estimating parameters in the logistic regression was proposed. This idea is extended to a robust multinomial logistic classification analysis. Finally, the sensitivity of neural networks to the presence of noise and outlying measurements in the data was discussed. The method for robust training of neural networks for the task of function approximation, which has the form of a robust estimator in nonlinear regression, was proposed.
References
Agresti A. (1990). Categorical data analysis. Wiley, New York.
Beliakov G., Kelarev A., & Yearwood J. (2012). Robust artificial neural networks and outlier detection. Technical report, arxiv.org/pdf/1110.1069.pdf (downloaded November 28, 2012).
Bobrowski L., & Łukaszuk T. (2011). Relaxed linear separability (RLS) approach to feature (gene) subset selection. In Xia X. (Ed.). Selected works in bioinformatics. InTech, Rijeka, 103-118.
Brandl B., Keber C., & Schuster M.G.(2006). An automated econometric decision support system. Forecast for foreign exchange trades. Central European Journal of Operations Research, 14 (4), 401-415.
Briner R.B., Denyer D., & Rousseau D.M. (2009). Evidence,based management. Concept cleanup time? Academy of Management Perspectives, 23 (4), 19-32.
Buonaccorsi J.P. (2010). Measurement error. models, methods, and applications. Boca Raton. Chapman & Hall/CRC.
Chae S.S., Kim C., Kim J.,M., & Warde W.D. (2008). Cluster analysis using different correlation coefficients. Statistical Papers, 49 (4), 715-727.
Chen D.S., & Jain R.C. (1994). A robust back propagation learning algorithm for function approximation. IEEE Transactions on Neural Networks, 5 (3), 467-479.
Christmann A. (1994). Least median of weighted squares in logistic regression with large strata. Biometrika, 81 (2), 413-417.
Čížek P. (2011). Semiparametrically weighted robust estimation of regression models. Computational Statistics & Data Analysis, 55 (1), 774-788.
Čížek P. (2008). Robust and efficient adaptive estimation of binary,choice regression models. Journal of the American Statistical Association, 103 (482), 687-696.
Davies P.L., & Gather U. (2005). Breakdown and groups. Annals of Statistics, 33 (3), 977-1035.
Dreiseitl S., & Ohno,Machado L. (2002). Logistic regression and artificial neural network classification models. A methodology review. Journal of Biomedical Informatics, 35, 352-359.
Dutt,Mazumder A., Button C., Robins A., & Bartlett R. (2011). Neural network modelling and dynamical systém theory. are they relevant to study the governing dynamics of association football players? Sports Medicine, 41 (12), 1003-1017.
Efendigil T., Önüt S., & Kahraman C. (2009). A decision support system for demand forecasting with artificial support networks and neuro,fuzzy models. A comparative analysis. Expert Systems with Applications, 36 (3), 6697-6707.
Fayyad U., Piatetsky,Shapiro G., & Smyth P. (1996). From data mining to knowledge discovery in databases. AI Magazine, 17 (3), 37-54.
Fernandez G. (2003). Data mining using SAS applications.Boca Raton. Chapman & Hall/CRC.
Gao J., & Hitchcock D.B. (2010). James,Stein shrinkage to improve k,means cluster analysis. Computational Statistics & Data Analysis, 54, 2113-2127.
García,Escudero L.A., Gordaliza A., San Martín R., van Aelst S., & Zamar R. (2009). Robust linear clustering. Journal of the Royal Statistical Society, B71 (1), 301-318.
Gruca T.S., Klemz B.R., & Petersen E.A.F. (1999). Mining sales data using a neural network model of market reponse. ACM SIGKDD Explorations Newsletter, 1 (1), 39-43.
Gunasekaran A., & Ngai E.W.T. (2012). Decision support systems for logistic and supply chain management. Decision Support Systems and Electronic Commerce, 52 (4), 777-778.
Hakimpoor H. , Arshad K.A.B. , Tat H.H. , Khani N., & Rahmandoust M. (2011). Artificial neural networks’ applications in management. World Applied Sciences Journal, 14 (7), 1008-1019.
Hand D.J. (2006). Classifier technology and the illusion of progress. Statistical Science, 21 (1), 1-15.
Hastie T. , Tibshirani R. , Friedman J. (2001). The elements of statistical learning. Springer, New York.
Hekimoglu S., Erenoglu R.C., & Kalina J. (2009). Outlier detection by means of robust regression estimators for use in engineering science. Journal of Zhejiang University, Science A, 10 (6), 909-921.
Jaakkola T.S. (2013). Machine learning. http.//www.ai.mit.edu/courses/6.867,f04/lectures/ lecture,5,ho.pdf (downloaded January 4, 2013).
Jeng J.,T., Chuang C.,T., & Chuang C.,C. (2011). Least trimmed squares based CPBUM neural networks. Proceedings International Conference on System Science and Engineering ICSSE 2011, IEEE Computer Society Press, Washington, 187-192.
Kalina J. (2012a). Implicitly weighted methods in robust image analysis. Journal of Mathematical Imaging and Vision, 44 (3), 449-462.
Kalina J. (2012b). On multivariate methods in robust econometrics. Prague Economic Papers, 21 (1), 69-82.
Kalina J. (2011). Some diagnostic tools in robust econometrics. Acta Universitatis Palackianae Olomucensis Facultas Rerum Naturalium Mathematica ,50 (2), 55-67.
Krycha K.A., & Wagner U. (1999). Applications of artificial neural networks in management science. A survey. Journal of Retailing and Consumer Services, 6, 185-203.
Liang K. (2005). Clustering as a basis of hedge fund manager selection. Technical report, University of California, Berkeley, cmfutsarchive/HedgeFunds/hf_managerselection.pdf (downloaded December 20, 2012).
Liano K. (1996). Robust error measure for supervised neural network learning with outliers. IEEE Transactions on Neural Networks, 7 (1), 246-250.
Maronna R.A., Martin R.D., & Yohai V.J. (2006). Robust statistics. Theory and methods. Chichester. Wiley.
Martinez W.L., Martinez A.R., & Solka J.L. (2011). Exploratory data analysis with MATLAB. Second edition. Chapman & Hall/CRC, London.
Mura L. (2012). Possible applications of the cluster analysis in the managerial business analysis. Information Bulletin of the Czech Statistical Society, 23 (4), 27-40. (In Slovak.)
Murtaza N., Sattar A.R., & Mustafa T. (2010). Enhancing the software effort estimation using outlier elimination methods for agriculture in Pakistan. Pakistan Journal of Life and Social Sciences, 8 (1), 54-58.
Nisbet R. , Elder J. , Miner G. (2009). Handbook of statistical analysis and data mining applications. Elsevier, Burlington.
Punj G., & Stewart D.W. (1983). Cluster analysis in marketing research. Review and suggestions for applications. Journal of Marketing Research, 20 (2), 134-148.
Ritchie M.D., Hahn L.W., Roodi N., Bailey L.R., Dupont W.D., Parl F.F., & Moore J.H. (2001). Multifactor,dimensionality reduction reveals high,order interactions among estrogen,metabolism genes in sporadic breast cancer. American Journal of Human Genetics, 69 (1), 138-147.
Rousseeuw P.J., & van Driessen K. (2006). Computing LTS regression for large data sets. DataMining and Knowledge Discovery,12 (1), 29-45.
Rusiecki A. (2008). Robust MCD,based backpropagation learning algorithm. In Rutkowski L, Tadeusiewicz R., Zadeh L., Zurada J. (Eds.). Artificial Intelligence and Soft Computing. Lecture Notes in Computer Science, 5097, 154-163.
Salibián,Barrera M. (2006). The asymptotics of MM,estimators for linear regression with fixed designs. Metrika, 63, 283-294.
Schäfer J., & Strimmer K. (2005). A shrinkage approach to large,scale covariance matrix estimation and implications for functional genomics. Statistical Applications in Genetics and Molecular Biology, 4 (1), Article 32, 1-30.
Seber A.F., & Wild C.J. (1989). Nonlinear regression. Wiley, New York.
Shertzer K.W., & Prager M.H. (2002). Least median of squares. A suitable objective function for stock assessment models? Canadian Journal of Fisheries and Aquatic Sciences, 59, 1474-1481.
Shin S., Yang L., Park K., & Choi Y. (2009). Robust data mining. An integrated approach. In Ponce J., Karahoca A. (Eds.). Data mining and knowledge discovery in real life applications. I,Tech Education and Publishing, New York.
Soda P., Pechenizkiy M., Tortorella F., & Tsymbal A. (2010). Knowledge discovery and computer,based decision support in biomedicine. Knowledge discovery and computer,based decision support in biomedicine. Artificial Intelligence in Medicine, 50 (1), 1-2.
Stigler S.M. (2010). The changing history of robustness. American Statistician, 64 (4), 277-281.
Svozil D., Kalina J, Omelka M., & Schneider B. (2008). DNA conformations and their sequence preferences. Nucleic Acids Research, 36 (11), 3690-3706.
Tibshirani R., Walther G., & Hastie T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society Series B, 63 (2), 411-423.
Vintr T., Vintrová V., & Řezanková H. (2012). Poisson distribution based initialization for fuzzy clustering. Neural Network World, 22 (2), 139-159.
Víšek J.Á. (2001). Regression with high breakdown point. In Antoch J., Dohnal G. (Eds.).Proceedings of ROBUST 2000, Summer School of JČMF, JČMF and Czech Statistical Society, Prague, 324-356.
Yeung D.S., Cloete I., Shi D., & Ng W.W.Y. (2010). Sensitivity analysis for neural networks. Springer, New York.
Youden W.J. (1950). Index for rating diagnostic tests. Cancer 3, 32-35.
Zvárová, J., Veselý A., & Vajda I. (2009). Data, information and knowledge. In P. Berka, J. Rauch and D. Zighed (Eds.), Data mining and medical knowledge management. Cases and applications standards. IGI Global, Hershey, 1-36.
The Author wishes to submit the Work to SJM for publication. To enable SJM to publish the Work and to give effect to the parties’ intention set forth herein, they have agreed to cede the first right to publication and republication in the SJM Journal.
Cession
The Author hereby cedes to SJM, who accepts the cession, to the copyright in and to the paper.
The purpose of the cession is to enable SJM to publish the Work, as first publisher world-wide, and for republication in the SJM Journal, and to grant the right to others to publish the Work world-wide, for so long as such copyright subsists;
SJM shall be entitled to edit the work before publication, as it deems fit, subject to the Authors approval
The Author warrants to SJM that:
- the Author is the owner of the copyright in the Work, whether as author or as reassigned from the Author’s employee and that the Author is entitled to cede the copyright to SJM;
- the paper (or any of its part) is not submitted or accepted for publication in any other Journal;
- the Work is an original work created by the Author;
- the Author has not transferred, ceded, or assigned the copyright, or any part thereof, to any third party; or granted any third party a licence or other right to the copyright, which may affect or detract from the rights granted to SJM in terms of this agreement.
The Author hereby indemnifies the SJM as a body and its individual members, to the fullest extent permitted in law, against all or any claims which may arise consequent to the warranties set forth.
No monetary consideration shall be payable by SJM to the Author for the cession, but SJM shall clearly identify the Author as having produced the Work and ensure that due recognition is given to the Author in any publication of the Work.
Should SJM, in its sole discretion, elect not to publish the Work within 1 year after the date of this agreement, the cession shall lapse and be of no further effect. In such event the copyright shall revert to the Author and SJM shall not publish the Work, or any part thereof, without the Author’s prior written consent.