Approximate search for Big Data with applications in information security - A survey

  • Slobodan Petrović Gjøvik University College

Abstract


This paper is a survey of approximate search techniques in very large data sets (so-called Big Data). After a short introduction, some techniques for speeding up approximate search in such data sets based on exploitation of inherent bit-parallelism in computers are described. It then reviews the applications in search related to information security problems (digital forensics, malware detection, intrusion detection) are reviewed. Finally, the need for constraints in approximate search regarding the number of so-called elementary edit operations and the run lengths of particular elementary edit operations is explained and the status of on-going research on efficient implementation of approximate search algorithms with various constraints is given.

References

Baeza-Yates, R., & Gonnet, G.H. (1992). A new approach to text searching. Communications of the ACM, 35(10), 74-82. doi:10.1145/135239.135243

Berre, A.J. (2015). BigData Value PPP in Horizon 2020. Retrieved from http://www.sintef.no/globalassets/sintef-ikt/arrangementer/2015-02-23_workshop_big_data/2---bdva-ppp-2402-berre.pdf.

Ďurian, B., Holub, J., Peltola, H., & Tarhio, J. (2009). Tuning BNDM with q-Grams. In: J. Hershberger & I. Finocchi (Eds.), Proceedings of the Eleventh Workshop on Algorithm Engineering and Experiments (ALENEX). doi:10.1137/1.9781611972894.3.

Eyrich, J. (2011). Drinking from a Firehose: How to get traffic to your Bro cluster. Retrieved from https://www.bro.org/bro-workshop-2011/slides/drinking-from-fire-hose.pdf

Faro, S., & Lecroq, T. (2012). Twenty Years of Bit-Parallelism in String Matching. In J. Holub, B.W. Watson, & J. Zdarek (Ed.), Festschrift for Borivoj Melichar. (pp. 72-101). Prague: Czech Technical University.

Frigessi, A. (2015). Big Insight. Retrieved from http://www.sintef.no/globalassets/sintef-ikt/arrangementer/2015-02-23_workshop_big_data/4---biginsight_frigessi.pdf.

Gorodetsky, V., Karsaev, O., & Samoilov, V. (2004). On-Line Update of Situation Assessment Based on Asynchronous Data Streams. In: Knowledge-Based Intelligent Information and Engineering Systems: 8th International Conference, KES. Berlin: Springer Verlag.1136-1142. doi:10.1007/978-3-540-30132-5_154.

IBM. (2015). What is big data. Retrieved from http://www-01.ibm.com/software/data/bigdata/what-is-big-data.html

Layer, C. (2007). A Coprocessor for Fast Searching in Large Databases: Associative Computing Engine. University of Ulm.

Levenshtein, V. (1966). Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Soviet Physical Doklady, 10(8), 707-710.

Naor, D., & Brutlag, D. (1993). On Suboptimal Alignments in Biological Sequences. In: A. Apostolico (Ed.),Combinatorial pattern matching: 4th annual symposium, CPM 93, Padova, Italy, June 2-4, Proceedings. Berlin: Springer Verlag.179-196.

Navarro, G., & Raffinot, M. (2002). Flexible Pattern Matching in Strings: Practical on-line search algorithms for texts and biological sequences. Cambridge: Cambridge University Press.

Petrovic, S., & Fuster-Sabater, A. (2007). Reconstruction of Suboptimal Paths in the Constrained Edit Distance Array with Application in Cryptanalysis. In: O. Gervasi & M.L. Gavrilova (Ed.), Computational Science and Its Applications - ICCSA International Conference, Kuala Lumpur, Malaysia, Proceedings, Part III. Berlin: Springer Verlag.597-610. doi:10.1007/978-3-540-74484-9_52.

Pries, K., & Dunnigan, R. (2015). Big Data Analytics: A practical guide for managers. Boca Raton: CRC Press.

Sankoff, D., & Kruskal, J. (2000). Time Warps, String Edits and Macromolecules: The Theory and Practice of Sequence Comparison. Cambridge: Cambridge University Press.

Sommer, R. (2011). Broverview. Retrieved from https://www.bro.org/bro-workshop-2011/slides/broverview.pdf

Published
2015/04/30
Section
Original Scientific Paper