Efficient Parallel Processing of k-Nearest Neighbor Queries by Using a Centroid-based and Hierarchical Clustering Algorithm
Source: By:Author(s)
DOI: https://doi.org/10.30564/aia.v4i1.4668
Abstract:The k-Nearest Neighbor method is one of the most popular techniques for both classification and regression purposes. Because of its operation, the application of this classification may be limited to problems with a certain number of instances, particularly, when run time is a consideration. However, the classification of large amounts of data has become a fundamental task in many real-world applications. It is logical to scale the k-Nearest Neighbor method to large scale datasets. This paper proposes a new k-Nearest Neighbor classification method (KNN-CCL) which uses a parallel centroid-based and hierarchical clustering algorithm to separate the sample of training dataset into multiple parts. The introduced clustering algorithm uses four stages of successive refinements and generates high quality clusters. The k-Nearest Neighbor approach subsequently makes use of them to predict the test datasets. Finally, sets of experiments are conducted on the UCI datasets. The experimental results confirm that the proposed k-Nearest Neighbor classification method performs well with regard to classification accuracy and performance.
References:[1] Alharthi, A., Krotov, V., Bowman, M., 2017. Addressing barriers to big data. Business Horizons. 60(3), 285-292. [2] Anagnostopoulos, I., Zeadally, S., Exposito, E., 2016. Handling big data: research challenges and future directions. The Journal of Supercomputing. 72(4), 1494-1516. [3] Akoka, J., Comyn-Wattiau, I., Laoufi, N., 2017. Research on Big Data – A systematic mapping study. Computer Standards & Interfaces. 54, 105-115. [4] Minelli, M., Chambers, M., Dhiraj, A., 2013. Big data, big analytics: emerging business intelligence and analytic trends for today's businesses. John Wiley & Sons. 578. [5] Cover, T., Hart, P., 1967. Nearest neighbor pattern classification. IEEE transactions on information theory. 13(1), 21-27. [6] Wu, X., Kumar, V., 2009. The top ten algorithms in data mining. CRC press. [7] Chen, Y.H., Garcia, E.K., Gupta, M.R., et al., 2009. Similarity-based classification: Concepts and algorithms. Journal of Machine Learning Research. 10, 747-776. [8] Weinberger, K.Q., Saul, L.K., 2009. Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research. 10, 207- 244. [9] Li, J., Liu, Y., Pan, J., et al., 2020. Map-BalanceReduce: An improved parallel programming model for load balancing of MapReduce. Future Generation Computer Systems. 105, 993-1001. [10] Dean, J., Ghemawat, S., 2008. MapReduce: simplified data processing on large clusters. Communications of the Acm. 51(1), 107-113. [11] White, T., 2012. Hadoop: The definitive guide. O’Reilly Media, Inc. [12] Chen, C.P., Zhang, C.Y., 2014. Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Information Sciences. 275, 314- 347. [13] Guo, Z., Fox, G., Zhou, M., 2012. Investigation of data locality in mapreduce. Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012). 419- 426. IEEE. [14] Zaharia, M., Chowdhury, M., Das, T., et al., 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (NSDI 12). 15- 28. [15] Bei, Zh.D., Yu, Zh.B., Luo, N., et al., 2018. Configuring in-memory cluster computing using random forest. Future Generation Computer Systems. 79, 1-15. [16] Tang, Zh., Zhang, X.Sh., Li, K.L., et al., 2018. An intermediate data placement algorithm for load balancing in Spark computing environment. Future Generation Computer Systems. 78, 287-301. [17] Gonzalez-Lopez, J., Cano, A., Ventura, S., 2017. Large-scale multi-label ensemble learning on Spark. 2017 IEEE Trustcom/BigDataSE/ICESS, 893-900. DOI: https://doi.org/10.1109/Trustcom/BigDataSE/ICESS.2017.328 [18] Harnie, D., Saey, M., Vapirev, A.E., et al., 2017. Scaling machine learning for target prediction in drug discovery using Apache Spark. Future Generation Computer Systems. 67, 409-417. [19] Hernández, Á.B., Perez, M.S., Gupta, S., et al., 2018. Using machine learning to optimize parallelism in big data applications. Future Generation Computer Systems. 86, 1076-1092. [20] Singh, H., Bawa, S., 2017. A MapReduce-based scalable discovery and indexing of structured big data. Future generation computer systems. 73, 32-43. [21] Gavagsaz, E., Rezaee, A., Javadi, H.H.S., 2018. Load balancing in reducers for skewed data in MapReduce systems by using scalable simple random sampling. The Journal of Supercomputing. 74(7), 3415-3440. [22] Gavagsaz, E., Rezaee, A., Javadi, H.H.S., 2019. Load balancing in join algorithms for skewed data in MapReduce systems. The Journal of Supercomputing. 75(1), 228-254. [23] Bhatia, N., 2010. Survey of nearest neighbor techniques. arXiv preprint arXiv:1007.0085. [24] Zhu, X., Zhang, L., Huang, Z., 2014. A sparse embedding and least variance encoding approach to hashing. IEEE transactions on image processing. 23(9), 3737-3750. [25] Zhu, X., Zhang, S., Zhi, J., et al., 2010. Missing value estimation for mixed-attribute data sets. IEEE Transactions on Knowledge and Data Engineering. 23(1), 110-121. [26] Connor, M., Kumar, P., 2010. Fast construction of k-nearest neighbor graphs for point clouds. IEEE transactions on visualization and computer graphics. 16(4), 599-608. [27] Liu, T., Moore, A.W., Gray, A., et al., 2004. An investigation of practical approximate nearest neighbor algorithms. Advances in neural information processing systems. 17. [28] Raginsky, M., Lazebnik, S., 2009. Locality-sensitive binary codes from shift-invariant kernels. Advances in neural information processing systems. 22. [29] Silpa-Anan, C., Hartley, R., 2008. Optimised KDtrees for fast image descriptor matching. 2008 IEEE Conference on Computer Vision and Pattern Recognition. 1-8. IEEE. [30] Zhu, X.F., Huang, Z., Cheng, H., et al., 2013. Sparse hashing for fast multimedia search. ACM Transactions on Information Systems (TOIS). 31(2), 9. [31] Triguero, I., Peralta, D., Bacardit, J., et al., 2015. MRPR: A MapReduce solution for prototype reduction in big data classification. neurocomputing. 150, 331-345. [32] Du, M., Ding, S., Jia, H., 2016. Study on density peaks clustering based on k-nearest neighbors and principal component analysis. Knowledge-Based Systems. 99, 135-145. [33] Deng, Zh.Y., Zhu, X.Sh., Cheng, D.B., et al., 2016. Efficient kNN classification algorithm for big data. Neurocomputing. 195, 143-148. [34] Moutafis, P., Mavrommatis, G., Vassilakopoulos, M., et al., 2019. Efficient processing of all-k-nearestneighbor queries in the MapReduce programming framework. Data & Knowledge Engineering. 121, 42-70. [35] Zhang, C., Li, F., Jestes, J., 2012. Efficient parallel kNN joins for large data in MapReduce. Proceedings of the 15th international conference on extending database technology. 38-49. [36] Chatzimilioudis, G., Costa, C., Zeinalipour-Yazti, D., et al., 2015. Distributed in-memory processing of all k nearest neighbor queries. IEEE Transactions on Knowledge and Data Engineering. 28(4), 925-938. [37] Sun, K., Kang, H., Park, H.H., 2015. Tagging and classifying facial images in cloud environments based on KNN using MapReduce. Optik. 126(21), 3227-3233. [38] Maillo, J., Triguero, I., Herrera, F., 2015. A mapreduce-based k-nearest neighbor approach for big data classification. 2015 IEEE Trustcom/BigDataSE/ ISPA. 2, 167-172. IEEE. [39] Maillo, J., Ramírez, S., Triguero, I., et al., 2017. kNN-IS: An Iterative Spark-based design of the k-Nearest Neighbors classifier for big data. Knowledge-Based Systems. 117, 3-15. [40] Wu, X., Zhang, C., Zhang, S., 2005. Database classification for multi-database mining. Information Systems. 30(1), 71-88. [41] Ester, M., Kriegel, H.P., Sander, J., et al., 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd. 96(24),226- 231. [42] Sander, J., 2010. Density-Based Clustering, in Encyclopedia of Machine Learning, C. Sammut and G.I. Webb, Editors. Springer US: Boston, MA. pp. 270- 273. [43] Amini, A., Wah, T.Y., Saybani, M.R., et al., 2011. A study of density-grid based clustering algorithms on data streams. 2011 Eighth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD). 3, 1652-1656. IEEE. [44] Wang, W., Yang, J., Muntz, R., 1997. STING: A statistical information grid approach to spatial data mining. Vldb. 97, 186-195. [45] Gerlhof, C.A., Kemper, A., 1993. Partition-based clustering in object bases: From theory to practice. International Conference on Foundations of Data Organization and Algorithms. Springer, Berlin, Heidelberg. 301-316. [46] MacQueen, J., 1967. Some methods for classification and analysis of multivariate observations. Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. 1(14), 281-297. [47] Johnson, S.C., 1967. Hierarchical clustering schemes. Psychometrika. 32(3), 241-254. [48] Zhang, T., Ramakrishnan, R., Livny, M., 1996. BIRCH: an efficient data clustering method for very large databases. ACM sigmod record, 25(2), 103- 114. [49] Mazzeo, G.M., Masciari, E., Zaniolo, C., 2017. A fast and accurate algorithm for unsupervised clustering around centroids. Information Sciences. 400, 63-90. [50] Masciari, E., Mazzeo, G.M., Zaniolo, C., 2013. Pacific-asia conference on knowledge discovery and data mining. Springer, Berlin, Heidelberg. 111-122. [51] Masciari, E., Mazzeo, G.M., Zaniolo, C., 2014. Analysing microarray expression data through effective clustering. Information Sciences. 262, 32-45. [52] Ianni, M., Masciari, E., Mazzeo, G.M., et al., 2020. Fast and effective Big Data exploration by clustering. Future Generation Computer Systems. 102, 84-94. [53] Muthukrishnan, S., Poosala, V., Suel, T., 1999. On rectangular partitionings in two dimensions: Algorithms, complexity and applications. International Conference on Database Theory. Springer, Berlin, Heidelberg. 236-256. [54] Furfaro, F., Mazzeo, G.M., Saccà, D., et al., 2008. Compressed hierarchical binary histograms for summarizing multi-dimensional data. Knowledge and Information Systems. 15(3), 335-380. [55] Arbelaitz, O., Gurrutxaga, I., Muguerza, J., et al., 2013. An extensive comparative study of cluster validity indices. Pattern Recognition. 46(1), 243-256. [56] Caliński, T., Harabasz, J., 1974. A dendrite method for cluster analysis. Communications in Statistics-theory and Methods. 3(1), 1-27. [57] Zhang, T., Ramakrishnan, R., Livny, M. ,1996. BIRCH: an efficient data clustering method for very large databases. ACM sigmod record, 25(2), 103- 114. [58] Dua, D., Graff, C., 2017. Machine learning repository. University of California, Irvine, School of Information and Computer Sciences. Available online at: http://archive. ics. uci. edu/ml.