Volume 2, Issue 4, December 2017, Page: 125-132
Unsupervised Dimensionality Reduction for High-Dimensional Data Classification
Hany Yan, School of Mathematics, Jilin University, Changchun, China
Hu Tianyu, School of Mathematics, Jilin University, Changchun, China
Received: Jul. 20, 2017;       Accepted: Aug. 9, 2017;       Published: Aug. 31, 2017
DOI: 10.11648/j.mlr.20170204.13      View  1638      Downloads  135
This paper carries on research surrounding the influences produced by dimensionality reduction on machine learning classification effect. Firstly, paper constructs the analysis architecture of data dimension reduction classification, combines the two different unsupervised dimension reduction methods, locally linear embedding (LLE) and principal component analysis (PCA) with the five machine learning classification methods: Gradient Boosting Decision Tree (GBDT), Random Forest, Support Vector Machine (SVM), K-Nearest Neighbor (KNN) and Logistic Regression. And then uses the handwritten digital identification dataset to analyze the classification performance of these five classification methods on different dimension datasets by different dimensionality reduction methods. The analysis shows that using the appropriate dimensionality reduction method for dimensionality reduction classification can effectively improve the classification accuracy; the dimensionality reduction classification effect of non-linear dimensionality reduction method is generally better than the linear dimensionality reduction method; different machine learning classification algorithms have significant differences in the sensitivity of dimensions.
Dimensionality Reduction, Machine Learning, Classification Problem, Handwritten Numeral Recognition
To cite this article
Hany Yan, Hu Tianyu, Unsupervised Dimensionality Reduction for High-Dimensional Data Classification, Machine Learning Research. Vol. 2, No. 4, 2017, pp. 125-132. doi: 10.11648/j.mlr.20170204.13
Copyright © 2017 Authors retain the copyright of this article.
This article is an open access article distributed under the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Gaber, Mohamed Medhat, A. Zaslavsky, and S. Krishnaswamy. A Survey of Classification Methods in Data Streams. Data Streams. 2015:39-59.
Su, Jiang, and H. Zhang. "A fast decision tree learning algorithm." National Conference on Artificial Intelligence AAAI Press, 2006:500-505.
Serpen, Gursel, and S. Pathical. "Classification in High-Dimensional Feature Spaces: Random Subsample Ensemble." International Conference on Machine Learning and Applications 2009:740-745.
Fan, J., and Y. Fan. "High Dimensional Classification Using Features Annealed Independence Rules." Annals of Statistics 36.6(2008):2605.
Miller, Alan. Subset selection in regression. Chapman & Hill/CRC, 2002.
Fodor, I. K. "A survey of dimension reduction techniques." Neoplasia 7.5(2002):475-485.
Mitchell, Tom M., J. G. Carbonell, and R. S. Michalski. Machine Learning. McGraw-Hill, 2003.
Huang, Cheng Lung, and J. F. Dun. "A distributed PSO–SVM hybrid system with feature selection and parameter optimization." Applied Soft Computing 8.4(2008):1381-1391.
Tsai, Flora S., and K. L. Chan. "Dimensionality reduction techniques for data exploration." International Conference on Information, Communications & Signal Processing IEEE, 2007:1-5.
Hotelling, H. H. "Analysis of Complex Statistical Variables into Principal Components." British Journal of Educational Psychology 24.6(1933):417-520.
Zigelman, G, R. Kimmel, and N. Kiryati. "Texture mapping using surface flattening via multi-dimensional scaling." IEEE Transactions on Visualization and Computer Graphics 2002:198-207.
Kuang, Fangjun, W. Xu, and S. Zhang. "A novel hybrid KPCA and SVM with GA model for intrusion detection." Applied Soft Computing 18. C(2014):178-184.
Bengio, Yoshua, et al. "Out-of-sample extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering." International Conference on Neural Information Processing Systems MIT Press, 2003:177-184.
Balasubramanian, M, and E. L. Schwartz. "The isomap algorithm and topological stability." Science 295.5552(2002):7.
Gorban, Alexander N., et al. Principal Manifolds for Data Visualization and Dimension Reduction. Springer Berlin Heidelberg, 2008.
Moore, B. "Principal component analysis in linear systems: Controllability, observability, and model reduction." IEEE Transactions on Automatic Control 26.1(2003):17-32.
Wang, Jianzhong. Locally Linear Embedding. Geometric Structure of High-Dimensional Data and Dimensionality Reduction. Springer Berlin Heidelberg, 2012:203-220.
Egeren, Lawrence F. Multivariate Statistical Analysis. North-Holland Pub. Co, 1973.
Kussul, Ernst, and T. Baidyk. "Improved method of handwritten digit recognition tested on MNIST database." Image & Vision Computing 22.12(2004):971-981.
Xie, Keming, C. Mou, and G. Xie. "The multi-parameter combination mind-evolutionary-based machine learning and its application." 1.1(2000):183-187 vol.1.
Burges, Christopher J. C. A Tutorial on Support Vector Machines for Pattern Recognition. Kluwer Academic Publishers, 1998.
Dietterich, Thomas G. "An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization." Machine Learning 40.2(2000):139-157.
Song, Yang, et al. IKNN: Informative K-Nearest Neighbor Pattern Classification. Knowledge Discovery in Databases: PKDD 2007. Springer Berlin Heidelberg, 2007:248-264.
Andrew Cucchiara. "Applied Logistic Regression." Technometrics 34.1(1992):358-359.
Cutler, Adele, D. R. Cutler, and J. R. Stevens. "Random Forests." Machine Learning 45.1(2012):157-176.
Kohavi, Ron. "A study of cross-validation and bootstrap for accuracy estimation and model selection." International Joint Conference on Artificial Intelligence Morgan Kaufmann Publishers Inc. 1995:1137-1143.
Browse journals by subject