In the rapid growth of textual data in various domains has increased the need for efficient clustering techniques capable of handling large-scale datasets. Traditional clustering methods often fail to capture semantic relationships and struggle with high-dimensional, sparse data. The present study shows an improved document clustering technique, i.e., WEClustering++, which enhances the existing WEClustering framework by integrating fine-tuned BERT based word embeddings. The proposed model incorporates advanced dimensionality reduction techniques and optimized clustering algorithms to improve clustering accuracy. In the present work, the BERT-large model, fine-tuned on domain-specific datasets is utilized. Seven benchmark datasets spanning various domains and sizes are considered. These datasets include collections of research articles, news articles, and other domain-specific texts. Experimental evaluations on multiple benchmark datasets demonstrate significant performance improvements in clustering metrics, including silhouette score, purity, and ARI. Results show a 45% and 67% increase in median silhouette scores for WEClustering_K++ (K-means-based) and WEClustering_A++ (Agglomerative-based) models, respectively. Result also shows an increase of median purity metrics of 0.4% and 0.8% is obtained for proposed WEClustering_K++ and WEClustering_A++ compared to the state of art model. Also, an increase of median ARI metrics of 7% and 11% is obtained for proposed WEClustering_K++ and WEClustering_A++ compared to the state of art model. These findings highlight the potential of fine-tuned word embeddings in bridging the gap between statistical clustering robustness and semantic understanding. The proposed approach is expected to contribute to advancements in large-scale text mining applications, including document organization, topic modelling, and information retrieval.
Published in | Machine Learning Research (Volume 10, Issue 1) |
DOI | 10.11648/j.mlr.20251001.14 |
Page(s) | 32-43 |
Creative Commons |
This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited. |
Copyright |
Copyright © The Author(s), 2025. Published by Science Publishing Group |
Deep Learning, Word Embedding, Large Text Data, Silhouette Score, Clustering Technique
[1] | Novel coronavirus resource directory (2020) Accessed Feb 08, 2025 |
[2] | Johnson R, Watkinson A, Mabe M (2018) The stm report. An overview of scientific and scholarly publishing, 5th Ed. |
[3] | Manning CD, Schütze H (1999) Foundations of statistical natural language processing. MIT press, New York |
[4] | Robertson S (2004) Understanding inverse document frequency: on theoretical arguments for idf. J Doc 60(5): 503–520 |
[5] | Hotho A, Staab S, Stumme G (2003) Ontologies improve text document clustering. In: Third IEEE international conference on data mining, pp 541–544. IEEE. |
[6] | Mehta V, Bawa S, Singh J (2021) Stamantic clustering: combining statistical and semantic features for clustering of large text datasets. Expert Syst Appl 174: 114710 |
[7] | Sedding J, Kazakov D (2004) Wordnet-based text document clustering. In: proceedings of the 3rd workshop on robust methods in analysis of natural language data. Association for Computational Linguistics, pp 104–113 |
[8] | Wei T, Lu Y, Chang H, Zhou Q, Bao X (2015) A semantic approach for text clustering using wordnet and lexical chains. Expert Syst Appl 42(4): 2264–2275 |
[9] | Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv: 1301.3781 |
[10] | Miller GA (1995) Wordnet: a lexical database for English. Commun ACM 38(11): 39–41 |
[11] | Chang WC, Yu HF, Zhong K, Yang Y, Dhillon I (2019) Xbert: extreme multi-label text classification with using bidirectional encoder representations from transformers. arXiv preprint arXiv: 1905.02331 |
[12] | Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543 |
[13] | Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching wordvectors with subword information. Trans Assoc Comput Linguist 5: 135–146. |
[14] | Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv: 1810.04805 |
[15] | Mehta, V., Bawa, S., & Singh, J. (2021). WEClustering: Word embeddings based text clustering technique for large datasets. Complex & Intelligent Systems, 7(6), 3211–3224. |
[16] | Almeida F, Xexéo G (2019) Word embeddings: a survey. arXiv preprint arXiv: 1901.09069 |
[17] | Bakarov A (2018) A survey of word embeddings evaluation methods. arXiv preprint arXiv: 1801.09536 |
[18] | Camacho-Collados J, Pilehvar MT (2018) From word to sense embeddings: a survey on vector representations of meaning. J Artif Intell Res 63: 743–788 |
[19] | Wang S, Zhou W, Jiang C (2020) A survey of word embeddings based on deep learning. Computing 102(3): 717–740 |
[20] | Fränti P, Sieranoja S (2018) K-means properties on six clustering benchmark datasets. Appl Intell 48(12): 4743–4759 |
[21] | Sculley D (2010) Web-scale k-means clustering. In: Proceedings of the 19th international conference on World wide web, pp 1177–1178 |
[22] | Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques. Elsevier, Amsterdam |
[23] | Nielsen F (2016) Hierarchical clustering. Introduction to HPC with MPI for data science. Springer, New York, pp 195–211 |
[24] | Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv: 1810.04805. |
[25] | Arthur D, Vassilvitskii S (2006) k-means++: the advantages of careful seeding. Technical report, Stanford |
[26] |
Soares VH. Downloads. Available from:
https://vhasoares.github.io/downloads.html (accessed 18 November 2024) |
[27] |
Langley J. 20 Newsgroups Dataset. Available from:
http://qwone.com/~jason/20Newsgroups/ (accessed 18 November 2024) |
[28] | Shahapure KRS and Nicholas C (2020) Cluster Quality Analysis Using Silhouette Score, IEEE DSAA, |
[29] | Santos JM, Embrechts M (2009) On the use of the adjusted rand index as a metric for evaluating supervised classification. International conference on artificial neural networks. Springer, New York, pp 175–184 |
[30] | Manning CD, Schütze H, Raghavan P (2008) Introduction to information retrieval. Cambridge University Press, Cambridge |
APA Style
Sutrakar, V. K., Mogre, N. (2025). An Improved Deep Learning Model for Word Embeddings Based Clustering for Large Text Datasets. Machine Learning Research, 10(1), 32-43. https://doi.org/10.11648/j.mlr.20251001.14
ACS Style
Sutrakar, V. K.; Mogre, N. An Improved Deep Learning Model for Word Embeddings Based Clustering for Large Text Datasets. Mach. Learn. Res. 2025, 10(1), 32-43. doi: 10.11648/j.mlr.20251001.14
@article{10.11648/j.mlr.20251001.14, author = {Vijay Kumar Sutrakar and Nikhil Mogre}, title = {An Improved Deep Learning Model for Word Embeddings Based Clustering for Large Text Datasets }, journal = {Machine Learning Research}, volume = {10}, number = {1}, pages = {32-43}, doi = {10.11648/j.mlr.20251001.14}, url = {https://doi.org/10.11648/j.mlr.20251001.14}, eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.mlr.20251001.14}, abstract = {In the rapid growth of textual data in various domains has increased the need for efficient clustering techniques capable of handling large-scale datasets. Traditional clustering methods often fail to capture semantic relationships and struggle with high-dimensional, sparse data. The present study shows an improved document clustering technique, i.e., WEClustering++, which enhances the existing WEClustering framework by integrating fine-tuned BERT based word embeddings. The proposed model incorporates advanced dimensionality reduction techniques and optimized clustering algorithms to improve clustering accuracy. In the present work, the BERT-large model, fine-tuned on domain-specific datasets is utilized. Seven benchmark datasets spanning various domains and sizes are considered. These datasets include collections of research articles, news articles, and other domain-specific texts. Experimental evaluations on multiple benchmark datasets demonstrate significant performance improvements in clustering metrics, including silhouette score, purity, and ARI. Results show a 45% and 67% increase in median silhouette scores for WEClustering_K++ (K-means-based) and WEClustering_A++ (Agglomerative-based) models, respectively. Result also shows an increase of median purity metrics of 0.4% and 0.8% is obtained for proposed WEClustering_K++ and WEClustering_A++ compared to the state of art model. Also, an increase of median ARI metrics of 7% and 11% is obtained for proposed WEClustering_K++ and WEClustering_A++ compared to the state of art model. These findings highlight the potential of fine-tuned word embeddings in bridging the gap between statistical clustering robustness and semantic understanding. The proposed approach is expected to contribute to advancements in large-scale text mining applications, including document organization, topic modelling, and information retrieval. }, year = {2025} }
TY - JOUR T1 - An Improved Deep Learning Model for Word Embeddings Based Clustering for Large Text Datasets AU - Vijay Kumar Sutrakar AU - Nikhil Mogre Y1 - 2025/04/29 PY - 2025 N1 - https://doi.org/10.11648/j.mlr.20251001.14 DO - 10.11648/j.mlr.20251001.14 T2 - Machine Learning Research JF - Machine Learning Research JO - Machine Learning Research SP - 32 EP - 43 PB - Science Publishing Group SN - 2637-5680 UR - https://doi.org/10.11648/j.mlr.20251001.14 AB - In the rapid growth of textual data in various domains has increased the need for efficient clustering techniques capable of handling large-scale datasets. Traditional clustering methods often fail to capture semantic relationships and struggle with high-dimensional, sparse data. The present study shows an improved document clustering technique, i.e., WEClustering++, which enhances the existing WEClustering framework by integrating fine-tuned BERT based word embeddings. The proposed model incorporates advanced dimensionality reduction techniques and optimized clustering algorithms to improve clustering accuracy. In the present work, the BERT-large model, fine-tuned on domain-specific datasets is utilized. Seven benchmark datasets spanning various domains and sizes are considered. These datasets include collections of research articles, news articles, and other domain-specific texts. Experimental evaluations on multiple benchmark datasets demonstrate significant performance improvements in clustering metrics, including silhouette score, purity, and ARI. Results show a 45% and 67% increase in median silhouette scores for WEClustering_K++ (K-means-based) and WEClustering_A++ (Agglomerative-based) models, respectively. Result also shows an increase of median purity metrics of 0.4% and 0.8% is obtained for proposed WEClustering_K++ and WEClustering_A++ compared to the state of art model. Also, an increase of median ARI metrics of 7% and 11% is obtained for proposed WEClustering_K++ and WEClustering_A++ compared to the state of art model. These findings highlight the potential of fine-tuned word embeddings in bridging the gap between statistical clustering robustness and semantic understanding. The proposed approach is expected to contribute to advancements in large-scale text mining applications, including document organization, topic modelling, and information retrieval. VL - 10 IS - 1 ER -