Research Article | | Peer-Reviewed

An Improved Deep Learning Model for Word Embeddings Based Clustering for Large Text Datasets

Received: 14 March 2025     Accepted: 31 March 2025     Published: 29 April 2025
Views:       Downloads:
Abstract

In the rapid growth of textual data in various domains has increased the need for efficient clustering techniques capable of handling large-scale datasets. Traditional clustering methods often fail to capture semantic relationships and struggle with high-dimensional, sparse data. The present study shows an improved document clustering technique, i.e., WEClustering++, which enhances the existing WEClustering framework by integrating fine-tuned BERT based word embeddings. The proposed model incorporates advanced dimensionality reduction techniques and optimized clustering algorithms to improve clustering accuracy. In the present work, the BERT-large model, fine-tuned on domain-specific datasets is utilized. Seven benchmark datasets spanning various domains and sizes are considered. These datasets include collections of research articles, news articles, and other domain-specific texts. Experimental evaluations on multiple benchmark datasets demonstrate significant performance improvements in clustering metrics, including silhouette score, purity, and ARI. Results show a 45% and 67% increase in median silhouette scores for WEClustering_K++ (K-means-based) and WEClustering_A++ (Agglomerative-based) models, respectively. Result also shows an increase of median purity metrics of 0.4% and 0.8% is obtained for proposed WEClustering_K++ and WEClustering_A++ compared to the state of art model. Also, an increase of median ARI metrics of 7% and 11% is obtained for proposed WEClustering_K++ and WEClustering_A++ compared to the state of art model. These findings highlight the potential of fine-tuned word embeddings in bridging the gap between statistical clustering robustness and semantic understanding. The proposed approach is expected to contribute to advancements in large-scale text mining applications, including document organization, topic modelling, and information retrieval.

Published in Machine Learning Research (Volume 10, Issue 1)
DOI 10.11648/j.mlr.20251001.14
Page(s) 32-43
Creative Commons

This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited.

Copyright

Copyright © The Author(s), 2025. Published by Science Publishing Group

Keywords

Deep Learning, Word Embedding, Large Text Data, Silhouette Score, Clustering Technique

References
[1] Novel coronavirus resource directory (2020) Accessed Feb 08, 2025
[2] Johnson R, Watkinson A, Mabe M (2018) The stm report. An overview of scientific and scholarly publishing, 5th Ed.
[3] Manning CD, Schütze H (1999) Foundations of statistical natural language processing. MIT press, New York
[4] Robertson S (2004) Understanding inverse document frequency: on theoretical arguments for idf. J Doc 60(5): 503–520
[5] Hotho A, Staab S, Stumme G (2003) Ontologies improve text document clustering. In: Third IEEE international conference on data mining, pp 541–544. IEEE.
[6] Mehta V, Bawa S, Singh J (2021) Stamantic clustering: combining statistical and semantic features for clustering of large text datasets. Expert Syst Appl 174: 114710
[7] Sedding J, Kazakov D (2004) Wordnet-based text document clustering. In: proceedings of the 3rd workshop on robust methods in analysis of natural language data. Association for Computational Linguistics, pp 104–113
[8] Wei T, Lu Y, Chang H, Zhou Q, Bao X (2015) A semantic approach for text clustering using wordnet and lexical chains. Expert Syst Appl 42(4): 2264–2275
[9] Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv: 1301.3781
[10] Miller GA (1995) Wordnet: a lexical database for English. Commun ACM 38(11): 39–41
[11] Chang WC, Yu HF, Zhong K, Yang Y, Dhillon I (2019) Xbert: extreme multi-label text classification with using bidirectional encoder representations from transformers. arXiv preprint arXiv: 1905.02331
[12] Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543
[13] Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching wordvectors with subword information. Trans Assoc Comput Linguist 5: 135–146.
[14] Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv: 1810.04805
[15] Mehta, V., Bawa, S., & Singh, J. (2021). WEClustering: Word embeddings based text clustering technique for large datasets. Complex & Intelligent Systems, 7(6), 3211–3224.
[16] Almeida F, Xexéo G (2019) Word embeddings: a survey. arXiv preprint arXiv: 1901.09069
[17] Bakarov A (2018) A survey of word embeddings evaluation methods. arXiv preprint arXiv: 1801.09536
[18] Camacho-Collados J, Pilehvar MT (2018) From word to sense embeddings: a survey on vector representations of meaning. J Artif Intell Res 63: 743–788
[19] Wang S, Zhou W, Jiang C (2020) A survey of word embeddings based on deep learning. Computing 102(3): 717–740
[20] Fränti P, Sieranoja S (2018) K-means properties on six clustering benchmark datasets. Appl Intell 48(12): 4743–4759
[21] Sculley D (2010) Web-scale k-means clustering. In: Proceedings of the 19th international conference on World wide web, pp 1177–1178
[22] Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques. Elsevier, Amsterdam
[23] Nielsen F (2016) Hierarchical clustering. Introduction to HPC with MPI for data science. Springer, New York, pp 195–211
[24] Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv: 1810.04805.
[25] Arthur D, Vassilvitskii S (2006) k-means++: the advantages of careful seeding. Technical report, Stanford
[26] Soares VH. Downloads. Available from:
[27] Langley J. 20 Newsgroups Dataset. Available from:
[28] Shahapure KRS and Nicholas C (2020) Cluster Quality Analysis Using Silhouette Score, IEEE DSAA,
[29] Santos JM, Embrechts M (2009) On the use of the adjusted rand index as a metric for evaluating supervised classification. International conference on artificial neural networks. Springer, New York, pp 175–184
[30] Manning CD, Schütze H, Raghavan P (2008) Introduction to information retrieval. Cambridge University Press, Cambridge
Cite This Article
  • APA Style

    Sutrakar, V. K., Mogre, N. (2025). An Improved Deep Learning Model for Word Embeddings Based Clustering for Large Text Datasets. Machine Learning Research, 10(1), 32-43. https://doi.org/10.11648/j.mlr.20251001.14

    Copy | Download

    ACS Style

    Sutrakar, V. K.; Mogre, N. An Improved Deep Learning Model for Word Embeddings Based Clustering for Large Text Datasets. Mach. Learn. Res. 2025, 10(1), 32-43. doi: 10.11648/j.mlr.20251001.14

    Copy | Download

    AMA Style

    Sutrakar VK, Mogre N. An Improved Deep Learning Model for Word Embeddings Based Clustering for Large Text Datasets. Mach Learn Res. 2025;10(1):32-43. doi: 10.11648/j.mlr.20251001.14

    Copy | Download

  • @article{10.11648/j.mlr.20251001.14,
      author = {Vijay Kumar Sutrakar and Nikhil Mogre},
      title = {An Improved Deep Learning Model for Word Embeddings Based Clustering for Large Text Datasets
    },
      journal = {Machine Learning Research},
      volume = {10},
      number = {1},
      pages = {32-43},
      doi = {10.11648/j.mlr.20251001.14},
      url = {https://doi.org/10.11648/j.mlr.20251001.14},
      eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.mlr.20251001.14},
      abstract = {In the rapid growth of textual data in various domains has increased the need for efficient clustering techniques capable of handling large-scale datasets. Traditional clustering methods often fail to capture semantic relationships and struggle with high-dimensional, sparse data. The present study shows an improved document clustering technique, i.e., WEClustering++, which enhances the existing WEClustering framework by integrating fine-tuned BERT based word embeddings. The proposed model incorporates advanced dimensionality reduction techniques and optimized clustering algorithms to improve clustering accuracy. In the present work, the BERT-large model, fine-tuned on domain-specific datasets is utilized. Seven benchmark datasets spanning various domains and sizes are considered. These datasets include collections of research articles, news articles, and other domain-specific texts. Experimental evaluations on multiple benchmark datasets demonstrate significant performance improvements in clustering metrics, including silhouette score, purity, and ARI. Results show a 45% and 67% increase in median silhouette scores for WEClustering_K++ (K-means-based) and WEClustering_A++ (Agglomerative-based) models, respectively. Result also shows an increase of median purity metrics of 0.4% and 0.8% is obtained for proposed WEClustering_K++ and WEClustering_A++ compared to the state of art model. Also, an increase of median ARI metrics of 7% and 11% is obtained for proposed WEClustering_K++ and WEClustering_A++ compared to the state of art model. These findings highlight the potential of fine-tuned word embeddings in bridging the gap between statistical clustering robustness and semantic understanding. The proposed approach is expected to contribute to advancements in large-scale text mining applications, including document organization, topic modelling, and information retrieval.
    },
     year = {2025}
    }
    

    Copy | Download

  • TY  - JOUR
    T1  - An Improved Deep Learning Model for Word Embeddings Based Clustering for Large Text Datasets
    
    AU  - Vijay Kumar Sutrakar
    AU  - Nikhil Mogre
    Y1  - 2025/04/29
    PY  - 2025
    N1  - https://doi.org/10.11648/j.mlr.20251001.14
    DO  - 10.11648/j.mlr.20251001.14
    T2  - Machine Learning Research
    JF  - Machine Learning Research
    JO  - Machine Learning Research
    SP  - 32
    EP  - 43
    PB  - Science Publishing Group
    SN  - 2637-5680
    UR  - https://doi.org/10.11648/j.mlr.20251001.14
    AB  - In the rapid growth of textual data in various domains has increased the need for efficient clustering techniques capable of handling large-scale datasets. Traditional clustering methods often fail to capture semantic relationships and struggle with high-dimensional, sparse data. The present study shows an improved document clustering technique, i.e., WEClustering++, which enhances the existing WEClustering framework by integrating fine-tuned BERT based word embeddings. The proposed model incorporates advanced dimensionality reduction techniques and optimized clustering algorithms to improve clustering accuracy. In the present work, the BERT-large model, fine-tuned on domain-specific datasets is utilized. Seven benchmark datasets spanning various domains and sizes are considered. These datasets include collections of research articles, news articles, and other domain-specific texts. Experimental evaluations on multiple benchmark datasets demonstrate significant performance improvements in clustering metrics, including silhouette score, purity, and ARI. Results show a 45% and 67% increase in median silhouette scores for WEClustering_K++ (K-means-based) and WEClustering_A++ (Agglomerative-based) models, respectively. Result also shows an increase of median purity metrics of 0.4% and 0.8% is obtained for proposed WEClustering_K++ and WEClustering_A++ compared to the state of art model. Also, an increase of median ARI metrics of 7% and 11% is obtained for proposed WEClustering_K++ and WEClustering_A++ compared to the state of art model. These findings highlight the potential of fine-tuned word embeddings in bridging the gap between statistical clustering robustness and semantic understanding. The proposed approach is expected to contribute to advancements in large-scale text mining applications, including document organization, topic modelling, and information retrieval.
    
    VL  - 10
    IS  - 1
    ER  - 

    Copy | Download

Author Information
  • Sections