A Baseline Extractive Summarization System for Hindi Language Using TF-IDF and ROUGE Evaluation

Atul Kumar; Shashi Kant Gupta; Atul Kumar

Home
Proceedings
Vol. 1 No. 3 (2026): LGPR Batch 2 Conference 3
Paper

A Baseline Extractive Summarization System for Hindi Language Using TF-IDF and ROUGE Evaluation

Date Published : 5 May 2026

Contributors

Atul Kumar

Lincoln University College, 47301, Petaling Jaya, Selangor Darul Ehsan , Malaysia

Author

Shashi Kant Gupta

Lincoln University College, 47301, Petaling Jaya, Selangor Darul Ehsan , Malaysia

Author

Atul Kumar

Chandigarh University, Uttar Pradesh, Unnao, India

Author

Keywords

Text Summarization Hindi NLP TF-IDF Extractive Summarization ROUGE Evaluation

Proceeding

Vol. 1 No. 3 (2026): LGPR Batch 2 Conference 3

Track

Engineering and Sciences

License

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Abstract

There is an urgent demand to develop effective automatic systems of text summarization because of the exponential increase in digital materials in Indian languages, especially in Hindi. This paper is a paper about an extractive text summarization system that was specially developed to work with Hindi language and the Term Frequency-Inverse Document Frequency (TF-IDF) methodology. In the proposed system raw Hindi text is processed using a pipeline consisting of tokenizing, removing stop-words, and TF-IDF vectorization to give meaning scores to each sentence. Sentences that score the most are then chosen to produce a summary. The system was deployed in a Python standard library-based evaluation, which was tested on a dataset based on Hindi Article Summarization (HAS) dataset. The quantitative metrics of ROUGE indicate that the TF-IDF model scores a ROUGE-1 F1 of 0.52 and a ROUGE-L F1 of 0.46, which is higher than a simple Lead-3 baseline. Findings show the TF-IDF method is a good, computationally cheap baseline to Hindi summarization. Nonetheless, shortcomings in the semantic capturing of meaning and ability to deal with redundancy were noted. This piece of work provides a basis of more sophisticated summarization methods that include semantic analysis and deep learning of Hindi.

References

No References

Downloads

PDF

How to Cite

Kumar, A., Gupta, S. K., & Kumar, A. (2026). A Baseline Extractive Summarization System for Hindi Language Using TF-IDF and ROUGE Evaluation. Sustainable Global Societies Initiative, 1(3). https://vectmag.com/sgsi/paper/view/394