A Baseline Extractive Summarization System for Hindi Language Using TF-IDF and ROUGE Evaluation
Contributors
Atul Kumar
Shashi Kant Gupta
Atul Kumar
Keywords
Proceeding
Track
Engineering and Sciences
License
Copyright (c) 2026 Sustainable Global Societies Initiative

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Abstract
There is an urgent demand to develop effective automatic systems of text summarization because of the exponential increase in digital materials in Indian languages, especially in Hindi. This paper is a paper about an extractive text summarization system that was specially developed to work with Hindi language and the Term Frequency-Inverse Document Frequency (TF-IDF) methodology. In the proposed system raw Hindi text is processed using a pipeline consisting of tokenizing, removing stop-words, and TF-IDF vectorization to give meaning scores to each sentence. Sentences that score the most are then chosen to produce a summary. The system was deployed in a Python standard library-based evaluation, which was tested on a dataset based on Hindi Article Summarization (HAS) dataset. The quantitative metrics of ROUGE indicate that the TF-IDF model scores a ROUGE-1 F1 of 0.52 and a ROUGE-L F1 of 0.46, which is higher than a simple Lead-3 baseline. Findings show the TF-IDF method is a good, computationally cheap baseline to Hindi summarization. Nonetheless, shortcomings in the semantic capturing of meaning and ability to deal with redundancy were noted. This piece of work provides a basis of more sophisticated summarization methods that include semantic analysis and deep learning of Hindi.