Research Article

Abstractive Summarization of Spoken Language: A Comparative Evaluation of BART and T5 on Podcast and Conversational Speech Transcripts

by  Ashish Joshi
journal cover
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 187 - Issue 97
Published: April 2026
Authors: Ashish Joshi
10.5120/ijca78706a352984
PDF

Ashish Joshi . Abstractive Summarization of Spoken Language: A Comparative Evaluation of BART and T5 on Podcast and Conversational Speech Transcripts. International Journal of Computer Applications. 187, 97 (April 2026), 11-19. DOI=10.5120/ijca78706a352984

                        @article{ 10.5120/ijca78706a352984,
                        author  = { Ashish Joshi },
                        title   = { Abstractive Summarization of Spoken Language: A Comparative Evaluation of BART and T5 on Podcast and Conversational Speech Transcripts },
                        journal = { International Journal of Computer Applications },
                        year    = { 2026 },
                        volume  = { 187 },
                        number  = { 97 },
                        pages   = { 11-19 },
                        doi     = { 10.5120/ijca78706a352984 },
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }
                        %0 Journal Article
                        %D 2026
                        %A Ashish Joshi
                        %T Abstractive Summarization of Spoken Language: A Comparative Evaluation of BART and T5 on Podcast and Conversational Speech Transcripts%T 
                        %J International Journal of Computer Applications
                        %V 187
                        %N 97
                        %P 11-19
                        %R 10.5120/ijca78706a352984
                        %I Foundation of Computer Science (FCS), NY, USA
Abstract

The exponential growth of long-form audio content, particularly podcasts and lectures, creates an urgent need for effective summarization systems capable of condensing hours of speech into concise, coherent summaries. This study presents a comprehensive comparative evaluation of two transformer-based architectures— BART and T5—for abstractive summarization of spoken language transcripts. Unlike prior work that relies on written dialogue datasets, the author fine-tunes and evaluates both models on three speech-specific datasets: PodcastSum (12,345 podcast episodes), How2 (12,987 instructional videos), and the AMI Meeting Corpus (137 hours of meetings). A multi-faceted evaluation framework is employed, combining automated metrics (ROUGE, BLEU, BERTScore, METEOR) with human judgments across five quality dimensions (coherence, fluency, factual consistency, conciseness, and speaker attribution). Statistical significance testing confirms observed differences, and qualitative analysis reveals model-specific strengths and failure patterns. Results demonstrate that BART significantly outperforms T5 across all automated metrics (p < 0.01) and receives higher human ratings for factual consistency and structural cohesion. However, T5 generates more lexically diverse summaries and better handles extended dialogue contexts. Complementary strengths are identified that suggest hybrid approaches may be beneficial. To support reproducibility, the evaluation framework and human-annotated test samples are released. The findings provide actionable guidance for deploying summarization systems in real-world speech applications.

References
  • Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations (ICLR), 2015.
  • Satanjeev Banerjee and Alon Lavie. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, 2005.
  • Karim Benharrak, Puyuan Peng, and Amy Pavel. Talkless: Blending extractive and abstractive summarization for editing speech to preserve content and style. In Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology, pages 1–19, 2025.
  • Jean Carletta, Simone Ashby, Sebastien Bourban, Mike Flynn, Mael Guillemot, Thomas Hain, Jaroslav Kadlec, Vasilis Karaiskos, Wessel Kraaij, Melissa Kronenthal, et al. The AMI meeting corpus: A pre-announcement. In International Workshop on Machine Learning for Multimodal Interaction (MLMI), pages 28–39, 2007.
  • Elizabeth Clark, Tal August, Sofia Serrano, Nikita Haduong, Suchin Gururangan, and Noah A Smith. All that’s ’human’ is not gold: Evaluating human evaluation of generated text. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, pages 7282–7296, 2021.
  • Ann Clifton, Sravana Reddy, Yongze Yu, Aasish Pappu, Rezvaneh Rezapour, Hamed Bonab, Maria Eskevich, Gareth Jones, Jussi Karlgren, Ben Carterette, et al. 100,000 podcasts: A spoken english document corpus. In Proceedings of the 28th International Conference on Computational Linguistics, pages 5903–5917, 2020.
  • David Demeter, Oshin Agarwal, Simon Ben Igeri, Marko Sterbentz, Neil Molino, John M Conroy, and Ani Nenkova. Summarization from leaderboards to practice: Choosing a representation backbone and ensuring robustness. arXiv preprint arXiv:2306.10555, 2023.
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pages 4171–4186, 2019.
  • Alexander R Fabbri, Wojciech Kry´sci´nski, Bryan McCann, Caiming Xiong, Richard Socher, and Dragomir Radev. Summeval: Re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics, 9:391–409, 2021.
  • Sadaoki Furui. Speech recognition technology in multimodal/ ubiquitous computing environments. In International Symposium on Chinese Spoken Language Processing (ISCSLP), pages 1–8, 2004.
  • Iona Gessinger, Erfan A Shams, and Julie Carson-Berndsen. Under the hood: Phonemic restoration in transformer-based automatic speech recognition. Computer Speech & Language, page 101893, 2025.
  • Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and Aleksander Wr´oblewska. SAMSum corpus: A human-annotated dialogue dataset for abstractive summarization. In Proceedings of the 2nd Workshop on New Frontiers in Summarization (EMNLP), pages 70–79, 2019.
  • Takatomo Kano, Atsunori Ogawa, Marc Delcroix, Ryo Fukuda, William Chen, and Shinji Watanabe. Pick and summarize: Integrating extractive and abstractive speech summarization. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pages 281–285. International Speech Communication Association, 2025.
  • Philipp Koehn. Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 388–395, 2004.
  • Wojciech Kry´sci´nski, Bryan McCann, Caiming Xiong, and Richard Socher. Evaluating the factual consistency of abstractive text summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9339–9346, 2020.
  • Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871– 7880, 2020.
  • Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In ACL Text Summarization Branches Out, pages 74–81, 2004.
  • Yinhong Liu, Jianfeng He, Hang Su, Ruixue Lian, Yi Nian, Jake Vincent, Srikanth Vishnubhotla, Robinson Piramuthu, and Saab Mansour. Mdseval: A meta-evaluation benchmark for multimodal dialogue summarization. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 14707–14727, 2025.
  • Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919, 2020.
  • Florian Metze, Zaid Sheikh, Alex Waibel, Jonas Gehring, Kevin Kilgour, Quoc Bao Nguyen, and Viet Huy Nguyen. Models of tone and intonation for speech summarization. Computer Speech & Language, 45:280–295, 2017.
  • Gabriel Murray, Steve Renals, and Jean Carletta. Extractive summarization of meeting recordings. In Interspeech, pages 593–596, 2005.
  • Ani Nenkova and Kathleen McKeown. A survey of text summarization techniques, pages 43–76. Springer, 2012.
  • Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pages 311–318, 2002.
  • Romain Paulus, Caiming Xiong, and Richard Socher. A deep reinforced model for abstractive summarization. In International Conference on Learning Representations (ICLR), 2018.
  • Picovoice. Complete guide to summarization APIs & SDKs (2026). https://picovoice.ai/blog/ guide-to-summarization-apis/, 2026. Accessed: 2026-03-08.
  • Podcast Insights. Podcast statistics 2026: Global market analysis. https://www.podcastinsights.com/ podcast-statistics/, 2026. Accessed: 2026-03-08.
  • Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning (ICML), pages 28492– 28518, 2023.
  • Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020.
  • Ramon Sanabria, Ozan Caglayan, Shruti Palaskar, Desmond Elliott, Lo¨ıc Barrault, Lucia Specia, and Florian Metze. How2: A large-scale dataset for multimodal language understanding. In NeurIPS Workshop on Visually Grounded Interaction and Language, 2018.
  • Abigail See, Peter J Liu, and Christopher D Manning. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 1073–1083, 2017.
  • Elizabeth Shriberg. Spontaneous speech: How people really talk and why engineers should care. In Interspeech, pages 1781–1784, 2005.
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), pages 5998– 6008, 2017.
  • Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. BERTScore: Evaluating text generation with BERT. In International Conference on Learning Representations (ICLR), 2020.
  • Xiaodan Zhu, Gerald Penn, and Frank Rudzicz. Multicriteria- based strategy to stop active learning for text classification. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), pages 373–381, 2009.
Index Terms
Computer Science
Information Sciences
No index terms available.
Keywords

Abstractive Summarization Spoken Language Understanding BART T5 Transformer Models Podcast Transcription Human Evaluation

Powered by PhDFocusTM