Research Article

Vision Bridge: An Adaptive Serverless Architecture for Multimodal Heritage Tourism - CLIP-Based Visual Querying with Split-Horizon Delivery on Low-Bandwidth Networks

by  Anurag Shrivastava, Shivang Agrawal, Sanjana Keshari, Mohd. Taukeer, Krishna Vishwakarma
journal cover
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 187 - Issue 105
Published: May 2026
Authors: Anurag Shrivastava, Shivang Agrawal, Sanjana Keshari, Mohd. Taukeer, Krishna Vishwakarma
10.5120/ijcac86083229a5f
PDF

Anurag Shrivastava, Shivang Agrawal, Sanjana Keshari, Mohd. Taukeer, Krishna Vishwakarma . Vision Bridge: An Adaptive Serverless Architecture for Multimodal Heritage Tourism - CLIP-Based Visual Querying with Split-Horizon Delivery on Low-Bandwidth Networks. International Journal of Computer Applications. 187, 105 (May 2026), 51-60. DOI=10.5120/ijcac86083229a5f

                        @article{ 10.5120/ijcac86083229a5f,
                        author  = { Anurag Shrivastava,Shivang Agrawal,Sanjana Keshari,Mohd. Taukeer,Krishna Vishwakarma },
                        title   = { Vision Bridge: An Adaptive Serverless Architecture for Multimodal Heritage Tourism - CLIP-Based Visual Querying with Split-Horizon Delivery on Low-Bandwidth Networks },
                        journal = { International Journal of Computer Applications },
                        year    = { 2026 },
                        volume  = { 187 },
                        number  = { 105 },
                        pages   = { 51-60 },
                        doi     = { 10.5120/ijcac86083229a5f },
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }
                        %0 Journal Article
                        %D 2026
                        %A Anurag Shrivastava
                        %A Shivang Agrawal
                        %A Sanjana Keshari
                        %A Mohd. Taukeer
                        %A Krishna Vishwakarma
                        %T Vision Bridge: An Adaptive Serverless Architecture for Multimodal Heritage Tourism - CLIP-Based Visual Querying with Split-Horizon Delivery on Low-Bandwidth Networks%T 
                        %J International Journal of Computer Applications
                        %V 187
                        %N 105
                        %P 51-60
                        %R 10.5120/ijcac86083229a5f
                        %I Foundation of Computer Science (FCS), NY, USA
Abstract

Heritage tourism in India occupies a curious position: the sites themselves are extraordinary, yet the information infrastructure surrounding them remains thin, fragmented, and predominantly English-language - excluding most of the domestic visitors who use them. This paper describes Vision Bridge, a serverless multimodal chatbot for heritage tourists that operates entirely through the Telegram messaging platform. The system accepts photographs of architectural features and returns contextually accurate multilingual descriptions - as both text and synthesized audio - within two seconds on standard mobile connections, with no application installation required. The authors introduce three original contributions beyond the prior text-only serverless heritage chatbot architecture on which this work builds. First, the Adaptive Confidence-Gated Visual Query Module (ACVQM) - a CLIP ViT-B/32 embedding retrieval system augmented with an image quality pre-filter and a query-adaptive threshold mechanism that adjusts matching confidence requirements based on estimated query ambiguity, improving identification robustness under real outdoor tourism conditions. Second, the Split-Horizon Delivery Protocol (SHDP) - a formally defined two-phase asynchronous pipeline that decouples initial text delivery from background audio synthesis, achieving 620 ms perceived response latency while full audio narration completes within 2.0 seconds. Third, a theoretical grounding of the design in Cognitive Load Theory and Information Foraging Theory, providing a principled framework for understanding why multimodal, audio-visual delivery of heritage information outperforms text-only presentation for tourists navigating unfamiliar architectural environments. Experimental evaluation across 500+ interaction cycles at the Residency Complex, Lucknow, demonstrates 87.4% top 1 visual identification accuracy with sub-500 ms inference on CPU-only cloud hardware. A seven-day field pilot with 120 participants yielded a TAM instrument Cronbach's alpha of 0.89, a visual utility mean score of 4.71/5 (SD = 0.39), and a statistically significant improvement over text-only baseline scores (t (119) = 3.47, p < 0.001, Cohen's d = 0.63). These results position Vision Bridge as a practically viable, replicable architectural blueprint for inclusive multimodal heritage information systems in resource-constrained deployments.

References
  • A. Shrivastava, S. Agrawal, S. Keshari, K. Vishwakarma, and M. Taukeer, "SAFARSETU: An AI-Powered Multilingual Tourist Guide Chatbot for Cultural Heritage Exploration," International Journal for Research in Applied Science and Engineering Technology (IJRASET), vol. 13, no. XII, pp. 3126-3135, Dec. 2025. DOI: 10.22214/ijraset.2025.76656.
  • K. Sathiyabamavathy and K. P. Anju, "Role of Chatbots in Cultural Heritage Tourism: An Empirical Study on Ancient Forts and Palaces," Journal of Heritage Management, vol. 9, no. 1, pp. 9-28, 2024.
  • D. Deepa, A. K. Archana, and K. Karthik, "Heritage Information Chatbot," International Journal of Emerging Technologies and Innovative Research (JETIR), vol. 12, no. 7, pp. 290-293, Jul. 2025.
  • P. Reddy and A. Kumar, "Enhancing Visitor Experience Through a Chatbot for Historical Places in India Using Google Dialog flow," Journal of Engineering Sciences, vol. 15, no. 4, pp. 222-230, 2024.
  • F. Nafis, A. Yahyaouy, and B. Aghoutane, "Chatbots for Cultural Heritage: A Real Added Value," in Proc. 2nd Int. Conf. Big Data, Modelling and Machine Learning (BML), 2021, pp. 502-506.
  • A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal et al., "Learning Transferable Visual Models from Natural Language Supervision," in Proc. ICML, 2021, pp. 8748-8763.
  • J. Li, D. Li, S. Savarese, and S. Hoi, "BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models," in Proc. ICML, 2023, pp. 19730-19742.
  • H. Liu, C. Li, Q. Wu, and Y. J. Lee, "Visual Instruction Tuning (LLaVA)," in Advances in Neural Information Processing Systems (NeurIPS), vol. 36, 2024.
  • M. Deng, "Machine Learning Advances in Technology Applications: Cultural Heritage Tourism Trends in Experience Design," International Journal of Advanced Computer Science and Applications (IJACSA), vol. 16, no. 4, pp. 186-196, 2025.
  • J. Hu et al., "Deep Serve: Serverless Large Language Model Serving at Scale," in Proc. USENIX Annual Technical Conference (ATC), Boston, MA, USA, Jul. 2025.
  • Ministry of Tourism, Government of India, "India Tourism Statistics 2024," Market Research Division, New Delhi, 2024.
  • S. Alekseev et al., "Telegram Bot Development Using Python: An Educational Architecture," International Journal of Emerging Technologies, vol. 11, no. 4, pp. 30-35, 2024.
  • R. Boboc, E. Bautu, and F. Girbacia, "Augmented Reality and AI in Cultural Heritage: An Overview of the Last Decade," Applied Sciences, vol. 12, no. 19, p. 9859, 2022.
  • T. K. Gireesh Kumar, "A Study on Digital Preservation Methods for Cultural Heritage Sites in India," Asian Journal of Information Science and Technology, vol. 14, no. 2, pp. 45-52, 2024.
  • D. Harisanty et al., "Cultural Heritage Preservation in the Digital Age: Harnessing Artificial Intelligence," Digital Library Perspectives, vol. 40, no. 4, pp. 609-625, 2024.
  • J. Sweller, "Cognitive Load During Problem Solving: Effects on Learning," Cognitive Science, vol. 12, no. 2, pp. 257-285, 1988.
  • R. E. Mayer and R. Moreno, "Nine Ways to Reduce Cognitive Load in Multimedia Learning," Educational Psychologist, vol. 38, no. 1, pp. 43-52, 2003.
  • T. D. Wilson, "Models in Information Behaviour Research," Journal of Documentation, vol. 55, no. 3, pp. 249-270, 1999.
  • E. M. Rogers, Diffusion of Innovations, 5th ed. New York, NY, USA: Free Press, 2003.
  • F. D. Davis, "Perceived Usefulness, Perceived Ease of Use, and User Acceptance of Information Technology," MIS Quarterly, vol. 13, no. 3, pp. 319-340, 1989.
Index Terms
Computer Science
Information Sciences
No index terms available.
Keywords

Vision-Language Models Serverless Architecture Heritage Tourism Multimodal Chatbot Cognitive Load Theory CLIP Telegram API Neural TTS Information Foraging

Powered by PhDFocusTM