Vision Bridge: An Adaptive Serverless Architecture for Multimodal Heritage Tourism - CLIP-Based Visual Querying with Split-Horizon Delivery on Low-Bandwidth Networks

Anurag Shrivastava; Shivang Agrawal; Sanjana Keshari; Mohd. Taukeer; Krishna Vishwakarma

Research Article

Vision Bridge: An Adaptive Serverless Architecture for Multimodal Heritage Tourism - CLIP-Based Visual Querying with Split-Horizon Delivery on Low-Bandwidth Networks

by Anurag Shrivastava, Shivang Agrawal, Sanjana Keshari, Mohd. Taukeer, Krishna Vishwakarma

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 187 - Issue 105

Published: May 2026

Authors: Anurag Shrivastava, Shivang Agrawal, Sanjana Keshari, Mohd. Taukeer, Krishna Vishwakarma

10.5120/ijcac86083229a5f

PDF

Anurag Shrivastava, Shivang Agrawal, Sanjana Keshari, Mohd. Taukeer, Krishna Vishwakarma . Vision Bridge: An Adaptive Serverless Architecture for Multimodal Heritage Tourism - CLIP-Based Visual Querying with Split-Horizon Delivery on Low-Bandwidth Networks. International Journal of Computer Applications. 187, 105 (May 2026), 51-60. DOI=10.5120/ijcac86083229a5f

                        @article{ 10.5120/ijcac86083229a5f,
                        author  = { Anurag Shrivastava,Shivang Agrawal,Sanjana Keshari,Mohd. Taukeer,Krishna Vishwakarma },
                        title   = { Vision Bridge: An Adaptive Serverless Architecture for Multimodal Heritage Tourism - CLIP-Based Visual Querying with Split-Horizon Delivery on Low-Bandwidth Networks },
                        journal = { International Journal of Computer Applications },
                        year    = { 2026 },
                        volume  = { 187 },
                        number  = { 105 },
                        pages   = { 51-60 },
                        doi     = { 10.5120/ijcac86083229a5f },
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }

                        %0 Journal Article
                        %D 2026
                        %A Anurag Shrivastava
                        %A Shivang Agrawal
                        %A Sanjana Keshari
                        %A Mohd. Taukeer
                        %A Krishna Vishwakarma
                        %T Vision Bridge: An Adaptive Serverless Architecture for Multimodal Heritage Tourism - CLIP-Based Visual Querying with Split-Horizon Delivery on Low-Bandwidth Networks%T 
                        %J International Journal of Computer Applications
                        %V 187
                        %N 105
                        %P 51-60
                        %R 10.5120/ijcac86083229a5f
                        %I Foundation of Computer Science (FCS), NY, USA

Abstract

Heritage tourism in India occupies a curious position: the sites themselves are extraordinary, yet the information infrastructure surrounding them remains thin, fragmented, and predominantly English-language - excluding most of the domestic visitors who use them. This paper describes Vision Bridge, a serverless multimodal chatbot for heritage tourists that operates entirely through the Telegram messaging platform. The system accepts photographs of architectural features and returns contextually accurate multilingual descriptions - as both text and synthesized audio - within two seconds on standard mobile connections, with no application installation required. The authors introduce three original contributions beyond the prior text-only serverless heritage chatbot architecture on which this work builds. First, the Adaptive Confidence-Gated Visual Query Module (ACVQM) - a CLIP ViT-B/32 embedding retrieval system augmented with an image quality pre-filter and a query-adaptive threshold mechanism that adjusts matching confidence requirements based on estimated query ambiguity, improving identification robustness under real outdoor tourism conditions. Second, the Split-Horizon Delivery Protocol (SHDP) - a formally defined two-phase asynchronous pipeline that decouples initial text delivery from background audio synthesis, achieving 620 ms perceived response latency while full audio narration completes within 2.0 seconds. Third, a theoretical grounding of the design in Cognitive Load Theory and Information Foraging Theory, providing a principled framework for understanding why multimodal, audio-visual delivery of heritage information outperforms text-only presentation for tourists navigating unfamiliar architectural environments. Experimental evaluation across 500+ interaction cycles at the Residency Complex, Lucknow, demonstrates 87.4% top 1 visual identification accuracy with sub-500 ms inference on CPU-only cloud hardware. A seven-day field pilot with 120 participants yielded a TAM instrument Cronbach's alpha of 0.89, a visual utility mean score of 4.71/5 (SD = 0.39), and a statistically significant improvement over text-only baseline scores (t (119) = 3.47, p < 0.001, Cohen's d = 0.63). These results position Vision Bridge as a practically viable, replicable architectural blueprint for inclusive multimodal heritage information systems in resource-constrained deployments.

References

A. Shrivastava, S. Agrawal, S. Keshari, K. Vishwakarma, and M. Taukeer, "SAFARSETU: An AI-Powered Multilingual Tourist Guide Chatbot for Cultural Heritage Exploration," International Journal for Research in Applied Science and Engineering Technology (IJRASET), vol. 13, no. XII, pp. 3126-3135, Dec. 2025. DOI: 10.22214/ijraset.2025.76656.
K. Sathiyabamavathy and K. P. Anju, "Role of Chatbots in Cultural Heritage Tourism: An Empirical Study on Ancient Forts and Palaces," Journal of Heritage Management, vol. 9, no. 1, pp. 9-28, 2024.
D. Deepa, A. K. Archana, and K. Karthik, "Heritage Information Chatbot," International Journal of Emerging Technologies and Innovative Research (JETIR), vol. 12, no. 7, pp. 290-293, Jul. 2025.
P. Reddy and A. Kumar, "Enhancing Visitor Experience Through a Chatbot for Historical Places in India Using Google Dialog flow," Journal of Engineering Sciences, vol. 15, no. 4, pp. 222-230, 2024.
F. Nafis, A. Yahyaouy, and B. Aghoutane, "Chatbots for Cultural Heritage: A Real Added Value," in Proc. 2nd Int. Conf. Big Data, Modelling and Machine Learning (BML), 2021, pp. 502-506.
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal et al., "Learning Transferable Visual Models from Natural Language Supervision," in Proc. ICML, 2021, pp. 8748-8763.
J. Li, D. Li, S. Savarese, and S. Hoi, "BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models," in Proc. ICML, 2023, pp. 19730-19742.
H. Liu, C. Li, Q. Wu, and Y. J. Lee, "Visual Instruction Tuning (LLaVA)," in Advances in Neural Information Processing Systems (NeurIPS), vol. 36, 2024.
M. Deng, "Machine Learning Advances in Technology Applications: Cultural Heritage Tourism Trends in Experience Design," International Journal of Advanced Computer Science and Applications (IJACSA), vol. 16, no. 4, pp. 186-196, 2025.
J. Hu et al., "Deep Serve: Serverless Large Language Model Serving at Scale," in Proc. USENIX Annual Technical Conference (ATC), Boston, MA, USA, Jul. 2025.
Ministry of Tourism, Government of India, "India Tourism Statistics 2024," Market Research Division, New Delhi, 2024.
S. Alekseev et al., "Telegram Bot Development Using Python: An Educational Architecture," International Journal of Emerging Technologies, vol. 11, no. 4, pp. 30-35, 2024.
R. Boboc, E. Bautu, and F. Girbacia, "Augmented Reality and AI in Cultural Heritage: An Overview of the Last Decade," Applied Sciences, vol. 12, no. 19, p. 9859, 2022.
T. K. Gireesh Kumar, "A Study on Digital Preservation Methods for Cultural Heritage Sites in India," Asian Journal of Information Science and Technology, vol. 14, no. 2, pp. 45-52, 2024.
D. Harisanty et al., "Cultural Heritage Preservation in the Digital Age: Harnessing Artificial Intelligence," Digital Library Perspectives, vol. 40, no. 4, pp. 609-625, 2024.
J. Sweller, "Cognitive Load During Problem Solving: Effects on Learning," Cognitive Science, vol. 12, no. 2, pp. 257-285, 1988.
R. E. Mayer and R. Moreno, "Nine Ways to Reduce Cognitive Load in Multimedia Learning," Educational Psychologist, vol. 38, no. 1, pp. 43-52, 2003.
T. D. Wilson, "Models in Information Behaviour Research," Journal of Documentation, vol. 55, no. 3, pp. 249-270, 1999.
E. M. Rogers, Diffusion of Innovations, 5th ed. New York, NY, USA: Free Press, 2003.
F. D. Davis, "Perceived Usefulness, Perceived Ease of Use, and User Acceptance of Information Technology," MIS Quarterly, vol. 13, no. 3, pp. 319-340, 1989.

Index Terms

Computer Science

Information Sciences

No index terms available.

Keywords

Vision-Language Models Serverless Architecture Heritage Tourism Multimodal Chatbot Cognitive Load Theory CLIP Telegram API Neural TTS Information Foraging