|
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
|
| Volume 187 - Issue 105 |
| Published: May 2026 |
| Authors: Anurag Shrivastava, Shivang Agrawal, Sanjana Keshari, Mohd. Taukeer, Krishna Vishwakarma |
10.5120/ijcac86083229a5f
|
Anurag Shrivastava, Shivang Agrawal, Sanjana Keshari, Mohd. Taukeer, Krishna Vishwakarma . Vision Bridge: An Adaptive Serverless Architecture for Multimodal Heritage Tourism - CLIP-Based Visual Querying with Split-Horizon Delivery on Low-Bandwidth Networks. International Journal of Computer Applications. 187, 105 (May 2026), 51-60. DOI=10.5120/ijcac86083229a5f
@article{ 10.5120/ijcac86083229a5f,
author = { Anurag Shrivastava,Shivang Agrawal,Sanjana Keshari,Mohd. Taukeer,Krishna Vishwakarma },
title = { Vision Bridge: An Adaptive Serverless Architecture for Multimodal Heritage Tourism - CLIP-Based Visual Querying with Split-Horizon Delivery on Low-Bandwidth Networks },
journal = { International Journal of Computer Applications },
year = { 2026 },
volume = { 187 },
number = { 105 },
pages = { 51-60 },
doi = { 10.5120/ijcac86083229a5f },
publisher = { Foundation of Computer Science (FCS), NY, USA }
}
%0 Journal Article
%D 2026
%A Anurag Shrivastava
%A Shivang Agrawal
%A Sanjana Keshari
%A Mohd. Taukeer
%A Krishna Vishwakarma
%T Vision Bridge: An Adaptive Serverless Architecture for Multimodal Heritage Tourism - CLIP-Based Visual Querying with Split-Horizon Delivery on Low-Bandwidth Networks%T
%J International Journal of Computer Applications
%V 187
%N 105
%P 51-60
%R 10.5120/ijcac86083229a5f
%I Foundation of Computer Science (FCS), NY, USA
Heritage tourism in India occupies a curious position: the sites themselves are extraordinary, yet the information infrastructure surrounding them remains thin, fragmented, and predominantly English-language - excluding most of the domestic visitors who use them. This paper describes Vision Bridge, a serverless multimodal chatbot for heritage tourists that operates entirely through the Telegram messaging platform. The system accepts photographs of architectural features and returns contextually accurate multilingual descriptions - as both text and synthesized audio - within two seconds on standard mobile connections, with no application installation required. The authors introduce three original contributions beyond the prior text-only serverless heritage chatbot architecture on which this work builds. First, the Adaptive Confidence-Gated Visual Query Module (ACVQM) - a CLIP ViT-B/32 embedding retrieval system augmented with an image quality pre-filter and a query-adaptive threshold mechanism that adjusts matching confidence requirements based on estimated query ambiguity, improving identification robustness under real outdoor tourism conditions. Second, the Split-Horizon Delivery Protocol (SHDP) - a formally defined two-phase asynchronous pipeline that decouples initial text delivery from background audio synthesis, achieving 620 ms perceived response latency while full audio narration completes within 2.0 seconds. Third, a theoretical grounding of the design in Cognitive Load Theory and Information Foraging Theory, providing a principled framework for understanding why multimodal, audio-visual delivery of heritage information outperforms text-only presentation for tourists navigating unfamiliar architectural environments. Experimental evaluation across 500+ interaction cycles at the Residency Complex, Lucknow, demonstrates 87.4% top 1 visual identification accuracy with sub-500 ms inference on CPU-only cloud hardware. A seven-day field pilot with 120 participants yielded a TAM instrument Cronbach's alpha of 0.89, a visual utility mean score of 4.71/5 (SD = 0.39), and a statistically significant improvement over text-only baseline scores (t (119) = 3.47, p < 0.001, Cohen's d = 0.63). These results position Vision Bridge as a practically viable, replicable architectural blueprint for inclusive multimodal heritage information systems in resource-constrained deployments.