A Knowledge-Graph–Driven Multimodal Large Model for Semantic Understanding and Controllable Generation of Intangible Cultural Heritage

Jundi Yang; Heng Yao

Research Article

A Knowledge-Graph–Driven Multimodal Large Model for Semantic Understanding and Controllable Generation of Intangible Cultural Heritage

by Jundi Yang, Heng Yao

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 187 - Issue 77

Published: January 2026

Authors: Jundi Yang, Heng Yao

10.5120/ijca2026926252

PDF

Jundi Yang, Heng Yao . A Knowledge-Graph–Driven Multimodal Large Model for Semantic Understanding and Controllable Generation of Intangible Cultural Heritage. International Journal of Computer Applications. 187, 77 (January 2026), 1-8. DOI=10.5120/ijca2026926252

                        @article{ 10.5120/ijca2026926252,
                        author  = { Jundi Yang,Heng Yao },
                        title   = { A Knowledge-Graph–Driven Multimodal Large Model for Semantic Understanding and Controllable Generation of Intangible Cultural Heritage },
                        journal = { International Journal of Computer Applications },
                        year    = { 2026 },
                        volume  = { 187 },
                        number  = { 77 },
                        pages   = { 1-8 },
                        doi     = { 10.5120/ijca2026926252 },
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }

                        %0 Journal Article
                        %D 2026
                        %A Jundi Yang
                        %A Heng Yao
                        %T A Knowledge-Graph–Driven Multimodal Large Model for Semantic Understanding and Controllable Generation of Intangible Cultural Heritage%T 
                        %J International Journal of Computer Applications
                        %V 187
                        %N 77
                        %P 1-8
                        %R 10.5120/ijca2026926252
                        %I Foundation of Computer Science (FCS), NY, USA

Abstract

Intangible Cultural Heritage (ICH) encompasses complex layers of symbolic meaning expressed through motifs, crafts, rituals, and regional traditions. Contemporary multimodal generative models frequently overlook such domain-specific semantics, leading to visually appealing but culturally inaccurate outputs. To address this limitation, this paper introduces a unified knowledge-graph–driven multimodal generation framework that couples a structured ICH Knowledge Graph (KG), a domain-adapted Large Language Model (LLM), and a controllable diffusion-based text-to-image generator. The KG organizes motifs, techniques, symbolic associations, and regional contexts into a structured semantic space, which the LLM leverages to interpret user queries and retrieve culturally grounded constraints. These constraints are injected into the diffusion model through a multi-stage semantic fusion mechanism, enabling culturally faithful and controllable image synthesis. Experimental results across three curated ICH datasets demonstrate that the proposed framework outperforms representative baselines in cultural semantic accuracy, text–image alignment, and robustness to linguistic variation. The proposed approach provides a principled pathway for integrating symbolic cultural knowledge with modern generative models, supporting large-scale preservation, computational interpretation, and creative revitalization of intangible cultural heritage.

References

AGARWAL, R., AND GUPTA, A. Knowledge-guided image synthesis via semantic graphs. In Proceedings of the IEEE/CVF International Conference on Computer Vision (2021).
ALAYRAC, J.-B., DONAHUE, J., LUC, P., ET AL. Flamingo: A visual language model for few-shot learning. In Advances in Neural Information Processing Systems (2022).
BROWN, T. B., MANN, B., RYDER, N., ET AL. Language models are few-shot learners. In Advances in Neural Information Processing Systems (2020).
CHEN, H., AND LI, Y. Deepculture: Multimodal cultural analysis via deep neural networks. In Proceedings of the 2020 Conference on Computational Cultural Heritage (2020).
CHEN, X., LIU, Z., AND HU, P. Culturalgen: Controllable cultural-aware image generation via symbolic knowledge. IEEE Transactions on Multimedia (2025).
CROWSON, K., ET AL. Diffusion guidance using clip models. arXiv preprint arXiv:2204.08583 (2022).
FANG, T., AND LUO, J. Semantic understanding of artistic patterns using deep neural networks. In Proceedings of the IEEE International Conference on Multimedia and Expo (2017).
GARCIA, N., VOGIATZIS, G., AND MANIVANNAN, N. Semart: A semantic art understanding benchmark. In Proceedings of the European Conference on Computer Vision Workshops (2018).
GONG, Y., AND CHEN, H. Cross-modal retrieval with multimodal graph reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021).
KIM, J., PARK, S., AND CHO, K. Semanticbridge: Knowledge-guided alignment for multimodal models. In Proceedings of the International Conference on Learning Representations (2024).
LIU, R., WANG, S., AND LI, P. Kg-diffusion: Integrating knowledge graphs into diffusion models. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (2024).
LIU, Y., AND ZHAO, M. Cultural modeling and computational aesthetics. Journal of Cultural Analytics (2021).
MOU, W., LI, X., AND ZHANG, H. Semantic alignment for controllable text-to-image generation. In Proceedings of the AAAI Conference on Artificial Intelligence (2023).
PARK, J., AND LEE, H. Symbolic vision: Integrating visual semantics with cultural knowledge. Pattern Recognition (2018).
PATEL, R., ZHANG, L., AND HUANG, X. Context-guided conditioning for enhanced text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (2023).
RADFORD, A., KIM, J. W., HALLACY, C., ET AL. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning (2021).
ROMBACH, R., BLATTMANN, A., LORENZ, D., ET AL. Highresolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022).
SAHARIA, C., HO, J., CHAN, W., ET AL. Photorealistic text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022).
STEFANAKIS, E., AND PAPADOPOULOU, M. Ich-ontology: A semantic framework for intangible cultural heritage knowledge representation. International Journal of Digital Humanities (2022).
STREZOSKI, G., AND WORRING, M. Omniart: A large-scale artistic benchmark. In Proceedings of the European Conference on Computer Vision Workshops (2018).
TANG, Z., ZHAO, T., ZHANG, T., PHAN, H., WANG, Y., SHI, C., YUAN, B., AND CHEN, Y. Rf domain backdoor attack on signal classification via stealthy trigger. IEEE Transactions on Mobile Computing 23, 12 (2024), 11765–11780.
TOBIAS, M., AND STEINER, R. Semantic preservation of intangible cultural heritage through linked data technologies. Journal on Computing and Cultural Heritage (2021).
WANG, X., GAO, T., ZHU, Z., ET AL. Kepler: A knowledgeenhanced pretrained language model for knowledge representation. Transactions of the Association for Computational Linguistics (2021).
WEI, L., AND SUN, C. Digital heritage reconstruction via knowledge-guided multimodal learning. ACM Journal on Computing and Cultural Heritage (2022).
WILBER, M. J., FANG, C., JIN, H., HERTZMANN, A., AND COLLOMOSSE, J. Bam! the behance artistic media dataset for recognition beyond photography. In Proceedings of the IEEE International Conference on Computer Vision (2017).
XIE, J., AND PAN, R. Heritagekg: A knowledge graph for intangible cultural heritage. Journal of Knowledge Management (2019).
YAO, S., ZHANG, C., AND WANG, M. Kg-bert: Bert for knowledge graph completion. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (2019).
ZHANG, L., RAO, A., AND AGRAWALA, M. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (2023).

Index Terms

Computer Science

Information Sciences

No index terms available.

Keywords

Knowledge Graph Multimodal Large Models Intangible Cultural Heritage Controllable Generation Semantic Understanding Textto- Image Synthesis