Research Article

Synthetic Medical Data Generation using Transformer-based Generative AI: A Performance Comparison with Faker and CTGAN

by  Srinivas Suresh Sikhakolli, Asha Kiran Sikhakolli
journal cover
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 187 - Issue 106
Published: May 2026
Authors: Srinivas Suresh Sikhakolli, Asha Kiran Sikhakolli
10.5120/ijca372b529d7c61
PDF

Srinivas Suresh Sikhakolli, Asha Kiran Sikhakolli . Synthetic Medical Data Generation using Transformer-based Generative AI: A Performance Comparison with Faker and CTGAN. International Journal of Computer Applications. 187, 106 (May 2026), 22-26. DOI=10.5120/ijca372b529d7c61

                        @article{ 10.5120/ijca372b529d7c61,
                        author  = { Srinivas Suresh Sikhakolli,Asha Kiran Sikhakolli },
                        title   = { Synthetic Medical Data Generation using Transformer-based Generative AI: A Performance Comparison with Faker and                           CTGAN },
                        journal = { International Journal of Computer Applications },
                        year    = { 2026 },
                        volume  = { 187 },
                        number  = { 106 },
                        pages   = { 22-26 },
                        doi     = { 10.5120/ijca372b529d7c61 },
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }
                        %0 Journal Article
                        %D 2026
                        %A Srinivas Suresh Sikhakolli
                        %A Asha Kiran Sikhakolli
                        %T Synthetic Medical Data Generation using Transformer-based Generative AI: A Performance Comparison with Faker and                           CTGAN%T 
                        %J International Journal of Computer Applications
                        %V 187
                        %N 106
                        %P 22-26
                        %R 10.5120/ijca372b529d7c61
                        %I Foundation of Computer Science (FCS), NY, USA
Abstract

Access to medical data is essential for health care research and advanced analytics. However, strict privacy regulations significantly limit data availability, hinder the machine learning applications. Due to these limitations, synthetic data usage raising across the world. Prior studies focused on building synthetic data using rule-based models such as Faker and deep learning models such as CTGAN. In recent years, ChatGPT, a transformer based Generative AI model has emerged with advanced capabilities to generate wide variety of synthetic data on demand. The aim of this research is to show that the transformer based generative AI model produces quality synthetic data that yields better predictive performance when compared with the Faker and CTGAN models. The synthetic data has been generated with reference to the UCI Cleveland Heart data. Random Forest algorithm has been used to evaluate the performance of the model. The results of the experiment prove that the transformer based GenAI, ChatGPT generated synthetic data yields better performance when compared with the Faker and CTGAN models. Also, proves that the performance metrics of ChatGPT based synthetic data are close to the actual Cleveland heart medical data. Our findings suggest that ChatGPT model effectively captured clinical relationships and offers practical insights for researchers without losing the privacy in synthetic data. This type experiment is useful for non-clinical research.

References
  • Bayrem Kaabachi, Jérémie Despraz, Thierry Meurers, Karen Otte, Mehmed Halilovic, Bogdan Kulynych, Fabian Prasser, and Jean Louis Raisaro (2023),Scoping review: “Privacy and utility in synthetic healthcare data” PubMed. Availble at https://pubmed.ncbi.nlm.nih.gov/39870798/
  • DataIntelo. (2024). Synthetic data in healthcare market outlook 2025–2033 (Market report). https://dataintelo.com/report/synthetic-data-in-healthcare-market
  • Fang, M. L., Dhami, D. S., & Kersting, K. (2022). DP-CTGAN: Differentially private tabular GAN. In M. Michalowski, S. S. R. Abidi, & S. Abidi (Eds.), Artificial Intelligence in Medicine: 20th International Conference on Artificial Intelligence in Medicine (AIME 2022) – Proceedings (pp. 178–188). Springer. https://doi.org/10.1007/978-3-031-09342-5_17
  • Janosi, A., Steinbrunn, W., Pfisterer, M., & Detrano, R. (1989). Heart disease [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C52P4X
  • Z. Zhao, A. Kunar, R. Birke, H. Van der Scheer and L. Y. Chen, “CTAB-GAN+: Enhancing tabular data synthesis,” Frontiers in Big Data, vol. 6, p. 1296508, Jan. 2024, doi: 10.3389/fdata.2023.1296508.
  • Umesh, C., Mahendra, M., Bej, S., Wolkenhauer, O., & Wolfien, M (2024). Challenges and applications in generative AI for clinical tabular data in physiology. Pflügers Archiv – European Journal of Physiology.
  • Ahmed, H. A., Nepomuceno, J. A., Vega‑Márquez, B., et al. (2025). “Synthetic Data Generation for Healthcare: Exploring Generative Adversarial Networks Variants for Medical Tabular Data.” International Journal of Data Science and Analytics, Springer.
  • Ghosheh, M., Murtaza, S., & others (2025). “A Systematic Review of Privacy‑Preserving Techniques for Synthetic Tabular Health Data.” Discover Data (Springer).
  • Karmakar, A., Shaw, A., Rakshit, S., Chakraborty, S., Biswas, S., Sahoo, S., & Biswas, S. (2025). The role of generative AI in medical image synthesis: A review. Discover Applied Sciences, 7, Article 714. https://doi.org/10.1007/s42452-025-07714-7
  • Zhang, W., Liu, R., Zhu, X., et al. (2025). “Enhancing Privacy Protection of Physical Examination Data through Synthetic Algorithms Based on Differential Privacy.” BMC Medical Informatics and Decision Making (Springer Nature).
Index Terms
Computer Science
Information Sciences
No index terms available.
Keywords

Synthetic Medical Data Generative AI Privacy-Preserving Data Random Forest Model Performance

Powered by PhDFocusTM