Estimating Re-identification Risk with Greater Accuracy: A Sample–Population Uniqueness Approach

P.L.M.K. Bandara

Research Article

Estimating Re-identification Risk with Greater Accuracy: A Sample–Population Uniqueness Approach

by P.L.M.K. Bandara

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 187 - Issue 40

Published: September 2025

Authors: P.L.M.K. Bandara

10.5120/ijca2025925684

PDF

P.L.M.K. Bandara . Estimating Re-identification Risk with Greater Accuracy: A Sample–Population Uniqueness Approach. International Journal of Computer Applications. 187, 40 (September 2025), 1-7. DOI=10.5120/ijca2025925684

                        @article{ 10.5120/ijca2025925684,
                        author  = { P.L.M.K. Bandara },
                        title   = { Estimating Re-identification Risk with Greater Accuracy: A Sample–Population Uniqueness Approach },
                        journal = { International Journal of Computer Applications },
                        year    = { 2025 },
                        volume  = { 187 },
                        number  = { 40 },
                        pages   = { 1-7 },
                        doi     = { 10.5120/ijca2025925684 },
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }

                        %0 Journal Article
                        %D 2025
                        %A P.L.M.K. Bandara
                        %T Estimating Re-identification Risk with Greater Accuracy: A Sample–Population Uniqueness Approach%T 
                        %J International Journal of Computer Applications
                        %V 187
                        %N 40
                        %P 1-7
                        %R 10.5120/ijca2025925684
                        %I Foundation of Computer Science (FCS), NY, USA

Abstract

The increasing availability of microdata for research and policy analysis raises critical concerns about the risk of re-identification, particularly when datasets contain quasi-identifying attributes. This paper proposes a novel model for estimating re-identification risk through the joint analysis of sample and population uniqueness, and evaluates its performance against the conventional log-linear approach. The methodology combines repeated sampling with aggregation of uniqueness measures to estimate population-level risk, with precision, recall, and F1-score employed to validate accuracy. Empirical evaluation was conducted on three real-world datasets: student performance, insurance claims, and car purchasing inquiries. The results demonstrate that re-identification risk is strongly dataset-dependent. The insurance dataset exhibited neartotal uniqueness at both sample and population levels, highlighting an elevated probability of re-identification and the urgent need for robust disclosure controls. In contrast, the student performance and car purchasing datasets showed lower, though still considerable, proportions of unique records. Across all datasets, the proposed model closely aligned with true population counts and consistently outperformed the log-linear model in terms of accuracy. The findings underscore the inadequacy of traditional risk estimation methods for modern, high-dimensional datasets. The proposed model provides a more accurate and reliable framework for disclosure risk assessment, offering valuable guidance for data custodians and policymakers in balancing data utility with privacy protection.

References

G. Greenleaf, “Global data privacy laws 2023: 162 national laws and 20 bills (Feb 10, 2023),” 181 Privacy Laws and Business International Report (PLBIR) 1, 2-4, UNSW Law Research Paper No. 23-48, 2023.
J. Wolff and N. Atallah, “Early gdpr penalties: Analysis of implementation and fines through may 2020,” Journal of Information Policy, vol. 11, pp. 63–103, 2021.
A. K. Saraswat and V. Meel, “Protecting data in the 21st century: Challenges, strategies and future prospects,” Information technology in industry, vol. 10, no. 2, pp. 26–35, 2022.
M. Finck and F. Pallas, “They who must not be identified— distinguishing personal from non-personal data under the gdpr,” International Data Privacy Law, vol. 10, no. 1, pp. 11–36, 2020.
L. Sweeney, “k-anonymity: A model for protecting privacy,” International journal of uncertainty, fuzziness and knowledge-based systems, vol. 10, no. 05, pp. 557–570, 2002.
A. Narayanan and V. Shmatikov, “How to break anonymity of the netflix prize dataset,” arXiv preprint cs/0610105, 2006.
e. De Montjoye, “Unique in the shopping mall: On the reidentifiability of credit card metadata,” Science, vol. 347, no. 6221, pp. 536–539, 2015.
——, “Unique in the crowd: The privacy bounds of human mobility,” Scientific reports, vol. 3, no. 1, pp. 1–5, 2013.
e. Rocher, “Estimating the success of re-identifications in incomplete datasets using generative models,” Nature Communications, vol. 10, no. 1, pp. 1–9, 2019.
M. Barbaro, T. Zeller, and S. Hansell, “A face is exposed for aol searcher no. 4417749,” New York Times, vol. 9, no. 2008, p. 8, 2006.
e. Gymrek, “Identifying personal genomes by surname inference,” Science, vol. 339, no. 6117, pp. 321–324, 2013.
A. Acquisti and R. Gross, “Predicting social security numbers from public data,” Proceedings of the National academy of sciences, vol. 106, no. 27, pp. 10 975–10 980, 2009.
e. Golle, “Secure conjunctive keyword search over encrypted data,” in International conference on applied cryptography and network security. Springer, 2004, pp. 31–45.
L. Sweeney, “Discrimination in online ad delivery,” Communications of the ACM, vol. 56, no. 5, pp. 44–54, 2013.
A. Tockar, “Riding with the stars: Passenger privacy in the nyc taxicab dataset,” Neustar Research, September, vol. 15, no. 6, 2014.
e.Wondracek, “A practical attack to de-anonymize social network users,” in 2010 ieee symposium on security and privacy. IEEE, 2010, pp. 223–238.
B. Malin and L. Sweeney, “How (not) to protect genomic data privacy in a distributed network: using trail re-identification to evaluate and design anonymity protection systems,” Journal of biomedical informatics, vol. 37, no. 3, pp. 179–192, 2004.
H. Zang and J. Bolot, “Anonymization of location data does not work: A large-scale measurement study,” in Proceedings of the 17th annual international conference on Mobile computing and networking, 2011, pp. 145–156.
e. Shokri, “Membership inference attacks against machine learning models,” in 2017 IEEE symposium on security and privacy (SP). IEEE, 2017, pp. 3–18.
e. Haeberlen, “Peerreview: Practical accountability for distributed systems,” ACM SIGOPS operating systems review, vol. 41, no. 6, pp. 175–188, 2007.
e. Xu, “N-doped nanoporous co3o4 nanosheets with oxygen vacancies as oxygen evolving electrocatalysts,” Nanotechnology, vol. 28, no. 16, p. 165402, 2017.
G. D. P. Regulation, “Gdpr. 2019,” 2019.
E. Illman and P. Temple, “California consumer privacy act,” The Business Lawyer, vol. 75, no. 1, pp. 1637–1646, 2019.
N. Gupta and A. George, “Digital personal data protection act, 2023: Charting the future of india’s data regulation,” in Data Governance and the Digital Economy in Asia. Routledge, 2025, pp. 34–53.
“General Data Protection Regulation (GDPR),” 2018. [Online]. Available: https://gdpr-info.eu/
D. P. Act, “Data protection act 2018,” [online] GOV. UK., 2018.
C. Malhotra and U. Malhotra, “Putting interests of digital nagriks first: Digital personal data protection (dpdp) act 2023 of india,” Indian Journal of Public Administration, vol. 70, no. 3, pp. 516–531, 2024.
“California consumer privacy act (CCPA),” 2024, california Privacy Protection Agency. [Online]. Available: https: //cppa.ca.gov/faq.html
e. Canedo, “Proposal of an implementation process for the brazilian general data protection law (lgpd).” in ICEIS (1), 2021, pp. 19–30.
D. Jaar and P. E. Zeller, “Canadian privacy law: The personal information protection and electronic documents act (pipeda),” Int’l. In-House Counsel J., vol. 2, p. 1135, 2008.
e. Staunton, “Protection of personal information act 2013 and data protection for health research in south africa,” International Data Privacy Law, vol. 10, no. 2, pp. 160–179, 2020.
W. B. Chik, “The singapore personal data protection act and an assessment of future trends in data privacy reform,” Computer Law & Security Review, vol. 29, no. 5, pp. 554–575, 2013.
˙I. Sevinc¸ and N. Karabulut, “A review on the personal data protection authority of turkey,” Akademik Hassasiyetler, vol. 7, no. 13, pp. 449–472, 2020.
Z. M. Yusoff, “The malaysian personal data protection act 2010: A legislation note,” NZJPIL, vol. 9, p. 119, 2011.
e. Okada, “On the revision of japanese personal information protection system in 2021,” Ph.D. dissertation, Waseda University, 2023.
J. Kevins and K. Brian, “Defining data protection in kenya: Challenges, perspectives and opportunities,” Perspectives and Opportunities (November 7, 2022), 2022.
E. Adeoti, “A new era of data protection and privacy; unveiling innovations & identifying gaps in the nigeria data protection act of 2023,” Unveiling Innovations & Identifying Gaps in the Nigeria Data Protection Act of, 2023.
A. Gurkov, “Personal data protection in russia,” The Palgrave Handbook of Digital Russia Studies, pp. 95–113, 2021.
e. Lison, “Anonymisation models for text data: State of the art, challenges and future directions,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021, pp. 4188–4203.
S. Garfinkel et al., De-identification of Personal Information:. US Department of Commerce, National Institute of Standards and Technology, 2015.
e. Kohlmayer, “Pseudonymization for research data collection: is the juice worth the squeeze?” BMC medical informatics and decision making, vol. 19, pp. 1–7, 2019.

Index Terms

Computer Science

Information Sciences

No index terms available.

Keywords

Population Uniqueness Sample Uniqueness Reidentification Risk Estimation Reidentification Example Reidentification Attack