International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
|
Volume 187 - Issue 40 |
Published: September 2025 |
Authors: P.L.M.K. Bandara |
![]() |
P.L.M.K. Bandara . Estimating Re-identification Risk with Greater Accuracy: A Sample–Population Uniqueness Approach. International Journal of Computer Applications. 187, 40 (September 2025), 1-7. DOI=10.5120/ijca2025925684
@article{ 10.5120/ijca2025925684, author = { P.L.M.K. Bandara }, title = { Estimating Re-identification Risk with Greater Accuracy: A Sample–Population Uniqueness Approach }, journal = { International Journal of Computer Applications }, year = { 2025 }, volume = { 187 }, number = { 40 }, pages = { 1-7 }, doi = { 10.5120/ijca2025925684 }, publisher = { Foundation of Computer Science (FCS), NY, USA } }
%0 Journal Article %D 2025 %A P.L.M.K. Bandara %T Estimating Re-identification Risk with Greater Accuracy: A Sample–Population Uniqueness Approach%T %J International Journal of Computer Applications %V 187 %N 40 %P 1-7 %R 10.5120/ijca2025925684 %I Foundation of Computer Science (FCS), NY, USA
The increasing availability of microdata for research and policy analysis raises critical concerns about the risk of re-identification, particularly when datasets contain quasi-identifying attributes. This paper proposes a novel model for estimating re-identification risk through the joint analysis of sample and population uniqueness, and evaluates its performance against the conventional log-linear approach. The methodology combines repeated sampling with aggregation of uniqueness measures to estimate population-level risk, with precision, recall, and F1-score employed to validate accuracy. Empirical evaluation was conducted on three real-world datasets: student performance, insurance claims, and car purchasing inquiries. The results demonstrate that re-identification risk is strongly dataset-dependent. The insurance dataset exhibited neartotal uniqueness at both sample and population levels, highlighting an elevated probability of re-identification and the urgent need for robust disclosure controls. In contrast, the student performance and car purchasing datasets showed lower, though still considerable, proportions of unique records. Across all datasets, the proposed model closely aligned with true population counts and consistently outperformed the log-linear model in terms of accuracy. The findings underscore the inadequacy of traditional risk estimation methods for modern, high-dimensional datasets. The proposed model provides a more accurate and reliable framework for disclosure risk assessment, offering valuable guidance for data custodians and policymakers in balancing data utility with privacy protection.