Back to news
Privacy Society Ethics

AI Models Specialized in Code Do Not Leak All Personal Data Equally Easily

A new AI study examines how the risks of data leaks based on different types of personal information vary when language models that produce code are trained with open-source repositories.

Large language models that assist in programming are built from vast collections of code, which often contain developers' names, email addresses, and other personal information. Previous studies have shown that commercial models can reproduce such identifiers, but have treated personal information as a single, uniform risk category.

Now, the work of Hua Yang, Alejandro Velasco, Sen Fang, Bowen Xu, and Denys Poshyvanyk delves deeper into whether models learn some types of personal information more easily than others and leak them more frequently. Personally identifiable information (PII) encompasses a wide range of things from individual usernames to addresses and identification numbers, and the risks associated with these can vary significantly.

The researchers built a dataset containing various types of personal information from real code for this purpose. They fine-tuned several linguistic code models of different sizes with this dataset and monitored so-called learning dynamics: how and with what certainty the models learn different information during training.

Additionally, they developed a structural causal model, a statistical method aimed at distinguishing mere correlation from actual causation. This allowed them to assess whether the risk of data leaks is specifically due to the type of personal information or, for example, the size of the model or other background factors.

The study does not focus solely on whether the model leaks sensitive information, but on how and why certain information is more easily absorbed by the model than others. This can help develop more precise and targeted safeguards for training AI models specialized in code.

Source: Understanding Privacy Risks in Code Models Through Training Dynamics: A Causal Approach, ArXiv (AI).

This text was generated with AI assistance and may contain errors. Please verify details from the original source.

Original research: Understanding Privacy Risks in Code Models Through Training Dynamics: A Causal Approach
Publisher: ArXiv (AI)
Authors: Hua Yang, Alejandro Velasco, Sen Fang, Bowen Xu, Denys Poshyvanyk
December 26, 2025
Read original →