Do large language models store personal data?

2024. július 30. 18:30 - poklaszlo

The rise of large language models (LLMs) poses challenges to the applicability of data protection rules. There is a lot of debate about the question of whether or not LLMs themselves store personal data. Recently, the Hamburg Commissioner for Data Protection and Freedom of Information (Hamburg DPA) published a discussion paper on large language models and personal data and this paper contains three basic theses as follows:

1. The mere storage of an LLM does not constitute processing within the meaning of article 4 (2) GDPR. This is because no personal data is stored in LLMs. Insofar as personal data is processed in an LLM-supported AI system, the processing must comply with the requirements of the GDPR. This applies in particular to the output of such an AI system.
2. Given that no personal data is stored in LLMs, data subject rights as defined in the GDPR cannot relate to the model itself. However, claims for access, erasure or rectification can certainly relate to the input and output of an AI system of the responsible provider or deployer.
3. The training of LLMs using personal data must comply with data protection regulations. Throughout this process, data subject rights must also be upheld. However, potential violations during the LLMs training phase do not affect the lawfulness of using such a model within an AI system.

The above three theses make it clear that according to the Hamburg DPA, no personal data is stored in LLMs. After the release of this discussion paper, some contra and pro arguments were published discussing this statement. Below, I´ll show the arguments regarding the question whether LLMs store personal data or not and I´ll also show why this question can be relevant.

Please read the following articles and posts that discuss the topic of whether LLMs store personal data (some of these articles specifically respond to the position of the Hamburg DPA):

David Rosenthal (Vischer): Part 19: Language models with and without personal data [excellent article on the subject, which provides a very clear and detailed assessment of the arguments regarding the storage of personal data by LLMs and also deals with the discussion paper of the Hamburg DPA in a separate section] (beyond this topic, it´s also worth checking other resources from David Rosenthal & Vischer about artificial intelligence),
David Vasella: Mutige “Hamburger Thesen zum Personenbezug in Large Language Models” (16.07.2024),
Mikolaj Barczentewicz: AI and EU privacy law: June 2024 state of play (18.07.2024),
Flemming Moos: Personenbezug von Large Language Models (CR 2024, 442).
It´s also worth checking the respective Linkedin posts of Dr. Markus Wünschelbaum and Frederico Marengo and the comments under such posts.

Why don´t LLMs store personal data?

First, it´s worth discussing why the Hamburg DPA concludes that LLMs do not store personal data. The Hamburg DPA builds its arguments on the technical evaluation of LLMs, namely on the way how the so-called "tokenization" works. The Hamburg DPA sets out that

[...] within LLMs texts are no longer stored in their original form, or only as fragments in the form of these numerical tokens. They are further processed into “embeddings”. These embeddings capture learned correlations by positioning tokens in relation to each other, i.e. assigning them according to probability weights. This describes the core "training" of an LLM. Furthermore, this mathematical representation of the trained input is used for the inference of a prompt. The embeddings represent the learned "knowledge" of the LLM. (Point II.1, p. 3)

Based on tokenization, vectorial relationships between tokens and the probabilistic nature of such relationships allow for meaningful and useful outputs from LLMs, even if the data is not stored in the same format as in a "normal" database.

It´s important to emphasize that the training dataset may contain personal data but through the training process, as the Hamburg DPA argues, such data is transformed in a way that such data is not stored in a personal data form anymore:

[...] When training data contains personal data, it undergoes a transformation during machine learning process, converting it into abstract mathematical representations. This abstraction process results in the loss of concrete characteristics and references to specific individuals. Instead, the model captures general patterns and correlations derived from the training data as a whole. (Point II.2, p. 4, emphasis added)

The conclusion of the Hamburg DPA regarding the nature of the data stored in the LLMs and the "creartion" of the outputs are as follows:

[...] LLMs process these training texts in a very specific manner based on contextual relationships, which enables the generation of similar and often useful output texts. However, everything that LLMs produce is "created" in the sense that it is not a simple reproduction of something stored (such as an entry in a database or a text document), but rather something newly produced. This probabilistic generation capability fundamentally differs from conventional data storage and data retrieval. (Point II.2, p. 4, emphasis added)

The Hamburg DPA makes it clear that despite there is no clear Court of Jusatice´s jurisprudence in this respect: "... an LLM does not store personal data within the meaning of article 4 (1), (2) GDPR in conjunction with Recital 26."

The referred sections of the GDPR set out the followings:

Art. 4 (1) ‘personal data’ means any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person;
Art. 4 (2) ‘processing’ means any operation or set of operations which is performed on personal data or on sets of personal data, whether or not by automated means, such as collection, recording, organisation, structuring, storage, adaptation or alteration, retrieval, consultation, use, disclosure by transmission, dissemination or otherwise making available, alignment or combination, restriction, erasure or destruction;

Recital (26) The principles of data protection should apply to any information concerning an identified or identifiable natural person. Personal data which have undergone pseudonymisation, which could be attributed to a natural person by the use of additional information should be considered to be information on an identifiable natural person. To determine whether a natural person is identifiable, account should be taken of all the means reasonably likely to be used, such as singling out, either by the controller or by another person to identify the natural person directly or indirectly. To ascertain whether means are reasonably likely to be used to identify the natural person, account should be taken of all objective factors, such as the costs of and the amount of time required for identification, taking into consideration the available technology at the time of the processing and technological developments. The principles of data protection should therefore not apply to anonymous information, namely information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable. This Regulation does not therefore concern the processing of such anonymous information, including for statistical or research purposes.

The DPA also summarizes some of the relevant CJEU judgments (e.g. Cases C‑434/16 [Nowak], C‑582/14 [Breyer], C‑319/22 [Gesamtverband Autoteile-Handel], etc.) concerning personal identifiers that relate to individuals (e.g. IP address). However, tokens and embeddings are different compared to such identifiers:

Unlike these identifiers addressed in CJEU case law, individual tokens as language fragments ("M", "ia", " Mü" and "ller") lack individual information content and do not function as placeholders for such. Even the embeddings, which represent relationships between these tokens, are merely mathematical representations of the trained input. [...] LLMs store highly abstracted and aggregated data points from training data and their relationships to each other, without concrete characteristics or references that “relate“ to individuals. Unlike the identifiers addressed in CJEU case law, which directly link to specific individuals, neither individual tokens nor their embeddings in LLMs contain such information about natural persons from the training dataset. Therefore, according to the standards set by CJEU jurisprudence, the question of whether personal data is stored in LLMs does not depend on the means available to a potential controller. In LLMs, the stored information already lacks the necessary direct, targeted association to individuals that characterizes personal data in CJEU jurisprudence: the information "relating" to a natural person. (Point III.1, p. 6, emphasis added)

Why do LLMs store personal data?

The Hamburg DPA´s discussion paper is probably the most detailed and well-argued document published by a DPA on theis subject to date. The conclusion of the Hamburg DPA was also welcomed as it took a very pragmatic approach that could help meet compliance requirments for LLMs in a reasonable manner. However, there are a number of publications that contradict (at least partially) the conclusion of the Hamburg DPA. It is worth considering these as well, as this is still an evolving topic.

The main arguments why LLMs store personal data include the followings (see especially Rosenthal´s post, which gives very clear arguments as to why the storage of personal data in LLMs should not be fully excluded):

Even if personal data is not "memorised" by the LLM at large scale, some (personal) data that is present in the training data set in a large number (e.g. data regarding famous persons) can be stored in the model itself.
However, there are opinions that some pieces of the training data might also be memorised by the model (see CNIL´s opinion on this, as also referred to in Mikolaj Barczentewicz´s post). (According to the position of the Hamburg DPA, memorization of data cannot be proved convincingly: "First of all, the mere presence of plausible personal information in LLM output is not conclusive evidence that personal data has been memorized, as LLMs are capable of generating texts that coincidentally matches training data.", see Point III.2, p. 7, emphasis added)
Data may not be stored in an easily understandable* format in the LLM, but such data can be retrieved without complex or complicated operations, it is enough to use appropriate inputs/prompts (of course, the process of retrieving information based on inputs/prompts can be very technical and complex, but users of the AI system do not need to understand such processes). (*According to the TechSonar blog of the EDPS´ Office, "LLMs store the data they learn in the form of the value of billions or trillions of parameters, rather than in a traditional database.", see Xabier Lareo: Large Lanuguage Models (LLM)) (The Hamburg DPA argues that the reproduction of information typically occurs only through targeted attacks on LLMs. See Point III.2, p. 7.)
Of course, there are additional technical arguments, such as multi-level storage and access to information within the model. The Hamburg DPA´s paper covers tekenization and embedding, but as Rosenthal writes, "[a]n LLM establishes relationships not only via vectors in the embedding space, but also via non-linear functions (neural networks) and functions such as the attention mechanism. In this way, information is stored across numerous levels."
It is also disputed whether the fact that the information (personal data) is not stored in the model but it can be generated by the model really makes a difference or not. The arguments say that the pure ability to produce the requested data (even if it is not fully clear how this happens) can lead to the conclusion that personal data is processed within the model. (Of course, it can be questioned on the other side whether such processing activity is storage or something else but the main point is, according to such arguments, that personal data processing takes place.)
It is also argued that based on the definition of personal data, it is not necessary that there is an identifier that links the information to a specific person, this is only one possibility but other, more complex techniques might also be applied (e.g. based on the proper training of the model, the prompt might become a means of identification, see Rosenthal´s post).

(The above summary of arguments might be simplifying, however, the main point is that both from a legal and a technical point of view, there are strong arguments that LLMs may contain personal data.)

Why is this topic relevant?

Applying data protection principles and requirements to the use of artificial intelligence can be challenging. It is important to understand how general requirements can be 'translated' into AI-specific requirements that ensure compliance with applicable rules, but in a way that really works in practice (also from a technical point of view). As LLMs do not always operate in an easy-to-manage manner from a data protection perspective (e.g. in relation to the exercise of the data subject rights), it is important to find an understanding of the applicable requirements that can also be applied to LLMs. It is also important to distinguish between the largae language model and the AI system based on the model, as this affects the applicability of requirements (see the practical implications part in the discussion paper of the Hamburg DPA, Point IV, pp. 8-10).

We can see that there are at least three different phases of using AI systems, including an LLM, that may be relevant from a data protection perspective:

model training where personal data can be used for the training (of course, data protection requirements are fully applicable to such data processing),
the storage of the LLM (the above discussions mainly relate to this part),
using an AI system based on an LLM, where both inputs and outputs can contain personal data, and in this case personal data is processed (in arguments against the conclusion of the Hamburg Data Protection Authority's discussion paper, sometimes this phase is also taken into account along with the storage of models to explain how the combination of inputs and information stored in the model can be combined to find the information within the model that might be classified as personal data).

Hopefully, the discussions and debates will continue on specific topics relating to the interplay between AI and data protection, and it would be very good to see similar practical approaches from public authorities as in the case of the discussion paper published by the Hamburg DPA that provides a solid basis for further assessments and practical evaluations.

Szólj hozzá!

Címkék: artificial intelligence AI EN personal data LLM Hamburg DPA large language models

GDPR

Adatvédelem mindenkinek / Data protection for everyone

Do large language models store personal data?

A bejegyzés trackback címe:

Kommentek:

GDPR

Adatvédelem mindenkinek / Data protection for everyone

Do large language models store personal data?

Ajánlott bejegyzések:

A bejegyzés trackback címe:

Kommentek: