EU Launches New AI Transparency Template

The European Union, as part of its Artificial Intelligence Act (AI Act), has introduced significant obligations for providers of general-purpose AI models. One of the key novelties is the obligation to publish a “Sufficiently detailed public summary of the content used for the training of the model”, also known as the “Summary”. This measure aims to increase transparency and protect the rights of stakeholders in the growing world of artificial intelligence.

What is the AI Act and when do these rules come into force?

Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024, known as the “AI Act”, entered into force on August 1, 2024. However, the rules concerning providers of general-purpose AI models, including the obligation to publish the Summary, will apply from August 2, 2025.

Who must prepare and publish the Summary?

The obligation applies to all providers of general-purpose AI models placed on the Union market. This includes providers of models released under free and open-source licenses, insofar as these models fall within the scope of the AI Act.

What is the objective of publishing the Summary?

The main objective of the Summary is to increase transparency on the content used for training general-purpose AI models, including text and data protected by copyright law. This transparency is crucial to facilitate parties with legitimate interests, including rightsholders, to exercise and enforce their rights under Union law.

Legitimate interests encompass a wide range of rights:

Intellectual property rights, including copyright and related rights, by helping rightsholders obtain relevant information on the content used in training, thus facilitating the exercise of their fundamental right to intellectual property and to an effective remedy. It also contributes to ensuring compliance with Union law on copyright and related rights.
Data subjects’ rights and, more broadly, the enforcement of Union data protection rules. This can be done by summarizing relevant information, such as data scraped from the internet or collected through user interactions, without replacing other data protection information.
The interests of consumers and the protection of their consumer rights under Union law.
Assisting providers integrating these models into downstream applications to assess the diversity of the data. This, in turn, allows them to implement mitigating measures to ensure that the fundamental rights to non-discrimination and language and cultural diversity are respected.
Facilitating the fundamental right to receive and impart information and allowing researchers and academic institutions to exercise their freedom of science to conduct scientific research and critically evaluate implications and limitations of AI models.
Contributing to more transparent and competitive markets. For example, information about whether publicly available general-purpose AI models have been used to train other models (e.g., through model distillation) or if a model was trained on user data collected from the provider’s own products may help users and companies understand data usage and avoid lock-in effects.

What must the Summary contain and how detailed must it be?

The Summary must cover data used in all stages of model training, from pre-training to post-training, including model alignment and fine-tuning. This includes all sources and types of data, regardless of whether they are protected by intellectual property rights or not. However, it does not require a technically detailed depiction, but must be “generally comprehensive” in its scope.

The Template provided by the AI Office, is structured into three main sections:

General information: This section requires information allowing identification of the provider and the model, and information on modalities, the approximate size of each modality within broad ranges (e.g., less than 1 billion tokens, 1-10 trillion tokens, more than 10 trillion tokens for text), as well as general characteristics of the training data.
List of data sources: This section requires disclosure of the main datasets used to train the model, such as large private or public databases, and a comprehensive narrative description of data scraped online by or on behalf of the provider (including a summary of the most relevant domain names scraped). It also requires a narrative description of all other data sources used (e.g., user data or synthetic data) to ensure completeness of the summary regarding the content used for training.
- Publicly available datasets: Requires naming and links to “large” datasets (those where any modality exceeds 3% of the total size of all publicly available datasets for that modality used for training) and a general description of other publicly available datasets.
- Private, non-publicly available datasets from third parties: Differentiates between commercially licensed data from rightsholders and other private datasets obtained from third parties (listing them only if publicly known, otherwise providing a general description).
- Data crawled and scraped from online sources: Describes the crawlers used, their purpose and behavior, the period of data collection, and a comprehensive description of the type of content and online sources scraped (e.g., news, blogs, social media). It requires a summary list of the most relevant domain names (top 10% by content size, or for SMEs, top 5% or 1000 domains, whichever is lower).
- User data: Information on whether data from user interactions with the AI model or the provider’s other services/products was used.
- Synthetic data: Information about AI models used to generate synthetic data for training, especially for model distillation, including general-purpose AI models placed on the market or the provider’s own models.
- Other sources of data: Description of any other sources not falling into previous categories (e.g., offline collected data, human-labeled datasets).
- Relevant data processing aspects: This section requires disclosure of certain data processing aspects relevant for the exercise of rights under Union law. This includes:

Respect of reservation of rights from text and data mining (TDM) exception or limitation:

Measures implemented to identify and comply with TDM opt-outs.
Removal of illegal content from training data (e.g., blacklists, filtering to mitigate risks of reproduction and dissemination of illegal content like child sexual abuse material or terrorist content).

Protection of trade secrets and confidential information

The creation of the Summary represents a careful balance between transparency and the protection of trade secrets and confidential business information. The Template is designed so that the level of detail varies depending on the data source, preserving sensitive information. For example, limited disclosure is required for commercially licensed data, while more detail is required for publicly available datasets. For data scraped from online sources, a summary of domain names is required, aiming to provide meaningful information while maintaining a non-technical form and balancing with trade secrets. The Summary does not require disclosure of the exact mix and composition of data sources, only high-level information about training data size per modality within broad ranges.

Simple, uniform, and effective reporting

Information should be provided in a narrative, simple, and effective form. Providers must ensure the information is accurate and comprehensive. The Template includes clear instructions for easy and uniform reporting. The AI Office may verify the correctness of the Template submission and request corrective measures. Non-compliance may result in fines of up to 3% of the provider’s annual total worldwide turnover or EUR 15,000,000, whichever is higher.

Updating and publishing the Summary

The Summary must be made publicly available no later than when the model is placed on the Union market. It should be published on the provider’s official website in a clearly visible and accessible manner, and together with the model across all its public distribution channels.

The Summary must be updated when the provider further trains the model on new data that requires a significant update. Updates should be performed every six months or sooner if a materially significant update is needed. If another entity modifies a model, the Summary should be limited to the training content used for that specific modification. The same Summary may be used for multiple models or versions if their content is identical.

Special rules for models placed on the market before August 2, 2025

Although the obligation enters into application on August 2, 2025, for models that were already on the market before that date, providers have a longer deadline – until August 2, 2027 – to publish the corresponding Summary. If a provider, despite their best efforts, cannot provide parts of the required information, they must clearly state and justify these information gaps in the Summary. Supervision and enforcement by the AI Office for compliance with these rules will begin on August 2, 2026.

The Commission will monitor the implementation of the template and this explanatory notice and will review and update them as necessary, taking into account gained experience and technological developments.

#AIAct #GeneralpurposeAI #Trainingdatasummary #EUAIregulations #AItransparency #AIcompliance #PublicsummaryofAItraining #EuropeanCommissionAIrules