Within AI Training
The hidden dataset problem in AI music
Artists cannot meaningfully consent, opt out or negotiate if they cannot see whether their recordings were used for training.
On this page
- Why companies keep datasets secret
- Why rightsholders need audit trails
- What disclosure could realistically look like
Page outline Jump by section
Introduction
One of the most contentious questions in AI music is not whether copyrighted recordings were used for training, but whether anyone outside the companies building the models can find out. Artists, labels, publishers and performers cannot meaningfully consent to, object to, or negotiate over the use of their work if they do not know whether it was included in a training dataset in the first place. The result is a transparency problem: large music-generation systems may depend on enormous collections of audio, yet the contents of those collections are often treated as confidential business information. This lack of visibility sits at the centre of many copyright disputes because it makes it difficult to verify rights, establish licences, assess infringement claims, or build compensation systems. [OUP Academic]academic.oup.comOUP AcademicCopyright and AI training data—transparency to the rescue?by A Buick · 2025 · Cited by 92 — AI developers to be required by l…
The debate is not simply about disclosure for its own sake. It concerns whether there can be a functioning market for AI training licences, creator consent and copyright enforcement when the underlying training material remains hidden. [UK Music]ukmusic.orgUK MusicCopyright and Artificial Intelligence ConsultationFebruary 25, 2025 — 30 May 2025 — Clear records of training data allow creators…
Why companies keep datasets secret
AI developers often argue that training datasets are commercially sensitive assets. Revealing exactly which recordings, catalogues or sources were used could expose competitive advantages, enable rivals to replicate a model-development strategy, or reveal business relationships that companies regard as proprietary. These concerns have become increasingly visible in litigation.
A recent example emerged in the copyright dispute involving Udio, where the company sought to keep information about the size of its training dataset out of the public record, arguing that disclosure could cause competitive harm. The dispute illustrates how training-data information is frequently treated as a trade secret rather than a public accountability issue. [Music Business Worldwide]musicbusinessworldwide.comMusic Business WorldwideAfter Suno, Udio asks court to seal the size of its AI training…1 day ago — After Suno, Udio asks court to sea…
The secrecy extends beyond court filings. Commercial music-generation systems have often declined to publish detailed lists of the recordings, songs or databases used to train their models. Analysts, researchers and rights organisations have repeatedly noted that some of the most prominent music AI services have not disclosed their training datasets, making independent verification difficult. [WIPO]wipo.intWIPORoyalties in the age of AI: paying artists for AI-generated…Commercial models such as Suno and Udio have not disclosed their train…
From the companies’ perspective, there are several reasons for caution:
- Detailed dataset disclosure may reveal acquisition strategies and model-development techniques.
- Large datasets may contain material gathered from numerous sources with complex ownership histories.
- Public disclosure could increase legal exposure by making rights claims easier to pursue.
- Maintaining secrecy can preserve bargaining power during licensing negotiations. [OUP Academic]academic.oup.comOUP AcademicCopyright and AI training data—transparency to the rescue?by A Buick · 2025 · Cited by 92 — AI developers to be required by l…
These arguments are not necessarily frivolous. Many technology firms genuinely regard training datasets as core intellectual property. However, the more valuable and influential AI music systems become, the harder it is to justify a situation where affected rightsholders cannot determine whether their work contributed to those systems at all. [UK Parliament]publications.parliament.ukUK ParliamentAI, copyright and the creative industries6 Mar 2026 — These must give creators and performers clear control over commercial…
Why rightsholders need audit trails
The consent problem in music AI is fundamentally an information problem. Consent requires knowledge. A songwriter cannot license the use of a composition, and a label cannot negotiate terms for a recording, if neither knows whether the work was used.
This issue has appeared repeatedly in disputes over AI training. When record companies sued Suno and Udio in 2024, the lawsuits focused on alleged unlicensed use of copyrighted recordings for model training. The broader controversy was intensified by the fact that the companies had not publicly disclosed the recordings on which their systems were trained. [RIAA]riaa.comRIAARecord Companies Bring Landmark Cases for…24 Jun 2024 — Record Companies Bring Landmark Cases for Responsible AI Against Suno and… [wired]wired.comThe lawsuits, seeking up to $150,000 per infringed work, were filed in Massachusetts and New York. The labels argue that the AI generator… Without audit trails, rightsholders face several practical obstacles:
Verifying use. A creator may suspect that a model was trained on their recordings but have no reliable method of proving it.
Negotiating licences. Licensing markets depend on knowing what material is being used and by whom.
Calculating compensation. If training contributions cannot be identified, it becomes difficult to determine who should be paid and on what basis.
Enforcing rights. Copyright law is difficult to enforce when the underlying evidence is hidden. [UK Music]ukmusic.orgUK MusicCopyright and Artificial Intelligence ConsultationFebruary 25, 2025 — 30 May 2025 — Clear records of training data allow creators…
The problem affects independent musicians particularly strongly. Major labels may have the resources to investigate potential use of their catalogues, but individual artists often lack the legal and technical means to determine whether their recordings were included in large-scale datasets. The information asymmetry favours model developers because only they possess complete knowledge of the training process. [Independent Society of Musicians]ism.orgIndependent Society of Musicians Copyright & AI consultation: ISM submissionIndependent Society of MusiciansCopyright & AI consultation: ISM submissionFebruary 27, 2025 — 27 Feb 2025 — The ISM advocates for musici…
Recent academic work suggests that technical auditing may become possible even without company cooperation. Researchers have demonstrated methods for “membership inference” against generative music models, attempting to determine whether a particular recording was likely included in training data. Such research remains experimental, but it reflects growing demand for independent verification tools when direct disclosure is unavailable. [arXiv]arxiv.orgarXivAuditing Training Data in Generative Music Models via Black-Box Membership InferenceMay 28, 2026…
The hidden cost of opacity
Dataset secrecy creates risks beyond copyright litigation. It also undermines trust in AI music systems.
When creators cannot identify the source material behind a model, rumours and speculation tend to fill the gap. Some musicians assume their catalogues were used without permission. Some users assume all models are trained on stolen music. Others assume that commercial systems must already be fully licensed. In many cases, none of these assumptions can be verified because the relevant information is unavailable. [WIPO]wipo.intWIPORoyalties in the age of AI: paying artists for AI-generated…Commercial models such as Suno and Udio have not disclosed their train…
Opacity also complicates discussions about ethical AI. A company may claim that it respects artists’ rights, but outsiders cannot independently assess that claim if training records remain inaccessible. Conversely, a company that has invested heavily in licensing may receive little public credit if it does not disclose meaningful information about its data sources. Transparency therefore affects not only enforcement but also credibility. [GOV.UK]GOV.UKreport on copyright and artificial intelligence18 Mar 2026 — Some countries have introduced transparency regulations that require AI developers to disclose sources of training data, wi…
The issue has become significant enough that policy debates in multiple jurisdictions now treat transparency as a separate governance question rather than merely a copyright side issue. Government consultations, parliamentary reviews and legal scholarship increasingly frame disclosure as a prerequisite for any workable system of consent and licensing. [UK Parliament]publications.parliament.ukUK ParliamentAI, copyright and the creative industries6 Mar 2026 — These must give creators and performers clear control over commercial… [OUP]academic.oup.comOUP AcademicCopyright and AI training data—transparency to the rescue?by A Buick · 2025 · Cited by 92 — AI developers to be required by l…
What disclosure could realistically look like
The transparency debate is often presented as a choice between complete secrecy and publishing every file used for training. In practice, many proposed solutions fall somewhere in between.
One approach is dataset source disclosure. Rather than listing every recording individually, developers could identify the databases, catalogues, platforms or collections from which training material was obtained. This would provide a basic level of accountability while limiting the release of commercially sensitive details. [GOV.UK]GOV.UKreport on copyright and artificial intelligence18 Mar 2026 — Some countries have introduced transparency regulations that require AI developers to disclose sources of training data, wi…
A second approach is rightsholder-access systems. Under this model, detailed training records would not necessarily be public but could be made available to verified copyright owners seeking to determine whether their works were used. This would focus disclosure on those with a legitimate legal interest. [OUP Academic]academic.oup.comOUP AcademicCopyright and AI training data—transparency to the rescue?by A Buick · 2025 · Cited by 92 — AI developers to be required by l…
A third possibility is auditable training logs. Developers could maintain standardised records documenting when material entered a dataset, under what licence it was obtained, and whether restrictions applied. Such records would create a chain of evidence that could support both licensing and dispute resolution. Rights organisations in the music sector have argued that clear records are essential for managing agreements and monitoring unauthorised use. [UK Music]ukmusic.orgUK MusicCopyright and Artificial Intelligence ConsultationFebruary 25, 2025 — 30 May 2025 — Clear records of training data allow creators…
Emerging regulation is beginning to test these ideas. The European AI Act includes transparency-related obligations for certain AI systems, while newer transparency laws and policy proposals in jurisdictions such as California and the United Kingdom have focused on requiring at least high-level disclosure of training data sources and copyright-relevant information. The precise scope remains contested, especially where companies argue that mandatory disclosure threatens trade secrets. [Reuters]reuters.comThis includes details on dataset sources, size, types, intellectual property status, commercial arrangements, personal information involv… [Ulster University]pure.ulster.ac.ukUlster University Copyright and AI training dataUlster UniversityCopyright and AI training data - transparency to the rescue?Today — by A Buick · 2024 · Cited by 94 — Generative Artific… [Davis Gilbert LLP]dglaw.comai legal updates californias ai training data transparency law takes effectDavis+Gilbert LLPCalifornia's AI Training Data Transparency Law Takes EffectJan 23, 2026 — The TDTA requires developers of generative AI…
The central tension
The hidden dataset problem is ultimately a conflict between two legitimate interests. AI developers want to protect commercially valuable information about how their models are built. Rightsholders want enough visibility to exercise copyright, negotiate licences and grant or withhold consent.
As music AI becomes more commercially important, the practical question is no longer whether transparency matters. It is how much transparency is necessary for creators to know when their work has been used, while still allowing companies to protect genuinely sensitive business information. The future shape of licensing, compensation and consent in music AI may depend less on the models themselves than on whether that balance can be achieved. [UK Parliament]publications.parliament.ukUK ParliamentAI, copyright and the creative industries6 Mar 2026 — These must give creators and performers clear control over commercial… [OUP]academic.oup.comOUP AcademicCopyright and AI training data—transparency to the rescue?by A Buick · 2025 · Cited by 92 — AI developers to be required by l…
Amazon book picks
Further Reading
Books and field guides related to The hidden dataset problem in AI music. Use these as the next step if you want deeper reading beyond the article.
Atlas of AI
Examines datasets, data extraction and power structures behind AI development.
Weapons of Math Destruction
Explains transparency, accountability and auditability problems in opaque data systems.
The Black Box Society
Directly addresses secrecy, transparency and accountability in data-driven systems.
The Coming Wave
Provides broad context for governance and disclosure issues surrounding AI models.
Endnotes
-
Source: academic.oup.com
Link: https://academic.oup.com/jiplp/article/20/3/182/7922541Source snippet
OUP AcademicCopyright and AI training data—transparency to the rescue?by A Buick · 2025 · Cited by 92 — AI developers to be required by l...
-
Source: wipo.int
Link: [https://www.wipo.int/en/web/wipo-magazine/articles/royaltiesSource snippet
WIPORoyalties in the age of AI: paying artists for AI-generated...Commercial models such as Suno and Udio have not disclosed their train...
-
Source: publications.parliament.uk
Link: https://publications.parliament.uk/pa/ld5901/ldselect/ldcomm/267/267.pdfSource snippet
UK ParliamentAI, copyright and the creative industries6 Mar 2026 — These must give creators and performers clear control over commercial...
-
Source: reuters.com
Link: https://www.reuters.com/legal/legalindustry/trade-secrets-training-data-transparency-act–pracin-2026-05-18/Source snippet
This includes details on dataset sources, size, types, intellectual property status, commercial arrangements, personal information involv...
-
Source: GOV.UK
Title: report on copyright and artificial intelligence
Link: https://www.gov.uk/government/publications/report-and-impact-assessment-on-copyright-and-artificial-intelligence/report-on-copyright-and-artificial-intelligenceSource snippet
18 Mar 2026 — Some countries have introduced transparency regulations that require AI developers to disclose sources of training data, wi...
-
Source: riaa.com
Link: https://www.riaa.com/record-companies-bring-landmark-cases-for-responsible-ai-againstsuno-and-udio-in-boston-and-new-york-federal-courts-respectively/Source snippet
RIAARecord Companies Bring Landmark Cases for...24 Jun 2024 — Record Companies Bring Landmark Cases for Responsible AI Against Suno and...
-
Source: wired.com
Link: https://www.wired.com/story/ai-music-generators-suno-and-udio-sued-for-copyright-infringementSource snippet
The lawsuits, seeking up to $150,000 per infringed work, were filed in Massachusetts and New York. The labels argue that the AI generator...
-
Source: arxiv.org
Link: https://arxiv.org/abs/2605.29202Source snippet
arXivAuditing Training Data in Generative Music Models via Black-Box Membership InferenceMay 28, 2026...
Published: May 28, 2026
-
Source: dglaw.com
Title: ai legal updates californias ai training data transparency law takes effect
Link: https://www.dglaw.com/ai-legal-updates-californias-ai-training-data-transparency-law-takes-effect/Source snippet
Davis+Gilbert LLPCalifornia's AI Training Data Transparency Law Takes EffectJan 23, 2026 — The TDTA requires developers of generative AI...
-
Source: copyright.gov
Title: Part 3: Generative AI Training pre-publication version
Link: https://www.copyright.gov/ai/Copyright-and-Artificial-Intelligence-Part-3-Generative-AI-Training-Report-Pre-Publication-Version.pdfSource snippet
May 6, 2025 — This Part of the Copyright Office's Report on Copyright and Artificial Intelligence addresses the use of copyrighted works...
Published: May 6, 2025
-
Source: help.suno.com
Link: https://help.suno.com/en/articles/9709569Source snippet
CA AB 2013 Disclosure1 Jan 2026 — Intended purpose: Suno uses the collected data to train its music generative AI models, which are inten...
-
Source: help.suno.com
Link: https://help.suno.com/en/articles/9710273Source snippet
AB 2013 Disclosure (Text to Speech) - Knowledge BaseJan 1, 2026 — Dataset sources: Suno's text-to-speech generative AI models (e.g., Bark...
-
Source: suno.com
Title: terms of service
Link: https://suno.com/terms-of-serviceSource snippet
26 Mar 2026 — By using the Service, you consent to our collection, use and disclosure of personal data and other data as outlined therein...
-
Source: terms.law
Title: Can You Sell Suno AI Music?
Link: https://terms.law/ai-output-rights/suno/Source snippet
Commercial Rights Guide...UMG, Sony Music, and Warner Music sued Suno in June 2024 for alleged copyright infringement in training data...
Published: June 2024
-
Source: ukmusic.org
Link: https://www.ukmusic.org/wp-content/uploads/2025/05/UK-Music-Copyright-and-Artificial-Intelligence-Consultation-Response-For-Submission.pdfSource snippet
UK MusicCopyright and Artificial Intelligence ConsultationFebruary 25, 2025 — 30 May 2025 — Clear records of training data allow creators...
Published: February 25, 2025
-
Source: musicbusinessworldwide.com
Link: https://www.musicbusinessworldwide.com/after-suno-udio-asks-court-to-seal-the-size-of-its-ai-training-data-in-sony-musics-copyright-case-also-citing-competitive-harm/Source snippet
Music Business WorldwideAfter Suno, Udio asks court to seal the size of its AI training...1 day ago — After Suno, Udio asks court to sea...
-
Source: ism.org
Title: Independent Society of Musicians Copyright & AI consultation: ISM submission
Link: https://www.ism.org/news/copyright-ai-consultation-ism-submission/Source snippet
Independent Society of MusiciansCopyright & AI consultation: ISM submissionFebruary 27, 2025 — 27 Feb 2025 — The ISM advocates for musici...
Published: February 27, 2025
-
Source: pure.ulster.ac.uk
Title: Ulster University Copyright and AI training data
Link: https://pure.ulster.ac.uk/ws/portalfiles/portal/217378593/jpae102.pdfSource snippet
Ulster UniversityCopyright and AI training data - transparency to the rescue?Today — by A Buick · 2024 · Cited by 94 — Generative Artific...
-
Source: musicbusinessworldwide.com
Link: https://www.musicbusinessworldwide.com/music-industry-backs-new-train-act-requiring-transparency-in-materials-used-to-train-ai/Source snippet
Music industry backs new 'TRAIN Act' requiring...Nov 26, 2024 — A proposed new US law that would require AI developers to disclose the m...
Additional References
-
Source: starguardlaw.com
Link: https://starguardlaw.com/insights/articles/ai-training-data-copyright-musicians-artistsSource snippet
AI Training Data and Your Work7 days ago — Training datasets for music models included recordings whose rights belong to record labels, i...
-
Source: linkedin.com
Link: https://www.linkedin.com/posts/radhikadirks_anthropic-ai-law-activity-7343688146804436992-rnFsSource snippet
AI can legally use copyrighted books for trainingIt's official: The first court ruling is here — AI companies CAN legally use millions of...
-
Source: facebook.com
Link: https://www.facebook.com/groups/aimusicworld/posts/989261500308835/Source snippet
AI music companies face legal battles over training data...Major record labels (UMG, Sony, Warner) and publishers are suing AI music sta...
-
Source: linkedin.com
Link: https://www.linkedin.com/top-content/artificial-intelligence/understanding-ai-systems/understanding-ai-training-data-rights-in-music/Source snippet
Understanding AI Training Data Rights in MusicUnderstanding AI training data rights in music means recognizing who owns the songs, [lyrics]({{ 'lyrics/' | relative_url }})...
-
Source: itsartlaw.org
Link: https://itsartlaw.org/art-law/generative-ai-and-transparency-of-databases-and-their-content-from-a-copyright-perspective/Source snippet
Generative AI and transparency of databases and their...21 May 2024 — From a copyright perspective, the principle of transparency has be...
Published: May 2024
-
Source: completemusicupdate.com
Link: https://completemusicupdate.com/first-major-ruling-on-ai-and-fair-use-goes-against-the-copyright-industries-though-with-a-silver-lining-relating-to-pirated-training-content/Source snippet
First major ruling on AI and fair use goes against the copyright...25 Jun 2025 — A judge has ruled in a legal battle between a group of...
-
Source: axios.com
Title: Record labels sue two AI startups for copyright infringement Major U.S
Link: https://www.axios.com/2024/06/24/record-labels-sue-ai-startups-copyright-infringementSource snippet
record labels have filed lawsuits against two AI music startups—Suno and Uncharted Labs (developer of Udio AI)—accusing them of mass copy...
-
Source: thelocal.dk
Title: danish music rights group sues [ai music platform]({{ ‘platform-rules/’ | relative_url }}) suno
Link: https://www.thelocal.dk/20251104/danish-music-rights-group-sues-ai-music-platform-sunoSource snippet
4 Nov 2025 — Danish music rights group Koda said on Tuesday that it was suing the American AI music platform Suno, accusing it of trainin...
-
Source: medium.com
Link: https://medium.com/%40adnanmasood/intellectual-property-rights-and-ai-generated-content-issues-in-human-authorship-fair-use-8c7ec9d6fdc3Source snippet
t's our liability if the AI generates content that infringes...Read more...
-
Source: waterandmusic.com
Title: music ai content copyright detection deepfakes
Link: https://www.waterandmusic.com/music-ai-content-copyright-detection-deepfakes/Source snippet
How music AI content and copyright detection actually worksJul 2, 2024 — Some of the fastest-growing music AI startups, including Suno an...
Topic Tree



