Within AI Training

The hidden dataset problem in AI music

Artists cannot meaningfully consent, opt out or negotiate if they cannot see whether their recordings were used for training.

On this page

  • Why companies keep datasets secret
  • Why rightsholders need audit trails
  • What disclosure could realistically look like
Preview for The hidden dataset problem in AI music

Introduction

One of the most contentious questions in AI music is not whether copyrighted recordings were used for training, but whether anyone outside the companies building the models can find out. Artists, labels, publishers and performers cannot meaningfully consent to, object to, or negotiate over the use of their work if they do not know whether it was included in a training dataset in the first place. The result is a transparency problem: large music-generation systems may depend on enormous collections of audio, yet the contents of those collections are often treated as confidential business information. This lack of visibility sits at the centre of many copyright disputes because it makes it difficult to verify rights, establish licences, assess infringement claims, or build compensation systems. [OUP Academic]academic.oup.comOUP AcademicCopyright and AI training data—transparency to the rescue?by A Buick · 2025 · Cited by 92 — AI developers to be required by l…

Transparency illustration 1 The debate is not simply about disclosure for its own sake. It concerns whether there can be a functioning market for AI training licences, creator consent and copyright enforcement when the underlying training material remains hidden. [UK Music]ukmusic.orgUK MusicCopyright and Artificial Intelligence ConsultationFebruary 25, 2025 — 30 May 2025 — Clear records of training data allow creators…Published: February 25, 2025

Why companies keep datasets secret

AI developers often argue that training datasets are commercially sensitive assets. Revealing exactly which recordings, catalogues or sources were used could expose competitive advantages, enable rivals to replicate a model-development strategy, or reveal business relationships that companies regard as proprietary. These concerns have become increasingly visible in litigation.

A recent example emerged in the copyright dispute involving Udio, where the company sought to keep information about the size of its training dataset out of the public record, arguing that disclosure could cause competitive harm. The dispute illustrates how training-data information is frequently treated as a trade secret rather than a public accountability issue. [Music Business Worldwide]musicbusinessworldwide.comMusic Business WorldwideAfter Suno, Udio asks court to seal the size of its AI training…1 day ago — After Suno, Udio asks court to sea…

The secrecy extends beyond court filings. Commercial music-generation systems have often declined to publish detailed lists of the recordings, songs or databases used to train their models. Analysts, researchers and rights organisations have repeatedly noted that some of the most prominent music AI services have not disclosed their training datasets, making independent verification difficult. [WIPO]wipo.intWIPORoyalties in the age of AI: paying artists for AI-generated…Commercial models such as Suno and Udio have not disclosed their train…

From the companies’ perspective, there are several reasons for caution:

  • Detailed dataset disclosure may reveal acquisition strategies and model-development techniques.
  • Large datasets may contain material gathered from numerous sources with complex ownership histories.
  • Public disclosure could increase legal exposure by making rights claims easier to pursue.
  • Maintaining secrecy can preserve bargaining power during licensing negotiations. [OUP Academic]academic.oup.comOUP AcademicCopyright and AI training data—transparency to the rescue?by A Buick · 2025 · Cited by 92 — AI developers to be required by l…

These arguments are not necessarily frivolous. Many technology firms genuinely regard training datasets as core intellectual property. However, the more valuable and influential AI music systems become, the harder it is to justify a situation where affected rightsholders cannot determine whether their work contributed to those systems at all. [UK Parliament]publications.parliament.ukUK ParliamentAI, copyright and the creative industries6 Mar 2026 — These must give creators and performers clear control over commercial…

Why rightsholders need audit trails

The consent problem in music AI is fundamentally an information problem. Consent requires knowledge. A songwriter cannot license the use of a composition, and a label cannot negotiate terms for a recording, if neither knows whether the work was used.

This issue has appeared repeatedly in disputes over AI training. When record companies sued Suno and Udio in 2024, the lawsuits focused on alleged unlicensed use of copyrighted recordings for model training. The broader controversy was intensified by the fact that the companies had not publicly disclosed the recordings on which their systems were trained. [RIAA]riaa.comRIAARecord Companies Bring Landmark Cases for…24 Jun 2024 — Record Companies Bring Landmark Cases for Responsible AI Against Suno and… [wired]wired.comThe lawsuits, seeking up to $150,000 per infringed work, were filed in Massachusetts and New York. The labels argue that the AI generator… Without audit trails, rightsholders face several practical obstacles:

Verifying use. A creator may suspect that a model was trained on their recordings but have no reliable method of proving it.

Negotiating licences. Licensing markets depend on knowing what material is being used and by whom.

Calculating compensation. If training contributions cannot be identified, it becomes difficult to determine who should be paid and on what basis.

Enforcing rights. Copyright law is difficult to enforce when the underlying evidence is hidden. [UK Music]ukmusic.orgUK MusicCopyright and Artificial Intelligence ConsultationFebruary 25, 2025 — 30 May 2025 — Clear records of training data allow creators…Published: February 25, 2025

The problem affects independent musicians particularly strongly. Major labels may have the resources to investigate potential use of their catalogues, but individual artists often lack the legal and technical means to determine whether their recordings were included in large-scale datasets. The information asymmetry favours model developers because only they possess complete knowledge of the training process. [Independent Society of Musicians]ism.orgIndependent Society of Musicians Copyright & AI consultation: ISM submissionIndependent Society of MusiciansCopyright & AI consultation: ISM submissionFebruary 27, 2025 — 27 Feb 2025 — The ISM advocates for musici…Published: February 27, 2025

Recent academic work suggests that technical auditing may become possible even without company cooperation. Researchers have demonstrated methods for “membership inference” against generative music models, attempting to determine whether a particular recording was likely included in training data. Such research remains experimental, but it reflects growing demand for independent verification tools when direct disclosure is unavailable. [arXiv]arxiv.orgarXivAuditing Training Data in Generative Music Models via Black-Box Membership InferenceMay 28, 2026…Published: May 28, 2026

Transparency illustration 2

The hidden cost of opacity

Dataset secrecy creates risks beyond copyright litigation. It also undermines trust in AI music systems.

When creators cannot identify the source material behind a model, rumours and speculation tend to fill the gap. Some musicians assume their catalogues were used without permission. Some users assume all models are trained on stolen music. Others assume that commercial systems must already be fully licensed. In many cases, none of these assumptions can be verified because the relevant information is unavailable. [WIPO]wipo.intWIPORoyalties in the age of AI: paying artists for AI-generated…Commercial models such as Suno and Udio have not disclosed their train…

Opacity also complicates discussions about ethical AI. A company may claim that it respects artists’ rights, but outsiders cannot independently assess that claim if training records remain inaccessible. Conversely, a company that has invested heavily in licensing may receive little public credit if it does not disclose meaningful information about its data sources. Transparency therefore affects not only enforcement but also credibility. [GOV.UK]GOV.UKreport on copyright and artificial intelligence18 Mar 2026 — Some countries have introduced transparency regulations that require AI developers to disclose sources of training data, wi…

The issue has become significant enough that policy debates in multiple jurisdictions now treat transparency as a separate governance question rather than merely a copyright side issue. Government consultations, parliamentary reviews and legal scholarship increasingly frame disclosure as a prerequisite for any workable system of consent and licensing. [UK Parliament]publications.parliament.ukUK ParliamentAI, copyright and the creative industries6 Mar 2026 — These must give creators and performers clear control over commercial… [OUP]academic.oup.comOUP AcademicCopyright and AI training data—transparency to the rescue?by A Buick · 2025 · Cited by 92 — AI developers to be required by l…

What disclosure could realistically look like

The transparency debate is often presented as a choice between complete secrecy and publishing every file used for training. In practice, many proposed solutions fall somewhere in between.

One approach is dataset source disclosure. Rather than listing every recording individually, developers could identify the databases, catalogues, platforms or collections from which training material was obtained. This would provide a basic level of accountability while limiting the release of commercially sensitive details. [GOV.UK]GOV.UKreport on copyright and artificial intelligence18 Mar 2026 — Some countries have introduced transparency regulations that require AI developers to disclose sources of training data, wi…

A second approach is rightsholder-access systems. Under this model, detailed training records would not necessarily be public but could be made available to verified copyright owners seeking to determine whether their works were used. This would focus disclosure on those with a legitimate legal interest. [OUP Academic]academic.oup.comOUP AcademicCopyright and AI training data—transparency to the rescue?by A Buick · 2025 · Cited by 92 — AI developers to be required by l…

A third possibility is auditable training logs. Developers could maintain standardised records documenting when material entered a dataset, under what licence it was obtained, and whether restrictions applied. Such records would create a chain of evidence that could support both licensing and dispute resolution. Rights organisations in the music sector have argued that clear records are essential for managing agreements and monitoring unauthorised use. [UK Music]ukmusic.orgUK MusicCopyright and Artificial Intelligence ConsultationFebruary 25, 2025 — 30 May 2025 — Clear records of training data allow creators…Published: February 25, 2025

Emerging regulation is beginning to test these ideas. The European AI Act includes transparency-related obligations for certain AI systems, while newer transparency laws and policy proposals in jurisdictions such as California and the United Kingdom have focused on requiring at least high-level disclosure of training data sources and copyright-relevant information. The precise scope remains contested, especially where companies argue that mandatory disclosure threatens trade secrets. [Reuters]reuters.comThis includes details on dataset sources, size, types, intellectual property status, commercial arrangements, personal information involv… [Ulster University]pure.ulster.ac.ukUlster University Copyright and AI training dataUlster UniversityCopyright and AI training data - transparency to the rescue?Today — by A Buick · 2024 · Cited by 94 — Generative Artific… [Davis Gilbert LLP]dglaw.comai legal updates californias ai training data transparency law takes effectDavis+Gilbert LLPCalifornia's AI Training Data Transparency Law Takes EffectJan 23, 2026 — The TDTA requires developers of generative AI…

Transparency illustration 3

The central tension

The hidden dataset problem is ultimately a conflict between two legitimate interests. AI developers want to protect commercially valuable information about how their models are built. Rightsholders want enough visibility to exercise copyright, negotiate licences and grant or withhold consent.

As music AI becomes more commercially important, the practical question is no longer whether transparency matters. It is how much transparency is necessary for creators to know when their work has been used, while still allowing companies to protect genuinely sensitive business information. The future shape of licensing, compensation and consent in music AI may depend less on the models themselves than on whether that balance can be achieved. [UK Parliament]publications.parliament.ukUK ParliamentAI, copyright and the creative industries6 Mar 2026 — These must give creators and performers clear control over commercial… [OUP]academic.oup.comOUP AcademicCopyright and AI training data—transparency to the rescue?by A Buick · 2025 · Cited by 92 — AI developers to be required by l…

Amazon book picks

Further Reading

Books and field guides related to The hidden dataset problem in AI music. Use these as the next step if you want deeper reading beyond the article.

eBay marketplace picks

Marketplace Samples

Example marketplace items related to this page. Use the search link to explore similar finds on eBay.

Using USA

Endnotes

  1. Source: academic.oup.com
    Link: https://academic.oup.com/jiplp/article/20/3/182/7922541
    Source snippet

    OUP AcademicCopyright and AI training data—transparency to the rescue?by A Buick · 2025 · Cited by 92 — AI developers to be required by l...

  2. Source: wipo.int
    Link: [https://www.wipo.int/en/web/wipo-magazine/articles/royalties
    Source snippet

    WIPORoyalties in the age of AI: paying artists for AI-generated...Commercial models such as Suno and Udio have not disclosed their train...

  3. Source: publications.parliament.uk
    Link: https://publications.parliament.uk/pa/ld5901/ldselect/ldcomm/267/267.pdf
    Source snippet

    UK ParliamentAI, copyright and the creative industries6 Mar 2026 — These must give creators and performers clear control over commercial...

  4. Source: reuters.com
    Link: https://www.reuters.com/legal/legalindustry/trade-secrets-training-data-transparency-act–pracin-2026-05-18/
    Source snippet

    This includes details on dataset sources, size, types, intellectual property status, commercial arrangements, personal information involv...

  5. Source: GOV.UK
    Title: report on copyright and artificial intelligence
    Link: https://www.gov.uk/government/publications/report-and-impact-assessment-on-copyright-and-artificial-intelligence/report-on-copyright-and-artificial-intelligence
    Source snippet

    18 Mar 2026 — Some countries have introduced transparency regulations that require AI developers to disclose sources of training data, wi...

  6. Source: riaa.com
    Link: https://www.riaa.com/record-companies-bring-landmark-cases-for-responsible-ai-againstsuno-and-udio-in-boston-and-new-york-federal-courts-respectively/
    Source snippet

    RIAARecord Companies Bring Landmark Cases for...24 Jun 2024 — Record Companies Bring Landmark Cases for Responsible AI Against Suno and...

  7. Source: wired.com
    Link: https://www.wired.com/story/ai-music-generators-suno-and-udio-sued-for-copyright-infringement
    Source snippet

    The lawsuits, seeking up to $150,000 per infringed work, were filed in Massachusetts and New York. The labels argue that the AI generator...

  8. Source: arxiv.org
    Link: https://arxiv.org/abs/2605.29202
    Source snippet

    arXivAuditing Training Data in Generative Music Models via Black-Box Membership InferenceMay 28, 2026...

    Published: May 28, 2026

  9. Source: dglaw.com
    Title: ai legal updates californias ai training data transparency law takes effect
    Link: https://www.dglaw.com/ai-legal-updates-californias-ai-training-data-transparency-law-takes-effect/
    Source snippet

    Davis+Gilbert LLPCalifornia's AI Training Data Transparency Law Takes EffectJan 23, 2026 — The TDTA requires developers of generative AI...

  10. Source: copyright.gov
    Title: Part 3: Generative AI Training pre-publication version
    Link: https://www.copyright.gov/ai/Copyright-and-Artificial-Intelligence-Part-3-Generative-AI-Training-Report-Pre-Publication-Version.pdf
    Source snippet

    May 6, 2025 — This Part of the Copyright Office's Report on Copyright and Artificial Intelligence addresses the use of copyrighted works...

    Published: May 6, 2025

  11. Source: help.suno.com
    Link: https://help.suno.com/en/articles/9709569
    Source snippet

    CA AB 2013 Disclosure1 Jan 2026 — Intended purpose: Suno uses the collected data to train its music generative AI models, which are inten...

  12. Source: help.suno.com
    Link: https://help.suno.com/en/articles/9710273
    Source snippet

    AB 2013 Disclosure (Text to Speech) - Knowledge BaseJan 1, 2026 — Dataset sources: Suno's text-to-speech generative AI models (e.g., Bark...

  13. Source: suno.com
    Title: terms of service
    Link: https://suno.com/terms-of-service
    Source snippet

    26 Mar 2026 — By using the Service, you consent to our collection, use and disclosure of personal data and other data as outlined therein...

  14. Source: terms.law
    Title: Can You Sell Suno AI Music?
    Link: https://terms.law/ai-output-rights/suno/
    Source snippet

    Commercial Rights Guide...UMG, Sony Music, and Warner Music sued Suno in June 2024 for alleged copyright infringement in training data...

    Published: June 2024

  15. Source: ukmusic.org
    Link: https://www.ukmusic.org/wp-content/uploads/2025/05/UK-Music-Copyright-and-Artificial-Intelligence-Consultation-Response-For-Submission.pdf
    Source snippet

    UK MusicCopyright and Artificial Intelligence ConsultationFebruary 25, 2025 — 30 May 2025 — Clear records of training data allow creators...

    Published: February 25, 2025

  16. Source: musicbusinessworldwide.com
    Link: https://www.musicbusinessworldwide.com/after-suno-udio-asks-court-to-seal-the-size-of-its-ai-training-data-in-sony-musics-copyright-case-also-citing-competitive-harm/
    Source snippet

    Music Business WorldwideAfter Suno, Udio asks court to seal the size of its AI training...1 day ago — After Suno, Udio asks court to sea...

  17. Source: ism.org
    Title: Independent Society of Musicians Copyright & AI consultation: ISM submission
    Link: https://www.ism.org/news/copyright-ai-consultation-ism-submission/
    Source snippet

    Independent Society of MusiciansCopyright & AI consultation: ISM submissionFebruary 27, 2025 — 27 Feb 2025 — The ISM advocates for musici...

    Published: February 27, 2025

  18. Source: pure.ulster.ac.uk
    Title: Ulster University Copyright and AI training data
    Link: https://pure.ulster.ac.uk/ws/portalfiles/portal/217378593/jpae102.pdf
    Source snippet

    Ulster UniversityCopyright and AI training data - transparency to the rescue?Today — by A Buick · 2024 · Cited by 94 — Generative Artific...

  19. Source: musicbusinessworldwide.com
    Link: https://www.musicbusinessworldwide.com/music-industry-backs-new-train-act-requiring-transparency-in-materials-used-to-train-ai/
    Source snippet

    Music industry backs new 'TRAIN Act' requiring...Nov 26, 2024 — A proposed new US law that would require AI developers to disclose the m...

Additional References

  1. Source: starguardlaw.com
    Link: https://starguardlaw.com/insights/articles/ai-training-data-copyright-musicians-artists
    Source snippet

    AI Training Data and Your Work7 days ago — Training datasets for music models included recordings whose rights belong to record labels, i...

  2. Source: linkedin.com
    Link: https://www.linkedin.com/posts/radhikadirks_anthropic-ai-law-activity-7343688146804436992-rnFs
    Source snippet

    AI can legally use copyrighted books for trainingIt's official: The first court ruling is here — AI companies CAN legally use millions of...

  3. Source: facebook.com
    Link: https://www.facebook.com/groups/aimusicworld/posts/989261500308835/
    Source snippet

    AI music companies face legal battles over training data...Major record labels (UMG, Sony, Warner) and publishers are suing AI music sta...

  4. Source: linkedin.com
    Link: https://www.linkedin.com/top-content/artificial-intelligence/understanding-ai-systems/understanding-ai-training-data-rights-in-music/
    Source snippet

    Understanding AI Training Data Rights in MusicUnderstanding AI training data rights in music means recognizing who owns the songs, [lyrics]({{ 'lyrics/' | relative_url }})...

  5. Source: itsartlaw.org
    Link: https://itsartlaw.org/art-law/generative-ai-and-transparency-of-databases-and-their-content-from-a-copyright-perspective/
    Source snippet

    Generative AI and transparency of databases and their...21 May 2024 — From a copyright perspective, the principle of transparency has be...

    Published: May 2024

  6. Source: completemusicupdate.com
    Link: https://completemusicupdate.com/first-major-ruling-on-ai-and-fair-use-goes-against-the-copyright-industries-though-with-a-silver-lining-relating-to-pirated-training-content/
    Source snippet

    First major ruling on AI and fair use goes against the copyright...25 Jun 2025 — A judge has ruled in a legal battle between a group of...

  7. Source: axios.com
    Title: Record labels sue two AI startups for copyright infringement Major U.S
    Link: https://www.axios.com/2024/06/24/record-labels-sue-ai-startups-copyright-infringement
    Source snippet

    record labels have filed lawsuits against two AI music startups—Suno and Uncharted Labs (developer of Udio AI)—accusing them of mass copy...

  8. Source: thelocal.dk
    Title: danish music rights group sues [ai music platform]({{ ‘platform-rules/’ | relative_url }}) suno
    Link: https://www.thelocal.dk/20251104/danish-music-rights-group-sues-ai-music-platform-suno
    Source snippet

    4 Nov 2025 — Danish music rights group Koda said on Tuesday that it was suing the American AI music platform Suno, accusing it of trainin...

  9. Source: medium.com
    Link: https://medium.com/%40adnanmasood/intellectual-property-rights-and-ai-generated-content-issues-in-human-authorship-fair-use-8c7ec9d6fdc3
    Source snippet

    t's our liability if the AI generates content that infringes...Read more...

  10. Source: waterandmusic.com
    Title: music ai content copyright detection deepfakes
    Link: https://www.waterandmusic.com/music-ai-content-copyright-detection-deepfakes/
    Source snippet

    How music AI content and copyright detection actually worksJul 2, 2024 — Some of the fastest-growing music AI startups, including Suno an...

Topic Tree

Follow this branch

Parent topic

AI Training Can AI Learn From Copyrighted Music?

Related pages 4