AI’s Next Growth Challenge: Transparency and Ethically Sourced Training Data

Language models are more than just text generators. They represent the deeper structures of human knowledge and culture. This is why ethically sourced training data has never been more important for AI companies.

Back in 1963, Noam Chomsky made a point that still resonates today:

“A computer program that succeeded in generating sentences of a language would be, in itself, of no scientific interest unless it also shed some light on the kinds of structural features that distinguish languages from arbitrary, recursively enumerable sets.”

In simpler terms, it’s not groundbreaking for AI to sound human. The real value lies in the structures that shape the language, on which its learning is based. If it’s based on stolen or biased data, it’s untrustworthy.

In 2025, as lawsuits from media giants like Disney challenge the legality of scraped content, the question is no longer if AI companies will change how they source data, but how. This post breaks down why ethically sourced training data may define the next phase of AI growth, and what it means for developers, content creators, and businesses.

TL;DR

Lawsuits like Disney’s indicate a maturing market, one driven toward AI training rooted in licensing, consent, and compensation.
“Ethically trained” models could soon become a competitive advantage for brands looking to invest in further AI adoption.
Platforms that connect brands with licensed, high-quality content (like nDash) indicate an already increased appetite for original content.

From Chomsky to Chatbots: Why Training Data Provenance Matters

Noam Chomsky’s quote aged well, even if he wasn’t focused on artificial intelligence. His argument that a language machine is only meaningful if it reflects the deep structures of human language is highly relevant.

Beyond Words: Structural Knowledge in AI

It’s not impressive for a computer to generate sentences. This has been possible for years. Yet, AI tools like Claude and ChatGPT now have access to advanced human reasoning, learning behaviors, logic, and tone structure.

If a language model is going to be useful and meaningful, the way it generates language needs to reflect those deeper logical, cultural, and semantic structural patterns. These patterns come directly from the data they’re trained on. That’s why data provenance matters.

The Value of Data Provenance

Data provenance and AI training ethics go far beyond legal compliance. Unethical sourcing has ripple effects across creator consent, fairness, brand trust, and model accuracy.

Rachel Brooks captured this tension in her 2024 blog post on the NaNoWriMo controversy, where she critiques the rise of AI-generated writing prompts: “Without human intervention, AI-assisted prompts become oversimplified or generically automated in the interest of time and savings…Generic automation reduces written content to informational nutrition. One could compare this to stripping the value of content from a three-course meal to saturated lettuce.”

When AI is trained on stolen, biased, or low-quality data, those same flaws live within its outputs. The deeper structures become warped. But when the training material is licensed, diverse, and consent-based, the result is more reliable and human-forward.

👉🏻 Key takeaway: The real value of AI lies beneath the surface. To build better outputs, we need to start with better inputs and ethical, transparent AI models.

The Market Matures: Disney’s Lawsuit and the End of Scraping

In June 2025, Disney filed a landmark lawsuit against Midjourney. They accuse the AI image generation giant of unauthorized use of copyrighted material to train their models.

TIME Magazine notes that the lawsuit “…challenges one of the AI industry’s fundamental assumptions: that it should be allowed to train upon copyrighted materials under the principle of fair use.”

The “Wild West” of AI Data Collection

Like many new frontiers, a lack of regulation can run rampant.

“Fair use” has turned into a “scrape now, apologize later” for many AI companies. Plus, it’s not just Disney. A study by Cornell University showed that newer LLM models, like ChatGPT-4o, show strong recognition of paywalled, copyrighted books from O’Reilly Media.

This indicates the model was trained on non-public, copyrighted content without consent.

Now, as generative models start to produce outputs that mirror recognizable characters, written works, and visuals, rights holders are taking notice.

What Disney vs. AI Companies Represent for the Industry

The lawsuits by Disney and other major production companies reflect growing pressure on AI companies to answer basic questions:

Where did your data come from?
Did the original creators consent?
Were they compensated?

Ed Newton-Rex, the CEO of nonprofit organization Fairly Trained, told TIME: “I really think the only thing that can stop AI companies doing what they’re doing is the law. If these lawsuits are successful, that is what will hopefully stop AI companies from exploiting people’s life’s work.”

👉🏻 Key takeaway: These lawsuits herald the end of the murky “ask forgiveness later” training methods. Transparent AI models and licensing are an absolute must for the next generation of AI.

Lessons from Other Industries: Ethical Sourcing as a Differentiator

What do your engagement ring and your favorite cup of coffee have in common? Ethical concerns.

In the early aughts, coffee and diamonds were subject to intense scrutiny over where they sourced their materials. What was once merely a question of “is the product good?” evolved into “was it made ethically and sourced responsibly?”

How Coffee, Fashion, and Diamonds Reframed Ethics

Coffee, jewelry, and fashion brands changed their business models and positioning to differentiate themselves from competitors who did not.

Coffee brands began tracing beans back to fair-trade farms, and even changing bean suppliers to more ethical partners. Jewelry brands promoted conflict-free sourcing and turned it into a core value proposition.

Additionally, the anti-fast-fashion movement that has risen over the last few years has also had an impact on the clothing we wear.

Brands like SHEIN and Temu position themselves as low-cost and easily accessible. Brands like Patagonia and Reformation position themselves as leaders in the environmental activism space. Socially, there’s a divide in opinion on those who purchase the cheapest and fastest versus the well-made and long-lasting.

Applying Ethical Sourcing Principles to AI Training Data

So what does “ethical sourcing” actually look like for AI?

It starts with treating training like any other supply chain: one that is traceable and fairly exchanged. Instead, AI developers must begin building partnerships with content creators, platforms, and publishers who can offer:

Licensed material with clear usage rights
Clear creator consent that respects intellectual property
Compensation structures that reward those whose work trains the models

👉🏻 Key takeaway: AI training is following a familiar path as our favorite coffee brands, with a rising consumer appetite for transparency. Ethical sourcing starts as damage control but can end up being a brand advantage to companies that only use licensed AI content.

Defining Ethical Data in Practice

It’s easy to discuss “Ethically sourced training data,” but what does that look like in real life? The answer is simple: consent, proper licensing (not just “fair use”), compensation, and accountability.

Brands that want to develop more responsible AI tools must clearly document their data practices. Transparent AI models can’t be vague; they need to be a set of operational decisions.

Licensed Content and Creator Consent

Only use content provided intentionally for training purposes
Consent must be documented in black and white

Compensation Structures for Ethical AI Training

Writers and other experts should be compensated for contributing to AI training.
Content creators should be treated as contributors to the value chain, not as free inputs.

Platforms that Enable Ethical Data Sharing

Content marketing platforms like nDash offer licensed, high-quality, original content from real humans.
This kind of platform offers author transparency, usage rights, and brand-safe content.

Platforms like nDash could form the foundation for ethically sourced training data. The framework is already there.

👉🏻 Key takeaway: Ethical data isn’t theoretical. It’s a real framework, and the pieces already exist.

The Road Ahead: Competing on Performance vs. Provenance

Thus far, AI development has favored speed, scale, and performance. But the next chapter might look different. These lawsuits, along with creator feedback, will undoubtedly pave the way for tighter regulations.

Last year, we unpacked a survey by McKinsey on AI usage trends. One data point indicated a trend in businesses allocating a larger number of their digital marketing budgets to AI tools and initiatives. We predict that this growing investment behavior will go hand in hand with more scrutiny around how these tools are trained.

Provenance, consent, and clarity will become a defining selling feature, just like fair trade coffee beans did.

Transparency as a Market Advantage

For brands, transparency in AI sourcing is about to become a priority. Choosing ethically trained models may soon be just as important as selecting carbon-neutral shipping partners or accessible design vendors. When you know where your tools are getting their training data, you can:

Protect your intellectual property
Make smarter bets on AI vendors
Avoid legal or ethical issues down the line
Align your tech stack with DEI and sustainability goals

Could “Ethically Trained” Become the Standard?

History is repeating itself. Think about how basic requirements became essential brand messaging:

“Organic” became a label
“Fair trade” became a selling point
“Green” became part of brand identity

“Ethically trained AI” could follow the same path. A signal of quality and values that is easy for customers to spot. This could be a particularly large boon for enterprise brands and the public sector.

👉🏻 Key takeaway: Performance will always matter, but in a saturated AI market, provenance may be the real differentiator.

Why Ethically Sourced Training Data Matters for AI’s Future

Generative AI is not going anywhere. Now that workflows across marketing, product, HR, and operations become more reliant on AI support, the pressure for AI transparent models will only grow.

Features are important, but now, brands need to look to their values when purchasing AI tools. Are they ethically trained? Inclusive? Traceable? A new standard will be held.

We already care about how our products are made and what’s in our food and drink. AI should be no different.

About the Author

Katie Major is a versatile marketing professional with a passion for content creation and strategic storytelling. She’s the Founder at Major Marketing, where her clients range from home services to wedding venues. To learn more about Katie — and to have her write for your brand — be sure to check out her nDash profile page.