AI and copyright: the legal landscape

As an expert in web scraping and proxies, I’m often asked by clients about the complex copyright issues surrounding AI-generated content. Can an AI system hold a copyright? Can AI models freely use copyrighted works for training data? I’ll explore these pressing questions and provide guidance from my uniquely informed perspective.

The million-dollar question: can AI systems hold copyright?

The straightforward legal answer is no. Copyright law in most countries requires human authorship – an AI system generating content autonomously cannot hold copyright on its own creations. This was confirmed in the recent US case of Thaler v Perlmutter. Here, Stephen Thaler argued AI systems he created should be listed as authors on copyright registrations. But the courts ultimately held that copyright can only vest in natural persons under current law.

However, an interesting follow-up question arises – what if an AI just had a little help from a human? How much human involvement is required for copyright to apply? This issue becomes very topical as AI models like DALL-E 2 produce eerily lifelike images and art simply from short text prompts.

Generally, the more selective creative judgment a human applies, the more likely copyright can subsist in the output. For example, say an artist carefully curates a collection of AI-generated images around a theme. The personalized selection and arrangement decisions likely make it a human-authored work eligible for copyright.

But on the other end of the spectrum, if a human merely types “a painting of a fox” to instantly produce an AI image, that likely doesn’t meet the originality threshold for copyright under current law.

This notion was supported in a 2021 case involving an AI-generated comic book called Zarya of the Dawn. Here, the artist provided detailed text descriptions guiding the AI system’s output. The US Copyright Office allowed registration over the comic book, excluding the AI-generated images themselves. This set an influential precedent that AI creations can attract copyright with sufficient human authorship.

So in summary – while an AI can’t hold copyright alone, in many cases applying personal creative choices to guide AI output will hit the thresholds for human authorship. But drawing that line will only get more complex as AI capabilities rapidly advance.

Using copyrighted data to train AI systems

The other huge legal dilemma around AI is whether copyrighted works can be copied and used without permission to train commercial models. This issue sits at the heart of various high-profile lawsuits in the US right now.

On one side, tech firms argue scraping copyrighted data from the internet to train AI should qualify as fair use. Their models only extract general statistical patterns rather than reproducing the protected expression itself. Plus, licensing all training data is wholly impractical – models like DALL-E 2 use over 650 million image-text pairs!

On the other side, content creators feel cheated as their hard work contributes to profitable AI systems without compensation or consent. The legal basis is debatable too – reproducing millions of copyrighted works in their entirety to commercialize derived AI models is hardly what fair use laws intended to permit.

This battleground is playing out in cases like Getty Images suing Stability AI over alleged unauthorized copying of millions of photos to train the Stable Diffusion tool. Pending lawsuits have also been filed against GitHub Copilot over its code suggestions, and Google for general web scraping to train AI models. These cases will be hugely influential in interpreting copyright law’s extension to AI training data.

Personally, I expect to see some practical compromise emerge. An outright ban on scraping any copyrighted online data for AI training risks stifling useful innovation. But leaving creators without any control also feels ethically questionable.

A licensing regime would be administratively complex, although big creative sectors like music figured this out. Opt-out mechanisms giving creators a say over use of their data also hold promise as a middle ground if adopted widely. But this relies heavily on establishing new technical protocols.

Ultimately, the law will provide more certainty down the track. But for now, the legal risks around AI training data remain ambiguous for technology companies, despite the compelling incentives to forge ahead.

Quantifying the legal risks around AI and copyright

As an expert in facilitating data harvesting from the web, clients often ask me to quantify the tangible legal risks around using copyrighted data to build AI systems. But in truth, it’s still very hard to put a dollar figure on the risks given all the uncertainties.

Nonetheless, I can provide guidance on factors that likely increase legal exposure:

Wholesale copying vast volumes of copyrighted work without filtering.
Making high commercial profits directly derived from the models.
Targeting data from sources known to actively enforce copyright claims.
Processing the data in ways that don’t sufficiently transform the copyrighted elements.
Publicly releasing datasets or models that expose infringement.

Mitigating factors that reduce risks include:

Following best practices like minimizing data use, de-identifying sources, securing consent where feasible, and not redistributing source data.
Avoiding targeting data sources that are highly unique or creative.
Having a takedown process if rightsholders object to specific uses.
Ensuring humans extensively transform any source data before commercial use.

Relying on fair use arguments also brings risks. Fair use is uncertain at the best of times. Courts scrutinize many factors in balancing interests, including: the purpose of use, nature of copied work, amount copied, market impact on original work, and more. Fair use risks inflate further when dealing with copyrighted data used to build profitable AI – hardly the strongest case that comes to mind.

Of course, I can’t advise on an acceptable level of legal risk, as this depends on organizations’ appetites. But I hope these pointers help frame the issues for clients to weigh up. Fundamentally the law is playing catch-up, so risks exist predominantly in untested grey areas for now.

Emerging responses to AI copyright concerns

How are regulators responding to this complex interplay between AI systems and copyright law? Progress is slow, but momentum is building to address the issues more meaningfully.

The US Copyright Office launched a major policy study in March 2024 focused specifically on AI and copyright policy. It promises to provide “carefully considered guidance” on AI issues later this year which will be influential, although not legally binding.

Across the pond, the UK’s Intellectual Property Office is also kicking off consultations in 2024 around updating copyright rules for the age of artificial intelligence. I expect we’ll see similar explorations in the EU and other countries soon too.

But of course, the more defining outcome will be legal precedents set through test cases currently before the courts. Alongside the AI copyright lawsuits I mentioned earlier targeting GitHub, Google and Stability AI, I monitor several more in early stages, such as legal action against Microsoft over rights in AI-generated content.

It remains very unclear how these cases will play out. But they represent crucial testing grounds for interpreting copyright law on issues like fair use protections and the consistency of legal frameworks across jurisdictions. I’ll be following closely for any guidance useful for advising proxy clients.

Lawmakers are also floating more radical ideas like denying copyright protection to works unless authors opt-in explicitly. However, fully revamping statutory rules will require international coordination and long lead times.

More immediately, some private sector initiatives are emerging to support responsible AI development. For instance, DeviantArt’s “Anthropic” metadata standard helps creators tag their art to opt-out of AI model training. And NVIDIA’s “Content Authenticity Initiative” explores attaching digital watermarks to media through the creative process for improved attribution.

Such technical measures may prove valuable interim compromises while regulation plays catch-up. However, truly resolving the tensions around copyright and AI requires rethinking legal frameworks for the modern paradigm of autonomous content generation.

The future of AI and copyright – where do we go from here?

Looking ahead, what is the future of copyright in an AI-driven world? I predict a gradual legal evolution rather than radical revolution.

For the foreseeable future, we will continue seeing high-profile test cases that inch towards firmer precedent – but inconsistencies across courts and jurisdictions will persist for years. Eventually, certain “safe harbor” provisions to reasonably enable AI training without punishing technical infringement may codify. But this will require careful balancing of interests.

Attribution and consent mechanisms for copyrighted source material are also likely to mature through both voluntary industry adoption and eventual regulation. There are still big open questions around whether consent frameworks could meaningfully scale, how rights might expire over time, and if viable technological solutions will emerge.

Most importantly, future copyright law needs to promote human creativity and expression while responsibly enabling AI innovation. But policymaking here demands deep legal and technical expertise. We must be wary of reactionary regulations stifling progress in generative AI, which promises immense societal benefits.

Of course, risks remain until the law catches up. But by quantifying exposures and staying abreast of legal developments, we can strategically guide responsible AI system development. There are always unknowns charting new technological frontiers like AI. For now, I advise proxy clients to proactively minimize risks where possible until firmer legal contours emerge.

The next few years promise dramatic evolution in how copyright accommodates transformative AI systems. While uncertainties persist, maintaining optimism around human ingenuity provides the most constructive path ahead. After all, we have adapted copyright frameworks to changing technologies throughout history. And with insight from legal and technical experts, I believe we can chart a prudent course once again.

The million-dollar question: can AI systems hold copyright?

Using copyrighted data to train AI systems

Quantifying the legal risks around AI and copyright

Emerging responses to AI copyright concerns

The future of AI and copyright – where do we go from here?

Join the conversation Cancel reply

Related Posts

What‘s the Difference Between Web Scraping and Crawling?

What are some BeautifulSoup alternatives for HTML parsing in Python?

How to Web Scrape with HTTPX and Python