This article, which is about the New York Times v Microsoft and OpenAI litigation, and also about the US Supreme Court’s inability to handle copyright issues intelligently, appeared in the May 2024 edition of the European Intellectual Property Review.
Footnotes are the end.
HERE COMES CONFUSION.
New York Times v Microsoft and OpenAI.
Oh irony of ironies!
In 1976, Bill Gates, then General Manager of Micro-Soft (sic), wrote his famous An Open Letter to Hobbyists. In that letter, he inveighed against people that were using – without paying - the version of BASIC that he (and others) had developed.
Why is this? As the majority.… must be aware, most of you steal your software. Hardware must be paid for, but software is something to share. Who cares if the people who worked on it get paid?
Is this fair? …. One thing you do do is prevent good software from being written. Who can afford to do professional work for nothing? What hobbyist can put 3-man years into programming, finding all bugs, documenting his product and distribute for free? The fact is, no one besides us has invested a lot of money in … software. … Most directly, the thing you do is theft.
On 27th December 2023, 47 years later, the New York Times’ complaint against OpenAI and Microsoft (Microsoft having invested substantially in OpenAI’s LLM, having provided OpenAI with its computing infrastructure, and having deployed OpenAI as part of Bing), makes exactly the same point. OpenAI and Microsoft are using the NYT’s hard work - ingesting 200 years’ worth of NYT publishing - without paying for it.
The NYT’s complaint is made up of the following main claims. In copyright:
· As part of training its LLM, OpenAI copied and ingested several million NYT articles. Moreover, because OpenAI wanted to train the LLM on high-quality content, the NYT material was disproportionately represented.
· OpenAI has memorialized many NYT articles. The NYT was able to prompt the LLM to reproduce NYT articles verbatim for sequences of approximately 200 words at a time.
· Similarly, through a series of prompts, the NYT was able to get the LLM to reproduce verbatim the first paragraph of an NYT article, then the second paragraph, then the third, and so on.
· The reproduction of NYT articles is not limited to training of the LLM. The NYT’s prompts got the LLM to reproduce verbatim parts of the NYT’s article which had been published after the cut off date for the LLM’s training. In other words, the LLM continued to access and use current NYT material on an ad hoc basis.
In unfair competition:
· NYT has a magazine called Wirecutter which, amongst other things, produces recommended lists of products - eg. top 10 office chairs, top 10 washing machines – and generates revenue from this activity by including clickable links which, in turn, generate revenue for the NYT. OpenAI, when prompted, reproduced the Wirecutter lists (sometimes inaccurately) but without the links, and thereby deprived the NYT of revenue.
The NYT’s complaint did not claim a specific amount in damages. Instead, it seeks to hold Microsoft and OpenAI responsible for billions of dollars in statutory and actual damages.
OpenAI’s public response to date has been a post on its website that states that a) regurgitation of content is a bug which it is working to fix, b) using copyright material for training purposes is fair use.
Taking a step back
It’s commonplace to remark that existing concepts of intellectual property (particularly copyright and related) are creaking under the strain of new technology, but this particular dispute is (if not settled) is going to take the problem to a whole new level.[i]
Taking a step back, here’s how I think we should be looking at this.
Firstly, it’s important to distinguish between rules, and the reasons behind the rules. Usually we can ignore the reason behind the rules because the rule is the practical application of its underlying reason. But in difficult situations, particularly when new factors (such as new technology) come into play, the underlying reasons become more important. And sometimes the reasoning is more important than the rules themselves.
Applying that approach: what is the reason for having property? What are we trying to achieve by having property? One of the primary functions of any form of property is to prevent others taking a free ride on what one has created or invested in. In other words – and this applies both to intellectual and to physical property - property is there to secure and protect revenue streams.
Does it make any difference if the revenue stream is new and has never existed before (for example, charging a fee to allow text and data mining)? No, it does not. We do not protect rightsholders if they are unlucky enough to have their works fall out of fashion, nor should we penalise them if they are lucky enough to find that their works become capable of generating new revenue streams.[ii]
If a rightsholder is lucky enough to discover that their works can now generate a new revenue stream, what role should the law play? With very few exceptions (anti-trust being one of them) the law should play no role at all. Let the buyer and seller agree, or not, on price. It’s a free market after all.
It is notable that OpenAI has managed to negotiate a number of deals with publishers, and was in negotiations with the NYT over access to the NYT’s materials. The NYT and OpenAI could not reach a commercial agreement, which is why OpenAI is having to rely on fair use to justify its actions.
But existing laws on fair use (and their national equivalents) were designed for uses at the level of a cottage industry. They were not designed for uses at an industrial scale. OpenAI’s LLM will have ingested several million NYT articles. According to Microsoft (which provided OpenAI’s cloud infrastructure), OpenAI’s computing operated as “a single system with more than 285,000 CPU cores, 10,000 GPUs and 400 gigabits per second of network connectivity for each GPU server.” According to the NYT, this made it one of the top five most powerful publicly known supercomputing systems in the world.
The hope of many commentators is that this case will make its way to the US Supreme Court, and there receive a definitive ruling which will provide a solid foundation for copyright in the future. This is unlikely to happen. Not just because the chances are that OpenAI and the NYT will settle, but because the US Supreme Court (together with other supreme courts) has shown itself incapable of reasoning coherently in relation to copyright, most notably in Feist[iii] and more recently in Google v. Oracle[iv] (Clarence Thomas’ excoriating minority judgement serving to highlight the poor reasoning of the majority).
The problem is not the copyright rules themselves (though they are a problem) but the incoherence of the underlying reasons. The origins of copyright lie in the granting of privileges to the well-connected few. That then evolved into copyright as a reward for the creators of a meritorious work. What copyright reasoning has - to date - failed to do is a) recognise that intellectual property is no long an exception in our societies, but commonplace, and b) fully assimilate the primary function of copyright which is to allow the work in question to become property and so become easily tradeable (and protected) on the market.
If the reasoning that underlies copyright is flawed, then there is little hope that the US Supreme Court can give a sensible answer on a difficult case of fair use.
Here are few examples of the flawed reasoning.
Copyright is a monopoly right (and therefore needs to be restricted). Except that there is no correlation between an exclusive right and a monopoly. The latter is an assessment of market dominance: an exclusive right has no (of itself) relationship to dominance in a market.[v]
Copyright and anti-trust. A company that had market dominance as a result of its ownership rights in physical property (for example, ownership of a port, or aircraft landing rights), and was abusing its dominance by refusing competitors access to its physical property, would typically find (as a result of the law on anti-trust) that it had to give its competitors access at a fair price. Yes, its use of its property rights are trumped by the application of anti-trust, but it would still be able to derive revenue from its property. However, where copyright is seen as giving an unfair advantage then, rather than applying an anti-trust approach and allowing the competitor to get access in return for paying a fair price, copyright itself is modified (by a finding of fair use) so as to give free access. Google v. Oracle provides a good example of this.[vi]
Copyright as a meritorious work. Creativity as a qualifying condition for copyright, on the basis that only creative works merit being copyright. This approach allows the law to usurp the function of the market. It is the function of the law to determine what is property and is not property: it is the function of the market to determine whether or not a property has value. Using the law to pre-empt the judgment of the market makes no sense.
Excluding compilations of facts from copyright on ideological grounds. Compilations of facts are intellectual property just as much as other forms of copyright. Creating a new category of copyright and giving it a different name (as happened in the EU with the Database Directive), solely because of an ideology that is incoherent in the first place, is not a smart way to go.
Propertisation on the basis investment (investment being a proxy for merit). Tying propertisation to levels of investment again mistakes the function of property. The EU’s Database Directive provided that only those databases that were created using substantial investment benefited from the database right. The primary consequence of this approach is that the more efficient the producer, the smaller the investment, and therefore the less likely she is to get the benefit of the database right. Propertisation on the basis of inefficiency of production is not optimal.
Propertisation on the basis of the means of production. In most copyright systems, a human author is required for a work to be copyright so that, for example, a work created by a non-human (for example by an AI machine or, more unlikely still, a poem-creating tree) cannot be copyright. What is the rational basis for regulating propertisation (ie. whether something is property or not) according to the means of production?
The idea/expression dichotomy. It has become increasingly accepted that copyright does not protect collections of facts, Feist being the case in point in the US. How does this apply to businesses, like the NYT, that make their living from investigative reporting? Investigative reporting is research which produces a collection of facts: those facts then get written up into an article. If people can use AI to keep the same facts, but to rewrite the article using a different expression, then there is no infringement of copyright. But it is free-riding. If we want investigative reporting to continue (which we must, if we want our democracies to continue), then this approach to intellectual property is not sustainable. The idea/expression dichotomy works well enough for fiction but, in an AI world, not for factual reporting.
[i] It is worth noting that section 29A of the UK’s Copyright, Designs and Patents Act 1988 allows web scraping for research for a non-commercial purpose. OpenAI is unlikely to have met that test. In the EU, Copyright in the Digital Single Market Directive 201 allows limited web-scraping for scientific research purposes, but also allows “reproductions and extractions of lawfully accessible works and other subject matter for the purposes of text and data mining” without a restriction as to purpose. However, rightsholder are entitled to override this provision. The forthcoming EU AI Act also provides: “Any provider placing a general purpose AI model on the EU market should comply with this obligation [ie. right holder consent], regardless of the jurisdiction in which the copyright-relevant acts underpinning the training of these general purpose AI models take place. This is necessary to ensure a level playing field among providers of general purpose AI models where no provider should be able to gain a competitive advantage in the EU market by applying lower copyright standards than those provided in the Union.”
[ii] When moving picture technology first arrived, novelists were lucky enough to get a whole new revenue stream – film rights! No-one argues that novelists should not be entitled to take advantage of these rights either because they were new, or because the change from book to film was transformative.
[iii] Feist Publications, Inc. v. Rural Telephone Service Co., 499 U. S. 340, 345 (1991)
[iv] Google LLC v. Oracle America, Inc., 593 U.S. ___ (2021).
[v] The purpose of fair use is “providing a context-based check that can help to keep a copyright monopoly within its lawful bounds”. [emphasis added]. Google LLC v. Oracle America, Inc.
[vi] “…..to allow enforcement of Oracle’s copyright here would risk harm to the public. Given the costs and difficulties of producing alternative APIs with similar appeal to programmers, allowing enforcement here would make of the Sun Java API’s declaring code a lock limiting the future creativity of new programs. Oracle alone would hold the key”. [emphasis added]. ]. Google LLC v. Oracle America, Inc.