What is ChatGPT? How does it work?
To help demystify the copyright implications of ChatGPT – the chatbot that we cannot stop hearing about – we asked ChatGPT itself to explain what it was. Its answer (in italics), has been analysed for readers’ reference.
The prompt was: “What is ChatGPT?”.
“ChatGPT is an AI-based language model developed by OpenAI, built on the GPT-3.5 architecture. The GPT-3.5 architecture is an advanced deep learning architecture that utilises transformers, a type of neural network, to process and generate natural language text. ChatGPT is pre-trained on a massive amount of text data, such as books, articles, and web pages, using unsupervised learning techniques. This pre-training enables ChatGPT to understand the nuances of natural language and generate human-like responses to a wide range of language-related tasks, such as answering questions, generating text, and translating languages.”
Simply put, ChatGPT is a derivative of the Large Language Model (LLM). LLMs are a type of machine-learning Natural Language Processing (NLP) model that processes vast amounts of text data, identifies and processes connections between words from the text. LLMs acquire better capabilities as the size of their input datasets and parameter space increases. This can be understood in terms of the following analogies: The more food one eats, the more energy one has (input datasets). The bigger the capacity of one’s stomach, the more food one can consume (parameter space).
At a fundamental level, a language model is first taught to be capable of predicting a word in a sequence of words. To do so, today’s LLMs employ “transformers”, which can process all input data simultaneously, which is why even when one submits several paragraphs of a prompt to ChatGPT, it is able to produce a relatively accurate answer in a short span of time.
The full form of GPT is “Generative Pre-training Transformer”. GPT-1, GPT-2, GPT-3, and GPT-3.5 use the transformer architecture, which means that they all have an “encoder”, which processes the input, and a “decoder”, which generates the output. Both the encoder and the decoder have a “multi-head self-attention mechanism” that permits the GPT model to weigh parts of the input to derive its meaning according to the context in a human-like manner. The encoder also uses “masked-language-modelling”, which works with the goal to predict “masked” words in between a sequence of words (the input). Masked-language-modelling allows a GPT to identify and understand connections between words, and provide tailored responses.
Could the way we perceive copyright law change because of GPT models?
This chatbot is able to create a variety of content that is protectable under IP law, such as articles, song lyrics, programming codes, or translations of texts.
“Content” comprises the input that a user provides in the prompt, as well as the output based on the input that is generated by ChatGPT in response. Currently, the user owns all input subject to applicable law. With regard to the output, OpenAI appears to “assign” to the user, all its rights, title, and interest in and to the output. “Assignment” refers to the copyright owner giving away ownership of the work to another party, and has no control over their usage once assigned. The assignee has all the rights that the initial owner had.
Nevertheless, the terms do not specify details of the assignment as required by Section 19 of the Indian Copyright Act (the Act). The details must include the term of assignment, territory, royalty percentage, and payable consideration. Since the mode of operation is online, the territory would constitute the whole world. This raises problems for takedown, especially since users who have been allegedly assigned ownership of certain output have published the output under their own names, because, what would happen if the assignment is deemed invalid in the future?
It must be noted that ChatGPT gathers publicly-available information from the internet that is fed to its own databases – based on which its training occurs – and creates its own new answers. It is likely that the publicly-available information that it derived its training data from was copyrighted, although some information may have been uncopyrightable material such as facts, discoveries, and works whose term of copyright protection has expired.
When ChatGPT was asked whether it used copyrighted material during its training, the first answer was as follows:
When asked the same question once more, it provided a different, if not conflicting answer.
However, ChatGPT’s argument that its training data has been pre-processed to remove all copyrighted material is questionable since a vast majority of publicly-available resources are copyrighted, or creative commons licences are provided, whose scope of use appears to be ambiguous when concerning ChatGPT. Since some training data would obviously be copyrighted, and ChatGPT uses it to write its own answers, would the resulting answers to our questions constitute derivative works? How can OpenAI assign these hypothetical derivative works to its users if they are unauthorised? If it is confirmed that some training data is copyrighted, and ChatGPT creates its own answers from that training data, is that process transformative enough to constitute fair use? (Italy currently has a ban on ChatGPT unless it agrees to disclose information on the process) If someone uses this unauthorised derivative work, would it constitute vicarious infringement or contributory infringement on open AI’s part? In either case, would the entity engaging in vicarious/ contributory infringer be OpenAI or the users of ChatGPT? If the argument that except for natural persons and corporations nobody can own IP is negated, how would the question of whether originality subsists in the output generated by ChatGPT be answered by each of the three tests (modicum of creativity, skill and judgement and sweat of brow)?
The answer to the last three questions is a different discourse since the objective of this article is to enlighten readers on the possible implications that ChatGPT could have on copyright law. Further, copyright laws of several jurisdictions do not confer ownership of content generated by AI to AI, since AI cannot be recognised as an author, and does not have legal or natural personality, which is a precondition to possess and own intellectual property.
However, ownership of the content varies depending on the sources of the training database, the database’s owner, and the degree of similarity between the database and the output generated. If ChatGPT is deemed to be capable of creating original literary work, it can be regarded as an author, and if it is regarded so, there arises the issue of copyright infringement liability. However, the terms of OpenAI seem to indicate that infringement liability appears to shift onto the subsequent owner of the output generated, namely, the user (remember that OpenAI assigns ownership of the generated output to the user who entered the respective input).
When one compares the way ChatGPT works to works created by graphic artists who use drawing software (ownership of which is vested in the artists and not the creator of the software), it is clear that ChatGPT does not allow human interference in the creation of its output, whereas drawing software is used by a human being as a tool to create the resultant artwork. Thus, copyright infringement cases against users of ChatGPT appear to be unlikely to succeed since courts require proof that one party has copied another’s work, which the user theoretically has no part in.
Another limitation, as terms of OpenAI dictate, is that the nature of machine learning generates similar and identical outputs for different users. In this event, determining the owner of copyright in such identical or similar output can be complex, given that the internet is a confusing place.
The Indian Scenario
Upon reading Section 16 and Section 17 of the Copyright Act along with the requirement of Form-XIV to disclose the applicant’s name, nationality, and address to register a copyright, it can be inferred that there is an implied human authorship condition for registration. It is also observed that Section 2(d)(vi) of the Act also acknowledges the human precondition in the use of the pronoun “who” in attributing authorship to an individual. Thus, while the question of whether a human can be assigned ChatGPT’s output is out of the Act’s scope, it is clear that a human being cannot be the author of the content generated by ChatGPT.
Ironically, upon a perusal of the Copyright Office’s records, an artistic AI application, RAGHAV (Robust Artificially Intelligent Graphics and Art Visualizer) has been named as a co-author and continues to remain so (as of 17th April 2023).