Training Data And Copyright: Is Scraping The Internet To Train Ai Models Infringement?

Navigating Section 52 of the Copyright Act, 1957 and the Case for a Specialised Text and Data Mining Framework in India

Abstract

The mass usage of copyrighted works to train large-scale generative artificial intelligence systems has created an enormous intellectual property controversy in contemporary India. This article aims to examine whether such ingestion constitutes as infringement under the Copyright Act, 1957, with special focus on the fair dealing exception under Section 52 of the Act. Section 52 lays down an exhaustive list which proves to be incapable of accommodating commercial AI training. The Copyright (Amendment) Act, 2012 did not establish any TDM provisions. Drawing a comparative analysis from the EU DSM Directive 2019/790, TRIPS Agreement, MeitY’s 2019 Report on Platforms and Data, DPDP Act, 2023 and the US fair use doctrine, this article provides a reform proposal of enacting a balanced, TRIPS-compliant TDM exception which will cover any legal gaps in the Indian statutory and structural framework.

Keywords: GenAI training, text and data mining, Section 52, fair dealing, Copyright Act, 1957, DPDP Act, 2023, TRIPS Agreement, DSM Directive, fair use doctrine, MeitY Committee-A Report.

Introduction

The training of Generative AI [GenAI] significantly reflects an act of mass reproduction. Language models and image generators exercise their functionality by processing datasets cumulated and produced by encroaching upon publicly accessible expressions protected by copyrights. The global stakes of this contemporary issue is substantiated deeply by the New York Times Company v Microsoft Corporation and OpenAI Inc., where billions of articles were alleged to have been ingested without any authorization,as well as by class actions filed by visual artists against Stability AI and Midjourney.Unfortunately, India has not been subjected to an equivalent litigation due to the lack of mobilization of domestic rights holders.

This articles largely focuses on whether India’s fair dealing framework under Article 52 of the Copyright Act, 1957 [hereinafter referred to as “the Act”] can accommodate commercial GenAI training from facing infringement liability. The core argument the paper is presenting is that – it cannot accommodate such escapism from liability. The 2012 Copyright Amendment Act, which was the last revision made to the Act, failed to produce TDM [Text and Data Mining] provisions. There is an immediate requirement of legislative action before India’s AI proposals laid down by the approved 2024 India’s AI Missionvi conflicts with the domestic rights of Indian copyright holders in various forms.

How AI Training engages Indian Copyright Law

Under Section 13(1), the Act protects original literary, dramatic, musical, and artistic works. The standard of originality was set by the Supreme Court in Eastern Book Company v. D.B. Modak, which required a modicum of intellectual creativity more than mere mechanical reproduction.viii Nonetheless, certain unregistered works also carry copyrights which makes it arduous for GenAI trainers to determine the copyright status of data that they scrape for training purposes.

GenAI training entangles copyrights in two ways. Firstly, training GenAI datasets requires scraping and storing copies of copyrighted work. Reproduction u/s 14(a)(i) of the Act includes storing of work in any medium by electronic means. The practice of reproduction through GenAI training amounting to allegation of copyright violations is cultivated by the lack of transparency. AI developers have largely been criticized for non-disclosure of their datasets used for training diminishing accountability from non-infringement.

Secondly, the parameters of the trained model designed by the patterns derived from protected work may constitute an adaptation u/s 14(a)(vi). Scholars argue that ‘AI generated works often closely mimic the style of specific artists or reproduce substantial portions of existing texts, blurring the line between learning and copying,’ and may be classified as ‘unauthorized derivative works, a clear infringement under both Indian and international copyright laws.’ The foundational principle that copyright protects expressions rather than ideas was laid down by the SC in R.G. Anand v. Delux Films does not automatically fortify GenAI training, since extraction of expressions from entire works differs intrinsically from drawing inspiration from ideas.

Section 52 fail dealing: Framework and Limitations

Section 52 of the Act constitutes an exhaustive set of acts that does not amount to infringement. The sections foundational principle as established by Wiley Eastern Ltd. v. Indian Institute of Management, is ‘to protect the freedom of expression u/a 19(1) of the Constitution of India, so that private research and study, criticism or review could be protected. Unlike section 107 of the US Copyright Act which allows the courts to contemplate any relevant factor as a fair-use infringement, section 52 is finite. An act cannot be excused If it does not fall under the provided categories of section 52. This distinction is usually ignored while applying the US framework of fair use to Indian statutory context. The most plausible defence for AI developers under fair dealing is section 52(1)(a) which allows fair dealing for the purposes of (i) private or personal use, including research; (ii) criticism or review.

However the above provision cannot permit commercial GenAI training for at least three reasons:

Firstly, the research must be ‘private or personal.’ In Civic Chandran v. C. Ammini Amma, the Kerala HC held that the courts have to consider: ‘(i) the quantum and value of the matter takin in relation to the comments or criticism; ; (2) the purpose for which it is taken; (3) the likelihood of competition between the two works. The court also adopted Lord Denning’s principle from Hubbard v. Vosper that use which ‘convey[s] the same information as the author, for a rival purpose, may be unfair.’ Commercial training of AI by groups like OpenAI and Google DeepMind is spearheaded for the explicit purpose of developing commercially valuable products, perpetrating a ‘rival purpose’. The Delhi HC in Super Cassettes Industries Ltd. v. Chintamani Rao commercial purpose counter veils fair dealing purposes.

Secondly, the dealing must be ‘fair’. The SC in R.G. Anand held that ‘one of the surest and safest tests to determine whether there has been violation of copyright is to see if the reader, spectator or viewer gets an unmistakable impression that the subsequent work appears to be a copy of the original’. GenAI training in essence involves reproduction of not just extracts of works, but the entirety of it. This directly undermines licensing markets that rights holders are developing. The Delhi HC in The Chancellor, University of Oxford v. Narendra Publishing House laid down that ‘the amount and value of the content taken from the original work would be considered to decide its effect on the market share of the original work  an aspect that GenAI training clearly fails to meet.

Thirdly, section 52(1)(a) expressly excludes computer programmes from its scope. This exclusion reflects parliament’s deliberation about extending fair dealing to the commercial technology sector. The absence of any kind of TDM provision in section 52 is the most pivotal gap. As Chauhan directly concludes: ‘the existing regime is insufficient to handle the copyright issues associated with TDM all these factors require legislature or executive intervention An exhaustive list under 52 hints that a TDM exception cannot be created by mere judicial interpretation, but rather requires strong parliamentary action.

    The DPDP Act, 2023: Exposing another layer of legal exposure

    GenAI developers legal exposure is not completely exposed by the Act alone. The Digital Personal Data Protection Act, 2023 interaction with the Copyrights Act introduces what scholars account as the central dilemma: ‘what may be considered permissible under the principles of copyright may simultaneously constitute a breach of privacy when the data involved is personal in nature and obtained without consent’.

    GenAI training ineluctably entails personally identifiable information [PII] embedded in the scraped web content.The DPDP Act requires that personal data be processed only with valid consent of the Data Principal. Right to privacy u/a 21 of the Constitution affirmed through the Justice K.S. Puttaswamy v. Union of India case recognized informational privacy as an aspect of such right. Therefore, the defence that people who have posted content online consented to its usage for commercial AI training renders itself untenable.

    Scholars have further identified that existing privacy regulations ‘focus predominantly on identifiable personal data,’ while ‘advancements in re-identification techniques mean even anonymized datasets can be reverse-engineered to reveal sensitive patterns, identities, or behaviours, creating a legal grey zone. As discussed earlier, they remains gaps in the legal structure governing GenAI training using personal data in India, which creates uncertainty for developers regarding the permissible scope of data usage.

    Comparative Research and Reform Proposal

    The EU’s DSM Directive created two TDM exceptions: (i) Article 3 – provides mandatory exception for research organizations and cultural heritage institutions; (ii) Article 4 – creates default exception for all users, including commercial developers, subject to right holders’ ability to opt out ‘in an appropriate manner’, such as machine readable means in cases of publicly available content’. On the other hand, Japan’s approach is more permissible in nature. Its amended Copyright law permits TDM in all cases, both commercial and non-commercial, without any prior consent, unless the exploitation ‘unreasonably jeopardizes the interests of the copyright holder’. Chauhan supports the Japan’s liberal TDM exception framework and says its incorporation into Indian statutory body would efficiently help our country given its cultural, linguistic and economic peculiarities. This is especially important in India as AI models need to trained across India’s 121 languages.

    The US heavily relied upon the fair use doctrine. In Authors Guild v. Google, Inc., the circuit held that large-scale digitisation was held transformative where it enabled new functionality without substituting for the original market. Whether AI training constitutes similarly transformative use remains heavily disputed and lies with the court to test the same. The US approach produces legal certainty that multi-billion-dollar litigation has made manifest.

    India has not adopted any of the above till date. While the 2024 MeitY Advisory covered bias and unlawful content, it did not address the copyright issues associated with TDM and AI, leaving AI companies that scrape Indian protected work in a stance of prima facie infringement without any statutory defence. TRIPS Article 7 and 8 which recognizes IP protection ‘should promote technological innovation’ and permit measures ‘necessary to protect public interest in sectors of vital importance to socio-economic and technological development’, provides a lucid treaty authority for India to enact a regulated TDM exception. The MeitY Committee-A Report in 2019 recognized that ‘the increasing access to quality data set is key to creating a competitive and equitable AI ecosystem’ and further recommended development of a National AI Resource Platform sanctioned by the government data framework.

    This article suggests that the Copyright Amendment Bill incorporates two main provisions:

    Mandatory non-commercial TDM exception for research institutions and educational establishment, deriving insights from Article 3 of the DSM directive.

    Commercial TDM exception subject to a machine-readable rights-holder opt-out, deriving insights from Article 4(3) of the Directive but tweaked to the Indian socioeconomic context greatly favouring the linguistic diversity and AI sector.

    Alongside these provisions, the DPDP Act should introduce dataset documentation requirements and a workable machine-learning standard for GenAI training standards, addressing legal gaps that scholars have largely identified.

      Conclusion

      Indian Copyrights Act, 1957 does not permit commercial AI training on copyrighted works. Section 52’s fair dealing exceptions does not shelter the large, automated ingestion of protected work that GenAI requires.The 2012 Amendment Act did not introduce any TDM provisions. The DPDP Act of 2023 further added a data protection obligation that AI developers have greatly failed to incorporate in their behaviour.

      The comparative analysis showcases that this legal and structural gap is one of ignorance and a policy choice. As Chauhan concludes, ‘India needs to draw a balance between the protection of the interests of right holders and promoting the R&D of AI…the development of generative AI is going to happen either with a liberal copyright exception with fewer risks and more opportunities and inclusive development, or with strict copyright exceptions and comparatively more risks and challenges. The choice is ours’. The Amendment Bill emphasized the need for the legislature to take action. Therefore, a TRIPS-compliant TDM exception that is calibrated to India’s linguistic diversity and economic aspects, while keeping in mind the copyright holders’ rights is definitely doable and constitutionally sound. Continued legislative silence will only constantly reinforce escapism from facing action for infringement of protected works and work against the interests of every Indian copyrights holder whose work is being scraped even at this very moment.

      THIS ARTICLE IS WRITTEN BY SUDHIKSHA FROM  O P JINDAL GLOBAL UNIVERSITY

      REFERENCE :

      i Bedanta De, “Training AI Models on Copyrighted and Personal Data: Reconciling Fair Use and Privacy Rights,” Indian Journal of Law and Legal Research 7, no. 3: 8870.

      ii Kailash Chauhan, “Generative AI, Text & Data Mining and the Fair Dealing Doctrine,” Journal of Intellectual Property Rights 30 (2025): 77–85, 77.

      iii The New York Times Company v. Microsoft Corporation and OpenAI, Inc., No. 1:23-cv-11195 (S.D.N.Y., filed December 27, 2023).

      iv Andersen v. Stability AI Ltd., No. 3:23-cv-00201 (N.D. Cal., filed January 13, 2023).

      v Bedanta De, “Training AI Models on Copyrighted and Personal Data,” 8874.

      vi Ministry of Electronics and Information Technology, “Union Cabinet Approves IndiaAI Mission,” Press Information Bureau, March 7, 2024, https://www.pib.gov.in/PressReleaseIframePage.aspx?PRID=2012355®=3&lang=2.

      vii The Copyright Act, 1957, s. 13(1).

      viii Eastern Book Company v. D.B. Modak, (2008) 1 SCC 1.

      ix Kailash Chauhan, “Generative AI, Text & Data Mining and the Fair Dealing Doctrine,” 78.

      x The Copyright Act, 1957, s. 14(a)(i).

      xiBedanta De, “Training AI Models on Copyrighted and Personal Data,” 8873.

      xii The Copyright Act, 1957, s. 14(a)(vi).

      xiiiBedanta De, “Training AI Models on Copyrighted and Personal Data,” 8875.

      xiv R.G. Anand v. Delux Films, (1978) 4 SCC 118.

      xv The Copyright Act, 1957, s. 52.

      xvi Wiley Eastern Ltd. v. Indian Institute of Management, 61 (1996) DLT 281, para. 19.

      xvii 17 U.S.C. § 107.

      xviii Kailash Chauhan, “Generative AI, Text & Data Mining and the Fair Dealing Doctrine,” 78.

      xixThe Copyright Act, 1957, s. 52(1)(a).

      xx Civic Chandran v. C. Ammini Amma, 1996 SCC OnLine Ker 63 : (1996) 1 KLJ 454 : (1996)

      16 PTC 670.

      xxi Hubbard v. Vosper [1972] 2 WLR 389.

      xxii Kailash Chauhan, “Generative AI, Text & Data Mining and the Fair Dealing Doctrine,” 79.

      xxiii Super Cassettes Industries Ltd. v. Chintamani Rao, 2011 SCC OnLine Del 4712 : (2012) 49 PTC 1.

      xxiv R.G. Anand v. Delux Films, (1978) 4 SCC 118.

      xxv Bedanta De, “Training AI Models on Copyrighted and Personal Data,” 8875.

      xxvi The Chancellor, Masters & Scholars of the University of Oxford v. Rameshwari Photocopy Services, 2016 SCC OnLine Del 6229 : (2016) 235 DLT 409 (DB) : (2017) 69 PTC 123.

      xxvii The Copyright Act, 1957, s. 52(1)(a).

      xxviii Kailash Chauhan, “Generative AI, Text & Data Mining and the Fair Dealing Doctrine,” 79.

      xxix Akshat Agrawal and Sneha Jain, “Indian Copyright Law and Generative AI,” SSRN, accessed 2025, https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5028835.

      xxx Bedanta De, “Training AI Models on Copyrighted and Personal Data,” 8879.

      xxxi Bedanta De, “Training AI Models on Copyrighted and Personal Data,” 8876.

      xxxii Digital Personal Data Protection Act, 2023, ss. 4 and 6.

      xxxiii Justice K.S. Puttaswamy (Retd.) v. Union of India, (2017) 10 SCC 1.

      xxxiv Bedanta De, “Training AI Models on Copyrighted and Personal Data,” 8880.

      xxxv Bedanta De, “Training AI Models on Copyrighted and Personal Data,” 8877.

      xxxvi Directive (EU) 2019/790 of the European Parliament and of the Council of 17 April 2019 on Copyright and Related Rights in the Digital Single Market, arts. 3 and 4(3).

      xxxvii Kailash Chauhan, “Generative AI, Text & Data Mining and the Fair Dealing Doctrine,” 80.

      xxxviii Kailash Chauhan, “Generative AI, Text & Data Mining and the Fair Dealing Doctrine,” 80–81.

      xxxix Authors Guild v. Google, Inc., 804 F.3d 202 (2d Cir. 2015).

      xl Bedanta De, “Training AI Models on Copyrighted and Personal Data,” 8874.

      xli Kailash Chauhan, “Generative AI, Text & Data Mining and the Fair Dealing Doctrine,” 78. xlii Agreement on Trade-Related Aspects of Intellectual Property Rights, arts. 7 and 8, April 15, 1994.

      xliii Ministry of Electronics and Information Technology, Report of Committee-A on Platforms and Data on Artificial Intelligence (New Delhi: Government of India, 2019), 9, 14.

      xliv Directive (EU) 2019/790, arts. 3 and 4(3).

      xlv Bedanta De, “Training AI Models on Copyrighted and Personal Data,” 8882–83.

      xlvi Kailash Chauhan, “Generative AI, Text & Data Mining and the Fair Dealing Doctrine,” 78.

      xlvii Digital Personal Data Protection Act, 2023, ss. 4–13.

      xlviii Kailash Chauhan, “Generative AI, Text & Data Mining and the Fair Dealing Doctrine,” 83.