Why does this model have an Apache 2.0 license?

by Stealcase - opened Jun 21

Discussion

Stealcase

Jun 21

This model is based on datasets that are not "clean" from a Copyright perspective.

uonlp/CulturaX is a subset of mC4 and OSCAR, which are both again a subset of Common Crawl, a web-crawl based dataset. This means there is 100% guarenteed to be data in this dataset that has licenses that proclude LLM training, most of the text is under Copyright, and you did not ask or license the creators to use this text for this purpose. The only reasons these datasets can exist on Huggingface from a Copyright perspective is because of exceptions that apply explicitly to research institutions and the idea that they would primarily be used for research.
HPLT is again, you guessed it: Common Crawl and Internet Archive. This means they are again scraped from the web and redistributed under the assumptions that they will be used as part of research, not creative endeavours.
NCC Corpus is the only partially clean dataset used to train this model. I say "partially", because you have not documented any filtering on this dataset.
Did you filter out "newspapers_online_nb, newspapers_online_nn"? These are under a CC BY-NC 2.0 license, which is explicitly for non-commercial use.
Did you filter out "opensubtitles, wikipedia"? These are under a CC BY-SA 3.0, a Share-alike license that implies you should share your resulting model using the same license.

You did not document this filtering anywhere, so I'm going to assume you did not.

Consequences

Your documentation and licensing around these models has produced reporting that completely and innapropriately claims that NoraLLM is trained on "legal data".

https://aiavisen.no/nora-llm/
"Modellene er åpne og transparente, og bruker bare lovlige data."

Why take the risk of licensing this model with a commercial license if you're basing your actions on exceptions awarded for researchers?

LLMs are known to reproduce text found in their datasets.
Norway does not have "Fair Use" as a legal exception, there is not really ambiguity around the practice of using Copyrighted material for commercial use under Norwegian law. So the normal ambiguities and legal gray areas don't apply.
Did you consult the legal department of UiO before publishing these models under commercial licenses?
What need does this serve when these models are marketed as "research artifacts" yet licensed for commercial use? Could you not advance Norwegian AI research without licensing these AI models commercially?

The NET effect is that UiO becomes part of a copyright laundering scheme which uses international Copyright protections and exceptions meant to give researchers access to data, in order to make the results of those efforts commercially availible for anyone to exploit without the consent, credit or compensation to the original authors.

As I'm sure this is not the intention of the creators of this model and that they are simply following the norms of Huggingface in how they licensed this model, I urge them to relicense this with a different license that actually reflects the data and the purpose of the model: a research artifact meant to be used by researchers to improve Norwegian LLM capabilities.

ltgoslo

Norwegian Large Language Models org Jul 5

Thanks for your comment.

We are in an ongoing communication with legal experts on this topic.
However, our current understanding is that large language models do not necessarily directly inherit licensing limitations of the training data. No data is re-distributed by us in any way.

Stealcase

Jul 7

•

edited Jul 7

You are missunderstanding my legal argument and not adressing my most salient points. This does not instill me with confidence that you are looking into the matter, as you are instead dismissing my worries and not adressing my key claims.

Please considder this

There is no Fair Use in Norway, there is no concept of "transformative use", and you NEED an EXCEPTION to copyright to use Copyrighted material in ANY capacity in Norway.

✅ You trained on Licensed copyrighted data, and the only legal basis you have for this act is "research" as a Copyright exception. This is fine, you are doing research. You have an exception.
❌ The fact that you then released the result of that research with a license that allows commercial use, is tantamount to copyright laundering through an educational institution with an explicit copyright exception that exists to allow researchers to do their work. This is NOT fine.

Why respect OpenAI's copyright, but not respect BY-NC 2.0?

Even though your response is short, there is hypocrisy in your statement on your legal beliefs regarding how you treat these models.
"our current understanding is that large language models do not necessarily directly inherit licensing limitations of the training data"
if this was true, why would you respect the copyright of OpenAI but not authors and writers?
To quote the README of this very model:
"This is a model instruction-tuned on open datasets released under the most permissive apache-2.0 licence (in other words, we don't use any datasets generated by ChatGPT) thus we can release this model under the same license and make it openly available for commercial applications."

The entire legal basis that OpenAI is using to defend the outputs of ChatGPT from being used to train a competing model, is rooted in Copyright Law and intellectual property law. If they weren't using Copyright, they would have no legal ability to defend this claim.
Spesifically in this case, they are benefiting from Berne Convention Article 9.2
"It shall be a matter for legislation in the countries of the Union to permit the reproduction of such works in certain special cases, provided that such reproduction does not conflict with a normal exploitation of the work and does not unreasonably prejudice the legitimate interests of the author."
OpenAI is saying that using their material to train a competing service would unreasonably prejudice their legitimate interests.
If OpenAI was not using Copyright, they would need to invent a whole extra-judicial legal paradigm to defend their claim. Instead, they are relying on Copyright.

Your most recent statement is directly contradictory to your stated motivations for not using text generated by OpenAI. You imply that the LICENSING is the main reason for not using ChatGPT generated data, implying that COPYRIGHT is the basis.
This indicates a respect of Terms and Conditions rooted in Copyright for one of the largest AI companies in the world, but not for individual authors who have had their works scraped from the web.

Why do you respect one and not the other? Is it because of the possible legal consequences for ignoring copyright? Nevermind that OpenAI actually doesn't have a copyright claim to material generated by ChatGPT per the Copyright office, they are nevertheless basing their ToS on the assumption that they do. Why does their Copyright weigh more heavily, even though they have less of a claim to Copyright than authors?

What are the consequences?

The consequence of this is a Privatized Copyright: respecting the imaginary corporate Copyright for multibillionaire conglomarates based entirely on their ability to pursue legal action and how big their war-chest is, while ignoring ACTUAL state legislated Copyright for everyone else. If this isn't a mockery of the rule of law and bending to the will of corporations, I don't know what is.

If I added a Terms of Service to my own website that stated "you may not use my material for deep learning or machine learning", would you respect this? I really doubt it, because your approach to Copyright is pretty apparently not considdered.

Final points

"No data is re-distributed by us in any way."
This is false. You are distributing an AI model. Though the individual bytes that you are distributing might not match bit-for-bit the input data, it is widely acknowledged in the ML field that ML models compress their dataset as a form of training. Otherwise, how could models retain information in the training data.

I can convert a PNG file to a JPG and change EVERY SINGLE BYTE, yet you would recognise the image is the same if shown side by side. The bytes do not matter, the data being compressed in a new novel form does not invalidate copyright law.

Even assuming your statement is true, this does not protect you.
Reproduction of Copyrighted text using an LLM is not a pre-requisite to claim copyright infringement (though it is certainly possible to do so with most LLMs and the right prompts).
Please note the NYT lawsuit against OpenAI is based on INPUTS, rather than outputs. The USE of Copyrighted works in a commercial LLM. The outputs are not key to their lawsuit, they only serve as proof of the training, while the training is the alleged "original sin".
The same is true for the Record Labels laswuits against Suno and UDIO

EDIT: Spelling fixes

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment