this post was submitted on 27 Dec 2023
14 points (100.0% liked)

Technology

4 readers
3 users here now

This magazine is dedicated to discussions on the latest developments, trends, and innovations in the world of technology. Whether you are a tech enthusiast, a developer, or simply curious about the latest gadgets and software, this is the place for you. Here you can share your knowledge, ask questions, and engage in discussions on topics such as artificial intelligence, robotics, cloud computing, cybersecurity, and more. From the impact of technology on society to the ethical considerations of new technologies, this category covers a wide range of topics related to technology. Join the conversation and let's explore the ever-evolving world of technology together!

founded 1 year ago
 

The New York Times is suing OpenAI and Microsoft for copyright infringement, claiming the two companies built their AI models by “copying and using millions” of the publication’s articles and now “directly compete” with its content as a result.

As outlined in the lawsuit, the Times alleges OpenAI and Microsoft’s large language models (LLMs), which power ChatGPT and Copilot, “can generate output that recites Times content verbatim, closely summarizes it, and mimics its expressive style.” This “undermine[s] and damage[s]” the Times’ relationship with readers, the outlet alleges, while also depriving it of “subscription, licensing, advertising, and affiliate revenue.”

The complaint also argues that these AI models “threaten high-quality journalism” by hurting the ability of news outlets to protect and monetize content. “Through Microsoft’s Bing Chat (recently rebranded as “Copilot”) and OpenAI’s ChatGPT, Defendants seek to free-ride on The Times’s massive investment in its journalism by using it to build substitutive products without permission or payment,” the lawsuit states.

The full text of the lawsuit can be found here

you are viewing a single comment's thread
view the rest of the comments
[–] [email protected] 12 points 8 months ago (40 children)

How so?

The trained model includes vast swathes of copyrighted material. It's the rights holders who get to decide whether someone can use it.

Just because it makes it inconvenient or harder for someone to train an AI model does not justify wholesale stealing.

A lot of models are even trained on large numbers of pirated material like books downloaded from pirate sites etc. I guarantee you OpenAI and others didn't even buy a lot of the material they use to train the AI models on.

[–] [email protected] 3 points 8 months ago (29 children)

No it doesn't, the training data isn't inside the LLM.

So firstly, even if those claims are true, you sue the wrong business, you would need to sue the training data maker. They however are usually protected by laws for science, because they are "non profit research"

Therefore this is completely ridiculous.

Btw, A the copyright part is only a thing if its a significant portion of the thing... Wich it clearly isn't in this case (its below 1% of it) making it even more ridiculous.

Also, if you can get the information on the internet, you are again suing the wrong place, you should be after the provider, not the automatic data grabbing system... As they can and will argue that they cant control what their algorithm crawler takes. There is a way to mark content as "dont use" for Mashines, but most people don't do that and will lose in court because they don't understand it...

Lastly, the training wouldn't be harder, the problem is the gathering of data. You can't manually look through all of it and its idiotic to think that its reasonable to demand such a thing.

[–] [email protected] 10 points 8 months ago (25 children)

No it doesn’t, the training data isn’t inside the LLM.

This is factually incorrect. You can extract the data. How do you think the legal cases are being brought?

For example

The model has to contain the data in order to produce works.

Wholesale commercial copyright infringement where you're profiting off of others work on a large scale is a whole different ball game.

They're training their models on large amounts of pirated content and profiting off it.

Of course the rights holders are going to say "wait a minute, why are you making money off my content without my permission? And how much of my work did you pirate to use?"

You cannot hand wave away mass piracy to train their models, and then distribute said models based on an act of mass copyright infringement.

Do you not understand the basics of the law?

its idiotic to think that its reasonable to demand such a thing.

Again, the law is the law. If they mass pirate a bunch of media which then the model contains chunks of they are breaking the law.

I can't believe this is a hard concept for someone to understand.

[–] [email protected] 2 points 8 months ago (1 children)

The model has to contain the data in order to produce works.
as far as I understand, this isn't true. can you elaborate on why it needs to contain the data?

[–] [email protected] 1 points 8 months ago (1 children)

It contains large parts of the data in order to create. In my link I provided it shows that the models do contain chunks of the original works.

Otherwise, how would it create the words etc.

I am amazed that we now have people on the level of crypto coin idiocy going on about ai models who don't understand this.

[–] [email protected] 1 points 8 months ago (1 children)

You would probably claim I don’t deserve my job with my level of technical illiteracy however you think you are inferring that . Anyways they do make reasonable efforts to design models that don’t memorize and are able to generalize. This is quite basic or fundamental on machine learning in general.

Previous models had semantic reasoning capacidad without memorization e.g. word2vec.

You should also realize that just because current models are memorizing despite efforts to prevent it doesn’t mean that models need to memorize. Like i said initially they are actually designed to work without needing to memorize.

[–] [email protected] 1 points 8 months ago (2 children)

You're contradicting yourself.

In one sentence you say it doesn't memorize (with "reasonable effort") then in the next you admit it does.

"Reasonable effort" is weasel wording.

Make up your mind.

[–] [email protected] 1 points 8 months ago

?? Are you trolling. If you design a car to combust gasoline without burning the lubricants but you still end up burning them it doesn’t mean that the lubricants are needed for the combustion itself. Conversely you have not made any nuanced argument explaining why memorization is necessary. I gave you an example where we know there is no memorization and you ignored it.

“Otherwise how would it create the words” is just saying you wouldn’t know.

[–] [email protected] 1 points 8 months ago (1 children)

?? Are you trolling. If you design a car to combust gasoline without burning the lubricants but you still end up burning them it doesn’t mean that the lubricants are needed for the combustion itself. Conversely you have not made any nuanced argument explaining why memorization is necessary. I gave you an example where we know there is no memorization and you ignored it.

“Otherwise how would it create the words” is just saying you wouldn’t know.

[–] [email protected] 1 points 8 months ago (1 children)

So, me pointing out the flaw in your argument is trolling?

What?

If you choose to use weasel wording to try and get out of something that is your call.

[–] [email protected] 1 points 8 months ago (1 children)

Ok i believe that you believe that. It’s ok. I have professional experience in this space so you’re either not reading carefully or you don’t understand much about the topic.

Perhaps you might want to reconsider this in more abstract terms. The engine example you ignored could help you with that.

Do you really think that the fact that we have language models that don’t memorize and are simple enough that we can know for certain is not all we need to show that language models don’t necessarily have to memorize? You keep repeating the same (illogical) argument and ignore the simpler arguments that disprove your claim.

[–] [email protected] 1 points 8 months ago (1 children)

So, now it's gone from "reasonable effort" to most definitely you can say without any doubt that all the trained models contain no copyrighted data at all?

Come on. Make up your mind.

[–] [email protected] 1 points 8 months ago (1 children)

You still haven’t backed up your claim. Once again just because you don’t know it doesn’t mean it’s not possible to do something.

[–] [email protected] 1 points 8 months ago (2 children)

My man, now you're just trying to put the onus on me.

Which is it?

Is it they don't retain or they do?

You made the claim. 🤷‍♂️

[–] [email protected] 1 points 8 months ago (1 children)

Lol. You already forgot you claimed that they need to retain the training data first.

[–] [email protected] 1 points 8 months ago

Pointing out your arguments inconsistency is forgetting?

Are you okay?

[–] [email protected] 1 points 8 months ago (1 children)

Lol. You already forgot you claimed that they need to retain the training data first.

[–] [email protected] 1 points 8 months ago

Oh, I've broken you.

load more comments (23 replies)
load more comments (26 replies)
load more comments (36 replies)