The Irony of 'You Wouldn't Download a Car' Making a Comeback in AI Debates

FatCat@lemmy.world · 13 days ago

The Irony of 'You Wouldn't Download a Car' Making a Comeback in AI Debates

lightnsfw@reddthat.com · 11 days ago

If ChatGPT was free I might see their point but it’s not so no. If you’re making money from someone’s work you should pay them.

Drewelite@lemmynsfw.com · 11 days ago

You’re making an indie movie on your iPhone with friends. You sell one ticket. You now owe: Apple, Joseph Nicéphore Niépce’s estate (inventor of the camera), every cinematographer who first devised the type of shots you’re using, the writers since the beginning of time that created the types of story elements in the script, the mathematicians and scientists that developed lense technology, the car manufacturers that aided your ability to transport you to the set, the guy who’s YouTube tutorial you watched to figure out lighting, etc, etc, etc.

Your black and white framing appears to provide a clear ethical framework until you dig a millimeter into it. The reality is that society only exists because of the work that all of the individuals within it produce. Things like copyright are an adapter to our capitalistic economy to ensure people’s work that can be copied, are protected enough that they have the opportunity to make money off of it. It exists so somebody else can’t immediately turn around and sell the same book someone else wrote, or just change a few words and do as such. This protection was meant to last 15 to 20 years. Then enter the public domain for anyone to copy and rewrite as they please.

Current copyright is an utter bastardization of its intended use. Massive corporations are trying to act like they’re fighting for the little guy to own their IP forever. But they buy up all that IP for pennies compared to how they turn around and commoditize it. Then they own all of what society produces in perpetuity. They can sit on their dragon hoards and laugh as they gobble up any new creation that strays too close. And people wonder why everything is a sequel of a sequel of a sequel owned by massive corporations.

lightnsfw@reddthat.com · 11 days ago

I was trying to keep it simple.

I would have paid them by purchasing the iphone and whatever software I used. I paid for the car that transported me. I would have paid for my education. People can also give their work away for free if they want, or be compensated by ads as in the case of Youtube or FOSS.

Current copyright is an utter bastardization of its intended use. Massive corporations are trying to act like they’re fighting for the little guy to own their IP forever. But they buy up all that IP for pennies compared to how they turn around and commoditize it. Then they own all of what society produces in perpetuity. They can sit on their dragon hoards and laugh as they gobble up any new creation that strays too close. And people wonder why everything is a sequel of a sequel of a sequel owned by massive corporations.

What do you think ChatGPT is trying to do? It’s already being used to churn out shitloads of garbage content. They’re not making things better.

Drewelite@lemmynsfw.com · edit-2 11 days ago

By that rationalization, OpenAI is paying their Internet bill, and for a copy of Dune, so they’re free to use any content they acquired to make their product better. Your original argument wasn’t akin to, “Shouldn’t someone using an iPhone pay for one?” It was “Shouldn’t Apple get a cut of everything made with the iPhone?”

You could make the argument that people use ChatGPT to churn out garbage content, sure, but a lot of cinephiles would accuse your proverbial indie movie of being the same and blame Apple for creating the iPhone and enabling it. If you want to make that argument, go ahead. But don’t pretend it has anything to do with people getting paid fairly for what they made.

ChatGPT is enabling people to make more things, easier, to get paid. And people, as always, are relying on everything that was created before them as a basis for their work. Same as when I go to school and the professor shows me lots of different works to learn from. The thousands of students in that class didn’t pay for any of that stuff. The professor distilled it and presented it and I paid him to do it.

lightnsfw@reddthat.com · 11 days ago

The problem is that they didn’t pay for the content they’ve acquired and they’re selling it to others. The creators are not being compensated and may not want to participate in AI development at all. If the creators agree to it then fine but most do not. Just look at what’s happening with art. People are scraping all of an artists work to create AI pictures in their style and impersonate them. That’s not okay.

scottywh@lemmy.world · 12 days ago

Look… All I have to say is… Support the Internet Archive!

(please)

General_Effort@lemmy.world · 11 days ago

Heh. Funny that this comment is uncontroversial. The Internet Archive supports Fair Use because, of course, it does.

This is from a position paper explicitly endorsed by the IA:

Based on well-established precedent, the ingestion of copyrighted works to create large language models or other AI training databases generally is a fair use.

By

Library Copyright Alliance
American Library Association
Association of Research Libraries

mm_maybe@sh.itjust.works · 12 days ago

The problem with your argument is that it is 100% possible to get ChatGPT to produce verbatim extracts of copyrighted works. This has been suppressed by OpenAI in a rather brute force kind of way, by prohibiting the prompts that have been found so far to do this (e.g. the infamous “poetry poetry poetry…” ad infinitum hack), but the possibility is still there, no matter how much they try to plaster over it. In fact there are some people, much smarter than me, who see technical similarities between compression technology and the process of training an LLM, calling it a “blurry JPEG of the Internet”… the point being, you wouldn’t allow distribution of a copyrighted book just because you compressed it in a ZIP file first.

cashew@lemmy.world · 12 days ago

I agree. You can’t just dismiss the problem saying it’s “just data represented in vector space” and on the other hand not be able properly censor the models and require AI safety research. If you don’t know exactly what’s going on inside, you also can’t claim that copyright is not being violated.

Hackworth@lemmy.world · 12 days ago

It honestly blows my mind that people look at a neutral network that’s even capable of recreating short works it was trained on without having access to that text during generation… and choose to focus on IP law.

fruitycoder@sh.itjust.works · 12 days ago

Right! Like if we could honestly further enhance that feature its an incredible increase in compression tech!

Hackworth@lemmy.world · 12 days ago

Equating LLMs with compression doesn’t make sense. Model sizes are larger than their training sets. if it requires “hacking” to extract text of sufficient length to break copyright, and the platform is doing everything they can to prevent it, that just makes them like every platform. I can download © material from YouTube (or wherever) all day long.

beebarfbadger@lemmy.world · 12 days ago

The issue isn’t that you can coax AI into giving away unaltered copyrighted books out of their trunk, the issue is that if you were to open the hood, you’d see that the entire engine is made of unaltered copyrighted books.

All those “anti hacking” measures are just there to obfuscate the fact that that the unaltered works are being in use and recallable at all times.

Hackworth@lemmy.world · edit-2 12 days ago

This is an inaccurate understanding of what’s going on. Under the hood is a neutral network with weights and biases, not a database of copyrighted work. That neutral network was trained on a HEAVILY filtered training set (as mentioned above, 45 terabytes was reduced to 570 GB for GPT3). Getting it to bug out and generate full sections of training data from its neutral network is a fun parlor trick, but you’re not going to use it to pirate a book. People do that the old fashioned way by just adding type:pdf to their common web search.

beebarfbadger@lemmy.world · 12 days ago

Again: nobody is complaining that you can make AI spit out their training data because AI is the only source of that training data. That is not the issue and nobody cares about AI as a delivery source of pirated material. The issue is that next to the transformed output, the not-transformed input is being in use in a commercial product.

ClamDrinker@lemmy.world · 12 days ago

This would be a good point, if this is what the explicit purpose of the AI was. Which it isn’t. It can quote certain information verbatim despite not containing that data verbatim, through the process of learning, for the same reason we can.

I can ask you to quote famous lines from books all day as well. That doesn’t mean that you knowing those lines means you infringed on copyright. Now, if you were to put those to paper and sell them, you might get a cease and desist or a lawsuit. Therein lies the difference. Your goal would be explicitly to infringe on the specific expression of those words. Any human that would explicitly try to get an AI to produce infringing material… would be infringing. And unknowing infringement… well there are countless court cases where both sides think they did nothing wrong.

You don’t even need AI for that, if you followed the Infinite Monkey Theorem and just happened to stumble upon a work falling under copyright, you still could not sell it even if it was produced by a purely random process.

Another great example is the Mona Lisa. Most people know what it looks like and if they had sufficient talent could mimic it 1:1. However, there are numerous adaptations of the Mona Lisa that are not infringing (by today’s standards), because they transform the work to the point where it’s no longer the original expression, but a re-expression of the same idea. Anything less than that is pretty much completely safe infringement wise.

You’re right though that OpenAI tries to cover their ass by implementing safeguards. Which is to be expected because it’s a legal argument in court that once they became aware of situations they have to take steps to limit harm. They can indeed not prevent it completely, but it’s the effort that counts. Practically none of that kind of moderation is 100% effective. Otherwise we’d live in a pretty good world.

mm_maybe@sh.itjust.works · 12 days ago

Y’all should really stop expecting people to buy into the analogy between human learning and machine learning i.e. “humans do it, so it’s okay if a computer does it too”. First of all there are vast differences between how humans learn and how machines “learn”, and second, it doesn’t matter anyway because there is lots of legal/moral precedent for not assigning the same rights to machines that are normally assigned to humans (for example, no intellectual property right has been granted to any synthetic media yet that I’m aware of).

That said, I agree that “the model contains a copy of the training data” is not a very good critique–a much stronger one would be to simply note all of the works with a Creative Commons “No Derivatives” license in the training data, since it is hard to argue that the model checkpoint isn’t derived from the training data.

FatCrab@lemmy.one · 12 days ago

ML techniques have been very useful in compression, yes, but it’s sort of nuts to say that a data structure that encodes only (sometimes overly so for certain regions of its latent space/embedding space/semantics space/whatever you want to call it right now) relationships between values rather than value sequences themselves as storing contiguous copyright protected works is storing partiularized creative works in particularly identifiable manner.

GiveMemes@jlai.lu · edit-2 12 days ago

Except that, again, as is literally written in the comment you’re directly replying to, it has been shown that AI can reproduce copyrightable works word for word, showing that it objectively and necessarily is storing particular creative works in a particularly identifiable manner, whether or not that manner is yet known to humans.

FatCrab@lemmy.one · 12 days ago

No, it isn’t storing that information in that sequence. What is happening is that it is overly encoding those particular sequential relationships along some arbitrary but tightly mapped semantic concepts represented by dimensions in a massive vector space. It is storing copies of the information on the way that inadvertent copying of music might be based on “memorized” music listened to by the infringing artist in the past.

GiveMemes@jlai.lu · edit-2 12 days ago

Not what I said. I used the exact language the above commenter used because it was specific and accurate. Also, inadvertent copyright violation is still copyright violation under US law. I’m not the biggest fan of every application of that law, but the ability to keep large corporations from ripping off small artists and creators is one that I think is good and useful under the global economic system we live under currently.

FatCrab@lemmy.one · 12 days ago

Yes, inadvertent copying is still copying, but it would be copying in the output and is not evidence of copying happening in the creation of the model. That was why I used the music example, because it is rather probative of where there could be grounds for copyright infringement related to these model architectures. This may not seem an important distinction, but it has significant consequences on who is ultimately liable and how.

Hackworth@lemmy.world · 12 days ago

It’s called learning, and I wish people did more of it.

sugar_in_your_tea@sh.itjust.works · 12 days ago

You don’t learn by memorizing and reproducing works, you learn by understanding the concepts in various works and producing new works that are combinations of the ideas in those other works. AI doesn’t understand, and it has been shown to be able to reproduce works, so I think it’s fair to say that it’s doing a lot of “memorizing” and therefore plagiarism.

Hackworth@lemmy.world · edit-2 12 days ago

Calling what attention transformers do memorization is wildly inaccurate.

*Unless we’re talking about semantic memory.

sugar_in_your_tea@sh.itjust.works · edit-2 12 days ago

Is it though? People memorize things very differently than computers do, but the actual mechanism of storage isn’t particularly important. What’s important is the net result. Whether it uses baysian networks (what we used in class for small-scale NLP), neural networks (what I assume LLMs use), or something else doesn’t particularly matter.

For example, a search engine typically only stores keywords and relationships, so there’s no way for it to reproduce an entire work (ignoring, of course, the “caching” features some search engines have). All it does is associate keywords with source material, so there’s a strong argument that it falls under fair use.

LLMs, on the other hand, process entire works and keep more than just keywords, and they store it in such a way that entire works can be recovered if coaxed. My understanding is that they break up words into something like sets of phonemes, and then queries do a similar break-up as input to the neural network to produce an output, which is then reassembled into text. But that’s my relatively naive understanding of how it all works (I’ve only done university level NLP, and that was years ago), but again, that’s really not the point here. The point is that it uses a lot more of the work than the typical understanding of “fair use,” and if copyrighted works can be reproduced by it, then the copyrighted work is “stored” in some fashion, so it can be thought of as a really complex form of compression, with tricky retrieval mechanisms. So in layman’s terms, it’s “memorizing” entire works in a way not entirely unlike a “mind palace”, and to reproduce a given work, you need the right input to follow the right steps, but a slightly different input will lead to a very different output (i.e. maybe something with similar content, but no copyright violations).

What’s at issue isn’t whether the LLM is likely to reproduce entire works, but whether it can and does, which would mean it’s violating fair use standards.

lettruthout@lemmy.world · 13 days ago

If they can base their business on stealing, then we can steal their AI services, right?

LibertyLizard@slrpnk.net · 13 days ago

Pirating isn’t stealing but yes the collective works of humanity should belong to humanity, not some slimy cabal of venture capitalists.

General_Effort@lemmy.world · 13 days ago

Yes, that’s exactly the point. It should belong to humanity, which means that anyone can use it to improve themselves. Or to create something nice for themselves or others. That’s exactly what AI companies are doing. And because it is not stealing, it is all still there for anyone else. Unless, of course, the copyrightists get there way.

masterspace@lemmy.ca · 13 days ago

How do you feel about Meta and Microsoft who do the same thing but publish their models open source for anyone to use?

lettruthout@lemmy.world · 13 days ago

Well how long to you think that’s going to last? They are for-profit companies after all.

sentientity@lemm.ee · edit-2 11 days ago

Disagree. These companies are exploiting an unfair power dynamic they created that people can’t say no to, to make an ungodly amount of money for themselves without compensating people whose data they took without telling them. They are not creating a cool creative project that collaboratively comments on or remixes what other people have made, they are seeking to gobble up and render irrelevant everything that they can, for short term greed. That’s not the scenario these laws were made for. AI hurts people who have already been exploited and industries that have already been decimated. Copyright laws were not written with this kind of thing in mind. There are potentially cool and ethical uses for AI models, but open ai and google are just greed machines.

Edited * THRICE because spelling. oof.

infinite_ass@leminal.space · 11 days ago

Ai has ideas? That’s a bit of a philosophical stretch.

TommySoda@lemmy.world · edit-2 13 days ago

Here’s an experiment for you to try at home. Ask an AI model a question, copy a sentence or two of what they give back, and paste it into a search engine. The results may surprise you.

And stop comparing AI to humans but then giving AI models more freedom. If I wrote a paper I’d need to cite my sources. Where the fuck are your sources ChatGPT? Oh right, we’re not allowed to see that but you can take whatever you want from us. Sounds fair.

fmstrat@lemmy.nowsci.com · 12 days ago

This is the catch with OPs entire statement about transformation. Their premise is flawed, because the next most likely token is usually the same word the author of a work chose.

TommySoda@lemmy.world · 12 days ago

And that’s kinda my point. I understand that transformation is totally fine but these LLM literally copy and paste shit. And that’s still if you are comparing AI to people which I think is completely ridiculous. If anything these things are just more complicated search engines with half the usefulness. If I search online about how to change a tire I can find some reliable sources to do so. If I ask AI how to change a tire it would just spit something out that might not even be accurate and I’d have to search again afterwards just to make sure what it told me was even accurate.

It’s just a word calculator based on information stolen from people without their consent. It has no original thought process so it has no way to transform anything. All it can do is copy and paste in different combinations.

BarqsHasBite@lemmy.ca · 13 days ago

Can you just give us the TLDE?

superkret@feddit.org · 12 days ago

AI Chat bots copy/paste much of their “training data” verbatim.

azuth@sh.itjust.works · 13 days ago

It’s not a breach of copyright or other IP law not to cite sources on your paper.

Getting your paper rejected for lacking sources is also not infringing in your freedom. Being forced to pay damages and delete your paper from any public space would be infringement of your freedom.

TommySoda@lemmy.world · 13 days ago

I mean, you’re not necessarily wrong. But that doesn’t change the fact that it’s still stealing, which was my point. Just because laws haven’t caught up to it yet doesn’t make it any less of a shitty thing to do.

Octopus1348@lemy.lol · 12 days ago

When I analyze a melody I play on a piano, I see that it reflects the music I heard that day or sometimes, even music I heard and liked years ago.

Having parts similar or a part that is (coincidentally) identical to a part from another song is not stealing and does not infringe upon any law.

takeda@lemmy.world · 12 days ago

You guys are missing a fundamental point. The copyright was created to protect an author for specific amount of time so somebody else doesn’t profit from their work essentially stealing their deserved revenue.

LLM AI was created to do exactly that.

ContrarianTrail@lemm.ee · edit-2 12 days ago

The original source material is still there. They just made a copy of it. If you think that’s stealing then online piracy is stealing as well.

TommySoda@lemmy.world · 12 days ago

Well they make a profit off of it, so yes. I have nothing against piracy, but if you’re reselling it that’s a different story.

EldritchFeminity@lemmy.blahaj.zone · 13 days ago

The argument that these models learn in a way that’s similar to how humans do is absolutely false, and the idea that they discard their training data and produce new content is demonstrably incorrect. These models can and do regurgitate their training data, including copyrighted characters.

And these things don’t learn styles, techniques, or concepts. They effectively learn statistical averages and patterns and collage them together. I’ve gotten to the point where I can guess what model of image generator was used based on the same repeated mistakes that they make every time. Take a look at any generated image, and you won’t be able to identify where a light source is because the shadows come from all different directions. These things don’t understand the concept of a shadow or lighting, they just know that statistically lighter pixels are followed by darker pixels of the same hue and that some places have collections of lighter pixels. I recently heard about an ai that scientists had trained to identify pictures of wolves that was working with incredible accuracy. When they went in to figure out how it was identifying wolves from dogs like huskies so well, they found that it wasn’t even looking at the wolves at all. 100% of the images of wolves in its training data had snowy backgrounds, so it was simply searching for concentrations of white pixels (and therefore snow) in the image to determine whether or not a picture was of wolves or not.

Riccosuave@lemmy.world · 13 days ago

Even if they learned exactly like humans do, like so fucking what, right!? Humans have to pay EXORBITANT fees for higher education in this country. Arguing that your bot gets socialized education before the people do is fucking absurd.

Eatspancakes84@lemmy.world · 12 days ago

I am also not really getting the argument. If I as a human want to learn a subject from a book I buy it ( or I go to a library who paid for it). If it’s similar to how humans learn, it should cost equally much.

The issue is of course that it’s not at all similar to how humans learn. It needs VASTLY more data to produce something even remotely sensible. Develop AI that’s truly transformative, by making it as efficient as humans are in learning, and the cost of paying for copyright will be negligible.

stephen01king@lemmy.zip · 12 days ago

If I as a human want to learn a subject from a book I buy it ( or I go to a library who paid for it). If it’s similar to how humans learn, it should cost equally much.

You’re on Lemmy where people casually says “piracy is morally the right thing to do”, so I’m not sure this argument works on this platform.

Eatspancakes84@lemmy.world · edit-2 12 days ago

I know my way around the Jolly Roger myself. At the same time using copyrighted materials in a commercial setting (as OpenAI does) shouldn’t be free.

Blaster M@lemmy.world · 12 days ago

Imagine if you had blinders and earmuffs on for most of the day, and only once in a while were you allowed to interact with certain people and things. Your ability to communicate would be truncated to only what you were allowed to absorb.

Dran@lemmy.world · 13 days ago

Devil’s Advocate:

How do we know that our brains don’t work the same way?

Why would it matter that we learn differently than a program learns?

Suppose someone has a photographic memory, should it be illegal for them to consume copyrighted works?

EldritchFeminity@lemmy.blahaj.zone · 12 days ago

Because we’re talking pattern recognition levels of learning. At best, they’re the equivalent of parrots mimicking human speech. They take inputs and output data based on the statistical averages from their training sets - collaging pieces of their training into what they think is the right answer. And I use the word think here loosely, as this is the exact same process that the Gaussian blur tool in Photoshop uses.

This matters in the context of the fact that these companies are trying to profit off of the output of these programs. If somebody with an eidetic memory is trying to sell pieces of works that they’ve consumed as their own - or even somebody copy-pasting bits from Clif Notes - then they should get in trouble; the same as these companies.

Given A and B, we can understand C. But an LLM will only be able to give you AB, A(b), and B(a). And they’ve even been just spitting out A and B wholesale, proving that they retain their training data and will regurgitate the entirety of copyrighted material.

nek0d3r@lemmy.world · 12 days ago

Generative AI does not work like this. They’re not like humans at all, it will regurgitate whatever input it receives, like how Google can’t stop Gemini from telling people to put glue in their pizza. If it really worked like that, there wouldn’t be these broad and extensive policies within tech companies about using it with company sensitive data like protection compliances. The day that a health insurance company manager says, “sure, you can feed Chat-GPT medical data” is the day I trust genAI.

helenslunch@feddit.nl · 13 days ago

Those claiming AI training on copyrighted works is “theft” misunderstand key aspects of copyright law and AI technology.

Or maybe they’re not talking about copyright law. They’re talking about basic concepts. Maybe copyright law needs to be brought into the 21st century?

kibiz0r@midwest.social · 13 days ago

Not even stealing cheese to run a sandwich shop.

Stealing cheese to melt it all together and run a cheese shop that undercuts the original cheese shops they stole from.

TheKMAP@lemmynsfw.com · 12 days ago

Whatever happened to copying isn’t stealing?

I think the crux of the conversation is whether or not the world is better with ChatGPT. I say yes. We can tackle the disinformation in another effort.

calcopiritus@lemmy.world · 12 days ago

When you copy to consume yourself it’s way different than when you copy to sell the copy for a lower price.

TheKMAP@lemmynsfw.com · 12 days ago

They’re not selling the copy, bruh. They’re selling a technology that very few understand. Smart people pretend they get it, but they don’t. That’s how rare the math is.

rainynight65@feddit.org · edit-2 12 days ago

Generative AI is not ‘influenced’ by other people’s work the way humans are. A human musician might spend years covering songs they like and copying or emulating the style, until they find their own style, which may or may not be a blend of their influences, but crucially, they will usually add something. AI does not do that. The idea that AI functions the same as human artists, by absorbing influences and producing their own result, is not only fundamentally false, it is dangerously misleading. To portray it as ‘not unethical’ is even more misleading.

31337@sh.itjust.works · 12 days ago

Production AI is highly tuned by training data selection and human feedback. Every model has its own style that many people helped tune. In the open model world there are thousands of different models targeting various styles. Waifu Diffusion and GPT-4chan, for example.

rainynight65@feddit.org · edit-2 12 days ago

Sure, training data selection impacts the output. If you feed an AI nothing but anime, the images it produces will look like anime. If all it knows is K-pop, then the music it puts out will sound like K-pop. Tweaking a computational process through selective input is not the same as a human being actively absorbing stimuli and forming their own, unique response.

AI doesn’t have an innate taste or feeling for what it likes. It won’t walk into a second hand CD store, browse the boxes, find something that’s intriguing and check it out. It won’t go for a walk and think “I want to take a photo of that tree there in the open field”. It won’t see or hear a piece of art and think “I’d like to be learn how to paint/write/play an instrument like that”. And it will never make art for the sake of making art, for the pure enjoyment that is the process of creating something, irrespective of who wants to see or hear the result. All it is designed to do is regurgitate an intersection of what it knows that best suits the parameters of a given request (aka prompt). Actively learning, experimenting, practicing techniques, trying to emulate specific techniques of someone else - making art for the sake of making art - is a key component to humans learning from others and being influenced by others.

So the process of human learning and influencing, and the selective feeding of data to an AI to ‘tune’ its output are entirely different things that cannot and should not be compared.

HereIAm@lemmy.world · 12 days ago

“This process is akin to how humans learn… The AI discards the original text, keeping only abstract representations…”

Now I sail the high seas myself, but I don’t think Paramount Studios would buy anyone’s defence they were only pirating their movies so they can learn the general content so they can produce their own knockoff.

Yes artists learn and inspire each other, but more often than not I’d imagine they consumed that art in an ethical way.

mriormro@lemmy.world · 12 days ago

You know, those obsessed with pushing AI would do a lot better if they dropped the patronizing tone in every single one of their comments defending them.

It’s always fun reading “but you just don’t understand”.

bitchkat@lemmy.world · 12 days ago

I absolutely would download a car.