FaceDeer

FaceDeer@fedia.io · 2 days ago

Especially because seeing the same information in different contexts helps mapping the links between the different contexts and helps dispel incorrect assumptions.

Yes, but this is exactly the point of deduplication - you don’t want identical inputs, you want variety. If you want the AI to understand the concept of cats you don’t keep showing it the same picture of a cat over and over, all that tells it is that you want exactly that picture. You show it a whole bunch of different pictures whose only commonality is that there’s a cat in it, and then the AI can figure out what “cat” means.

They need to fundamentally change big parts of how learning happens and how the algorithm learns to fix this conflict.

Why do you think this?

FaceDeer@fedia.io · 3 days ago

There actually isn’t a downside to de-duplicating data sets, overfitting is simply a flaw. Generative models aren’t supposed to “memorize” stuff - if you really want a copy of an existing picture there are far easier and more reliable ways to accomplish that than giant GPU server farms. These models don’t derive any benefit from drilling on the same subset of data over and over. It makes them less creative.

I want to normalize the notion that copyright isn’t an all-powerful fundamental law of physics like so many people seem to assume these days, and if I can get big companies like Meta to throw their resources behind me in that argument then all the better.

FaceDeer@fedia.io · edit-2 3 days ago

Remember when piracy communities thought that the media companies were wrong to sue switch manufacturers because of that?

It baffles me that there’s such an anti-AI sentiment going around that it would cause even folks here to go “you know, maybe those litigious copyright cartels had the right idea after all.”

We should be cheering that we’ve got Meta on the side of fair use for once.

look up sample recover attacks.

Look up “overfitting.” It’s a flaw in generative AI training that modern AI trainers have done a great deal to resolve, and even in the cases of overfitting it’s not all of the training data that gets “memorized.” Only the stuff that got hammered into the AI thousands of times in error.

FaceDeer@fedia.io · edit-2 4 days ago

Training an AI does not involve copying anything so why would you think that fair use is even a factor here? It’s outside of copyright altogether. You can’t copyright concepts.

Downloading pirated books to your computer does involve copyright violation, sure, but it’s a violation by the uploader. And look at what community we’re in, are we going to get all high and mighty about that?

FaceDeer@fedia.io · 7 days ago

Does that mean that a country that imports 100% of the oil it burns should be counted as having no emissions?

FaceDeer@fedia.io · 7 days ago

What did I say that implied that? I’m pointing out a contradiction in kilgore’s comment, I’m not adding anything of my own here.

FaceDeer@fedia.io · 8 days ago

Their distribution of books is completely legal.

Corporations just have more money to warp the laws in their favour.

You just contradicted yourself in two sentences.

FaceDeer@fedia.io · 9 days ago

But I think the law is pretty clear, and a precedent calling their use case fair use would be mind blowing. You need new, much more common sense IP legislation that redefines consumer rights in a digital world.

Indeed. I’m a big supporter of IA’s mission, and I’m a big supporter of piracy (copyright has gone insane over the years), but this outcome was obvious from the moment IA did this and it was a mistake for them to fight this fight. They should focus on preservation. Let the EFF handle the lawsuits, and let Library Genesis handle the illegal distribution of books. Everyone focus on what they’re best at.

FaceDeer@fedia.io · 9 days ago

They’re appealing the decision so there’s still opportunity for IA to throw good money after bad on this.

FaceDeer@fedia.io · 11 days ago

Widely accepted, but not legally enforced.

FaceDeer@fedia.io · 11 days ago

Oh now suddenly you care about leaks?

FaceDeer@fedia.io · 11 days ago

A material witness can be prevented from leaving the country.

Note also that I said housed, not imprisoned.

FaceDeer@fedia.io · 11 days ago

Never said that you could. Watching a movie and learning from it are not copying it.

FaceDeer@fedia.io · 11 days ago

There is a middle ground between these two extremes. I don’t see why they can’t be housed on shore. Give them a temporary visa or somesuch.

FaceDeer@fedia.io · 11 days ago

If it’s public posts then what’s the privacy concern? This is stuff that people are deliberately and explicitly making available for all to see.

Also, the last paragraph of the article says that Meta is pausing this initiative after a request from Ireland’s Data Protection Commission. It seems a bit clickbaity to me to be hiding that down at the very end of the article.

FaceDeer@fedia.io · 12 days ago

There’s an ActivityPub plugin, though I haven’t tried it - just Googled it up now. Looks very new and rough.

FaceDeer@fedia.io · 13 days ago

The “glue on pizza” thing wasn’t a result of the AI’s training, the AI was working fine. It was the search result that gave it a goofy answer to summarize.

The problem here is that it seems people don’t really understand what goes into training an LLM or how the training data is used.

FaceDeer@fedia.io · 13 days ago

Netanyahu doing scummy things to cling to power no matter what is indeed a widely expected move.

FaceDeer@fedia.io · 13 days ago

The Internet Archive was distributing unlimited copies of ebooks whose rights were held by major publishers.

The major publishers sued them for distributing copies of ebooks whose rights were held by them.

Yeah, totally unrelated.

FaceDeer@fedia.io · edit-2 13 days ago

Reddit is actually extremely good for AI. It’s a vast trove of examples of people talking to each other.

When it comes to factual data then there are better sources, sure, but factual data has never been the key deficiency of AI. We’ve long had search engines for that kind of thing. What AIs had trouble with was human interaction, which is what Reddit and Facebook are all about. These datasets train the AI to be able to communicate.

If the Fediverse was larger we’d be a significant source of AI training material too. Would be surprised if it’s not being collected already.