ChatGPT is full of sensitive private information and spits out verbatim text from CNN, Goodreads, WordPress blogs, fandom wikis, Terms of Service agreements, Stack Overflow source code, Wikipedia pages, news blogs, random internet comments, and much more.

  • pntha@lemmy.world
    link
    fedilink
    English
    arrow-up
    15
    ·
    1 year ago

    how do we know the ChatGPT models haven’t crawled the publicly accessible breach forums where private data is known to leak? I imagine the crawler models would have some ‘follow webpage-attachments and then crawl’ function. surely they have crawled all sorts of leaked data online but also genuine question bc i haven’t done any previous research.

    • d3Xt3r@lemmy.nz
      link
      fedilink
      English
      arrow-up
      9
      ·
      edit-2
      1 year ago

      We don’t, but from what I’ve seen in the past, those sort of forums either require registration or payment to access the data, and/or some special means to download it (eg: bittorrent link, often hidden behind a URL forwarders + captchas so they the uploaded can earn some bucks). A simple web crawler wouldn’t be able to access such data.