Thanks to visit codestin.com
Credit goes to lemmy.world

  • 43 Posts
  • 6.09K Comments
Joined 2 years ago
Codestin Search App
Cake day: March 22nd, 2024

Codestin Search App













  • The “entire internet” is not even that big these days. The Internet Archive, for instance, is on the order of 100s of petabytes. 10K (or at least less than 100K) spinning disks is almost trivial for Azure, who has many millions deployed.

    And the actual training runs for text models are in the trillions of tokens; again, chump change data wise.

    On the other hand, they’d lose a ton of ephemeral data scraping for training runs every time instead of just saving the good stuff. I suppose it’s possible they mass rescrape and filter the content redundantly, but… that seems like a colossal waste?

    Hmm, could be what they do, I guess.


  • That’s so bizzare to me.

    Why does OpenAI need to crawl your site more than once? Unless its fetching search results for some question, can’t they just copy it into their training archive, and be done?

    Based on tiny bits of insider nuggets and some outside views, I’m increasingly convinced that these huge AI houses are efficiency shitshows. They do not care about internal overutilization, they don’t optimize or check for rogue bots and bugs. They literally run busywork to keep the appearance of busy GPUs. And it’s going to catch up to them when the Chinese models have the same capabilities, and run on peanuts.