Posts
Comments
Super interesting post. I'm a bit late to the party, but I work in the space and the obvious reason nobody will say specifically what data they used is that their lawyers won't let them. I've literally had conversations with product counsel about how specific we can be in a paper / blog post about our data sources.
Copyright and privacy law is approximately 3 centuries behind the state of technology, and there are not clear rules about what data you're allowed to use for what. Someone like Google or Microsoft won't just say "we scraped Quora for a Q&A dataset" because scraping Quora is technically against their terms of service, and it's all user generated data so probably there's some PII buried in there somewhere. By admitting they used Quora data, they might get sued by Quora, or Quora users, or EU regulators might decide that since their are EU users on Quora that means the dataset is subject to GDPR wipeout rules and no model trained on data that's been deleted by a user can be stored for more than 7 days... etc. etc. I don't think this is a scientific community failing nearly so much as it's a regulatory failing.