Why No Automated Plagerism Detection For Past Papers?

lao-mein

Why No Automated Plagerism Detection For Past Papers?

post by Lao Mein (derpherpize) · 2023-12-12T17:24:31.544Z · LW · GW · 4 comments

This is a question post.

  Answers
    9 Shankar Sivarajan
    6 Dagon
None
4 comments

Automated plagerism detection software is common. But cases like the recent incident with Harvard administrator Gay have shown that egregious cases of plagerism are still being uncovered. Why would this be the case? Is it really so hard to run plagerism checks for every paper on Sci-hub? Has anyone tried?

I am curious since I am currently upskilling for the purposes of technical alignment research and it seems like an interesting project to pursue.

Answers

answer by Shankar Sivarajan · 2023-12-12T21:18:59.497Z · LW(p) · GW(p)

"The total size of Sci-Hub database is about 100 TB."

↑ comment by faul_sname · 2023-12-13T01:41:56.870Z · LW(p) · GW(p)

i.e. $1000-$2000 in drive space, or $20 / day to store on Backblaze if you don't anticipate needing it for more than a couple of months tops.

Replies from: shankar-sivarajan

↑ comment by Shankar Sivarajan (shankar-sivarajan) · 2023-12-13T02:10:35.395Z · LW(p) · GW(p)

You're correct that simply storing the entire database isn't infeasible. But as I understand it, that's large enough that training a model on that is too expensive for most hobbyists to do just for kicks.

Replies from: faul_sname

↑ comment by faul_sname · 2023-12-13T03:12:52.238Z · LW(p) · GW(p)

Depends on how big of a model you're trying to train, and how you're trying to train it.

I was imagining something along the lines of "download the full 100TB torrent which includes 88M articles, extract the text of each article ("extract text from a given PDF" isn't super reliable but it should be largely doable), which should leave you somewhere in the ballpark of 4TB of uncompressed plain text. If you're using a BPE, that would leave you with ~1T tokens.

If you're trying to do the chinchilla optimality thing, I fully agree that there's no way you're going to be able to do that with the compute budget available to mere mortals. If you're trying to do the "generate embeddings for every paragraph of every paper, and do similarity searches, and then on matches calculate edit distance to see if it was literally copy-pasted" I think that'd be entirely doable with a hobbyist budget.

I personally think it'd be a great learning project.

answer by Dagon · 2023-12-13T00:56:02.271Z · LW(p) · GW(p)

I think there are two reasons it's not more common to retroactively analyze papers and publications for copied or closely-paraphrased segments.

First, it's not actually easy to automate. Current solutions are RIFE with false-positives and human judgement requirements to make final conclusions.

Second, and perhaps more importantly, nobody really cares, outside of graded work where the organization is basing your credentials on doing original work (but usually not even that, just semi-original presentation of other works).

It would probably be a minor scandal if any significant papers were discovered to be based on uncredited/un-footnoted other work, but unless it were egregious (in which case it probably would have already been noticed), just not that big a deal.

↑ comment by Orual · 2023-12-13T03:05:15.049Z · LW(p) · GW(p)

Distinguishing between a properly cited paraphrase and taking someone's work as your own without sufficient attribution is not trivial even for people. There's a lot of grey area in terms of how closely you can mimic the original before it becomes problematic (this is largely what I've seen Rufo trying to hang the Harvard admin woman with, paraphrases that maintained a lot of the original wording which were nonetheless clearly cited, which at least to me seem like bad practice but not actually plagiarism in the sense it is generally meant) and it comes down to a judgement call in the edge cases.

A professor I know fell afoul of an automated plagiarism detector because it pinged on some of her own previous papers on the same subject, and the journal refused to reconsider. Felt very silly, like they were asking her to go through and arbitrarily change the wording she thought was best just because she had used it before because the computer said so. I think she ultimately ended up submitting to a different journal and it got accepted there.

4 comments

Comments sorted by top scores.

comment by MondSemmel · 2023-12-12T20:24:33.774Z · LW(p) · GW(p)

Typo: plagerism -> plagiarism (4x, incl. in the title)

comment by Buck · 2023-12-12T20:08:28.234Z · LW(p) · GW(p)

I was just thinking about this. I think this would be a good learning experience!

comment by rotatingpaguro · 2023-12-12T17:57:38.005Z · LW(p) · GW(p)

Is plagiarism considered bad everywhere in the world, or is it an American foible? I vaguely recall reading years ago that in China it was not considered bad per-se and this occasionally gave Chinese some problems with American academic institutions. However I did not check the sources at the time nor quantified the effect, I was a naive newspaper-reader.

Replies from: derpherpize

↑ comment by Lao Mein (derpherpize) · 2023-12-13T07:04:24.607Z · LW(p) · GW(p)

Standards have been going up over time, so grad students are unironically subjected to higher standards than university professors. I know of professors who have used google translate on English papers and published them in Chinese language journals.

Why No Automated Plagerism Detection For Past Papers?

Contents

Answers

4 comments