Sign In

Communications of the ACM

ACM News

Did You Really Write That?

A no plagiarism sign.

Online tools and services use complex machine learning algorithms to help educators, administrators, and others identify unattributed or misattributed works.


Plagiarism—using words, thoughts, and ideas without attributing them to their original sourceshas been an issue for the academic and scientific communities since the dawn of the written word.  To combat this problem, more than a dozen tools and services have been developed over the past two decades to help educators, administrators, and others interested in identifying instances of unattributed or misappropriated work, powered by complex machine learning algorithms.

Indeed, companies such as Viper, Itheniticate, Antiplagiarist, DocCop,, Grammarly, and Copyscape, among many others, offer solutions to academic institutions, research organizations, and writers. The largest and best-known service is Turnitin, which was founded in 1998, has a repository of 45+ billion pages of digital content (including archived Internet content), and has more than 15,000 universities utilizing its software around the globe.

While each service works slightly differently, the operating principle is the largely the same: an algorithm will compare the text against a content repository made up of a database of published Web content, academic papers, articles and, in some cases, previously submitted student work.

Most of these algorithms are designed to account for mosaic or piecemeal writing (switching the order of the words in a phrase) and completely ignore common articles (‘a’ and ‘the’) and stop words (commonly used words), to ensure relevant text strings can be compared against source material.

Once this process has been completed, the software will return a similarity match, often expressed as a percentage of the content that matches source material. However, these services are quick to point out their tools simply identify similar content, and cannot and should not be used to definitively determine whether plagiarism has occurred.

"There’s no service out there that’s going to be able to catch plagiarism for you," says Jason Chu, education director at Turnitin.  "At the end of the day, the process of actually identifying whether content has been used appropriately or not actually falls on the shoulders of the teacher, the instructor, or the administrator.  It’s really up to them to make the determination as to whether content that’s been identified, what we call matched content, in student work, whether that content has been used appropriately or not."

Nevertheless, Shan Wu, an attorney with Washington, D.C.-based Wu, Grohovsky & Whipple PLLC who focuses on student defense issues, says some instructors at higher learning are simply relying on the similarity scores to determine whether academic dishonesty has taken place.

"The drawback is the way it’s being used, and I don’t know if it’s not understanding what the software is doing, or maybe it’s laziness on the part of the academic world, where they don’t seem to be properly utilizing it," Wu says.  "I think the critical thing to emphasize is that professors and universities are over-relying just on the data that they get from applications like that."

Scott Siddell, founder of Vericite, another provider of string-matching software, agrees, citing the need to have human educators or administrators review the output of the software to determine whether or not plagiarism has occurred.

"We detect string matches as the initial step, but subjective analysis is absolutely required," Siddell says.

Chu, Siddell, and Wu each emphasized the most valuable way to utilize string-matching software is as a diagnostic tool to help students, researchers, and writers improve their writing, by highlighting instances of missing citations or non-original content, rather than as a punitive tool.

"When you ask the institutions that use our service, particularly on a secondary level, they will tell you that they’ve found our service has helped students to become more mindful on how they’re using information, and how they share that information," Chu says. "Ideally what we’d like to see is students starting to think more critically about how they use source material."

The other main criticism of online string-matching software is based around the process of archiving student and academic papers to build up a repository of content.  Siddell says this process of archiving papers from students and others around the world can be fraught with pitfalls, and students have raised questions on Internet message boards about the ethics of software companies using student-submitted works to generate fees.

"There were even plagiarism services that were actually front ends to paper mills (companies that resell student-submitted papers for a profit)," Siddell says, noting that many early services granted the software company to reuse and sell submitted work six months after it was captured. As a result, Vericite only captures and compares papers from each university, rather than across the entire universe of submitted works.

However, Turnitin—which boasts it indexed 137 million papers in 2015 and sees between 400,000 and 600,000 submissions per day—considers its breadth of content a major strength, and says authors retain the copyright to work they have created and submitted.  

Keith Kirkpatrick is principal of 4K Research & Consulting, LLC, based in Lynbrook, NY.


No entries found