Attrition.org Errata Project: Detecting Plagiarism

Saturday, November 19, 2011

security curmudgeon

B8b580348b4e717042d0e394ee072001

With the recent rash of plagiarism exposure, one of the most frequent questions we get is "how do you find plagiarism?"

Our methodology is home-grown and very simple. We assume that we are only catching some of it, and that our methodology causes us to miss some cases.

Rather than read our layman views on the matter, we encourage you to read the NYU Ethics Handbook written by Professor Adam Penenberg. The entire handbook is worth reading, but you can jump to section 9, "Cardinal Sins", to read about plagiarism.

Before we get into the "how", we want to address a second question and concern; what is plagiarism and how can I avoid it?

Plagiarism, Infringement and Copyright

First, we must clarify that a lot of our work uncovers copyright violation, not plagiarism. There is a big difference between the two, while both are unethical and unacceptable in academic and professional circles.

Using Wikipedia, there are four important definitions and concepts to understand before discussion on the topic:

What is plagiarism?

http://en.wikipedia.org/wiki/Plagiarism:
Plagiarism is defined in dictionaries as the "wrongful appropriation," "close imitation," or "purloining and publication" of another author's "language, thoughts, ideas, or expressions," and the representation of them as one's own original work...

What is copyright?

http://en.wikipedia.org/wiki/Copyright:
A copyright is a set of exclusive rights granted by a state to the creator of an original work or their assignee for a limited period of time upon disclosure of the work. This includes the right to copy, distribute and adapt the work.

What is copyright infringement?

http://en.wikipedia.org/wiki/Copyright_infringement:
Copyright infringement is the unauthorized or prohibited use of works under copyright, infringing the copyright holder's exclusive rights, such as the right to reproduce or perform the copyrighted work, or to make derivative works.

What is 'fair use'?

http://en.wikipedia.org/wiki/Fair_use_doctrine:
In United States copyright law, fair use is a doctrine that permits limited use of copyrighted material without acquiring permission from the rights holders. Examples of fair use include commentary, criticism, news reporting, research, teaching, library archiving and scholarship.

In layman's terms, the difference betwee plagiarism and copyright violation is that a person who plagiarizes will typically make small edits to the original text or clip pieces from multiple sources to form a new work, without contributing their own original thoughts or material.

A person who takes a large block of text, reprints it in full, and does not take measures to change the text is likely violating copyright. The easiest way to avoid issues like plagiarism and copyright infringement is to cite your sources.

Even if the material is your own, citing the source(s) that influenced your work is a good practice and gives your audience insight into your background as well as additional material if they are interested in the topic.

If the positive aspect of citing sources doesn't appeal to you, let us appeal to the negative; detecting plagiarism and copyright infringement is pretty simple. If someone bothers to check, they will find it if it is there.

How We Find the "Bad Stuff" (TM)

The methodology that we have developed is home grown and evolved through repetition. Over time, there are several tips and tricks we have picked up that help to more easily identify plagiarism.

In summary, we use basic Google searches for blocks or groups of text, and compare the results to the text being reviewed. Yes, it is rather slow and repetitive, but ultimately very effective. We do not use commercial services such as iThenticate for several reasons, both technical and financial.

The following list is in no particular order, and contains the basics, tricks, and hurdles in detecting plagiarism.

Googling:

  • Look for distinct phrases that are the least likely to be encountered. Search for a group of techical terms rather than a piece that is conversational. Searching for "should be developed, implemented, and maintained" will not find the original text that used a different tense: "should develop, implement, and maintain".
  • Check the author! There are times where material in a book is found to have been published earlier, but by the same person. The re-use of material in books is more common than you may realize.
  • Check the dates! This is perhaps one of the most important aspects of plagiarism review. Finding text from a book in a blog is extremely common. If the book was published in 2008, and the blogs containing the text are posted in 2011, you have found plagiarism or copyright infringement, but in the wrong direction. Google search results often include the date of the article, and become the best way to quickly filter results. You are primarily looking for the same text published before the material in question. However, there are many cases where a plagiarism review will result in finding dozens of blogs or web sites that use the material you are reviewing.
  • Trick: We have found that checking text at the end of a paragraph or toward the end of a chapter is better. We speculate that an author starts out with noble intentions, begins writing their own material, and then gets lazy. A couple paragraphs of their own work to start, a couple paragraphs of plagiarized text to end.
  • Trick: If Google returns results from books.google.com or scribd.com, weigh those more heavily. We have found that people frequently plagiarize from print books rather than online resources. This may be due to a book being more convenient, or it may be due to the writer thinking that their plagiarism will not be detected.
  • Trick: Before looking at blocks of text, look at the formatting and virtual "typesetting". When someone plagiarizes a significant amount of material, it is frequently simple cut and paste. However, cut and paste leads to extra whitespace (e.g., double space between words), carriage return issues (e.g., too often or a lack of causing extremely long paragraphs), incorrect symbols (e.g., a period instead of a dash or quotes), presence of a footnote symbol that is not referenced later, etc. In short, any typo should be investigated.
  • Trick: Look for a change in grammar and style. When a person changes from poor, chopped "Engrish" to a well-spoken master of the English language, there is a reason. This is particularly true for writers with English as a second language (ESL). Similiarly, a change in the language and understanding surrounding technical terms can indicate something is wrong.
  • Trick: If the material being reviewed contains any type of code (e.g., C, Perl), the lack of any comments in the code is suspicious. Almost every program will have a header with credit, usage instructions, or something. The lack of these comments frequently means the writer stripped them out to obscure who originally wrote the code.
  • Trick: Search Google for two phrases at a time, not one. For example, searching for "business analytics" "certification accreditation" is more likely to find an exact match to an original source than one phrase alone. If you search for one phrase at a time, leave the Google results up in different tabs, then quickly compare them. Which source appears on both searches? That is more likely to be the original source.
  • Hurdle: The older the text you are examining is, the harder it is to find plagiarism. We have examined a few books from ~ 2003 that very likely have plagiarized material in them. The change of style, formatting errors, change in grammar and other signs mentioned above are present. However, without fail, Google returns dozens or hundreds of results where the material was taken from the book, and the original source cannot be found or is too buried in the search results. Google's filter by date has been largely uncooperative thus far.

As you can see, it isn't rocket science. Our method requires being meticulous and persistent, with a healthy dose of booze to maintain sanity. As always, we encourage you to check books and articles for plagiarism.

It typically takes a couple minutes to find evidence of it. With that, mail us the details and we can help out if you are pressed for time.

Cross-posted from Attrition.org

Possibly Related Articles:
29385
Security Awareness
Information Security
Intellectual Property Analytics Information Security Infosec Copyright Attrition.org Plagiarism
Post Rating I Like this!
The views expressed in this post are the opinions of the Infosec Island member that posted this content. Infosec Island is not responsible for the content or messaging of this post.

Unauthorized reproduction of this article (in part or in whole) is prohibited without the express written permission of Infosec Island and the Infosec Island member that posted this content--this includes using our RSS feed for any purpose other than personal use.