Who wrote that book? Magda might be able to tell you
Authorship analysis is a research field dealing with a specific type of text analysis, namely the analysis of the author writing style. It deals with various types of questions related to the authorship of a text including:
“Who among candidate authors wrote a given text?”
“Were given documents written by the same person?”
“Is this document a plagiarism of another text?”
“What is the gender of the author?”
Authorship analysis has applications in forensics, security, plagiarism detection, business intelligence, and literary research.
An interesting example of such an authorship analysis that recently received media attention, was the case of The Cuckoo’s Calling, a detective novel by a first-time writer named Robert Galbraith. The Sunday Times reporters, following an anonymous tip, carried out an investigation on whether the real author of the book was J. K. Rowling, the author of the Harry Potter series. Part of the investigation constituted two automatic stylistic analyses by two experts in author attribution, Patrick Juola of Duquesne University and Peter Millican of Oxford University. Their results indicated that the book was stylistically more similar to the previous work by Rowling than to detective books by three other British female authors. As a consequence, J. K. Rowling was confronted by The Sunday Times and admitted that she wrote The Cuckoo’s Calling.
Authorship analysis is the context of most of the doctoral research performed by Magda Jankowska, a third year PhD student in the Faculty of Computer Science, and a member of the Visual Text Analytics group. With her supervisors, Dr. Evangelos Milios and Dr. Vlado Keselj, they proposed a visual analytic tool called RNG-Sig for comparing text documents on sub-word level and applied it to the analysis of authorship style of literary works.
The tool is based on the Common N-Gram (CNG) classifier, a classification algorithm proposed by Dr. Keselj and his colleagues that has been successfully applied to author detection tasks. It is based on character n-grams – overlapping strings of consecutive characters from a text. Such character strings, despite being so low-level and simple, proved to be very powerful features for authorship attribution. They also have the merit of being language independent. The visual tool RNG-Sig provides insight into the inner workings of the CNG classifier, aiming at presenting in a visual way the “reasons” of the algorithm decision as well as visualizing the dissimilarities between documents at the level of character n-grams, and providing a linkage between these strings of characters (which are often not easy to interpret) and words or phrases. The tool also enables a user to manually adapt the classification process.
Magda has been also working on an automatic method for a task called authorship verification. In this task one is presented with a set of a few texts of a single author and is asked whether the same person wrote another document. Magda and her supervisors approached the problem by proposing a measure of similarity between the “unknown” document and the entire set of the documents of the known authorship. This measure utilizes the same character n-gram based dissimilarity between documents the CNG classifier uses.
Their method has been submitted to the PAN 2013 competition in author verification. Magda is looking forward to her further work on character n-gram based classification and combining the automatic methods with visualizations that enable human insight and interactions.