As this paper has demonstrated, information drawn from statistical parsing of a text can be used to distinguish between between authors. Different sets of features have been considered (all subtrees, rooted subtrees, POS, and POS by Level), with different degrees of performance among them. Other than the POS these features have not been previously considered (to the knowledge of the authors), including in the large set of features examined in [16]. This suggests that these tree-based features, especially the features based on all subtrees, may be beneficially included among other features.

\ It appears that the Sanditon texts are easier to classify than the The Federalist Papers. Even without the generally performance-enhancing step of dimension reduction, Sanditon classifies well, even using the POS feature vectors which are not as strong when applied to the The Federalist Papers. This is amusing, since the completer of Sanditon attempted to write in an imitative style, suggesting that these structural features are not easily faked.

\ The methods examined here does not preclude the excellent work on author identification that has previously been done, which is usually done using more obvious features in the document (such as word counts, with words selected from some appropriate set). This makes previous methods easier to compute. But at the same time, it may make it easy to spoof the author identification. The grammatical parsing provides more subtle features which will be more difficult to spoof.

\ Another tradeoff is the amount of data needed to extract a statistically meaningful feature vector. The number of trees — the number of feature elements — quickly becomes very large. In order to be statistically significant a feature element should have multiple counts. (Recall that for the chi-squared test in classical statistics a rule of thumb is that at least five counts are needed.) This need to count a lot of features indicates that the method is best applied to large documents.

\ In light of these considerations, the method described here may be considered supplemental to more traditional author identification methods.

\ The method is naturally agnostic to the particular content of a document — it does not require selecting some subset of words to use for comparisons — and so should be applicable to documents across different styles and genres. The analysis could be applied to any document amenable to statistical parsing. (It does seem that documents with a lot of specialized notation, such as mathematical or chemical notation would require adaptation to the parser.)

\ This paper introduces many possibilities for future work. Of course there is the question of how this will apply to other work in author identification. It is curious that the dimension reduction behaves so differently for the Federalist and Sanditon — Federalist best in smaller dimensions, but Sanditon works better in larger dimensions. Given recent furor over machine learning, it would be interesting to see if the features extracted by the grammatical parser correspond in any way to features that would be extracted by a ML tool. (My suspicion is that training on current ML tools does not extract grammatical information applicable to the author identification problem.)

\ Example rules for a PCFG (see Figure 14.1 of [26]). S=start symbol (or sentence); NP=noun phrase; VP = verb phrase; PP=prepositional phrase.)

:::info This paper is available on arxiv under CC BY 4.0 DEED license.

:::

Feed: Hacker Noon - Medium

View: Original article

Tags: content

Content Distribution