Skip to main content
Join the official 2020 Python Developers SurveyStart the survey!

Implement Writeprints faithfully, finely, and easily.

Project description

Pywriteprints

Pywriteprints aims at providing an API to extract Writeprints variants (Brennan et al., 2012.; McDonald et al., 2012; Overdorf & Greenstadt, 2016). The API mimics that used by scikit-learn's text feature extraction classes (e.g., CountVectorizer).

Requirements

  • Python 3.8 or higher
  • spacy 2.3.2 or higher
  • chardet = 3.0.4 or higher
  • numpy = 1.19.1 or higher
  • langdetect = 1.0.8 or higher

Getting Started

from pywriteprints import Writeprints  
  
wp = Writeprints()  
x = wp.process("While winter reigns the earth reposes but these colourless green ideas sleep furiously.")  

# if you want to check the  feature-value paires
X = x.todict()  

# if you want to feed X into some M?L algorithm
X = x.toarray()

Documentation

Pywriteprints has two classes: the Writeprints class holds specification and does the heavy lifting, while
the X class is for i/o. You already got it if you are familiar with scikit-learn.

class Writeprints(input='content',
                  encoding='utf-8', 
                  decode_error='strict', 
                  feature_set='static', 
                  max_ngram=50, 
                  stop_words='default')

Parameters:

  • input: string {'file', 'content'}, default='content'. If 'file', the sequence items must have a 'read' method (file-like object) that is called to fetch the bytes in memory. Otherwise the input is expected to be an item that can be of type string or byte.
  • encoding: string, default='utf-8'. If bytes or files are given to analyze, this encoding is used to decode.
  • decode_error: {'strict', 'ignore'}, default='strict'. Instruction on what to do if a byte sequence is given to analyze that contains characters not of the given encoding. By default, it is 'strict', meaning that a UnicodeDecodeError will be raised. Other values are 'ignore' and 'replace'.
  • feature_set: {'static', 'expanded', 'new', 'combined'}, default='static'. The choice on which variant of Writeprints to be extracted: 'static' stands for the Writeprints-static feature set used by Brennan et al. (2012), 'expanded' for Writeprints-expanded feature set used by Jstylo (McDonald et al., 2012), 'new' for another Writeprints version used by Overdorf & Greenstadt (2016), and 'combined' is a version comprised all subsets of the above Writeprints variants. Variant differences can be accessed in Attribute section.
  • max_ngram: int (natural number lower than 20,000), default=50. The maximum of POS/character/word ngram will return. max_ngram is designed to be no more than 20,000 and will raise a warning if the specified number is larger than possible permutation of certain ngrams (i.e., character bigram won't be more than 676.). All the most-frequent POS/character/word ngrams are sorted by Brown corpus (Francis& Kucera, 1979).
  • stop_words: 'default' or a list of str, default='default'. stop_words is a list of words (str) that to be counted as stop words. By default, pywriteprints harnessed 512 stop words used by Koppel et al. (2005) by default.
  • tagset: {'Penn', 'Universal'}, default='Penn'. Tagset used to specify which tagset to apply: 'Penn' stands for [Penn Treebank OntoNotes Release 5.0] (https://catalog.ldc.upenn.edu/LDC2013T19) with 55 tags, 'Universal' for Universal POS tags version 2 with 17 tags.

Methods:

  • process(input): take in raw text or file-like object (according to what has been specified in the Writeprints instance), return an X instance.
  • get_params(): return all the settings to the Writeprints instance.
  • get_stop_words(): return the list of stop words used in the Writeprints instance.
  • get_tagset(): return tagset used in the Writeprints instance.

Attributes:

  • word_count: total words in a given text, as a part of Writeprints-static/-new/-combined.
  • average_characters_per_word: average number of characters per word in a given text, as a part of all four Writeprints variants.
  • short_word_count: total words short than four characters in a given text, as a part of Writeprints-static/-combined.
  • character_count: total number of characters in a given text, as a part of all four Writeprints variants.
  • percentage_digits: percentage of digits over total characters, as a part of Writeprints-static/-combined.
  • uppercase_letters_percentage: percentage of uppercase letters over total characters in a given test, as a part of Writeprints-expanded/-static/-combined.
  • special_characters: frequencies of special characters ("~", "@", "#", "$", "%", "^", "&", "*", "-", "_", "=", "+", ">", "<", "[", "]", "{", "}", "/", "\", "|"), 21 in total, as a part of Writeprints-expanded/-static/-combined.
  • letters: frequency of letters (a-z, case insensitive), 26 in total, as a part of Writeprints-expanded/-static/-combined.
  • digits: frequency of digits in a given text (0,1,...,9), 10 in total, as a part of all four Writeprints variants.
  • hapax_legomena_ratio: hapax legomena over all word tokens, as a part of Writeprints-static/-combined.
  • dis_legomena_ratio: dis legomena over all word tokens, as a part of Writeprints-static/-combined.
  • function_words: frequency of function words in a given text, as a part of all four Writeprints variants. By default, 512 common function words used by Koppel et al. (2005) will be applied.
  • POS_tags: part-of-speech tags extracted by spaCy "en_core_web_sm" model in OntoNotes 5 paradigm, 55 in total, as a part of all four Writeprints variants.
  • punctuation: frequency of punctuation symbols ("...", ".", "!", "?", ",", ";", ":", "'", '"', '“', '”', '‘', '’'), 13 in total, as a part of all four Writeprints variants.
  • digits_percentage: percentage of digits over total characters in a given text, as a part of Writeprints-expanded/-combined.
  • letters_percentage: percentage of letters over total characters in a given text, as a part of Writeprints-expanded/-combined.
  • two_digit_numbers: frequencies of 2 digit numbers (e.g., 11, 99, etc.), as a part of Writeprints-expanded/-new/-combined.
  • three_digit_numbers: frequencies of 3 digit numbers (e.g., 100, 209, etc.), as a part of Writeprints-expanded/-new/-combined.
  • word_lengths: frequency of words of 1~20 letter lengths (excluding punctuation), 20 in total, as a part of Writeprints-expanded/-new/-combined.
  • misspelled_words: frequencies of misspelled words out of a list of 5,513 common misspellings, 5,513 in total, as a part of Writeprints-expanded/-new/-combined.
  • top_letter_bigrams: most common letter bigrams (e.g., aa, ab, etc.), case insensitive. Bigrams are taken only within words. 50 in total by default, as a part of all four Writeprints variants.
  • top_letter_trigrams: most common letter trigrams (e.g., aac, abc, etc.), case insensitive. Trigrams are taken only within words. 50 in total by default, as a part of all four Writeprints variants.
  • POS_bigrams: part-of-speech tag bigrams extracted by spaCy "en_core_web_sm" model with OntoNotes 5 paradigm. 50 in total by default, as a part of Writeprints-expanded/-new/-combined.
  • POS_trigrams: part-of-speech tag trigrams extracted by spaCy "en_core_web_sm" model in OntoNotes 5 paradigm. 50 in total by default, as a part of Writeprints-expanded/-new/-combined.
  • words: frequencies of various words in a given text, case insensitive, without punctuations, and not cross adjacent sentences. 50 in total by default, as a part of Writeprints-expanded/-new/-combined.
  • word_bigrams: frequencies of various word bigrams in a given text, case insensitive and without punctuations. 50 in total by default, as a part of Writeprints-expanded/-new/-combined.
  • word_trigrams: frequencies of various word trigrams in a given text, case insensitive and without punctuations. 50 in total by default, as a part of Writeprints-expanded/-new/-combined.
  • word_lengths_distribution: percentage for 1-20 character words (excluding punctuation) in terms of the number of word tokens, 20 in total, as a part of Writeprints-new/-combined.
  • vocabulary_richness: unique word types divided the number of word tokens ("type token ratio"), as a part of Writeprints-new/-combined.
  • character_percentage: percentage of characters in a given text, as a part of Writeprints-new/-combined.
  • letter_count: count of letters in a given text, as a part of Writeprints-new/-combined.

Class X is a convenient class renders the result to user's liking.
Methods:

  • todict(): return a dict keyed by feature name and valued by the corresponding value.
  • toarray(): return a numpy.ndarray instance containing float.
  • get_feature_name(): return a dict of feature names corresponds to X.toarray().

Reference

  • Brennan, M., Afroz, S., & Greenstadt, R. (2012). Adversarial stylometry: Circumventing authorship recognition to preserve privacy and anonymity. ACM Transactions on Information and System Security (TISSEC), 15(3), 1-22. Francis, W. N. & Kucera, H. (1979).
  • Brown Corpus Manual. Department of Linguistics, Brown University, Providence, Rhode Island, US .
  • Koppel, M., Schler, J., & Zigdon, K. (2005, May). Automatically determining an anonymous author's native language. In International Conference on Intelligence and Security Informatics (pp. 209-217). Springer, Berlin, Heidelberg.
  • McDonald, A. W., Afroz, S., Caliskan, A., Stolerman, A., & Greenstadt, R. (2012, July). Use fewer instances of the letter “i”: Toward writing style anonymization. In International Symposium on Privacy Enhancing Technologies Symposium (pp. 299-318). Springer, Berlin, Heidelberg. (corresponding specification curated at https://github.com/psal/jstylo/blob/bdc5a9e79adb35795819de147bb21ce2908ae45d/jsan_resources/feature_sets/writeprints_expanded.xml)
  • Overdorf, R., & Greenstadt, R. (2016). Blogs, twitter feeds, and reddit comments: Cross-domain authorship attribution. Proceedings on Privacy Enhancing Technologies, 2016(3), 155-171.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for pywriteprints, version 0.2.7
Filename, size File type Python version Upload date Hashes
Filename, size pywriteprints-0.2.7-py3-none-any.whl (14.6 kB) File type Wheel Python version py3 Upload date Hashes View
Filename, size pywriteprints-0.2.7.tar.gz (16.5 kB) File type Source Python version None Upload date Hashes View

Supported by

Pingdom Pingdom Monitoring Google Google Object Storage and Download Analytics Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page