Skip to main content

This package represents the code used for the publication of the article https://arxiv.org/abs/1901.00519

Project description

Each part of the code can be used independently as long as the model parameters are correctly set up.

modules:

  1. get cache

  2. get texts (filtering on criteria happens here)

  3. 0.enrich texts (feature engineerng)

    1. raw sequences of punctuation (Fig 1)
    1. heatmaps with punctuation (Fig 1 and Fig 4)
    1. corpus overall info (*)
    1. Histogram of number of documents per auhtor (Fig 2) (*)
    1. comparing two books different metrics (*)
    1. Histogram of features f1 and f5 (Fig 3) (*)
    1. scatter plots f1 for two books (Fig 5) (*)
    1. heatmaps (KL or other distance?) (Fig 6) (*)
    1. consistency within authors/genre (Fig 7 & Fig 8) (*)
    1. cross correlation between authors (Fig X)
    1. prediction using closest author (Table 2) (*)
    1. prediction using neural net (Table 3 & Table 4) (*)
  4. Genre anaylisis (give stats) (*)

  5. Temporal analysis

    1. Distribution of author dates over time in our corpus (Fig 9)
    1. Mean frequency of punctuation marks versus the middle years of authors. (Fig 10)
    1. Temporal evolution of mean number of words between two consecutive punctuation marks (Fig 11)
    1. Mean frequency of punctuation marks versus publication date for works by (a) Herbert George Wells, (b) Agnes May Fleming, and (c) Charles Dickens. (Fig 12)

Set up:

  1. Model Parameters.
  • Setp up punctuation.ini file in conf/punctuation.ini
    • empirical-nb-words = 40
    • empirical-nb-sentences = 200
    • punctuation-vector = [!, ", (, ), ,, ., :, ;, ?, ^]
    • punctuation-end = [!, ?, ., ^]
    • alpha = abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890
    • punctuation-quotes = ["'","“", "”"]
  1. Training Data.
  • To use Gutenberg Database

  • To use Digilibraries

    • run data/download_data_digilibraries.py from the folder data
  • To use your own data

    • place your folder with the list of documents you would like to use.
    • make sure that you have a pickle file storing a python dataframe where each row contains information about the documents. Columns required
      • for author recognition module: 'title', 'book_id', 'author'
      • for genre recognition module: 'title', 'book_id', 'genre'
      • for genre temporal module: 'title', 'book_id', 'author_birthdate', 'author_deathdate' or ** book_date **
  1. Apply Module.

Project details


Release history Release notifications

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for punctuation-oxford, version 0.0.2
Filename, size File type Python version Upload date Hashes
Filename, size punctuation_oxford-0.0.2-py3-none-any.whl (37.2 kB) File type Wheel Python version py3 Upload date Hashes View hashes
Filename, size punctuation_oxford-0.0.2.tar.gz (24.7 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page