This package represents the code used for the publication of the article https://arxiv.org/abs/1901.00519
Each part of the code can be used independently as long as the model parameters are correctly set up.
get texts (filtering on criteria happens here)
0.enrich texts (feature engineerng)
- raw sequences of punctuation (Fig 1)
- heatmaps with punctuation (Fig 1 and Fig 4)
- corpus overall info (*)
- Histogram of number of documents per auhtor (Fig 2) (*)
- comparing two books different metrics (*)
- Histogram of features f1 and f5 (Fig 3) (*)
- scatter plots f1 for two books (Fig 5) (*)
- heatmaps (KL or other distance?) (Fig 6) (*)
- consistency within authors/genre (Fig 7 & Fig 8) (*)
- cross correlation between authors (Fig X)
- prediction using closest author (Table 2) (*)
- prediction using neural net (Table 3 & Table 4) (*)
Genre anaylisis (give stats) (*)
- Distribution of author dates over time in our corpus (Fig 9)
- Mean frequency of punctuation marks versus the middle years of authors. (Fig 10)
- Temporal evolution of mean number of words between two consecutive punctuation marks (Fig 11)
- Mean frequency of punctuation marks versus publication date for works by (a) Herbert George Wells, (b) Agnes May Fleming, and (c) Charles Dickens. (Fig 12)
- Model Parameters.
- Setp up punctuation.ini file in conf/punctuation.ini
- empirical-nb-words = 40
- empirical-nb-sentences = 200
- punctuation-vector = [!, ", (, ), ,, ., :, ;, ?, ^]
- punctuation-end = [!, ?, ., ^]
- alpha = abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890
- punctuation-quotes = ["'","“", "”"]
- Training Data.
To use Gutenberg Database
To use Digilibraries
- run data/download_data_digilibraries.py from the folder data
To use your own data
- place your folder with the list of documents you would like to use.
- make sure that you have a pickle file storing a python dataframe
where each row contains information about the documents.
- for author recognition module: 'title', 'book_id', 'author'
- for genre recognition module: 'title', 'book_id', 'genre'
- for genre temporal module: 'title', 'book_id', 'author_birthdate', 'author_deathdate' or ** book_date **
- Apply Module.
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
|Filename, size||File type||Python version||Upload date||Hashes|
|Filename, size punctuation_oxford-0.0.2-py3-none-any.whl (37.2 kB)||File type Wheel||Python version py3||Upload date||Hashes View hashes|
|Filename, size punctuation_oxford-0.0.2.tar.gz (24.7 kB)||File type Source||Python version None||Upload date||Hashes View hashes|
Hashes for punctuation_oxford-0.0.2-py3-none-any.whl