Analyze characters in fictional texts.
Project description
Buskin
Buskin is a python package for analyzing various attributes of characters in fictional texts. This was developed as part of a project for the terrific Computational Humanities course at UC Berkeley. Buskin's pipeline utilizes state-of-the-art techniques in processing the text (to obtain Emotions, Characters, Character Arcs, Patient-agent-predicatives, Part-of-speech tags,etc.)
We created this package to understand character arcs from various novels, but we hope it will reduce the effort to get started in analyzing fictional text for any purpose. We hope that Buskin makes it easier to peel open any novel and the characters within, in all their idiosyncrasies. Over time, we intend to add more features to the package in pursuit of that goal. Also, this is very much a work in progress. We appreciate any feedback, or contribution to the project!
“Plot is no more than footprints left in the snow after your characters have run by on their way to incredible destinations.” ― Ray Bradbury, Zen in the Art of Writing
Contributors : nuwandavek, Dmacracy
Usage
Buskin needs Pytorch (>=1.4) which can be installed from here. Once that's done, Buskin can be installed with Pip by :
pip install buskin
Buskin requires spacy
, torch
and huggingface's transformers
among other dependencies. So installation might take a while.
Many examples can be found in the example_notebooks
directory.
Functions
parse_book
parse_book(book_path, batch_size=None, threads=None, max_chunk_size=None, pipeline=None, model=None, tokenizer=None)
Description : Parse a fictional text
Parameters :
- book_path :
str
: Path to the.txt
file of the book - batch_size :
int
, optional : Batch size of sentences for emotion classification (default = 8) - threads :
int
, optional : Number of threads to be used in the processing of chunks (default = 5). The larger the number of threads, the faster the processing; but this might fill up the memory since neural coreference is memory heavy - max_chunk_size :
int
, optional : Max size of a chunk that the text is divided into (default = 10k). The larger the chunks, the better the corefernce, but, memory is a constraint. - pipeline :
Spacy Pipeline
, optional : This is used to process the text tokens to obtain the POS tags, etc. If not provided, a default pipeline is initialized. - model :
HuggingFace BertForSequenceClassification model
, optional : Model used to obtain emotion for sentences. If not provided, a default model is initialized. - tokenizer :
HuggingFace BertTokenizer
, optional : Tokenizer used for the emotion model. If not provided, a default tokenizer is initialized.
Returns:
- Book : An instance of the
Book
class
load_default_models
load_default_models()
Description : Explicitly initialize the pipeline, model and tokenizer in case a batch of books are parses and you want to avoid initializing for each book.
Returns :
- nlp :
Spacy Pipeline
, optional : This is used to process the text tokens to obtain the POS tags, etc. - model :
HuggingFace BertForSequenceClassification model
, optional : Model used to obtain emotion for sentences. - tokenizer :
HuggingFace BertTokenizer
, optional : Tokenizer used for the emotion model.
Classes
Book
Book(book_path=None, sentences=None, characters=None)
Attribute | Type | Description |
---|---|---|
book_path | str |
Path to the book text file |
sentences | List of Sentence |
List of all sentences in the fictional text |
characters | List of Character |
List of all characters in the fictional text |
Sentence
Sentence(sentence_id=None, cluster_id=None, global_token_start=None, text=None, token_tags=None, emotion_tags=None)
Attribute | Type | Description |
---|---|---|
sentence_id | int |
ID of the sentence |
cluster_id | int |
ID of the sentence cluster used for coreference resolution |
global_token_start | int |
ID of the token |
text | str |
Text in the sentence |
token_tags | List of TokenTags |
List of all tags for each token in the sentence |
emotion_tags | List of Emotion |
List of all emotions for each token in the sentence |
Character
Character(rank=None, name=None, mentions=None, agents=None, patients=None, predicatives=None)
Attribute | Type | Description |
---|---|---|
rank | int |
Rank of the character (1 = most mentioned character) |
name | str |
Name of the character |
mentions | List of Occurrence |
List of all occurrences of the character mentions |
agents | List of Occurrence |
List of all occurrences of the character agent verbs |
patients | List of Occurrence |
List of all occurrences of the character patient verbs |
predicatives | List of Occurrence |
List of all occurrences of the character predicatives |
TokenTags
TokenTags(token_id=None, token_global_id=None, token=None, lemma=None, pos=None, tag=None, dep=None, head_global_id=None)
Attribute | Type | Description |
---|---|---|
token_id | int |
ID of the token |
token_global_id | int |
Global ID of the token |
token | str |
Text of the token |
lemma | str |
Lemma of the token |
pos | str |
Part of Speech of the token |
tag | str |
POS Tag of the token |
dep | str |
Dependency Parse tag of the token |
head_global_id | int |
ID of the parse-head of the token |
Emotion
Emotion(emotion=None, mini_emotion=None, probability=None)
Attribute | Type | Description |
---|---|---|
emotion | str |
Emotion of the sentence (28 values) |
mini_emotion | str |
Reduced Emotion of the sentence (3 values) |
probability | float |
Probability of the emotion [0,1] |
Occurrence
Occurrence(text=None, sentence_id=None, cluster_id=None, start=None, end=None)
Attribute | Type | Description |
---|---|---|
text | str |
Text of the occurrence |
sentence_id | int |
ID of the sentence with the occurrence |
cluster_id | int |
ID of the cluster with the occurrence |
start | int |
Start token ID of the occurrence |
end | int |
End token ID of the occurrence |
Icon made by Freepik from www.flaticon.com
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for buskin-0.1.0-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3578873cf29833d6c9cdbe7860ca47c759a479c6d7f390e9860174f96de8c251 |
|
MD5 | bc5f6f4385db2fea04591e04d309bfb3 |
|
BLAKE2b-256 | 59e8a5a28bf8ece41535ad8fc9ffd9cf22a9b79fb74852bdd8d952a3192f5902 |