Tools to parse and search across http://www.cs.cmu.edu/~dbamman/latin.html
Project description
What ?
This piece of software is intended to be used with the 11K Latin Texts produced by David Bamman ( http://www.cs.cmu.edu/~dbamman/latin.html ). It supports only the plain text formats and the metadata github repo CSV file. This has been tested with Python3 only. I welcome any new functions or backward compatibility support.
How to install ?
- With development version:
Clone the repository :
git clone https://github.com/ponteineptique/archives_org_latin_toolkit.git
Go to the directory :
cd archives_org_latin_toolkit
Install the source with develop option :
python setup.py install
- With pip:
Install from pip :
pip install archives_org_latin_toolkit
Example
The following example should run with the data in tests/test_data. The example can be run with python example.py
# We import the main classes from the module
from archives_org_latin_toolkit import Repo, Metadata
from pprint import pprint
# We initiate a Metadata object and a Repo object
metadata = Metadata("./test/test_data/latin_metadata.csv")
# We want the text to be set in lowercase
repo = Repo("./test/test_data/archive_org_latin/", metadata=metadata, lowercase=True)
# We define a list of token we want to search for
tokens = ["ecclesiastico", "ecclesia", "ecclesiis", """]
# We instantiate a result storage
results = []
# We iter over text having those tokens :
# Note that we need to "unzip" the list
for text_matching in repo.find(*tokens):
# For each text, we iter over embeddings found in the text
# We want 3 words left, 3 words right,
# and we want to keep the original token (Default behaviour)
for embedding in text_matching.find_embedding(*tokens, window=3, ignore_center=False):
# We add it to the results
results.append(embedding)
# We print the result (list of list of strings)
pprint(results)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Hashes for archives_org_latin_toolkit-0.0.2.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 130077d8d151059c19b076c09acdbbd22055d614dfbab7e46cc3df34d0d368fe |
|
MD5 | 88dc885a5eee93bb1563c775e3efe29d |
|
BLAKE2b-256 | a9b3ba968b2bba1712c88303c0e0f67ef015f07ff330e71a532d1efe007dd586 |