Skip to main content

API to access Portuguese Literary Corpus

Project description

litcorpt

LITerary CORpus in PorTuguese is a API to access a literary corpus in portuguese language.

The API provides access to the corpus without all the fuzz to download and write a loader for different types of data sources. It is exposed as a simple document database.

How to install.

Simply:

pip install litcorpt

Getting started

After installation in you Python just

import litcorpt
from pprint  import pprint as pp
corpus_db = litcorpt.corpus_load()
print(f'There are {len(corpus_db)} documents in corpus')

It will load the whole corpus. When running by the first time, it will download from internet, process and build the whole dataset.

The download size is around 600MB and is automaticly handled by the library. It is downloaded just at first time you load it. After the first time it will load from local disk. The time to load data locally takes around 6 seconds. This value was measured in my own computer (your mileage may vary).

Basic Usage

Most of time you just want to retrieve the whole corpus as a list of text documents. You can do that with this one liner.

corpus = litcorpt.corpus(corpus_db)

This operation just append to a list all contents for all documents. Since a document may have more than one content.

Advanced usage

Besides the fetchall usage, many custom queries can be done. Is possible to search by matches, regexes, fields.

Book metadata

To retrieve metadata from all books

metadata = litcorpt.metadata(corpus_db)

Metadata will be a list of books (model.Book). Each i-th element is a metadata related to i-th text from corpus variable from last example.

You can convert this book metadata to a dictionary with

metadata[0].dict(exclude_none=True, exclude_defaults=True)

or to a JSON object

metadata[0].json(exclude_none=True, exclude_defaults=True)

All book titles of an author (Eça de Queirós)

We are ignoring documents where Queirós is an editor.

As a regular for loop

q = litcorpt.Query()
result = corpus_db.search(q.creator.any(q.name == 'Eça de Queirós'))

titles = []
for document in result:
  titles.append(document['title'])

pp(titles)

As a list comprehension shorter but harder to read.

q = litcorpt.Query()
titles = [ document['title'] for document in corpus_db.search(q.creator.any(q.name == 'Eça de Queirós'))]
pp(titles)

Building a corpus with Eça de Queirós

q = litcorpt.Query()
search = (q.creator.any(q.name == 'Eça de Queirós'))
queiros_corpus = litcorpt.corpus(corpus_db, search)
pp(queiros_corpus)

Building a bibliography

Here we handle the case where there is no author.

bibliography = []
for document in iter(corpus_db):
    creators = []
    for creator in document.get('creator', [{'name': 'Anonymous'}]):
        creators.append(creator['name'])
    bibliography.append(f'{" and ".join(creators)}. {document["title"].strip()}.')

pp(bibliography)

Count documents by Author

Here we use Python's Counter to count the surnames and using a dict comprehension to filter the authors that occurs more than 5 times. You still can access the whole counting the name variable

As a list comprehension

q = litcorpt.Query()
from collections import Counter
names = Counter([ creator['name'] for document in corpus_db.search(q.creator.exists()) for creator in document['creator'] ])
most_common_names = {name: count for name, count in names.items() if count >= 5}

print(most_common_names)

Unrolling the comprehension

q = litcorpt.Query()
from collections import Counter

names = []

for document in corpus_db.search(q.creator.exists()):
  for creator in document['creator']:
    names.append(creator['name'])

names = Counter(names)

most_common_names = {}
for name, count in names.items():
  if count >= 5:
    most_common_names[name] = count

Extra: Sorting by decreasing frequency, then alphabeticaly.

sorted(most_common_names.items(), key=lambda item: (-item[1], item[0]))

Display all Subjects

First we group all subjects

q = litcorpt.Query()
subjects = [subject
            for document in iter(corpus_db)
            if document['subject'] is not None
            for subject in document['subject']]

Then we can count, and sort by descending frequency (Python 3.6> dicts are ordered by default).

from collections import Counter
subject_frequency = Counter(subjects)
subject_frequency = dict(sorted(subject_frequency.items(), key=lambda item: -item[1]))

And also group the unique items for reference.

subject_list = list(subject_frequency.keys())

Building a corpus given a list of Subjects

First we pick a list of subjects (this is just an example with a few valid entries, and some not valid).

subjects = [ 'portuguese drama',
             'france',
             'drama',
             'women',
             '<INVALID SUBJECT>' ]

Then we proceed with search and corpus building

q = litcorpt.Query()
result = corpus_db.search(q.subject.any(subjects))
drama_corpus = [book['contents'] for book in result]

If we want we can easily list the titles in our new drama_corpus

titles = [ document['title'] for document in result ]

Of course we can do the same by any of the fields in document.

Retrieving a document by ID

q = litcorpt.Query()
search  = q.creator.any(q.name == 'Joaquim Manuel de Macedo')
doc_ids = litcorpt.doc_id(corpus_db, search)

for doc_id in doc_ids:
  print(corpus_db.get(doc_id=doc_id)['title'])

Extra

You can check the 'tests' dir in source_code for examples.

The structure of a document.

The corpus database is a list of documents. A document is often related with a literary document (book, text, play, etc) and contains the following fields:

Field Explanation
index An unique string to internaly identify the entry.
title A title associated to the entry.
subtitle Document subtitle (if exists)
creator A list of creators. Each creator contains:
role Creator relationship with the book entry
name creator name
birth Creator's birth year.
death Creator's death year.
place Creator's birth place.
language A list of ISO entry with language, pt_BR or pt are the most common here. A document can contain many languages. Most of time just one.
published Date of first publish. Multiple edition should use the date of first edition. Except when large changes happened in document, as change of translator, change of ortography.
identifier A unique global identifier, often a ISBN13 for books.
original_language Original language of document. Using ISO entry for language.
subject A list entry subjects. As example: Fantasy, Science-Fiction, Kids. Use lower caps always.
genre A list of literary genre: Novel, Poetry, Lyrics, Theather. Use lower caps always.
ortography Reference to which Portuguese ortography is being used.
abstract The book abstract/resume.
notes Any notes worth of note.
contents Book contents. The text itself.

Caution: Date fields must contain a datetime.date or a string in format "YYYY-MM-DD"

Customizing

By default, the corpus is stored at

${HOME}/litcorpt_data

If you wish to put in a different place, just set the CORPUS_DATAPATH environment variable in your system configuration. For example for bash, add this to your ~/.bashrc

export CORPUS_DATAPATH="/whatever/place/you/want"

or you can create a .env file in your project root with

CORPUS_DATAPATH="/whatever/place/you/want"

This file will be loaded by litcorpt.

Then call your programs using litcorpt or your ipython session

TODO

  • Rewrite DOCSTRINGS using numpy style: https://numpydoc.readthedocs.io/en/latest/format.html#docstring-standard

  • From a Iterator of books (as the new package search functions) return: a. contensts only (litcorpt._c) b. Metadata only (litcorpt._m)

  • Add the new module search functions to README documentation

  • Check if tests still work with new book. model.

  • Check if functions are returning generators instead lists.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

litcorpt-0.0.8.tar.gz (21.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

litcorpt-0.0.8-py3-none-any.whl (19.7 kB view details)

Uploaded Python 3

File details

Details for the file litcorpt-0.0.8.tar.gz.

File metadata

  • Download URL: litcorpt-0.0.8.tar.gz
  • Upload date:
  • Size: 21.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.11.1 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.10

File hashes

Hashes for litcorpt-0.0.8.tar.gz
Algorithm Hash digest
SHA256 0213e71650b882b61d19c761976c142c643a0d414119441608d07c9dc0dafed3
MD5 294b69dbb4264063e281e5a4f64655b8
BLAKE2b-256 4e0d1b6a120be2c777aea1511431d07bdbaf88991f076c17930dee43e14f8b66

See more details on using hashes here.

File details

Details for the file litcorpt-0.0.8-py3-none-any.whl.

File metadata

  • Download URL: litcorpt-0.0.8-py3-none-any.whl
  • Upload date:
  • Size: 19.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.11.1 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.10

File hashes

Hashes for litcorpt-0.0.8-py3-none-any.whl
Algorithm Hash digest
SHA256 55376854d01a9dec2e8791648b56a96026d34ff54565500aa5926cee71fbbed4
MD5 37abf13a2d60917c77ac27dee6a5b2fb
BLAKE2b-256 22166f298b4d26be69963a972a0ed361be853cfbc2f8eb615f31433384783036

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page