Skip to main content

Corpora downloader and reader for Spanish sources

Project description

Averell, the python library and command line interface that facilitates working with existing repositories of annotated poetry. Averell is able to download an annotated corpus and reconcile different TEI entities to provide a unified JSON output at the desired granularity. That is, for their investigations some researchers might need the entire poem, poems split line by line, or even word by word if that is available. Averell allows to specify the granularity of the final generated dataset, which is a combined JSON with all the entities in the selected corpora. Each corpus in the catalog must specify the parser to produce the expected data format.

  • Free software: Apache Software License 2.0

Available corpora (version 1.0.2)

  id  name                size      docs    words  granularity    license
----  ------------------  ------  ------  -------  -------------  -----------
   1  Disco V2            22M       4088   381539  stanza         CC-BY
                                                   line
   2  Disco V3            28M       4080   377978  stanza         CC-BY
                                                   line
   3  Sonetos Siglo       6.8M      5078   466012  stanza         CC-BY-NC
      de Oro                                       line           4.0
   4  ADSO 100            128K       100     9208  stanza         CC-BY-NC
      poems corpus                                 line           4.0
   5  Poesía Lírica       3.8M       475   299402  stanza         CC-BY-NC
      Castellana Siglo                             line           4.0
      de Oro                                       word
                                                   syllable
   6  Gongocorpus         9.2M       481    99079  stanza         CC-BY-NC-ND
                                                   line           3.0
                                                   word           FR
                                                   syllable
   7  Eighteenth Century  2400M     3084  2063668  stanza         CC
      Poetry Archive                               line           BY-SA
                                                   word           4.0
   8  For Better          39.5M      103    41749  stanza         Unknown
      For Verse                                    line
   9  Métrique en         183M      5081  1850222  stanza         Unknown
      Ligne                                        line

Documentation

https://averell.readthedocs.io/

Installation

To install averell, run this command in your terminal:

pip install averell

This is the preferred method to install averell, as it will always install the most recent stable release.

If you don’t have pip installed, this Python installation guide can guide you through the process.

Usage

To show averell help:

averell --help

To list all available corpora:

averell list

Visualization example of one of the available corpora:

  id  name              size      docs    words  granularity    license
----  ----------------  ------  ------  -------  -------------  ---------
   1  Disco V2          22M       4088   381539  stanza         CC-BY
                                                 line

Download desired corpora into “mycorpora” folder:

averell download 2 3 --corpora-folder my_corpora

Example of poem in TEI format obtained from one of the corpora:

<TEI xmlns="http://www.tei-c.org/ns/1.0">
    <teiHeader>
        <fileDesc>
            <titleStmt>
                <title> Spanish Metrical Patterns Bank: Golden Age Sonnets.</title>
                <principal>Borja Navarro Colorado</principal>
                <respStmt>
                    <name>María Ribes Lafoz</name>
                    <name>Noelia Sánchez López</name>
                    <name>Borja Navarro Colorado</name>
                    <resp>Metrical patterns annotation</resp>
                </respStmt>
            </titleStmt>
            <publicationStmt>
                <publisher>Natural Language Processing Group. Department of Software and Computing Systems. University of Alicante (Spain)</publisher>
            </publicationStmt>
            <sourceDesc>
                <bibl><title>Sonetos</title> de <author>Garcilaso de La Vega</author>. <publisher>Biblioteca Virtual Miguel de Cervantes</publisher>, edición de <editor role="editor">Ramón García González</editor>.</bibl>
            </sourceDesc>
        </fileDesc>
        <encodingDesc>
            <metDecl xml:id="bncolorado" type="met" pattern="((\+|\-)+)*">
                <metSym value="+">stressed syllable</metSym>
                <metSym value="-">unstressed syllable</metSym>
            </metDecl>
            <metDecl>
                <p>All metrical patterns have been manually checked.</p>
            </metDecl>
        </encodingDesc>
    </teiHeader>
    <text>
        <body>
            <head>
                <title>-XX-</title>
            </head>
            <lg type="cuarteto">
                <l n="1" met="-++--++--+-">Con tal fuerza y vigor son concertados</l>
                <l n="2" met="-----+-+-+-">para mi perdición los duros vientos,</l>
                <l n="3" met="--+--+---+-">que cortaron mis tiernos pensamientos</l>
                <l n="4" met="+----++--+-">luego que sobre  fueron mostrados.</l>
            </lg>
            <lg type="terceto">
                <l n="5" met="-++--+---+-">El mal es que me quedan los cuidados</l>
                <l n="6" met="---+-----+-">en salvo de estos acontecimientos,</l>
                <l n="7" met="-++--+---+-">que son duros, y tienen fundamentos</l>
            </lg>
        </body>
    </text>
</TEI>

Generated example JSON file from input XML/TEI poem into my_corpora/{corpus}/averell/parser/{author_name}/{poem_name}.json

{
    "manually_checked": true,
    "poem_title": "-XX-",
    "author": "Garcilaso de La Vega",
    "stanzas": [
        {
            "stanza_number": "1",
            "stanza_type": "cuarteto",
            "lines": [
                {
                    "line_number": "1",
                    "line_text": "Con tal fuerza y vigor son concertados",
                    "metrical_pattern": "-++--++--+-"
                },
                {
                    "line_number": "2",
                    "line_text": "para mi perdición los duros vientos,",
                    "metrical_pattern": "-----+-+-+-"
                },
                {
                    "line_number": "3",
                    "line_text": "que cortaron mis tiernos pensamientos",
                    "metrical_pattern": "--+--+---+-"
                },
                {
                    "line_number": "4",
                    "line_text": "luego que sobre mí fueron mostrados.",
                    "metrical_pattern": "+----++--+-"
                }
            ],
            "stanza_text": "Con tal fuerza y vigor son concertados\npara mi perdición los duros vientos,\nque cortaron mis tiernos pensamientos\nluego que sobre mí fueron mostrados."
        },
        {
            "stanza_number": "2",
            "stanza_type": "terceto",
            "lines": [
                {
                    "line_number": "5",
                    "line_text": "El mal es que me quedan los cuidados",
                    "metrical_pattern": "-++--+---+-"
                },
                {
                    "line_number": "6",
                    "line_text": "en salvo de estos acontecimientos,",
                    "metrical_pattern": "---+-----+-"
                },
                {
                    "line_number": "7",
                    "line_text": "que son duros, y tienen fundamentos",
                    "metrical_pattern": "-++--+---+-"
                }
            ],
            "stanza_text": "El mal es que me quedan los cuidados\nen salvo de estos acontecimientos,\nque son duros, y tienen fundamentos"
        }
    ]
}

Now we can combine and join these corpora through “granularity” selection:

averell export 2 3 --granularity line --corpora-folder my_corpora

It produces an single JSON file with information about all the lines in those corpora. Example of two random lines in the file mycorpora/corpus_2_3.json:

{
    "line_number": "5",
    "line_text": "¿Has visto que en el mismo lugar donde",
    "metrical_pattern": "++---+--++-",
    "stanza_number": "2",
    "manually_checked": false,
    "poem_title": " - II - ",
    "author": "Mira de Amescua",
    "stanza_text": "¿Has visto que en el mismo lugar donde\nbordado estuvo el cristalino velo\nun bordado terliz de escarcha y hielo\nhace que el campo de verdor se monde?",
    "stanza_type": "cuarteto"
}
{
    "line_number": "10",
    "line_text": "el que a lo cierto no a lo incierto mira,",
    "metrical_pattern": "---+-+-+-+-",
    "stanza_number": "3",
    "manually_checked": false,
    "poem_title": "- VIII - Considerando un sepulcro y los que están en él ",
    "author": "Lope de Zarate",
    "stanza_text": "De aquí si que consigue el ser dichoso\nel que a lo cierto no a lo incierto mira,\npues le adorna lo eterno fastuoso;",
    "stanza_type": "terceto"
}

Development

To run the all tests run:

tox

Note, to combine the coverage data from all the tox environments run:

Windows

set PYTEST_ADDOPTS=--cov-append
tox

Other

PYTEST_ADDOPTS=--cov-append tox

Changelog

1.0.3 (2020-09-03)

  • Added export ‘filename’ option

  • Added two new readers:

    • For better for verse

    • Métrique en ligne

1.0.2 (2020-06-23)

  • Added two new readers:

    • ECPA corpus

    • Gongocorpus

  • Minor bug fixes

1.0.1 (2020-05-18)

  • Setting up bumbpversion

  • Integration with Zenodo

1.0.0 (2020-04-29)

  • Remove commits-since code block

  • Adding automated deployments to PyPI on tag releases

  • Added menu

  • Remove comments and cleaner code fixes

  • Fix sorted output of tests

  • Added proper documentation and coverage tests

  • Added tests for export function

  • Added ‘export’ function and test

  • Added TEI_NAMESPACE as a constant

  • Fixed docs. Fixed loads with Path. Fixed logging errors

  • Added tests

0.0.1 (2020-01-08)

  • First release on PyPI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

averell-1.0.3.tar.gz (177.2 kB view details)

Uploaded Source

Built Distribution

averell-1.0.3-py2.py3-none-any.whl (23.9 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file averell-1.0.3.tar.gz.

File metadata

  • Download URL: averell-1.0.3.tar.gz
  • Upload date:
  • Size: 177.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.1.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.0

File hashes

Hashes for averell-1.0.3.tar.gz
Algorithm Hash digest
SHA256 2a20177b3414b4f71de954c77dd4e19829841ececdb9f6b1a0de16c512b766e3
MD5 5707eaa75ee007271765311091eab388
BLAKE2b-256 06a274c1d697e231a505576671af22df91845cace50d719f23ea6f0b9b734fe1

See more details on using hashes here.

File details

Details for the file averell-1.0.3-py2.py3-none-any.whl.

File metadata

  • Download URL: averell-1.0.3-py2.py3-none-any.whl
  • Upload date:
  • Size: 23.9 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.1.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.0

File hashes

Hashes for averell-1.0.3-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 4daea07ac603a4daaae35c1a2c55a43c5b4bf6bfbfc2129b7139938f5f69dd86
MD5 e66148ab649cef86ebd75ec64fcfb9a9
BLAKE2b-256 5b09a04dfd75041d10050dab8970785509d72dba13e88802f846574bed4c9538

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page