Skip to main content

Corpora downloader and reader for Spanish sources

Project description

Averell, the python library and command line interface that facilitates working with existing repositories of annotated poetry. Averell is able to download an annotated corpus and reconcile different TEI entities to provide a unified JSON output at the desired granularity. That is, for their investigations some researchers might need the entire poem, poems split line by line, or even word by word if that is available. Averell allows to specify the granularity of the final generated dataset, which is a combined JSON with all the entities in the selected corpora. Each corpus in the catalog must specify the parser to produce the expected data format.

  • Free software: Apache Software License 2.0

Available corpora (version 1.0.4)

id name lang size docs words granularity license
1 Disco V2.1 (disco2_1) es 22M 4088 381539 stanza line CC-BY
2 Disco V3 (disco3) es 28M 4080 377978 stanza line CC-BY
3 Sonetos Siglo de Oro (adso) es 6.8M 5078 466012 stanza line CC-BY-NC 4.0
4 ADSO 100 poems corpus (adso100) es 128K 100 9208 stanza line CC-BY-NC 4.0
5 Poesía Lírica Castellana Siglo de Oro (plc) es 3.8M 475 299402 stanza line word syllable CC-BY-NC 4.0
6 Gongocorpus (gongo) es 9.2M 481 99079 stanza line word syllable CC-BY-NC-ND 3.0 FR
7 Eighteenth Century Poetry Archive (ecpa) en 2400M 3084 2063668 stanza line word CC BY-SA 4.0
8 For Better For Verse (4b4v) en 39.5M 103 41749 stanza line Unknown
9 Métrique en Ligne (mel) fr 183M 5081 1850222 stanza line Unknown
10 Biblioteca Italiana (bibit) it 242M 25341 7121246 stanza line word Unknown

Installation

To install averell, run this command in your terminal:

pip install averell

This is the preferred method to install averell, as it will always install the most recent stable release.

If you don’t have pip installed, this Python installation guide can guide you through the process.

Usage

To show averell help:

averell --help

To list all available corpora:

averell list

Visualization example of one of the available corpora:

  id  name                 lang    size      docs    words  granularity    license
----  -------------------  ------  ------  ------  -------  -------------  -----------
   1  Disco V2.1           es      22M       4088   381539  stanza         CC-BY
                                                            line

download

Download desired corpora into “mycorpora” folder:

averell download 2 3 --corpora-folder my_corpora

Example of poem in TEI format obtained from one of the corpora:

<TEI xmlns="http://www.tei-c.org/ns/1.0">
    <teiHeader>
        <fileDesc>
            <titleStmt>
                <title> Spanish Metrical Patterns Bank: Golden Age Sonnets.</title>
                <principal>Borja Navarro Colorado</principal>
                <respStmt>
                    <name>María Ribes Lafoz</name>
                    <name>Noelia Sánchez López</name>
                    <name>Borja Navarro Colorado</name>
                    <resp>Metrical patterns annotation</resp>
                </respStmt>
            </titleStmt>
            <publicationStmt>
                <publisher>Natural Language Processing Group. Department of Software and Computing Systems. University of Alicante (Spain)</publisher>
            </publicationStmt>
            <sourceDesc>
                <bibl><title>Sonetos</title> de <author>Garcilaso de La Vega</author>. <publisher>Biblioteca Virtual Miguel de Cervantes</publisher>, edición de <editor role="editor">Ramón García González</editor>.</bibl>
            </sourceDesc>
        </fileDesc>
        <encodingDesc>
            <metDecl xml:id="bncolorado" type="met" pattern="((\+|\-)+)*">
                <metSym value="+">stressed syllable</metSym>
                <metSym value="-">unstressed syllable</metSym>
            </metDecl>
            <metDecl>
                <p>All metrical patterns have been manually checked.</p>
            </metDecl>
        </encodingDesc>
    </teiHeader>
    <text>
        <body>
            <head>
                <title>-XX-</title>
            </head>
            <lg type="cuarteto">
                <l n="1" met="-++--++--+-">Con tal fuerza y vigor son concertados</l>
                <l n="2" met="-----+-+-+-">para mi perdición los duros vientos,</l>
                <l n="3" met="--+--+---+-">que cortaron mis tiernos pensamientos</l>
                <l n="4" met="+----++--+-">luego que sobre mí fueron mostrados.</l>
            </lg>
            <lg type="terceto">
                <l n="5" met="-++--+---+-">El mal es que me quedan los cuidados</l>
                <l n="6" met="---+-----+-">en salvo de estos acontecimientos,</l>
                <l n="7" met="-++--+---+-">que son duros, y tienen fundamentos</l>
            </lg>
        </body>
    </text>
</TEI>

Generated example JSON file from input XML/TEI poem into my_corpora/{corpus}/averell/parser/{author_name}/{poem_name}.json

{
    "manually_checked": true,
    "poem_title": "-XX-",
    "author": "Garcilaso de La Vega",
    "stanzas": [
        {
            "stanza_number": "1",
            "stanza_type": "cuarteto",
            "lines": [
                {
                    "line_number": "1",
                    "line_text": "Con tal fuerza y vigor son concertados",
                    "metrical_pattern": "-++--++--+-"
                },
                {
                    "line_number": "2",
                    "line_text": "para mi perdición los duros vientos,",
                    "metrical_pattern": "-----+-+-+-"
                },
                {
                    "line_number": "3",
                    "line_text": "que cortaron mis tiernos pensamientos",
                    "metrical_pattern": "--+--+---+-"
                },
                {
                    "line_number": "4",
                    "line_text": "luego que sobre mí fueron mostrados.",
                    "metrical_pattern": "+----++--+-"
                }
            ],
            "stanza_text": "Con tal fuerza y vigor son concertados\npara mi perdición los duros vientos,\nque cortaron mis tiernos pensamientos\nluego que sobre mí fueron mostrados."
        },
        {
            "stanza_number": "2",
            "stanza_type": "terceto",
            "lines": [
                {
                    "line_number": "5",
                    "line_text": "El mal es que me quedan los cuidados",
                    "metrical_pattern": "-++--+---+-"
                },
                {
                    "line_number": "6",
                    "line_text": "en salvo de estos acontecimientos,",
                    "metrical_pattern": "---+-----+-"
                },
                {
                    "line_number": "7",
                    "line_text": "que son duros, y tienen fundamentos",
                    "metrical_pattern": "-++--+---+-"
                }
            ],
            "stanza_text": "El mal es que me quedan los cuidados\nen salvo de estos acontecimientos,\nque son duros, y tienen fundamentos"
        }
    ]
}

export

Now we can combine and join these corpora through “granularity” selection:

averell export 2 3 --granularity line --corpora-folder my_corpora --filename export_1

It produces an single JSON file with information about all the lines in those corpora. Example of two random lines in the file mycorpora/export_1.json:

{
    "line_number": "5",
    "line_text": "¿Has visto que en el mismo lugar donde",
    "metrical_pattern": "++---+--++-",
    "stanza_number": "2",
    "manually_checked": false,
    "poem_title": " - II - ",
    "author": "Mira de Amescua",
    "stanza_text": "¿Has visto que en el mismo lugar donde\nbordado estuvo el cristalino velo\nun bordado terliz de escarcha y hielo\nhace que el campo de verdor se monde?",
    "stanza_type": "cuarteto"
}
{
    "line_number": "10",
    "line_text": "el que a lo cierto no a lo incierto mira,",
    "metrical_pattern": "---+-+-+-+-",
    "stanza_number": "3",
    "manually_checked": false,
    "poem_title": "- VIII - Considerando un sepulcro y los que están en él ",
    "author": "Lope de Zarate",
    "stanza_text": "De aquí si que consigue el ser dichoso\nel que a lo cierto no a lo incierto mira,\npues le adorna lo eterno fastuoso;",
    "stanza_type": "terceto"
}

By default, export will download corpora if needed. To avoid this behaviour, the flag --no-download can be passed in.

Exported corpora can be easily loaded into Pandas

averell export adso100 --filename adso100.json
import pandas as pd

adso100 = pd.read_json(open("adso100.json"))

A note on IDS

IDS can be numeric identifiers in the averell list output, corpus shortcodes (shown between parenthesis), the speciall literal all to refer to all corpora, or two-letter ISO language codes to refer to avaliable corpora in a specific language.

For example, the command averell export 1 bibit fr will export DISCO V2.1, the Biblioteca Italiana poetry corpus, and all corpora tagged with the French languge tag in a single file spliting poems line by line.

Changelog

1.1.0 (2020-09-18)

  • Added Biblioteca Italiana (bibit) reader
  • Added Archivio Metrico Italiano info to Biblioteca Italiana reader
  • Reduced fixtures file size
  • Adding a tmp file to git ignore
  • Adding languages and some other cosmetic changes
  • Fixing an error with the expected output of the averell list command
  • Adding slugs, langs, and ‘all’ to download and export
  • Fixing coverage
  • Adding documentation and fixing a test

1.0.3 (2020-09-03)

  • Added export --filename option
  • Added two new readers:
    • For better for verse
    • Métrique en ligne

1.0.2 (2020-06-23)

  • Added two new readers:
    • ECPA corpus
    • Gongocorpus
  • Minor bug fixes

1.0.1 (2020-05-18)

  • Setting up bumbpversion
  • Integration with Zenodo

1.0.0 (2020-04-29)

  • Remove commits-since code block
  • Adding automated deployments to PyPI on tag releases
  • Added menu
  • Remove comments and cleaner code fixes
  • Fix sorted output of tests
  • Added proper documentation and coverage tests
  • Added tests for export function
  • Added export function
  • Added TEI_NAMESPACE as a constant
  • Fixed docs. Fixed loads with Path. Fixed logging errors
  • Added tests

0.0.1 (2020-01-08)

  • First release on PyPI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for averell, version 1.1.0
Filename, size File type Python version Upload date Hashes
Filename, size averell-1.1.0-py2.py3-none-any.whl (26.4 kB) File type Wheel Python version py2.py3 Upload date Hashes View
Filename, size averell-1.1.0.tar.gz (183.1 kB) File type Source Python version None Upload date Hashes View

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring DigiCert DigiCert EV certificate Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page