A library for reading corpora.

These details have not been verified by PyPI

Project links

Project description

corpona

corpona is a library for processing corpora formats (e.g. XML and JSON). The library is installable via PIP: pip install -U corpona.

Examples

Reading NewsML XML format

from corpona import XML
d = XML.parse_xml('2660341.xml')
print(f"Guid: {d.guid}") # access tag attributes as Python attributes
print(f"Language: {d.attributes['xml:lang']}") # in case of special characters, access them directly

contentMeta = d['contentMeta'][0]
print(f"Urgency: {contentMeta['urgency']}")
print(f"Headline: {contentMeta['headline']}")
print(f"Subject: {contentMeta['subject'][0]['name']}")
print("Genres: {}".format(", ".join(g['name'].text for g in contentMeta['genre'])))
print()
content_body = d['contentSet'][0]['inlineXML'][0]['html'][0]['body'][0]
print("Content: ")
for p in content_body['p']:
    print(p)

Getting a Summary of an XML/JSON

from corpona import XML
from corpona import summarize
from pprint import pprint

d = XML.parse_xml('data.xml', namespaces={'http://www.w3.org/XML/1998/namespace': 'xml', })
pprint(summarize(d), indent=4)

pprint(summarize([
    {'key1': 'hello1', 'key2': 1},
    {'key1': 'hello2', 'key2': 2},
    {'key1': 'hello3', 'key2': 3},
    {'key1': 'hello4', 'key2': 4},
]), indent=4)

Find children

from corpona import find_child

data = {"key":["list_item", {"key2":"oo"}, {"key2":"bbb"}]}
print(find_child(data, ["key", "key2"]))
print(find_child(data, ["key", "key3"], default_value="ok"))

>> ['oo', 'bbb']
>> ['ok']

Cite

If you use the library in an academic paper, please cite it:

Alnajjar, K. & Hämäläinen, M., (2021) Corpona – The Pythonic Way of Processing Corpora. In Hämäläinen, M., Partanen, N. & Alnajjar, K. (eds.) Multilingual Facilitation. University of Helsinki, p. 25−30

@inbook{3bd164164c8648b986cb14a4a8524423,
title = "Corpona – The Pythonic Way of Processing Corpora",
author = "Khalid Alnajjar and Mika H{\"a}m{\"a}l{\"a}inen",
year = "2021",
language = "English",
pages = "25−30",
editor = "Mika H{\"a}m{\"a}l{\"a}inen and Niko Partanen and Khalid Alnajjar",
booktitle = "Multilingual Facilitation",
publisher = "University of Helsinki",
address = "Finland",
}

Need for NLP solutions for your business?

Our company, Rootroo offers consulting related to multilingual NLP tasks. We have a strong academic background in the state-of-the-art AI solutions for every NLP need. Just contact us, we won't bite.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.1

Mar 24, 2021

1.0.0

Dec 17, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

corpona-1.0.1.tar.gz (5.8 kB view details)

Uploaded Mar 24, 2021 Source

File details

Details for the file corpona-1.0.1.tar.gz.

File metadata

Download URL: corpona-1.0.1.tar.gz
Upload date: Mar 24, 2021
Size: 5.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.7.3

File hashes

Hashes for corpona-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`800d5aa1e10e5f865902b674e3c5c8d4e9be29e8071dc7669961343aee78e2a5`
MD5	`dc9b908a65e5efaa014c2f8dbf7d1050`
BLAKE2b-256	`3a5c26d53061ac3938d734f6d96329e37d2b837b5c6b21598fc74ef662044871`

See more details on using hashes here.

corpona 1.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

corpona

Examples

Reading NewsML XML format

Getting a Summary of an XML/JSON

Find children

Cite

Need for NLP solutions for your business?

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes