Skip to main content

A library for reading corpora.

Project description

corpona

corpona is a library for processing corpora formats (e.g. XML and JSON). The library is installable via PIP: pip install -U corpona.

Examples

Reading NewsML XML format

from corpona import XML
d = XML.parse_xml('2660341.xml')
print(f"Guid: {d.guid}") # access tag attributes as Python attributes
print(f"Language: {d.attributes['xml:lang']}") # in case of special characters, access them directly

contentMeta = d['contentMeta'][0]
print(f"Urgency: {contentMeta['urgency']}")
print(f"Headline: {contentMeta['headline']}")
print(f"Subject: {contentMeta['subject'][0]['name']}")
print("Genres: {}".format(", ".join(g['name'].text for g in contentMeta['genre'])))
print()
content_body = d['contentSet'][0]['inlineXML'][0]['html'][0]['body'][0]
print("Content: ")
for p in content_body['p']:
    print(p)

Getting a Summary of an XML/JSON

from corpona import XML
from corpona import summarize
from pprint import pprint

d = XML.parse_xml('data.xml', namespaces={'http://www.w3.org/XML/1998/namespace': 'xml', })
pprint(summarize(d), indent=4)

pprint(summarize([
    {'key1': 'hello1', 'key2': 1},
    {'key1': 'hello2', 'key2': 2},
    {'key1': 'hello3', 'key2': 3},
    {'key1': 'hello4', 'key2': 4},
]), indent=4)

Find children

from corpona import find_child

data = {"key":["list_item", {"key2":"oo"}, {"key2":"bbb"}]}
print(find_child(data, ["key", "key2"]))
print(find_child(data, ["key", "key3"], default_value="ok"))

>> ['oo', 'bbb']
>> ['ok']

Cite

If you use the library in an academic paper, please cite it:

Alnajjar, K. & Hämäläinen, M., (2021) Corpona – The Pythonic Way of Processing Corpora. In Hämäläinen, M., Partanen, N. & Alnajjar, K. (eds.) Multilingual Facilitation. University of Helsinki, p. 25−30

@inbook{3bd164164c8648b986cb14a4a8524423,
title = "Corpona – The Pythonic Way of Processing Corpora",
author = "Khalid Alnajjar and Mika H{\"a}m{\"a}l{\"a}inen",
year = "2021",
language = "English",
pages = "25−30",
editor = "Mika H{\"a}m{\"a}l{\"a}inen and Niko Partanen and Khalid Alnajjar",
booktitle = "Multilingual Facilitation",
publisher = "University of Helsinki",
address = "Finland",
}

Need for NLP solutions for your business?

Rootroo logo

Our company, Rootroo offers consulting related to multilingual NLP tasks. We have a strong academic background in the state-of-the-art AI solutions for every NLP need. Just contact us, we won't bite.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

corpona-1.0.1.tar.gz (5.8 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page