Skip to main content

A library for reading corpora.

Project description

corpona

corpona is a library for processing corpora formats (e.g. XML and JSON). The library is installable via PIP: pip install -U corpona.

Examples

Reading NewsML XML format

from corpona import XML
d = XML.parse_xml('2660341.xml')
print(f"Guid: {d.guid}") # access tag attributes as Python attributes
print(f"Language: {d.attributes['xml:lang']}") # in case of special characters, access them directly

contentMeta = d['contentMeta'][0]
print(f"Urgency: {contentMeta['urgency']}")
print(f"Headline: {contentMeta['headline']}")
print(f"Subject: {contentMeta['subject'][0]['name']}")
print("Genres: {}".format(", ".join(g['name'].text for g in contentMeta['genre'])))
print()
content_body = d['contentSet'][0]['inlineXML'][0]['html'][0]['body'][0]
print("Content: ")
for p in content_body['p']:
    print(p)

Getting a Summary of an XML/JSON

from corpona import XML
from corpona import summarize
from pprint import pprint

d = XML.parse_xml('data.xml', namespaces={'http://www.w3.org/XML/1998/namespace': 'xml', })
pprint(summarize(d), indent=4)

pprint(summarize([
    {'key1': 'hello1', 'key2': 1},
    {'key1': 'hello2', 'key2': 2},
    {'key1': 'hello3', 'key2': 3},
    {'key1': 'hello4', 'key2': 4},
]), indent=4)

Find children

from corpona import find_child

data = {"key":["list_item", {"key2":"oo"}, {"key2":"bbb"}]}
print(find_child(data, ["key", "key2"]))
print(find_child(data, ["key", "key3"], default_value="ok"))

>> ['oo', 'bbb']
>> ['ok']

Cite

If you use the library in an academic paper, please cite it:

Alnajjar, K. & Hämäläinen, M., (2021) Corpona – The Pythonic Way of Processing Corpora. In Hämäläinen, M., Partanen, N. & Alnajjar, K. (eds.) Multilingual Facilitation. University of Helsinki, p. 25−30

@inbook{3bd164164c8648b986cb14a4a8524423,
title = "Corpona – The Pythonic Way of Processing Corpora",
author = "Khalid Alnajjar and Mika H{\"a}m{\"a}l{\"a}inen",
year = "2021",
language = "English",
pages = "25−30",
editor = "Mika H{\"a}m{\"a}l{\"a}inen and Niko Partanen and Khalid Alnajjar",
booktitle = "Multilingual Facilitation",
publisher = "University of Helsinki",
address = "Finland",
}

Need for NLP solutions for your business?

Rootroo logo

Our company, Rootroo offers consulting related to multilingual NLP tasks. We have a strong academic background in the state-of-the-art AI solutions for every NLP need. Just contact us, we won't bite.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

corpona-1.0.1.tar.gz (5.8 kB view details)

Uploaded Source

File details

Details for the file corpona-1.0.1.tar.gz.

File metadata

  • Download URL: corpona-1.0.1.tar.gz
  • Upload date:
  • Size: 5.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.7.3

File hashes

Hashes for corpona-1.0.1.tar.gz
Algorithm Hash digest
SHA256 800d5aa1e10e5f865902b674e3c5c8d4e9be29e8071dc7669961343aee78e2a5
MD5 dc9b908a65e5efaa014c2f8dbf7d1050
BLAKE2b-256 3a5c26d53061ac3938d734f6d96329e37d2b837b5c6b21598fc74ef662044871

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page