A library for reading corpora.
Project description
corpona
corpona is a library for processing corpora formats (e.g. XML and JSON).
The library is installable via PIP: pip install -U corpona.
Examples
Reading NewsML XML format
from corpona import XML
d = XML.parse_xml('2660341.xml')
print(f"Guid: {d.guid}") # access tag attributes as Python attributes
print(f"Language: {d.attributes['xml:lang']}") # in case of special characters, access them directly
contentMeta = d['contentMeta'][0]
print(f"Urgency: {contentMeta['urgency']}")
print(f"Headline: {contentMeta['headline']}")
print(f"Subject: {contentMeta['subject'][0]['name']}")
print("Genres: {}".format(", ".join(g['name'].text for g in contentMeta['genre'])))
print()
content_body = d['contentSet'][0]['inlineXML'][0]['html'][0]['body'][0]
print("Content: ")
for p in content_body['p']:
print(p)
Getting a Summary of an XML/JSON
from corpona import XML
from corpona import summarize
from pprint import pprint
d = XML.parse_xml('data.xml', namespaces={'http://www.w3.org/XML/1998/namespace': 'xml', })
pprint(summarize(d), indent=4)
pprint(summarize([
{'key1': 'hello1', 'key2': 1},
{'key1': 'hello2', 'key2': 2},
{'key1': 'hello3', 'key2': 3},
{'key1': 'hello4', 'key2': 4},
]), indent=4)
Find children
from corpona import find_child
data = {"key":["list_item", {"key2":"oo"}, {"key2":"bbb"}]}
print(find_child(data, ["key", "key2"]))
print(find_child(data, ["key", "key3"], default_value="ok"))
>> ['oo', 'bbb']
>> ['ok']
Cite
If you use the library in an academic paper, please cite it:
Alnajjar, K. & Hämäläinen, M., (2021) Corpona – The Pythonic Way of Processing Corpora. In Hämäläinen, M., Partanen, N. & Alnajjar, K. (eds.) Multilingual Facilitation. University of Helsinki, p. 25−30
@inbook{3bd164164c8648b986cb14a4a8524423,
title = "Corpona – The Pythonic Way of Processing Corpora",
author = "Khalid Alnajjar and Mika H{\"a}m{\"a}l{\"a}inen",
year = "2021",
language = "English",
pages = "25−30",
editor = "Mika H{\"a}m{\"a}l{\"a}inen and Niko Partanen and Khalid Alnajjar",
booktitle = "Multilingual Facilitation",
publisher = "University of Helsinki",
address = "Finland",
}
Need for NLP solutions for your business?
Our company, Rootroo offers consulting related to multilingual NLP tasks. We have a strong academic background in the state-of-the-art AI solutions for every NLP need. Just contact us, we won't bite.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file corpona-1.0.1.tar.gz.
File metadata
- Download URL: corpona-1.0.1.tar.gz
- Upload date:
- Size: 5.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.7.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
800d5aa1e10e5f865902b674e3c5c8d4e9be29e8071dc7669961343aee78e2a5
|
|
| MD5 |
dc9b908a65e5efaa014c2f8dbf7d1050
|
|
| BLAKE2b-256 |
3a5c26d53061ac3938d734f6d96329e37d2b837b5c6b21598fc74ef662044871
|