A library for reading corpora.
Project description
corpona
corpona is a library for processing corpora formats (e.g. XML and JSON).
The library is installable via PIP: pip install -U corpona
.
Examples
Reading NewsML XML format
from corpona import XML
d = XML.parse_xml('2660341.xml')
print(f"Guid: {d.guid}") # access tag attributes as Python attributes
print(f"Language: {d.attributes['xml:lang']}") # in case of special characters, access them directly
contentMeta = d['contentMeta'][0]
print(f"Urgency: {contentMeta['urgency']}")
print(f"Headline: {contentMeta['headline']}")
print(f"Subject: {contentMeta['subject'][0]['name']}")
print("Genres: {}".format(", ".join(g['name'].text for g in contentMeta['genre'])))
print()
content_body = d['contentSet'][0]['inlineXML'][0]['html'][0]['body'][0]
print("Content: ")
for p in content_body['p']:
print(p)
Getting a Summary of an XML/JSON
from corpona import XML
from corpona import summarize
from pprint import pprint
d = XML.parse_xml('data.xml', namespaces={'http://www.w3.org/XML/1998/namespace': 'xml', })
pprint(summarize(d), indent=4)
pprint(summarize([
{'key1': 'hello1', 'key2': 1},
{'key1': 'hello2', 'key2': 2},
{'key1': 'hello3', 'key2': 3},
{'key1': 'hello4', 'key2': 4},
]), indent=4)
Find children
from corpona import find_child
data = {"key":["list_item", {"key2":"oo"}, {"key2":"bbb"}]}
print(find_child(data, ["key", "key2"]))
print(find_child(data, ["key", "key3"], default_value="ok"))
>> ['oo', 'bbb']
>> ['ok']
Cite
If you use the library in an academic paper, please cite it:
Alnajjar, K. & Hämäläinen, M., (2021) Corpona – The Pythonic Way of Processing Corpora. In Hämäläinen, M., Partanen, N. & Alnajjar, K. (eds.) Multilingual Facilitation. University of Helsinki, p. 25−30
@inbook{3bd164164c8648b986cb14a4a8524423,
title = "Corpona – The Pythonic Way of Processing Corpora",
author = "Khalid Alnajjar and Mika H{\"a}m{\"a}l{\"a}inen",
year = "2021",
language = "English",
pages = "25−30",
editor = "Mika H{\"a}m{\"a}l{\"a}inen and Niko Partanen and Khalid Alnajjar",
booktitle = "Multilingual Facilitation",
publisher = "University of Helsinki",
address = "Finland",
}
Need for NLP solutions for your business?
Our company, Rootroo offers consulting related to multilingual NLP tasks. We have a strong academic background in the state-of-the-art AI solutions for every NLP need. Just contact us, we won't bite.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
corpona-1.0.1.tar.gz
(5.8 kB
view hashes)