nafigator

Python package to convert spaCy and Stanza documents to NLP Annotation Format (NAF)

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Programming Language

Project description

nafigator

DISCLAIMER - BETA PHASE

This parser to naf is currently in a beta phase.

Python package to convert text documents to NLP Annotation Format (NAF)

Free software: MIT license
Documentation: https://nafigator.readthedocs.io.

Features

Nafigator allows you to store NLP output from custom made spaCy and stanza pipelines with (intermediate) results and all processing steps in one format.

Convert text files to .naf files that satisfy the NLP Annotation Format (NAF)
- Supported input media types: application/pdf (.pdf), text/plain (.txt)
- Supported output format: .naf (xml)
- Supported NLP pipelines: spaCy, stanza
- Supported NAF layers: raw, text, terms, entities, deps
Read .naf documents and access data as Python lists and dicts

The NAF format

Key features:

Multilayered extensible annotations;
Reproducible NLP pipelines;
NLP processor agnostic;
Compatible with RDF

References:

Installation

To install the package

pip install nafigator

To install the package from Github

pip install -e git+https://github.com/wjwillemse/nafigator.git#egg=nafigator

How to run

Command line interface

To parse an pdf or a txt file run in the root of the project:

python -m nafigator.parse

Function calls

Example:

from nafigator.parse import generate_naf

doc = generate_naf(input = "../data/example.pdf",
                   engine = "stanza",
                   language = "en",
                   naf_version = "v3.1",
                   dtd_validation = False,
                   params = {'fileDesc': {'author': 'W.J.Willemse'}},
                   nlp = None)

input: text document to convert to naf document
engine: pipeline processor, i.e. ‘spacy’ or ‘stanza’
language: ‘en’ or ‘nl’
naf_version: ‘v3’ or ‘v3.1’
dtd_validation: True or False (default = False)
params: dictionary with parameters (default = {})
nlp: custom made pipeline object from spacy or stanza (default = None)

Get the document and processors metadata via:

doc.header

Output of doc.header of processed data/example.pdf:

{
        'fileDesc': {
                'author': 'W.J.Willemse',
                'creationtime': '2021-04-25T11:28:58UTC',
                'filename': 'data/example.pdf',
                'filetype': 'application/pdf',
                'pages': '2'},
        'public': {
                '{http://purl.org/dc/elements/1.1/}uri': 'data/example.pdf',
                '{http://purl.org/dc/elements/1.1/}format': 'application/pdf'},
                ...

Get the raw layer output via:

doc.raw

Output of doc.raw of processed data/example.pdf:

The cat sat on the mat. Matt was his name.

Get the text layer output via:

doc.text

Output of doc.text of processed data/example.pdf:

[
        {'text': 'The', 'page': '1', 'sent': '1', 'id': 'w1', 'length': '3', 'offset': '0'},
        {'text': 'cat', 'page': '1', 'sent': '1', 'id': 'w2', 'length': '3', 'offset': '4'},
        {'text': 'sat', 'page': '1', 'sent': '1', 'id': 'w3', 'length': '3', 'offset': '8'},
        {'text': 'on', 'page': '1', 'sent': '1', 'id': 'w4', 'length': '2', 'offset': '12'},
        {'text': 'the', 'page': '1', 'sent': '1', 'id': 'w5', 'length': '3', 'offset': '15'},
        {'text': 'mat', 'page': '1', 'sent': '1', 'id': 'w6', 'length': '3', 'offset': '19'},
        {'text': '.', 'page': '1', 'sent': '1', 'id': 'w7', 'length': '1', 'offset': '22'},
        {'text': 'Matt', 'page': '1', 'sent': '2', 'id': 'w8', 'length': '4', 'offset': '24'},
        {'text': 'was', 'page': '1', 'sent': '2', 'id': 'w9', 'length': '3', 'offset': '29'},
        {'text': 'his', 'page': '1', 'sent': '2', 'id': 'w10', 'length': '3', 'offset': '33'},
        {'text': 'name', 'page': '1', 'sent': '2', 'id': 'w11', 'length': '4', 'offset': '37'},
        {'text': '.', 'page': '1', 'sent': '2', 'id': 'w12', 'length': '1', 'offset': '41'}
]

Get the terms layer output via:

doc.terms

Output of doc.terms of processed data/example.pdf:

[
        {'id': 't1', 'lemma': 'the', 'pos': 'DET', 'targets': ['w1']},
        {'id': 't2', 'lemma': 'cat', 'pos': 'NOUN', 'targets': ['w2']},
        {'id': 't3', 'lemma': 'sit', 'pos': 'VERB', 'targets': ['w3']},
        {'id': 't4', 'lemma': 'on', 'pos': 'ADP', 'targets': ['w4']},
        {'id': 't5', 'lemma': 'the', 'pos': 'DET', 'targets': ['w5']},
        {'id': 't6', 'lemma': 'mat', 'pos': 'NOUN', 'targets': ['w6']},
        {'id': 't7', 'lemma': '.', 'pos': 'PUNCT', 'targets': ['w7']},
        {'id': 't8', 'lemma': 'Matt', 'pos': 'PROPN', 'targets': ['w8']},
        {'id': 't9', 'lemma': 'be', 'pos': 'AUX', 'targets': ['w9']},
        {'id': 't10', 'lemma': 'he', 'pos': 'PRON', 'targets': ['w10']},
        {'id': 't11', 'lemma': 'name', 'pos': 'NOUN', 'targets': ['w11']},
        {'id': 't12', 'lemma': '.', 'pos': 'PUNCT', 'targets': ['w12']}]

Get the entities layer output via:

doc.entities

Output of doc.entities of processed data/example.pdf:

[
        {'id': 'e1', 'type': 'PERSON', 'targets': ['t8']}
]

Get the entities layer output via:

doc.deps

Output of doc.deps of processed data/example.pdf:

[
        {'from': 't2', 'to': 't1', 'rfunc': 'det'},
        {'from': 't3', 'to': 't2', 'rfunc': 'nsubj'},
        {'from': 't6', 'to': 't4', 'rfunc': 'case'},
        {'from': 't3', 'to': 't6', 'rfunc': 'obl'},
        {'from': 't6', 'to': 't5', 'rfunc': 'det'},
        {'from': 't3', 'to': 't7', 'rfunc': 'punct'},
        {'from': 't11', 'to': 't8', 'rfunc': 'nsubj'},
        {'from': 't11', 'to': 't9', 'rfunc': 'cop'},
        {'from': 't11', 'to': 't10', 'rfunc': 'nmod:poss'},
        {'from': 't11', 'to': 't12', 'rfunc': 'punct'}
]

Get the formats layer output via:

doc.formats

Output of doc.formats:

[
        {'length': '45', 'offset': '0', 'textboxes': [
                {'textlines': [
                        {'texts': [
                                {'font': 'CIDFont+F1',
                                 'size': '12.000',
                                 'length': '42',
                                 'offset': '0',
                                 'text': 'The cat sat on the mat. Matt was his name.'}]
                        }
                }]
        ]}
]

Credits

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.

History

0.1.0 (2021-03-13)

First release on PyPI.

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Programming Language

Release history Release notifications | RSS feed

0.1.64

Aug 31, 2023

0.1.63

Apr 11, 2023

0.1.62

Mar 29, 2023

0.1.61

Jan 9, 2023

0.1.60

Dec 22, 2022

0.1.59

Dec 8, 2022

0.1.58

Dec 8, 2022

0.1.57

Nov 30, 2022

0.1.55

Nov 21, 2022

0.1.54

Nov 17, 2022

0.1.53

Nov 14, 2022

0.1.52

Oct 19, 2022

0.1.51

Oct 11, 2022

0.1.50

Sep 5, 2022

0.1.49

Sep 2, 2022

0.1.48

Aug 30, 2022

0.1.47

Aug 24, 2022

0.1.46

Jul 26, 2022

0.1.45

Jul 14, 2022

0.1.44

Jun 17, 2022

0.1.43

Apr 29, 2022

0.1.42

Apr 6, 2022

0.1.41

Mar 28, 2022

0.1.40

Mar 1, 2022

0.1.39

Feb 13, 2022

0.1.38

Jan 30, 2022

0.1.37

Nov 4, 2021

0.1.36

Oct 25, 2021

0.1.35

Oct 14, 2021

0.1.34

Oct 4, 2021

0.1.33

Sep 6, 2021

0.1.32

Sep 6, 2021

0.1.31

Aug 20, 2021

0.1.30

Aug 11, 2021

0.1.29

Aug 9, 2021

0.1.28

Aug 9, 2021

0.1.27

Aug 2, 2021

0.1.26

Jul 29, 2021

0.1.25

Jul 25, 2021

0.1.24

Jul 16, 2021

0.1.23

May 29, 2021

0.1.22

May 29, 2021

0.1.21

May 25, 2021

0.1.20

May 12, 2021

0.1.19

May 11, 2021

0.1.18

May 9, 2021

0.1.17

May 9, 2021

0.1.16

May 4, 2021

0.1.15

May 3, 2021

0.1.14

May 3, 2021

0.1.13

May 2, 2021

0.1.12

Apr 30, 2021

0.1.11

Apr 29, 2021

0.1.10

Apr 29, 2021

0.1.9

Apr 29, 2021

0.1.8

Apr 29, 2021

This version

0.1.7

Apr 28, 2021

0.1.6

Apr 26, 2021

0.1.5

Apr 26, 2021

0.1.4

Apr 25, 2021

0.1.3

Apr 22, 2021

0.1.2

Apr 20, 2021

0.1.1

Apr 12, 2021

0.1.0

Apr 8, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

nafigator-0.1.7-py2.py3-none-any.whl (24.7 kB view details)

Uploaded Apr 28, 2021 Python 2Python 3

File details

Details for the file nafigator-0.1.7-py2.py3-none-any.whl.

File metadata

Download URL: nafigator-0.1.7-py2.py3-none-any.whl
Upload date: Apr 28, 2021
Size: 24.7 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.6.10

File hashes

Hashes for nafigator-0.1.7-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`da35b5fcf0e0e65c6aa6b973b50be96fabbb7f80e00a5dcd3eb9e718f7bf2244`
MD5	`ca96cd4a2fa792aa905a19ba2b7acc81`
BLAKE2b-256	`b13256f308efad47a589c24d659547878ecb05b20895968634fdba2d03ed0ef5`

See more details on using hashes here.

nafigator 0.1.7

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

nafigator

Features

The NAF format

Installation

How to run

Command line interface

Function calls

Credits

History

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes