Skip to main content

Python parser for USFM files, based on tree-sitter-usfm3

Project description

USFM-Grammar

The python library that facilitates

  • Parsing and validation of USFM files using tree-sitter-usfm3
  • Conversion of USFM files to other formats (USX, dict, list etc)
  • Extraction of specific contents from USFM files like scripture alone(clean verses), notes (footnotes, cross-refs) etc

Built on python 3.10

Installation

pip install usfm-grammar

This requires a C compiler. On Windows, Microsoft Visual C++ 14.0 or above is required. It is recommended that you update pip, setuptools and wheel.

Usage

By importing library in Python code

from usfm_grammar import USFMParser, Filter

# input_usfm_str = open("sample.usfm","r", encoding='utf8').read()
input_usfm_str = '''
\\id GEN
\\c 1
\\p
\\v 1 test verse
'''

my_parser = USFMParser(input_usfm_str)

errors = my_parser.errors
print(errors)

To convert to USX

from lxml import etree

usx_elem = my_parser.to_usx() # default filter=ALL
print(etree.tostring(usx_elem, encoding="unicode", pretty_print=True))

To convert to Dict

output = my_parser.to_usj() # default all markers

# filters out specified markers from output
# output = my_parser.to_usj(exclude_markers=['s1','h', 'toc1','toc2','mt'])

# retains only specified contents from output
# output = my_parser.to_usj(include_markers=['id', 'c', 'v']) 

# use predefined marker groups instead of listing them one by one
# output = my_parser.to_usj(include_markers=Filter.BCV+Filter.TEXT)

# for a flattened JSON removing nesting brought in by paragraphs, lists, quotes, tables and character level markups
# output = my_parser.to_usj(exclude_markers=Filter.PARAGRAPHS+Filter.CHARACTERS)

# To NOT concatinate text extracted from different markers
# output = my_parser.to_usj(exclude_markers=Filter.PARAGRAPHS+Filter.CHARACTERS, combine_texts=False) 

print(output)

To understand more about how exclude_markers, include_markers, combine_texts and Filter works refer the section on filtering on USJ

To save as json

import json
dict_output = my_parser.to_usj()
with open("file_path.json", "w", encoding='utf-8') as fp:
	json.dump(dict_output, fp)

To convert to List or table like format

list_output = my_parser.to_list() 
#list_output = my_parser.to_list([Filter.SCRIPTURE_TEXT])

table_output = "\n".join(["\t".join(row) for row in list_output])
print(table_output)

To round trip with USJ

from usfm_grammar import USFMParser, Filter

my_parser = USFMParser(input_usfm_str)
usj_obj = my_parser.to_usj()

my_parser2 = USFMParser(from_usj=usj_obj)
print(my_parser2.usfm)

:warning: There will be differences between first USFM and the generated one in 1. Spaces and lines 2. Default attributes will be given their names 3. Closing markers may be newly added

To remove unwanted markers from USFM

from usfm_grammar import USFMParser, Filter, USFMGenerator

my_parser = USFMParser(input_usfm_str)
usj_obj = my_parser.to_usj(include_markers=Filter.BCV+Filter.TEXT)

my_parser2 = USFMParser(from_usj=usj_obj)
print(my_parser2.usfm)

USJ to USX or Table

rom usfm_grammar import USFMParser, Filter

my_parser = USFMParser(input_usfm_str)
usj_obj = my_parser.to_usj()

my_parser2 = USFMParser(from_usj=usj_obj)
print(my_parser2.to_usx())
# print(my_parser2.to_list())

From CLI

usage: usfm-grammar [-h] [--in_format {usfm,usj}]
                    [--out_format {usj,table,syntax-tree,usx,markdown,usfm}]
                    [--include_markers {book_headers,titles,...}]
                    [--exclude_markers {book_headers,titles,...}]
                    [--csv_col_sep CSV_COL_SEP] [--csv_row_sep CSV_ROW_SEP]
                    [--ignore_errors] [--combine_text]
                    infile

Uses the tree-sitter-usfm grammar to parse and convert USFM to Syntax-tree,
JSON, CSV, USX etc.

positional arguments:
  infile                input usfm or usj file

options:
  -h, --help            show this help message and exit
  --in_format {usfm,usj}
                        input file format
  --out_format {usj,table,syntax-tree,usx,markdown,usfm}
                        output format
  --include_markers {book_headers,titles,comments,paragraphs,characters,notes,study_bible,bcv,text,ide,usfm,h,toc,toca,imt,is,ip,ipi,im,imi,ipq,imq,ipr,iq,ib,ili,iot,io,iex,imte,ie,mt,mte,cl,cd,ms,mr,s,sr,r,d,sp,sd,sts,rem,lit,restore,p,m,po,pr,cls,pmo,pm,pmc,pmr,pi,mi,nb,pc,ph,q,qr,qc,qa,qm,qd,lh,li,lf,lim,litl,tr,tc,th,tcr,thr,table,b,add,bk,dc,ior,iqt,k,litl,nd,ord,pn,png,qac,qs,qt,rq,sig,sls,tl,wj,em,bd,bdit,it,no,sc,sup,rb,pro,w,wh,wa,wg,lik,liv,jmp,f,fe,ef,efe,x,ex,fr,ft,fk,fq,fqa,fl,fw,fp,fv,fdc,xo,xop,xt,xta,xk,xq,xot,xnt,xdc,esb,cat,id,c,v,text-in-excluded-parent}
                        the list of of contents to be included
  --exclude_markers {book_headers,titles,comments,paragraphs,characters,notes,study_bible,bcv,text,ide,usfm,h,toc,toca,imt,is,ip,ipi,im,imi,ipq,imq,ipr,iq,ib,ili,iot,io,iex,imte,ie,mt,mte,cl,cd,ms,mr,s,sr,r,d,sp,sd,sts,rem,lit,restore,p,m,po,pr,cls,pmo,pm,pmc,pmr,pi,mi,nb,pc,ph,q,qr,qc,qa,qm,qd,lh,li,lf,lim,litl,tr,tc,th,tcr,thr,table,b,add,bk,dc,ior,iqt,k,litl,nd,ord,pn,png,qac,qs,qt,rq,sig,sls,tl,wj,em,bd,bdit,it,no,sc,sup,rb,pro,w,wh,wa,wg,lik,liv,jmp,f,fe,ef,efe,x,ex,fr,ft,fk,fq,fqa,fl,fw,fp,fv,fdc,xo,xop,xt,xta,xk,xq,xot,xnt,xdc,esb,cat,id,c,v,text-in-excluded-parent}
                        the list of of contents to be included
  --csv_col_sep CSV_COL_SEP
                        column separator or delimiter. Only useful with
                        format=table.
  --csv_row_sep CSV_ROW_SEP
                        row separator or delimiter. Only useful with
                        format=table.
  --ignore_errors       to get some output from successfully parsed portions
  --combine_text        to be used along with exclude_markers or
                        include_markers, to concatinate the consecutive text
                        snippets, from different components, or not

Example

>>> python3 -m usfm_grammar sample.usfm --out_format usx

>>> usfm-grammar sample.usfm

>>> usfm-grammar sample.usfm --out_format usx

>>> usfm-grammar sample.usfm --include_markers bcv --include_markers text --include_markers s

>>> usfm-grammar sample-usj.json --out_format usfm

Filtering on USJ

The filtering on USJ, the JSON output, is a feature incorporated to allow data extraction, markup cleaning etc. The arguments exclude_markers and include_markers in the methods USFMParser.to_usj() makes this possible. Also the USFMParser.to_list(), can accept these inputs and perform similar operations. There is CLI versions also for these arguments to replicate the filtering feature there.

  • include_markers

    Optional input parameter to to_usj() and to_list in python library and also in CLI when format=json or format=table. Defaults to None.When proivded, only those markers listed will be included in the output. include_markers is applied before applying exclude_markers.

  • exclude_markers

    Optional input parameter to to_usj() and to_list in python library and also in CLI when format=json or format=table. Defaults to None. When proivded, all markers except those listed will be included in the output.

  • combine_texts

    Optional input parameter to to_usj() and to_list in python library and also in CLI when format=json or format=table. Defaults to True. After filtering out makers like paragraphs and characters, we are left with texts from within them, if 'text-in-excluded-parent' is also not excluded. These text snippets may come as separate components in the contents list. When this option is True, the consequetive text snippets will be concatinated together. The text concatination is done in a puctuation and space aware manner. If users need more control over the space handling or for any other reason, would prefer the texts snippets as different components in the output, this can be set to False.

  • usfm_grammar.Filter

    This Class provides a set of enums that would be useful in providing in the exclude_markers and include_markers inputs rather than users listing out individual markers. The class has following options

      BOOK_HEADERS : identification and introduction markers
      TITLES : section headings and associated markers
      COMMENTS : comment markers like \rem
      PARAGRAPHS : paragraph markers like \p, poetry markers, list table markers
      CHARACTERS : all character level markups like \em, \w, \wj etc and their nested versions with +
      NOTES : foot note, cross-reference and their content markers
      STUDY_BIBLE : \esb and `cat
      BCV : \id, \c and \v
      TEXT : 'text-in-excluded-parent'
    

    To inspect which are the markers in each of these options, it could be just printed out, print(Filter.TITLES). These could be used individually or concatinated to get the desired filtering of markers and data:

    output = my_parser.to_usj(include_markers=Filter.BCV)
    output = my_parser.to_usj(include_markers=Filter.BCV+Filter.TEXT)
    output = my_parser.to_usj(exclude_markers=Filter.PARAGRAPHS+Filter.CHARACTERS)
    
  • Inner contents of excluded markers

    For markers like \p \q etc, by excluding them, we only remove them from the heirachy and retain the inner contents like \v, text etc that would be coming inside it. But for certain other markers like \f, \x, \esb etc, if they are excluded their inner contents are also excluded. Following is the set of all markers, who inner contents are discarded if they are mentioned in exclude_markers or not included in include_markers.

    BOOK_HEADERS, TITLES, COMMENTS, NOTES, STUDY_BIBLE
    

    :warning: Generally, it is recommended to NOT use both exclude_markers and include_markers together as it could lead to unexpected behavours and data loss. For instance if include_makers has \fk and exclude_markers has \f, the output will not contain \fk as all inner contents of \f will be discarded.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

usfm_grammar-3.0.0b3-cp311-cp311-win_amd64.whl (259.8 kB view details)

Uploaded CPython 3.11 Windows x86-64

usfm_grammar-3.0.0b3-cp311-cp311-win32.whl (262.6 kB view details)

Uploaded CPython 3.11 Windows x86

usfm_grammar-3.0.0b3-cp311-cp311-musllinux_1_1_x86_64.whl (259.4 kB view details)

Uploaded CPython 3.11 musllinux: musl 1.1+ x86-64

usfm_grammar-3.0.0b3-cp311-cp311-musllinux_1_1_i686.whl (268.5 kB view details)

Uploaded CPython 3.11 musllinux: musl 1.1+ i686

usfm_grammar-3.0.0b3-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (259.1 kB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

usfm_grammar-3.0.0b3-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (268.2 kB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ i686 manylinux: glibc 2.5+ i686

usfm_grammar-3.0.0b3-cp311-cp311-macosx_10_9_x86_64.whl (252.8 kB view details)

Uploaded CPython 3.11 macOS 10.9+ x86-64

usfm_grammar-3.0.0b3-cp310-cp310-win_amd64.whl (259.8 kB view details)

Uploaded CPython 3.10 Windows x86-64

usfm_grammar-3.0.0b3-cp310-cp310-win32.whl (262.6 kB view details)

Uploaded CPython 3.10 Windows x86

usfm_grammar-3.0.0b3-cp310-cp310-musllinux_1_1_x86_64.whl (259.4 kB view details)

Uploaded CPython 3.10 musllinux: musl 1.1+ x86-64

usfm_grammar-3.0.0b3-cp310-cp310-musllinux_1_1_i686.whl (268.5 kB view details)

Uploaded CPython 3.10 musllinux: musl 1.1+ i686

usfm_grammar-3.0.0b3-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (259.1 kB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

usfm_grammar-3.0.0b3-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (268.2 kB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ i686 manylinux: glibc 2.5+ i686

usfm_grammar-3.0.0b3-cp310-cp310-macosx_10_9_x86_64.whl (252.8 kB view details)

Uploaded CPython 3.10 macOS 10.9+ x86-64

File details

Details for the file usfm_grammar-3.0.0b3-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for usfm_grammar-3.0.0b3-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 86046fe371a977951a4f70cb2538b1da8aef8c6782e99d897da4b9426b205b3f
MD5 c4f8ffedbc8d46c435878c7102bc56e6
BLAKE2b-256 85062522cda998c40d4d8bf3b23ac33f3bdebf0555bb1d75c57041309d708bc0

See more details on using hashes here.

Provenance

File details

Details for the file usfm_grammar-3.0.0b3-cp311-cp311-win32.whl.

File metadata

File hashes

Hashes for usfm_grammar-3.0.0b3-cp311-cp311-win32.whl
Algorithm Hash digest
SHA256 d22f60b2853b36158df55db2fbd4a9625af466b53a8f8d2084bb7a1377e5779f
MD5 f455da3e0d9d6ca7498d9b8ae79740c3
BLAKE2b-256 bb49dddc1a57debb9ffd91393314d125b9ca64dc2b1ae22e1a9338a2d929fc63

See more details on using hashes here.

Provenance

File details

Details for the file usfm_grammar-3.0.0b3-cp311-cp311-musllinux_1_1_x86_64.whl.

File metadata

File hashes

Hashes for usfm_grammar-3.0.0b3-cp311-cp311-musllinux_1_1_x86_64.whl
Algorithm Hash digest
SHA256 07d1bf50b073b571c5f098a39b619c313d63c161b3e10791ab9d74edd5452eb7
MD5 f66c4afd07ac4a2c9a52db878e26b6db
BLAKE2b-256 f73b511c34434c101483d903801fa7ae42b8e0ae79455d5a72e771221f6b4679

See more details on using hashes here.

Provenance

File details

Details for the file usfm_grammar-3.0.0b3-cp311-cp311-musllinux_1_1_i686.whl.

File metadata

File hashes

Hashes for usfm_grammar-3.0.0b3-cp311-cp311-musllinux_1_1_i686.whl
Algorithm Hash digest
SHA256 ddc5733bc48305ac72027c94712218d29d882dadc3a447b87dbf3cc01de96f95
MD5 dbca4c51b9a3ed5297764c48e12ae719
BLAKE2b-256 774b11b652cd5152b5cef2647d675c0e95af0ea6bc4a3972557b60b7e7c85e87

See more details on using hashes here.

Provenance

File details

Details for the file usfm_grammar-3.0.0b3-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for usfm_grammar-3.0.0b3-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 5a9db839ab9915571249b12ebc68642699580e76f3101a9d314fe1517761f230
MD5 587ce736d2505535b3dc015ef7a34c27
BLAKE2b-256 99d250865a3a4cad9e5b31a5d5f5d82eb8c7073d2fa6168a2d3573355a537d0c

See more details on using hashes here.

Provenance

File details

Details for the file usfm_grammar-3.0.0b3-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for usfm_grammar-3.0.0b3-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 baec491e154c013791cb49ea6f72862359db4195d7a62233e88d423149518aba
MD5 ecc8b2c8953741232a463721e15fd77d
BLAKE2b-256 c3d1c0491933eccc7fa83a924f17656b649fb277ed15bf57e1a7c48afc723502

See more details on using hashes here.

Provenance

File details

Details for the file usfm_grammar-3.0.0b3-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for usfm_grammar-3.0.0b3-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 e3f1fc61666cc19c7097441df62887303e7cc7c055a637b4c73f6db1845b629c
MD5 4cd8ff033bd9ae51f93ac2f98c500075
BLAKE2b-256 7c2660802394bffe4cf1b3e10a7c12c82fa418ca71e012dcaded1bd33847454a

See more details on using hashes here.

Provenance

File details

Details for the file usfm_grammar-3.0.0b3-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for usfm_grammar-3.0.0b3-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 39cceb616ea35edd2d9aa713e7918b872eefe37dca123cef05ee6926f006cfb3
MD5 1268e6b80f11679868c594c08bd083b0
BLAKE2b-256 04a4ed39872ba594ad7568b07b5ce5d84de8a68a2e525e0db3206f4e1f138e68

See more details on using hashes here.

Provenance

File details

Details for the file usfm_grammar-3.0.0b3-cp310-cp310-win32.whl.

File metadata

File hashes

Hashes for usfm_grammar-3.0.0b3-cp310-cp310-win32.whl
Algorithm Hash digest
SHA256 947a76b08ab1c0f066794038d3806047175c377c20dd3d0c7c0a8834ac59f0b7
MD5 0831c11e4ca5fa0d2f5e4fb700eed5f8
BLAKE2b-256 f63a5452299140dbdfeebf9a8d3889fe1102af178149bc23540d363f31c2ab3f

See more details on using hashes here.

Provenance

File details

Details for the file usfm_grammar-3.0.0b3-cp310-cp310-musllinux_1_1_x86_64.whl.

File metadata

File hashes

Hashes for usfm_grammar-3.0.0b3-cp310-cp310-musllinux_1_1_x86_64.whl
Algorithm Hash digest
SHA256 32cb8814ae1fd42a73e95b7150b585e6506bc30230848b76eb45cb45f5f30ce3
MD5 929d7692678050782e643ac6af0f1684
BLAKE2b-256 c0dc5990caa931bd5b593d97b060fa978ec6e1a73679a48d5052c4f8cb07eb90

See more details on using hashes here.

Provenance

File details

Details for the file usfm_grammar-3.0.0b3-cp310-cp310-musllinux_1_1_i686.whl.

File metadata

File hashes

Hashes for usfm_grammar-3.0.0b3-cp310-cp310-musllinux_1_1_i686.whl
Algorithm Hash digest
SHA256 48a840cd2e2f36ac4d31baadf308e3501491d4ed7025e0c75413ea80c43a243b
MD5 c9d158d8f1118b9bc2c783f56740d3f2
BLAKE2b-256 3fd49493ca8e7768933127b4608c90b61fbd851ab9996b5d98cb1cceb2af6476

See more details on using hashes here.

Provenance

File details

Details for the file usfm_grammar-3.0.0b3-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for usfm_grammar-3.0.0b3-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 3b5a1af46f6e9214aa5c4610497b517c64216506c82f98d30d3f8a978d8287bc
MD5 0a103b95fb99da1e93ccb3178e667366
BLAKE2b-256 9c3a19bdef7000baa85e5cd693ca7ac466c32355129e556c6177a58c20248c0f

See more details on using hashes here.

Provenance

File details

Details for the file usfm_grammar-3.0.0b3-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for usfm_grammar-3.0.0b3-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 397b72d9d68daa0e3edaf16ffeda5766fe328ec170c43ad8ab1badd88d0627be
MD5 9bf56f2365687d8867ca3090ff361471
BLAKE2b-256 0ebce0ecef1e04fb6b7c0bd39eda237f7109dfd4e2b93cac75905d119ecac9cd

See more details on using hashes here.

Provenance

File details

Details for the file usfm_grammar-3.0.0b3-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for usfm_grammar-3.0.0b3-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 1ba6b772f0591d96d2e85133a9df89db0d5c3c477f948086c33cd334693e3d74
MD5 1d0694aaf1eabd9373940d7c04f3a6ac
BLAKE2b-256 cc44dbd313dc7a530a92e45bdc41214c157f6b90ade86233fe791d233d7e8b34

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page