Skip to main content

Python parser for USFM files, based on tree-sitter-usfm3

Project description

USFM-Grammar

The python library that facilitates

  • Parsing and validation of USFM files using tree-sitter-usfm3
  • Conversion of USFM files to other formats (USX, dict, list etc)
  • Extraction of specific contents from USFM files like scripture alone(clean verses), notes (footnotes, cross-refs) etc

Built on python 3.10

Installation

pip install usfm-grammar

This requires a C compiler. On Windows, Microsoft Visual C++ 14.0 or above is required. It is recommended that you update pip, setuptools and wheel.

Usage

By importing library in Python code

from usfm_grammar import USFMParser, Filter

# input_usfm_str = open("sample.usfm","r", encoding='utf8').read()
input_usfm_str = '''
\\id GEN
\\c 1
\\p
\\v 1 test verse
'''

my_parser = USFMParser(input_usfm_str)

errors = my_parser.errors
print(errors)
To convert to USX
from lxml import etree

usx_elem = my_parser.to_usx() # default filter=ALL
print(etree.tostring(usx_elem, encoding="unicode", pretty_print=True))
To convert to Dict/USJ
output = my_parser.to_usj() # default all markers

# filters out specified markers from output
# output = my_parser.to_usj(exclude_markers=['s1','h', 'toc1','toc2','mt'])

# retains only specified contents from output
# output = my_parser.to_usj(include_markers=['id', 'c', 'v']) 

# use predefined marker groups instead of listing them one by one
# output = my_parser.to_usj(include_markers=Filter.BCV+Filter.TEXT)

# for a flattened JSON removing nesting brought in by paragraphs, lists, quotes, tables and character level markups
# output = my_parser.to_usj(exclude_markers=Filter.PARAGRAPHS+Filter.CHARACTERS)

# To NOT concatinate text extracted from different markers
# output = my_parser.to_usj(exclude_markers=Filter.PARAGRAPHS+Filter.CHARACTERS, combine_texts=False) 

print(output)

To understand more about how exclude_markers, include_markers, combine_texts and Filter works refer the section on filtering on USJ

To save as json
import json
dict_output = my_parser.to_usj()
with open("file_path.json", "w", encoding='utf-8') as fp:
	json.dump(dict_output, fp)
To convert to List or table like format
list_output = my_parser.to_list() 
#list_output = my_parser.to_list([Filter.SCRIPTURE_TEXT])

table_output = "\n".join(["\t".join(row) for row in list_output])
print(table_output)

To round trip with USJ
from usfm_grammar import USFMParser, Filter

my_parser = USFMParser(input_usfm_str)
usj_obj = my_parser.to_usj()

my_parser2 = USFMParser(from_usj=usj_obj)
print(my_parser2.usfm)

:warning: There will be differences between first USFM and the generated one in 1. Spaces and lines 2. Default attributes will be given their names 3. Closing markers may be newly added

To remove unwanted markers from USFM
from usfm_grammar import USFMParser, Filter, USFMGenerator

my_parser = USFMParser(input_usfm_str)
usj_obj = my_parser.to_usj(include_markers=Filter.BCV+Filter.TEXT)

my_parser2 = USFMParser(from_usj=usj_obj)
print(my_parser2.usfm)
USJ to USX or Table
from usfm_grammar import USFMParser, Filter

my_parser = USFMParser(input_usfm_str)
usj_obj = my_parser.to_usj()

my_parser2 = USFMParser(from_usj=usj_obj)
print(my_parser2.to_usx())
# print(my_parser2.to_list())
USX to USFM, USJ or Table
from usfm_grammar import USFMParser, Filter
from lxml import etree

test_xml_file = "sample_usx.xml"
with open(test_xml_file, 'r', encoding='utf-8') as usx_file:
    usx_str = usx_file.read()
    usx_obj = etree.fromstring(usx_str)

    my_parser = USFMParser(from_usx=usx_obj)
    print(my_parser.usfm)
    # print(my_parser.to_usj())
    # print(my_parser.to_list())

From CLI

usage: usfm-grammar [-h] [--in_format {usfm,usj,usx}]
                    [--out_format {usj,table,syntax-tree,usx,markdown,usfm}]
                    [--include_markers {book_headers,titles,...}]
                    [--exclude_markers {book_headers,titles,...}]
                    [--csv_col_sep CSV_COL_SEP] [--csv_row_sep CSV_ROW_SEP]
                    [--ignore_errors] [--combine_text]
                    infile

Uses the tree-sitter-usfm grammar to parse and convert USFM to Syntax-tree,
JSON, CSV, USX etc.

positional arguments:
  infile                input usfm or usj file

options:
  -h, --help            show this help message and exit
  --in_format {usfm,usj}
                        input file format
  --out_format {usj,table,syntax-tree,usx,markdown,usfm}
                        output format
  --include_markers {book_headers,titles,comments,paragraphs,characters,notes,study_bible,bcv,text,ide,usfm,h,toc,toca,imt,is,ip,ipi,im,imi,ipq,imq,ipr,iq,ib,ili,iot,io,iex,imte,ie,mt,mte,cl,cd,ms,mr,s,sr,r,d,sp,sd,sts,rem,lit,restore,p,m,po,pr,cls,pmo,pm,pmc,pmr,pi,mi,nb,pc,ph,q,qr,qc,qa,qm,qd,lh,li,lf,lim,litl,tr,tc,th,tcr,thr,table,b,add,bk,dc,ior,iqt,k,litl,nd,ord,pn,png,qac,qs,qt,rq,sig,sls,tl,wj,em,bd,bdit,it,no,sc,sup,rb,pro,w,wh,wa,wg,lik,liv,jmp,f,fe,ef,efe,x,ex,fr,ft,fk,fq,fqa,fl,fw,fp,fv,fdc,xo,xop,xt,xta,xk,xq,xot,xnt,xdc,esb,cat,id,c,v,text-in-excluded-parent}
                        the list of of contents to be included
  --exclude_markers {book_headers,titles,comments,paragraphs,characters,notes,study_bible,bcv,text,ide,usfm,h,toc,toca,imt,is,ip,ipi,im,imi,ipq,imq,ipr,iq,ib,ili,iot,io,iex,imte,ie,mt,mte,cl,cd,ms,mr,s,sr,r,d,sp,sd,sts,rem,lit,restore,p,m,po,pr,cls,pmo,pm,pmc,pmr,pi,mi,nb,pc,ph,q,qr,qc,qa,qm,qd,lh,li,lf,lim,litl,tr,tc,th,tcr,thr,table,b,add,bk,dc,ior,iqt,k,litl,nd,ord,pn,png,qac,qs,qt,rq,sig,sls,tl,wj,em,bd,bdit,it,no,sc,sup,rb,pro,w,wh,wa,wg,lik,liv,jmp,f,fe,ef,efe,x,ex,fr,ft,fk,fq,fqa,fl,fw,fp,fv,fdc,xo,xop,xt,xta,xk,xq,xot,xnt,xdc,esb,cat,id,c,v,text-in-excluded-parent}
                        the list of of contents to be included
  --csv_col_sep CSV_COL_SEP
                        column separator or delimiter. Only useful with
                        format=table.
  --csv_row_sep CSV_ROW_SEP
                        row separator or delimiter. Only useful with
                        format=table.
  --ignore_errors       to get some output from successfully parsed portions
  --combine_text        to be used along with exclude_markers or
                        include_markers, to concatinate the consecutive text
                        snippets, from different components, or not

Example

>>> python3 -m usfm_grammar sample.usfm --out_format usx

>>> usfm-grammar sample.usfm

>>> usfm-grammar sample.usfm --out_format usx

>>> usfm-grammar sample.usfm --include_markers bcv --include_markers text --include_markers s

>>> usfm-grammar sample-usj.json --out_format usfm

Filtering on USJ

The filtering on USJ, the JSON output, is a feature incorporated to allow data extraction, markup cleaning etc. The arguments exclude_markers and include_markers in the methods USFMParser.to_usj() makes this possible. Also the USFMParser.to_list(), can accept these inputs and perform similar operations. There is CLI versions also for these arguments to replicate the filtering feature there.

  • include_markers

    Optional input parameter to to_usj() and to_list in python library and also in CLI when format=json or format=table. Defaults to None.When proivded, only those markers listed will be included in the output. include_markers is applied before applying exclude_markers.

  • exclude_markers

    Optional input parameter to to_usj() and to_list in python library and also in CLI when format=json or format=table. Defaults to None. When proivded, all markers except those listed will be included in the output.

  • combine_texts

    Optional input parameter to to_usj() and to_list in python library and also in CLI when format=json or format=table. Defaults to True. After filtering out makers like paragraphs and characters, we are left with texts from within them, if 'text-in-excluded-parent' is also not excluded. These text snippets may come as separate components in the contents list. When this option is True, the consequetive text snippets will be concatinated together. The text concatination is done in a puctuation and space aware manner. If users need more control over the space handling or for any other reason, would prefer the texts snippets as different components in the output, this can be set to False.

  • usfm_grammar.Filter

    This Class provides a set of enums that would be useful in providing in the exclude_markers and include_markers inputs rather than users listing out individual markers. The class has following options

      BOOK_HEADERS : identification and introduction markers
      TITLES : section headings and associated markers
      COMMENTS : comment markers like \rem
      PARAGRAPHS : paragraph markers like \p, poetry markers, list table markers
      CHARACTERS : all character level markups like \em, \w, \wj etc and their nested versions with +
      NOTES : foot note, cross-reference and their content markers
      STUDY_BIBLE : \esb and `cat
      BCV : \id, \c and \v
      TEXT : 'text-in-excluded-parent'
    

    To inspect which are the markers in each of these options, it could be just printed out, print(Filter.TITLES). These could be used individually or concatinated to get the desired filtering of markers and data:

    output = my_parser.to_usj(include_markers=Filter.BCV)
    output = my_parser.to_usj(include_markers=Filter.BCV+Filter.TEXT)
    output = my_parser.to_usj(exclude_markers=Filter.PARAGRAPHS+Filter.CHARACTERS)
    
  • Inner contents of excluded markers

    For markers like \p \q etc, by excluding them, we only remove them from the heirachy and retain the inner contents like \v, text etc that would be coming inside it. But for certain other markers like \f, \x, \esb etc, if they are excluded their inner contents are also excluded. Following is the set of all markers, who inner contents are discarded if they are mentioned in exclude_markers or not included in include_markers.

    BOOK_HEADERS, TITLES, COMMENTS, NOTES, STUDY_BIBLE
    

    :warning: Generally, it is recommended to NOT use both exclude_markers and include_markers together as it could lead to unexpected behavours and data loss. For instance if include_makers has \fk and exclude_markers has \f, the output will not contain \fk as all inner contents of \f will be discarded.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

usfm_grammar-3.0.0b5-cp311-cp311-win_amd64.whl (260.8 kB view details)

Uploaded CPython 3.11 Windows x86-64

usfm_grammar-3.0.0b5-cp311-cp311-win32.whl (263.6 kB view details)

Uploaded CPython 3.11 Windows x86

usfm_grammar-3.0.0b5-cp311-cp311-musllinux_1_1_x86_64.whl (260.4 kB view details)

Uploaded CPython 3.11 musllinux: musl 1.1+ x86-64

usfm_grammar-3.0.0b5-cp311-cp311-musllinux_1_1_i686.whl (269.5 kB view details)

Uploaded CPython 3.11 musllinux: musl 1.1+ i686

usfm_grammar-3.0.0b5-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (260.1 kB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

usfm_grammar-3.0.0b5-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (269.2 kB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ i686 manylinux: glibc 2.5+ i686

usfm_grammar-3.0.0b5-cp311-cp311-macosx_10_9_x86_64.whl (253.8 kB view details)

Uploaded CPython 3.11 macOS 10.9+ x86-64

usfm_grammar-3.0.0b5-cp310-cp310-win_amd64.whl (260.8 kB view details)

Uploaded CPython 3.10 Windows x86-64

usfm_grammar-3.0.0b5-cp310-cp310-win32.whl (263.6 kB view details)

Uploaded CPython 3.10 Windows x86

usfm_grammar-3.0.0b5-cp310-cp310-musllinux_1_1_x86_64.whl (260.4 kB view details)

Uploaded CPython 3.10 musllinux: musl 1.1+ x86-64

usfm_grammar-3.0.0b5-cp310-cp310-musllinux_1_1_i686.whl (269.5 kB view details)

Uploaded CPython 3.10 musllinux: musl 1.1+ i686

usfm_grammar-3.0.0b5-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (260.1 kB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

usfm_grammar-3.0.0b5-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (269.2 kB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ i686 manylinux: glibc 2.5+ i686

usfm_grammar-3.0.0b5-cp310-cp310-macosx_10_9_x86_64.whl (253.8 kB view details)

Uploaded CPython 3.10 macOS 10.9+ x86-64

File details

Details for the file usfm_grammar-3.0.0b5-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for usfm_grammar-3.0.0b5-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 1c445a878291a5c31f5a504b19f7253639645750d2d2010be0fc075c0c0e4c10
MD5 cd76042235ff8d27769de41c8bee288a
BLAKE2b-256 5cbace62978cf3f0b72d2191d9118d5de59814dcfe202cddc83edee913f0fe92

See more details on using hashes here.

Provenance

File details

Details for the file usfm_grammar-3.0.0b5-cp311-cp311-win32.whl.

File metadata

File hashes

Hashes for usfm_grammar-3.0.0b5-cp311-cp311-win32.whl
Algorithm Hash digest
SHA256 228c4839c44ad6aab84f25f9d17bd4875dfc4f9604e24c1ee613b757858e610e
MD5 69bd6c57fc89bc214b0f525622011417
BLAKE2b-256 fa54a958a11675d22015cd1f1d4f8fbbc029cf7ae98503b5eea8985975b711c9

See more details on using hashes here.

Provenance

File details

Details for the file usfm_grammar-3.0.0b5-cp311-cp311-musllinux_1_1_x86_64.whl.

File metadata

File hashes

Hashes for usfm_grammar-3.0.0b5-cp311-cp311-musllinux_1_1_x86_64.whl
Algorithm Hash digest
SHA256 99b40511f1bc11ee379e29494adc0834d41e4d739f061598b3c594eb3a8eccf6
MD5 ad238f5b4578372fa4ee124a713458d5
BLAKE2b-256 d39683ede5f7a92f1f1c97b02d30bdade2fab175ab572ffd275f24d05e501ef5

See more details on using hashes here.

Provenance

File details

Details for the file usfm_grammar-3.0.0b5-cp311-cp311-musllinux_1_1_i686.whl.

File metadata

File hashes

Hashes for usfm_grammar-3.0.0b5-cp311-cp311-musllinux_1_1_i686.whl
Algorithm Hash digest
SHA256 0112ef59e066b16a8e5bd62073feb21a34425a387360196ad13a625827e327cc
MD5 0d8211d1a1229032fc14ac0c3ef1bd9e
BLAKE2b-256 a02825b6979e17b9ecb632bdb27cdd3484350cf73faf458f8f24222969503208

See more details on using hashes here.

Provenance

File details

Details for the file usfm_grammar-3.0.0b5-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for usfm_grammar-3.0.0b5-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 0a7b163ae9af5c26266e44bd1f743ba6d0eb42c920245ae6a9dd7f85efed953f
MD5 837ce851d437b2b26e17aa606348edf8
BLAKE2b-256 214c5495c8668d1e9bdbc17c7dccb4b0a3adcb9ce9f06f944b668e7a3c5f919a

See more details on using hashes here.

Provenance

File details

Details for the file usfm_grammar-3.0.0b5-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for usfm_grammar-3.0.0b5-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 a488c574667ab37ca6aec72daa033261600daf40e32f6e360277f75ea0aeffd7
MD5 0aea01391ecf67cf98035c84ef7673b5
BLAKE2b-256 f6f20aa72837c13d5a08649f7fc5fb96bf668201f3fc87c8758f2e107efe9888

See more details on using hashes here.

Provenance

File details

Details for the file usfm_grammar-3.0.0b5-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for usfm_grammar-3.0.0b5-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 d34d0b76da000218c32c7992e4fa6c4034ff7f4fee0a350c635e693b31115d87
MD5 eb6480377bd5c6e88cb59202df1f2dac
BLAKE2b-256 3b38194ef855bac8e2c84333e52d45ea03c3ebb4f708f977df6cf9118bfbbe10

See more details on using hashes here.

Provenance

File details

Details for the file usfm_grammar-3.0.0b5-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for usfm_grammar-3.0.0b5-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 8b28a7530a026d0fba28ca48c2caf6fc0d8eb45231fe4515c92d3d1e506baf2f
MD5 8c73ef36e9608bebd289a50b2e3f5f18
BLAKE2b-256 04bc297c95e59b98309e476962b06f1300f4028a0a8eb22079186460dceb7e0e

See more details on using hashes here.

Provenance

File details

Details for the file usfm_grammar-3.0.0b5-cp310-cp310-win32.whl.

File metadata

File hashes

Hashes for usfm_grammar-3.0.0b5-cp310-cp310-win32.whl
Algorithm Hash digest
SHA256 9bacf828aa7adca6fcd8c0f8e5ac0782efbab013d3a57d2d5c15bd42d731253a
MD5 1066d54239fd68e2dbb3dd13e06c2176
BLAKE2b-256 cfd48d0f5183f4fbdd22f412e0f103863fa76ed2b4f64be5168b0e3e4de921ae

See more details on using hashes here.

Provenance

File details

Details for the file usfm_grammar-3.0.0b5-cp310-cp310-musllinux_1_1_x86_64.whl.

File metadata

File hashes

Hashes for usfm_grammar-3.0.0b5-cp310-cp310-musllinux_1_1_x86_64.whl
Algorithm Hash digest
SHA256 389ea63a2b4cc7280cf314d76db9571087e8c3900b51710e480be7dfea8fa976
MD5 0770586e015cdb7cb907146d67dd9947
BLAKE2b-256 aec05917bca4c33bbeec60bfd11a82d913810c58c3bffbdb45123a80688fc933

See more details on using hashes here.

Provenance

File details

Details for the file usfm_grammar-3.0.0b5-cp310-cp310-musllinux_1_1_i686.whl.

File metadata

File hashes

Hashes for usfm_grammar-3.0.0b5-cp310-cp310-musllinux_1_1_i686.whl
Algorithm Hash digest
SHA256 99414d3f0922ab02502ed62b39443583970c932b284209ed8030e1f0809d7002
MD5 6502b64aea02372400dfda58dd09adae
BLAKE2b-256 7f945f07a281b5bdec90089ab2d413fbabf77a0962ddf68018c957b454bcf6f9

See more details on using hashes here.

Provenance

File details

Details for the file usfm_grammar-3.0.0b5-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for usfm_grammar-3.0.0b5-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 6b2a0902d7cc48a6b375c5997a41a5894ee704c17ffcd921e93a3a5b74ee742f
MD5 797d1f635d83ee393467f71cd7719190
BLAKE2b-256 8ef7dee51bf9aef86b9f0f0c9d2bc3e628a324e541666f0d1ba23733086e888c

See more details on using hashes here.

Provenance

File details

Details for the file usfm_grammar-3.0.0b5-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for usfm_grammar-3.0.0b5-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 b584dcc7810bd65cd763f9a649fab30917fe63c0c7481f43563d3f5a509c487f
MD5 6e25917b5ab8db8064f96ea9b00c7a6d
BLAKE2b-256 9153bc66dea0e985e12dab3cfdb4927b8b4870d521a20fd4e8c45553f7b3772f

See more details on using hashes here.

Provenance

File details

Details for the file usfm_grammar-3.0.0b5-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for usfm_grammar-3.0.0b5-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 cc88f3de048747872ce0e9d229640ed5ab9fcb374ab9e7e8abeec4fb2fd2403c
MD5 cb3aa5ec43fec1345a151987f49c0a10
BLAKE2b-256 137b24f4547c433ea18b1b6613e607bb85e0fdba7a5c36c8a0643db93f7f0538

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page