Skip to main content

Python parser for USFM files, based on tree-sitter-usfm3

Project description

USFM-Grammar

The python library that facilitates

  • Parsing and validation of USFM files using tree-sitter-usfm3
  • Conversion of USFM files to other formats (USX, dict, list etc)
  • Extraction of specific contents from USFM files like scripture alone(clean verses), notes (footnotes, cross-refs) etc

Built on python 3.10

Installation

pip install usfm-grammar

This requires a C compiler. On Windows, Microsoft Visual C++ 14.0 or above is required. It is recommended that you update pip, setuptools and wheel.

Usage

By importing library in Python code

from usfm_grammar import USFMParser, Filter

# input_usfm_str = open("sample.usfm","r", encoding='utf8').read()
input_usfm_str = '''
\\id GEN
\\c 1
\\p
\\v 1 test verse
'''

my_parser = USFMParser(input_usfm_str)

errors = my_parser.errors
print(errors)
To convert to USX
from lxml import etree

usx_elem = my_parser.to_usx() # default filter=ALL
print(etree.tostring(usx_elem, encoding="unicode", pretty_print=True))
To convert to Dict/USJ
output = my_parser.to_usj() # default all markers

# filters out specified markers from output
# output = my_parser.to_usj(exclude_markers=['s1','h', 'toc1','toc2','mt'])

# retains only specified contents from output
# output = my_parser.to_usj(include_markers=['id', 'c', 'v']) 

# use predefined marker groups instead of listing them one by one
# output = my_parser.to_usj(include_markers=Filter.BCV+Filter.TEXT)

# for a flattened JSON removing nesting brought in by paragraphs, lists, quotes, tables and character level markups
# output = my_parser.to_usj(exclude_markers=Filter.PARAGRAPHS+Filter.CHARACTERS)

# To NOT concatinate text extracted from different markers
# output = my_parser.to_usj(exclude_markers=Filter.PARAGRAPHS+Filter.CHARACTERS, combine_texts=False) 

print(output)

To understand more about how exclude_markers, include_markers, combine_texts and Filter works refer the section on filtering on USJ

To save as json
import json
dict_output = my_parser.to_usj()
with open("file_path.json", "w", encoding='utf-8') as fp:
	json.dump(dict_output, fp)
To convert to List or table like format
list_output = my_parser.to_list() 
#list_output = my_parser.to_list([Filter.SCRIPTURE_TEXT])

table_output = "\n".join(["\t".join(row) for row in list_output])
print(table_output)

To round trip with USJ
from usfm_grammar import USFMParser, Filter

my_parser = USFMParser(input_usfm_str)
usj_obj = my_parser.to_usj()

my_parser2 = USFMParser(from_usj=usj_obj)
print(my_parser2.usfm)

:warning: There will be differences between first USFM and the generated one in 1. Spaces and lines 2. Default attributes will be given their names 3. Closing markers may be newly added

To remove unwanted markers from USFM
from usfm_grammar import USFMParser, Filter, USFMGenerator

my_parser = USFMParser(input_usfm_str)
usj_obj = my_parser.to_usj(include_markers=Filter.BCV+Filter.TEXT)

my_parser2 = USFMParser(from_usj=usj_obj)
print(my_parser2.usfm)
USJ to USX or Table
from usfm_grammar import USFMParser, Filter

my_parser = USFMParser(input_usfm_str)
usj_obj = my_parser.to_usj()

my_parser2 = USFMParser(from_usj=usj_obj)
print(my_parser2.to_usx())
# print(my_parser2.to_list())
USX to USFM, USJ or Table
from usfm_grammar import USFMParser, Filter
from lxml import etree

test_xml_file = "sample_usx.xml"
with open(test_xml_file, 'r', encoding='utf-8') as usx_file:
    usx_str = usx_file.read()
    usx_obj = etree.fromstring(usx_str)

    my_parser = USFMParser(from_usx=usx_obj)
    print(my_parser.usfm)
    # print(my_parser.to_usj())
    # print(my_parser.to_list())

From CLI

usage: usfm-grammar [-h] [--in_format {usfm,usj,usx}]
                    [--out_format {usj,table,syntax-tree,usx,markdown,usfm}]
                    [--include_markers {book_headers,titles,...}]
                    [--exclude_markers {book_headers,titles,...}]
                    [--csv_col_sep CSV_COL_SEP] [--csv_row_sep CSV_ROW_SEP]
                    [--ignore_errors] [--combine_text]
                    infile

Uses the tree-sitter-usfm grammar to parse and convert USFM to Syntax-tree,
JSON, CSV, USX etc.

positional arguments:
  infile                input usfm or usj file

options:
  -h, --help            show this help message and exit
  --in_format {usfm,usj}
                        input file format
  --out_format {usj,table,syntax-tree,usx,markdown,usfm}
                        output format
  --include_markers {book_headers,titles,comments,paragraphs,characters,notes,study_bible,bcv,text,ide,usfm,h,toc,toca,imt,is,ip,ipi,im,imi,ipq,imq,ipr,iq,ib,ili,iot,io,iex,imte,ie,mt,mte,cl,cd,ms,mr,s,sr,r,d,sp,sd,sts,rem,lit,restore,p,m,po,pr,cls,pmo,pm,pmc,pmr,pi,mi,nb,pc,ph,q,qr,qc,qa,qm,qd,lh,li,lf,lim,litl,tr,tc,th,tcr,thr,table,b,add,bk,dc,ior,iqt,k,litl,nd,ord,pn,png,qac,qs,qt,rq,sig,sls,tl,wj,em,bd,bdit,it,no,sc,sup,rb,pro,w,wh,wa,wg,lik,liv,jmp,f,fe,ef,efe,x,ex,fr,ft,fk,fq,fqa,fl,fw,fp,fv,fdc,xo,xop,xt,xta,xk,xq,xot,xnt,xdc,esb,cat,id,c,v,text-in-excluded-parent}
                        the list of of contents to be included
  --exclude_markers {book_headers,titles,comments,paragraphs,characters,notes,study_bible,bcv,text,ide,usfm,h,toc,toca,imt,is,ip,ipi,im,imi,ipq,imq,ipr,iq,ib,ili,iot,io,iex,imte,ie,mt,mte,cl,cd,ms,mr,s,sr,r,d,sp,sd,sts,rem,lit,restore,p,m,po,pr,cls,pmo,pm,pmc,pmr,pi,mi,nb,pc,ph,q,qr,qc,qa,qm,qd,lh,li,lf,lim,litl,tr,tc,th,tcr,thr,table,b,add,bk,dc,ior,iqt,k,litl,nd,ord,pn,png,qac,qs,qt,rq,sig,sls,tl,wj,em,bd,bdit,it,no,sc,sup,rb,pro,w,wh,wa,wg,lik,liv,jmp,f,fe,ef,efe,x,ex,fr,ft,fk,fq,fqa,fl,fw,fp,fv,fdc,xo,xop,xt,xta,xk,xq,xot,xnt,xdc,esb,cat,id,c,v,text-in-excluded-parent}
                        the list of of contents to be included
  --csv_col_sep CSV_COL_SEP
                        column separator or delimiter. Only useful with
                        format=table.
  --csv_row_sep CSV_ROW_SEP
                        row separator or delimiter. Only useful with
                        format=table.
  --ignore_errors       to get some output from successfully parsed portions
  --combine_text        to be used along with exclude_markers or
                        include_markers, to concatinate the consecutive text
                        snippets, from different components, or not

Example

>>> python3 -m usfm_grammar sample.usfm --out_format usx

>>> usfm-grammar sample.usfm

>>> usfm-grammar sample.usfm --out_format usx

>>> usfm-grammar sample.usfm --include_markers bcv --include_markers text --include_markers s

>>> usfm-grammar sample-usj.json --out_format usfm

Filtering on USJ

The filtering on USJ, the JSON output, is a feature incorporated to allow data extraction, markup cleaning etc. The arguments exclude_markers and include_markers in the methods USFMParser.to_usj() makes this possible. Also the USFMParser.to_list(), can accept these inputs and perform similar operations. There is CLI versions also for these arguments to replicate the filtering feature there.

  • include_markers

    Optional input parameter to to_usj() and to_list in python library and also in CLI when format=json or format=table. Defaults to None.When proivded, only those markers listed will be included in the output. include_markers is applied before applying exclude_markers.

  • exclude_markers

    Optional input parameter to to_usj() and to_list in python library and also in CLI when format=json or format=table. Defaults to None. When proivded, all markers except those listed will be included in the output.

  • combine_texts

    Optional input parameter to to_usj() and to_list in python library and also in CLI when format=json or format=table. Defaults to True. After filtering out makers like paragraphs and characters, we are left with texts from within them, if 'text-in-excluded-parent' is also not excluded. These text snippets may come as separate components in the contents list. When this option is True, the consequetive text snippets will be concatinated together. The text concatination is done in a puctuation and space aware manner. If users need more control over the space handling or for any other reason, would prefer the texts snippets as different components in the output, this can be set to False.

  • usfm_grammar.Filter

    This Class provides a set of enums that would be useful in providing in the exclude_markers and include_markers inputs rather than users listing out individual markers. The class has following options

      BOOK_HEADERS : identification and introduction markers
      TITLES : section headings and associated markers
      COMMENTS : comment markers like \rem
      PARAGRAPHS : paragraph markers like \p, poetry markers, list table markers
      CHARACTERS : all character level markups like \em, \w, \wj etc and their nested versions with +
      NOTES : foot note, cross-reference and their content markers
      STUDY_BIBLE : \esb and `cat
      BCV : \id, \c and \v
      TEXT : 'text-in-excluded-parent'
    

    To inspect which are the markers in each of these options, it could be just printed out, print(Filter.TITLES). These could be used individually or concatinated to get the desired filtering of markers and data:

    output = my_parser.to_usj(include_markers=Filter.BCV)
    output = my_parser.to_usj(include_markers=Filter.BCV+Filter.TEXT)
    output = my_parser.to_usj(exclude_markers=Filter.PARAGRAPHS+Filter.CHARACTERS)
    
  • Inner contents of excluded markers

    For markers like \p \q etc, by excluding them, we only remove them from the heirachy and retain the inner contents like \v, text etc that would be coming inside it. But for certain other markers like \f, \x, \esb etc, if they are excluded their inner contents are also excluded. Following is the set of all markers, who inner contents are discarded if they are mentioned in exclude_markers or not included in include_markers.

    BOOK_HEADERS, TITLES, COMMENTS, NOTES, STUDY_BIBLE
    

    :warning: Generally, it is recommended to NOT use both exclude_markers and include_markers together as it could lead to unexpected behavours and data loss. For instance if include_makers has \fk and exclude_markers has \f, the output will not contain \fk as all inner contents of \f will be discarded.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

usfm_grammar-3.0.0b7-cp311-cp311-win_amd64.whl (260.8 kB view details)

Uploaded CPython 3.11 Windows x86-64

usfm_grammar-3.0.0b7-cp311-cp311-win32.whl (263.7 kB view details)

Uploaded CPython 3.11 Windows x86

usfm_grammar-3.0.0b7-cp311-cp311-musllinux_1_1_x86_64.whl (260.4 kB view details)

Uploaded CPython 3.11 musllinux: musl 1.1+ x86-64

usfm_grammar-3.0.0b7-cp311-cp311-musllinux_1_1_i686.whl (269.5 kB view details)

Uploaded CPython 3.11 musllinux: musl 1.1+ i686

usfm_grammar-3.0.0b7-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (260.2 kB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

usfm_grammar-3.0.0b7-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (269.2 kB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ i686 manylinux: glibc 2.5+ i686

usfm_grammar-3.0.0b7-cp311-cp311-macosx_10_9_x86_64.whl (253.9 kB view details)

Uploaded CPython 3.11 macOS 10.9+ x86-64

usfm_grammar-3.0.0b7-cp310-cp310-win_amd64.whl (260.8 kB view details)

Uploaded CPython 3.10 Windows x86-64

usfm_grammar-3.0.0b7-cp310-cp310-win32.whl (263.7 kB view details)

Uploaded CPython 3.10 Windows x86

usfm_grammar-3.0.0b7-cp310-cp310-musllinux_1_1_x86_64.whl (260.4 kB view details)

Uploaded CPython 3.10 musllinux: musl 1.1+ x86-64

usfm_grammar-3.0.0b7-cp310-cp310-musllinux_1_1_i686.whl (269.5 kB view details)

Uploaded CPython 3.10 musllinux: musl 1.1+ i686

usfm_grammar-3.0.0b7-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (260.2 kB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

usfm_grammar-3.0.0b7-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (269.2 kB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ i686 manylinux: glibc 2.5+ i686

usfm_grammar-3.0.0b7-cp310-cp310-macosx_10_9_x86_64.whl (253.9 kB view details)

Uploaded CPython 3.10 macOS 10.9+ x86-64

File details

Details for the file usfm_grammar-3.0.0b7-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for usfm_grammar-3.0.0b7-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 1bb21c48aad993b67dd00afcadd1d6b6415edc124a71b850aaae9aba1eebacf2
MD5 841f05fa39a7d73799d3a91082118698
BLAKE2b-256 6b335905261cc0df8f7b3cac17e121afcd7d54b8e7a4d2ec5bd06b52b3d5e1ec

See more details on using hashes here.

Provenance

File details

Details for the file usfm_grammar-3.0.0b7-cp311-cp311-win32.whl.

File metadata

File hashes

Hashes for usfm_grammar-3.0.0b7-cp311-cp311-win32.whl
Algorithm Hash digest
SHA256 6fc5f9d524a563a5e2db65159887c06dbcd7d5dff8102805870b850639484649
MD5 6f3cbcf062c04ed1148bc6e96fb250d8
BLAKE2b-256 54d8dbdc14cc40b78cae74b1dc6ec9c914dd3b327080708ebea9af94e0aab5ee

See more details on using hashes here.

Provenance

File details

Details for the file usfm_grammar-3.0.0b7-cp311-cp311-musllinux_1_1_x86_64.whl.

File metadata

File hashes

Hashes for usfm_grammar-3.0.0b7-cp311-cp311-musllinux_1_1_x86_64.whl
Algorithm Hash digest
SHA256 054c14cdf4c004e967c09e673a5496c2c931974485e6a99787298211da078382
MD5 cc1cd2f981c1562c7cc3bc2d918ae0de
BLAKE2b-256 0db3c6ecf90df0ab1d2dd13c68a9281adaddb2940d8aa2da56ee84274589a342

See more details on using hashes here.

Provenance

File details

Details for the file usfm_grammar-3.0.0b7-cp311-cp311-musllinux_1_1_i686.whl.

File metadata

File hashes

Hashes for usfm_grammar-3.0.0b7-cp311-cp311-musllinux_1_1_i686.whl
Algorithm Hash digest
SHA256 9e5ac43773d61a7676aa143f62810b23a9e6f653340ac5d5b43804a288584f1a
MD5 3f4c3ae7cd0675ee995fe6d8bf62b484
BLAKE2b-256 ed0d6813914fee587cb458bb5d833221b169b0467898d77994fc5154e796ec44

See more details on using hashes here.

Provenance

File details

Details for the file usfm_grammar-3.0.0b7-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for usfm_grammar-3.0.0b7-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a3e31a48362e11f594cd1f429ff1f92a3de14dfd3a2880d683ce3bfba9ad0a68
MD5 43d3716cd998accbf695762f25307249
BLAKE2b-256 e31529f578ab3acbadc3ab002a82f482e2023efda3716600727d0b631e6b35c3

See more details on using hashes here.

Provenance

File details

Details for the file usfm_grammar-3.0.0b7-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for usfm_grammar-3.0.0b7-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 7f92b2d2330d70422f2bb36c315aaf7fa5174d645fc002f1143362a71dabd63b
MD5 d4457b91ba2ce33b63b82d690c224ff9
BLAKE2b-256 2e6c986863f3ec1d1191f3a75689a07cc5616d3896a8bff9080095a1ca0f09e3

See more details on using hashes here.

Provenance

File details

Details for the file usfm_grammar-3.0.0b7-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for usfm_grammar-3.0.0b7-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 fa82e9b577067367763e521d9d9a1a606b1b7755b33c317fbc76dd8d08e6d06a
MD5 86611fd0f8139d6a49e62ac79bffc381
BLAKE2b-256 f8a46318d143914024e35376c149f889ebe7f29fc75114c90644d19c7f0ca3da

See more details on using hashes here.

Provenance

File details

Details for the file usfm_grammar-3.0.0b7-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for usfm_grammar-3.0.0b7-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 3af546ac1f3c9b6614d0cd9c2edfe841087c8ec8353b0975be5bdf16511807c0
MD5 9d21ec2cf5cd4551aa141f6c674a329d
BLAKE2b-256 ea81947ca832ba68d42971d3f076bd7870a451daf794347f2a97220d2a4f66c7

See more details on using hashes here.

Provenance

File details

Details for the file usfm_grammar-3.0.0b7-cp310-cp310-win32.whl.

File metadata

File hashes

Hashes for usfm_grammar-3.0.0b7-cp310-cp310-win32.whl
Algorithm Hash digest
SHA256 4cfb812ae7869292054e27fab78c13c087997a859ebd041bb3d5a409d5177635
MD5 7d1e2840ec55581d5dfa46a79a61a9f5
BLAKE2b-256 9ce4ca3bd7967f8dbe30ea9b5e6d1586d6bc57a7dbc12c3c8da8b3932224df89

See more details on using hashes here.

Provenance

File details

Details for the file usfm_grammar-3.0.0b7-cp310-cp310-musllinux_1_1_x86_64.whl.

File metadata

File hashes

Hashes for usfm_grammar-3.0.0b7-cp310-cp310-musllinux_1_1_x86_64.whl
Algorithm Hash digest
SHA256 4733e9e2788880797e03bbdadd851145ebf778406bbf26118ea1cc330c98fd3d
MD5 e36d653a6df6b15803c069a6a9008ea5
BLAKE2b-256 cb118b8da66196b328e61c7bd73920c4048eaaa463bd609ca947514b608527de

See more details on using hashes here.

Provenance

File details

Details for the file usfm_grammar-3.0.0b7-cp310-cp310-musllinux_1_1_i686.whl.

File metadata

File hashes

Hashes for usfm_grammar-3.0.0b7-cp310-cp310-musllinux_1_1_i686.whl
Algorithm Hash digest
SHA256 6d515a36d8678d590284020ba96adc9ed53b7d154d26b05b50a27c77882c8f8e
MD5 93ac30a95e9d1527a55189fe70f3df80
BLAKE2b-256 aef843b65f23eee641d0c73b4abc3c8176773c1fb563c7bcdd444681e2d694e7

See more details on using hashes here.

Provenance

File details

Details for the file usfm_grammar-3.0.0b7-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for usfm_grammar-3.0.0b7-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 aaff4e3feb447a032e2fb2953e9e2f0a2213cdf1a678dbada741f05bdd8f0424
MD5 21a52afd0e4ebce94a32f08ef32a5a15
BLAKE2b-256 f89859499586d574c510ca3af87e700ae4b12aabc14226184df0068fc93a7b66

See more details on using hashes here.

Provenance

File details

Details for the file usfm_grammar-3.0.0b7-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for usfm_grammar-3.0.0b7-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 47773f36161d65f69e4b27a63fa1e32ad85cf60bc49350727db819190978abb3
MD5 4c57564d0217b37dd11b4eaa16d75d06
BLAKE2b-256 21c7f38910bf979d867ebc6af46da3ae1f071db77ca0e3a095eedeea186d8d65

See more details on using hashes here.

Provenance

File details

Details for the file usfm_grammar-3.0.0b7-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for usfm_grammar-3.0.0b7-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 1d2ff0df19a010b159605e9256a36124af13886934bd127420d6acc415f6ad63
MD5 4926eaf88c60ee8d3a6df3757136a482
BLAKE2b-256 36d2722248d991d9f5515fe543983e098c13b3cc785ec856ec387c2ae817145f

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page