Skip to main content

Python parser for USFM files, based on tree-sitter-usfm3

Project description

USFM-Grammar

The python library that facilitates

  • Parsing and validation of USFM files using tree-sitter-usfm3
  • Conversion of USFM files to other formats (USX, dict, list etc)
  • Extraction of specific contents from USFM files like scripture alone(clean verses), notes (footnotes, cross-refs) etc

Built on python 3.10

Installation

pip install usfm-grammar

This requires a C compiler. On Windows, Microsoft Visual C++ 14.0 or above is required. It is recommended that you update pip, setuptools and wheel.

Usage

By importing library in Python code

from usfm_grammar import USFMParser, Filter

# input_usfm_str = open("sample.usfm","r", encoding='utf8').read()
input_usfm_str = '''
\\id GEN
\\c 1
\\p
\\v 1 test verse
'''

my_parser = USFMParser(input_usfm_str)

errors = my_parser.errors
print(errors)
To convert to USX
from lxml import etree

usx_elem = my_parser.to_usx() # default filter=ALL
print(etree.tostring(usx_elem, encoding="unicode", pretty_print=True))
To convert to Dict/USJ
output = my_parser.to_usj() # default all markers

# filters out specified markers from output
# output = my_parser.to_usj(exclude_markers=['s1','h', 'toc1','toc2','mt'])

# retains only specified contents from output
# output = my_parser.to_usj(include_markers=['id', 'c', 'v']) 

# use predefined marker groups instead of listing them one by one
# output = my_parser.to_usj(include_markers=Filter.BCV+Filter.TEXT)

# for a flattened JSON removing nesting brought in by paragraphs, lists, quotes, tables and character level markups
# output = my_parser.to_usj(exclude_markers=Filter.PARAGRAPHS+Filter.CHARACTERS)

# To NOT concatinate text extracted from different markers
# output = my_parser.to_usj(exclude_markers=Filter.PARAGRAPHS+Filter.CHARACTERS, combine_texts=False) 

print(output)

To understand more about how exclude_markers, include_markers, combine_texts and Filter works refer the section on filtering on USJ

To save as json
import json
dict_output = my_parser.to_usj()
with open("file_path.json", "w", encoding='utf-8') as fp:
	json.dump(dict_output, fp)
To convert to List or table like format
list_output = my_parser.to_list() 
#list_output = my_parser.to_list([Filter.SCRIPTURE_TEXT])

table_output = "\n".join(["\t".join(row) for row in list_output])
print(table_output)

To round trip with USJ
from usfm_grammar import USFMParser, Filter

my_parser = USFMParser(input_usfm_str)
usj_obj = my_parser.to_usj()

my_parser2 = USFMParser(from_usj=usj_obj)
print(my_parser2.usfm)

:warning: There will be differences between first USFM and the generated one in 1. Spaces and lines 2. Default attributes will be given their names 3. Closing markers may be newly added

To remove unwanted markers from USFM
from usfm_grammar import USFMParser, Filter, USFMGenerator

my_parser = USFMParser(input_usfm_str)
usj_obj = my_parser.to_usj(include_markers=Filter.BCV+Filter.TEXT)

my_parser2 = USFMParser(from_usj=usj_obj)
print(my_parser2.usfm)
USJ to USX or Table
from usfm_grammar import USFMParser, Filter

my_parser = USFMParser(input_usfm_str)
usj_obj = my_parser.to_usj()

my_parser2 = USFMParser(from_usj=usj_obj)
print(my_parser2.to_usx())
# print(my_parser2.to_list())
USX to USFM, USJ or Table
from usfm_grammar import USFMParser, Filter
from lxml import etree

test_xml_file = "sample_usx.xml"
with open(test_xml_file, 'r', encoding='utf-8') as usx_file:
    usx_str = usx_file.read()
    usx_obj = etree.fromstring(usx_str)

    my_parser = USFMParser(from_usx=usx_obj)
    print(my_parser.usfm)
    # print(my_parser.to_usj())
    # print(my_parser.to_list())

From CLI

usage: usfm-grammar [-h] [--in_format {usfm,usj,usx}]
                    [--out_format {usj,table,syntax-tree,usx,markdown,usfm}]
                    [--include_markers {book_headers,titles,...}]
                    [--exclude_markers {book_headers,titles,...}]
                    [--csv_col_sep CSV_COL_SEP] [--csv_row_sep CSV_ROW_SEP]
                    [--ignore_errors] [--combine_text]
                    infile

Uses the tree-sitter-usfm grammar to parse and convert USFM to Syntax-tree,
JSON, CSV, USX etc.

positional arguments:
  infile                input usfm or usj file

options:
  -h, --help            show this help message and exit
  --in_format {usfm,usj}
                        input file format
  --out_format {usj,table,syntax-tree,usx,markdown,usfm}
                        output format
  --include_markers {book_headers,titles,comments,paragraphs,characters,notes,study_bible,bcv,text,ide,usfm,h,toc,toca,imt,is,ip,ipi,im,imi,ipq,imq,ipr,iq,ib,ili,iot,io,iex,imte,ie,mt,mte,cl,cd,ms,mr,s,sr,r,d,sp,sd,sts,rem,lit,restore,p,m,po,pr,cls,pmo,pm,pmc,pmr,pi,mi,nb,pc,ph,q,qr,qc,qa,qm,qd,lh,li,lf,lim,litl,tr,tc,th,tcr,thr,table,b,add,bk,dc,ior,iqt,k,litl,nd,ord,pn,png,qac,qs,qt,rq,sig,sls,tl,wj,em,bd,bdit,it,no,sc,sup,rb,pro,w,wh,wa,wg,lik,liv,jmp,f,fe,ef,efe,x,ex,fr,ft,fk,fq,fqa,fl,fw,fp,fv,fdc,xo,xop,xt,xta,xk,xq,xot,xnt,xdc,esb,cat,id,c,v,text-in-excluded-parent}
                        the list of of contents to be included
  --exclude_markers {book_headers,titles,comments,paragraphs,characters,notes,study_bible,bcv,text,ide,usfm,h,toc,toca,imt,is,ip,ipi,im,imi,ipq,imq,ipr,iq,ib,ili,iot,io,iex,imte,ie,mt,mte,cl,cd,ms,mr,s,sr,r,d,sp,sd,sts,rem,lit,restore,p,m,po,pr,cls,pmo,pm,pmc,pmr,pi,mi,nb,pc,ph,q,qr,qc,qa,qm,qd,lh,li,lf,lim,litl,tr,tc,th,tcr,thr,table,b,add,bk,dc,ior,iqt,k,litl,nd,ord,pn,png,qac,qs,qt,rq,sig,sls,tl,wj,em,bd,bdit,it,no,sc,sup,rb,pro,w,wh,wa,wg,lik,liv,jmp,f,fe,ef,efe,x,ex,fr,ft,fk,fq,fqa,fl,fw,fp,fv,fdc,xo,xop,xt,xta,xk,xq,xot,xnt,xdc,esb,cat,id,c,v,text-in-excluded-parent}
                        the list of of contents to be included
  --csv_col_sep CSV_COL_SEP
                        column separator or delimiter. Only useful with
                        format=table.
  --csv_row_sep CSV_ROW_SEP
                        row separator or delimiter. Only useful with
                        format=table.
  --ignore_errors       to get some output from successfully parsed portions
  --combine_text        to be used along with exclude_markers or
                        include_markers, to concatinate the consecutive text
                        snippets, from different components, or not

Example

>>> python3 -m usfm_grammar sample.usfm --out_format usx

>>> usfm-grammar sample.usfm

>>> usfm-grammar sample.usfm --out_format usx

>>> usfm-grammar sample.usfm --include_markers bcv --include_markers text --include_markers s

>>> usfm-grammar sample-usj.json --out_format usfm

Filtering on USJ

The filtering on USJ, the JSON output, is a feature incorporated to allow data extraction, markup cleaning etc. The arguments exclude_markers and include_markers in the methods USFMParser.to_usj() makes this possible. Also the USFMParser.to_list(), can accept these inputs and perform similar operations. There is CLI versions also for these arguments to replicate the filtering feature there.

  • include_markers

    Optional input parameter to to_usj() and to_list in python library and also in CLI when format=json or format=table. Defaults to None.When proivded, only those markers listed will be included in the output. include_markers is applied before applying exclude_markers.

  • exclude_markers

    Optional input parameter to to_usj() and to_list in python library and also in CLI when format=json or format=table. Defaults to None. When proivded, all markers except those listed will be included in the output.

  • combine_texts

    Optional input parameter to to_usj() and to_list in python library and also in CLI when format=json or format=table. Defaults to True. After filtering out makers like paragraphs and characters, we are left with texts from within them, if 'text-in-excluded-parent' is also not excluded. These text snippets may come as separate components in the contents list. When this option is True, the consequetive text snippets will be concatinated together. The text concatination is done in a puctuation and space aware manner. If users need more control over the space handling or for any other reason, would prefer the texts snippets as different components in the output, this can be set to False.

  • usfm_grammar.Filter

    This Class provides a set of enums that would be useful in providing in the exclude_markers and include_markers inputs rather than users listing out individual markers. The class has following options

      BOOK_HEADERS : identification and introduction markers
      TITLES : section headings and associated markers
      COMMENTS : comment markers like \rem
      PARAGRAPHS : paragraph markers like \p, poetry markers, list table markers
      CHARACTERS : all character level markups like \em, \w, \wj etc and their nested versions with +
      NOTES : foot note, cross-reference and their content markers
      STUDY_BIBLE : \esb and `cat
      BCV : \id, \c and \v
      TEXT : 'text-in-excluded-parent'
    

    To inspect which are the markers in each of these options, it could be just printed out, print(Filter.TITLES). These could be used individually or concatinated to get the desired filtering of markers and data:

    output = my_parser.to_usj(include_markers=Filter.BCV)
    output = my_parser.to_usj(include_markers=Filter.BCV+Filter.TEXT)
    output = my_parser.to_usj(exclude_markers=Filter.PARAGRAPHS+Filter.CHARACTERS)
    
  • Inner contents of excluded markers

    For markers like \p \q etc, by excluding them, we only remove them from the heirachy and retain the inner contents like \v, text etc that would be coming inside it. But for certain other markers like \f, \x, \esb etc, if they are excluded their inner contents are also excluded. Following is the set of all markers, who inner contents are discarded if they are mentioned in exclude_markers or not included in include_markers.

    BOOK_HEADERS, TITLES, COMMENTS, NOTES, STUDY_BIBLE
    

    :warning: Generally, it is recommended to NOT use both exclude_markers and include_markers together as it could lead to unexpected behavours and data loss. For instance if include_makers has \fk and exclude_markers has \f, the output will not contain \fk as all inner contents of \f will be discarded.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

usfm_grammar-3.0.0b6-cp311-cp311-win_amd64.whl (260.8 kB view details)

Uploaded CPython 3.11 Windows x86-64

usfm_grammar-3.0.0b6-cp311-cp311-win32.whl (263.6 kB view details)

Uploaded CPython 3.11 Windows x86

usfm_grammar-3.0.0b6-cp311-cp311-musllinux_1_1_x86_64.whl (260.4 kB view details)

Uploaded CPython 3.11 musllinux: musl 1.1+ x86-64

usfm_grammar-3.0.0b6-cp311-cp311-musllinux_1_1_i686.whl (269.5 kB view details)

Uploaded CPython 3.11 musllinux: musl 1.1+ i686

usfm_grammar-3.0.0b6-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (260.1 kB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

usfm_grammar-3.0.0b6-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (269.2 kB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ i686 manylinux: glibc 2.5+ i686

usfm_grammar-3.0.0b6-cp311-cp311-macosx_10_9_x86_64.whl (253.8 kB view details)

Uploaded CPython 3.11 macOS 10.9+ x86-64

usfm_grammar-3.0.0b6-cp310-cp310-win_amd64.whl (260.8 kB view details)

Uploaded CPython 3.10 Windows x86-64

usfm_grammar-3.0.0b6-cp310-cp310-win32.whl (263.6 kB view details)

Uploaded CPython 3.10 Windows x86

usfm_grammar-3.0.0b6-cp310-cp310-musllinux_1_1_x86_64.whl (260.4 kB view details)

Uploaded CPython 3.10 musllinux: musl 1.1+ x86-64

usfm_grammar-3.0.0b6-cp310-cp310-musllinux_1_1_i686.whl (269.5 kB view details)

Uploaded CPython 3.10 musllinux: musl 1.1+ i686

usfm_grammar-3.0.0b6-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (260.1 kB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

usfm_grammar-3.0.0b6-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (269.2 kB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ i686 manylinux: glibc 2.5+ i686

usfm_grammar-3.0.0b6-cp310-cp310-macosx_10_9_x86_64.whl (253.8 kB view details)

Uploaded CPython 3.10 macOS 10.9+ x86-64

File details

Details for the file usfm_grammar-3.0.0b6-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for usfm_grammar-3.0.0b6-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 f64f1d7298e7c6a3c76d00fad990b10f1a529bc86ee7c22b9458dcc17024d699
MD5 41f9e9a861931d76e76e29fdd273ef80
BLAKE2b-256 4bfd89045b8875b7183f9ef7eb54c069258e73b1198df429fbb7a89f935709cb

See more details on using hashes here.

Provenance

File details

Details for the file usfm_grammar-3.0.0b6-cp311-cp311-win32.whl.

File metadata

File hashes

Hashes for usfm_grammar-3.0.0b6-cp311-cp311-win32.whl
Algorithm Hash digest
SHA256 b73bf9256b20852865792bf265dd1cd0b35b926d169d2c6c47e92b3a4ec46b3c
MD5 cffa9e9a306692bc154a18d592bba0d8
BLAKE2b-256 b42bc5952bc9a7e30e236ae7e193dc949ac5e3de0dc7da0ff241ad29b17aea0b

See more details on using hashes here.

Provenance

File details

Details for the file usfm_grammar-3.0.0b6-cp311-cp311-musllinux_1_1_x86_64.whl.

File metadata

File hashes

Hashes for usfm_grammar-3.0.0b6-cp311-cp311-musllinux_1_1_x86_64.whl
Algorithm Hash digest
SHA256 051596fd9136d807690d49f52fd60a4bf8911f21543fe2c493551d9e6bd49f88
MD5 373decdd64328f661ac27872d51731e1
BLAKE2b-256 2287d0877b7ea3a0854a6495f9d3ad08a3c2a9ef179e3f549a18e2cd34835e5e

See more details on using hashes here.

Provenance

File details

Details for the file usfm_grammar-3.0.0b6-cp311-cp311-musllinux_1_1_i686.whl.

File metadata

File hashes

Hashes for usfm_grammar-3.0.0b6-cp311-cp311-musllinux_1_1_i686.whl
Algorithm Hash digest
SHA256 4b5e804194582bf4cae5481d28333288e54dab9995a571df638070fba258eba6
MD5 9167fa72913a91453371ae74706aefe4
BLAKE2b-256 758c8641bd04c58de8d256a5cc7bf37a296832e69d2d68256bf4786cd2675bea

See more details on using hashes here.

Provenance

File details

Details for the file usfm_grammar-3.0.0b6-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for usfm_grammar-3.0.0b6-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 6ea29ec8759ee7ee6ce0a72a9c77975e20187771b039befed655c78807f5a06a
MD5 e88c9c80fef072a1050800a8ff451430
BLAKE2b-256 b71be9b286b884a4dc1bb89a4e2c83806fb8108281cffc3ba184bcf822128ec2

See more details on using hashes here.

Provenance

File details

Details for the file usfm_grammar-3.0.0b6-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for usfm_grammar-3.0.0b6-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 5af70f208ac2026c53c9618852eb9db731eddd7a41714cb623117d8eab09f1ee
MD5 2f656841e815deeff155de677789f505
BLAKE2b-256 5ac6e7600b41b82a477666e2d2c69c081b892bc1ba8dc4e19ec7ace401468aeb

See more details on using hashes here.

Provenance

File details

Details for the file usfm_grammar-3.0.0b6-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for usfm_grammar-3.0.0b6-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 f9700cb466d96cc26b89ef8cfebbec64cc6574b91955e87c269441ac41b39983
MD5 bb52963176724d69f49f0e273c4e66d3
BLAKE2b-256 bfb2d576f5b837682cf5cb042b2184e5ef11bedd013cb039d70bad562407c30d

See more details on using hashes here.

Provenance

File details

Details for the file usfm_grammar-3.0.0b6-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for usfm_grammar-3.0.0b6-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 a598c9b9ef7c4d7862f54821bd7cfadd45745f2e1011e24d3fe4ab84bccbdb3d
MD5 96378baa8758367178c3d4908bb9a4ea
BLAKE2b-256 c3f5bfddc7c38faa86f32ac26a7c00ccc8a97f6405a94d43652d9fdc37f4dcea

See more details on using hashes here.

Provenance

File details

Details for the file usfm_grammar-3.0.0b6-cp310-cp310-win32.whl.

File metadata

File hashes

Hashes for usfm_grammar-3.0.0b6-cp310-cp310-win32.whl
Algorithm Hash digest
SHA256 87f16bcc9af2ddc42fe16ff94baf0d053e0243881d2665affab2b5494542372a
MD5 0887b86c4ad243043fb25f2f36ed825b
BLAKE2b-256 b6d3577a3d13408fc9681c4ab67f1827cab0ab14a9cc02c1ea7be34b20636588

See more details on using hashes here.

Provenance

File details

Details for the file usfm_grammar-3.0.0b6-cp310-cp310-musllinux_1_1_x86_64.whl.

File metadata

File hashes

Hashes for usfm_grammar-3.0.0b6-cp310-cp310-musllinux_1_1_x86_64.whl
Algorithm Hash digest
SHA256 9213f4ff4335a32a98369f79d51b963fdfa97b5dd02e87232951f12afd776d3f
MD5 42ccb0882191b3e5c9a765a77e4bffbf
BLAKE2b-256 bf8a6464a9f9f501ff1f3f2ebdabb4ac25c7adbc68ac0a322dd245d1378a7be3

See more details on using hashes here.

Provenance

File details

Details for the file usfm_grammar-3.0.0b6-cp310-cp310-musllinux_1_1_i686.whl.

File metadata

File hashes

Hashes for usfm_grammar-3.0.0b6-cp310-cp310-musllinux_1_1_i686.whl
Algorithm Hash digest
SHA256 ee301f92765fc61fb2a0a6790a5484fa8076098780ca485bab7975d151ea8bbb
MD5 f0f1749cb52f1926b6ba956cd9e7b68c
BLAKE2b-256 61a0ce7f7707438db69b823b61e433b4a4813ed477eee25c332e0302c4507582

See more details on using hashes here.

Provenance

File details

Details for the file usfm_grammar-3.0.0b6-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for usfm_grammar-3.0.0b6-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ac43f605efc4d1db4a3bf73ba67dbb79d2fbeab71672f16a3d4476b17c07f20d
MD5 cf12baea3d37e9d012cad48496ab4454
BLAKE2b-256 56e00629f4b935e2d73f865a57cf65567361a682bf3ae8b981ac987d6c3e35bf

See more details on using hashes here.

Provenance

File details

Details for the file usfm_grammar-3.0.0b6-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for usfm_grammar-3.0.0b6-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 f8bd648669897062b1f51f5021a135c9a2f221e771051d7b0752d1450b4bda71
MD5 ef83990b547125a5c030a73a53db2032
BLAKE2b-256 5cf552956c54ce4fa58cae180c6c51b06c4274c0d617525de2424d2934cf7325

See more details on using hashes here.

Provenance

File details

Details for the file usfm_grammar-3.0.0b6-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for usfm_grammar-3.0.0b6-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 a91e0a878d8330c1d67f1ac26525c0921c12c63cf7b975fa33c14fef22344a40
MD5 5b31be85382714eefce1f4ac159dcfc8
BLAKE2b-256 af8774593638e67ce6c39f9a725402369c4ff2cb6158bbffef9cb8a5aab943e0

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page