Python parser for USFM files, based on tree-sitter-usfm3

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

USFM-Grammar

The python library that facilitates

Parsing and validation of USFM files using tree-sitter-usfm3
Conversion of USFM files to other formats (USX, dict, list etc)
Extraction of specific contents from USFM files like scripture alone(clean verses), notes (footnotes, cross-refs) etc

Built on python 3.10

Installation

pip install usfm-grammar

This requires a C compiler. On Windows, Microsoft Visual C++ 14.0 or above is required. It is recommended that you update pip, setuptools and wheel.

Usage

By importing library in Python code

from usfm_grammar import USFMParser, Filter

# input_usfm_str = open("sample.usfm","r", encoding='utf8').read()
input_usfm_str = '''
\\id GEN
\\c 1
\\p
\\v 1 test verse
'''

my_parser = USFMParser(input_usfm_str)

errors = my_parser.errors
print(errors)

To convert to USX

from lxml import etree

usx_elem = my_parser.to_usx() # default filter=ALL
print(etree.tostring(usx_elem, encoding="unicode", pretty_print=True))

To convert to Dict/USJ

output = my_parser.to_usj() # default all markers

# filters out specified markers from output
# output = my_parser.to_usj(exclude_markers=['s1','h', 'toc1','toc2','mt'])

# retains only specified contents from output
# output = my_parser.to_usj(include_markers=['id', 'c', 'v']) 

# use predefined marker groups instead of listing them one by one
# output = my_parser.to_usj(include_markers=Filter.BCV+Filter.TEXT)

# for a flattened JSON removing nesting brought in by paragraphs, lists, quotes, tables and character level markups
# output = my_parser.to_usj(exclude_markers=Filter.PARAGRAPHS+Filter.CHARACTERS)

# To NOT concatinate text extracted from different markers
# output = my_parser.to_usj(exclude_markers=Filter.PARAGRAPHS+Filter.CHARACTERS, combine_texts=False) 

print(output)

To understand more about how exclude_markers, include_markers, combine_texts and Filter works refer the section on filtering on USJ

To save as json

import json
dict_output = my_parser.to_usj()
with open("file_path.json", "w", encoding='utf-8') as fp:
	json.dump(dict_output, fp)

To convert to List or table like format

list_output = my_parser.to_list() 
#list_output = my_parser.to_list([Filter.SCRIPTURE_TEXT])

table_output = "\n".join(["\t".join(row) for row in list_output])
print(table_output)

To round trip with USJ

from usfm_grammar import USFMParser, Filter

my_parser = USFMParser(input_usfm_str)
usj_obj = my_parser.to_usj()

my_parser2 = USFMParser(from_usj=usj_obj)
print(my_parser2.usfm)

:warning: There will be differences between first USFM and the generated one in 1. Spaces and lines 2. Default attributes will be given their names 3. Closing markers may be newly added

To remove unwanted markers from USFM

from usfm_grammar import USFMParser, Filter, USFMGenerator

my_parser = USFMParser(input_usfm_str)
usj_obj = my_parser.to_usj(include_markers=Filter.BCV+Filter.TEXT)

my_parser2 = USFMParser(from_usj=usj_obj)
print(my_parser2.usfm)

USJ to USX or Table

from usfm_grammar import USFMParser, Filter

my_parser = USFMParser(input_usfm_str)
usj_obj = my_parser.to_usj()

my_parser2 = USFMParser(from_usj=usj_obj)
print(my_parser2.to_usx())
# print(my_parser2.to_list())

USX to USFM, USJ or Table

from usfm_grammar import USFMParser, Filter
from lxml import etree

test_xml_file = "sample_usx.xml"
with open(test_xml_file, 'r', encoding='utf-8') as usx_file:
    usx_str = usx_file.read()
    usx_obj = etree.fromstring(usx_str)

    my_parser = USFMParser(from_usx=usx_obj)
    print(my_parser.usfm)
    # print(my_parser.to_usj())
    # print(my_parser.to_list())

From CLI

usage: usfm-grammar [-h] [--in_format {usfm,usj,usx}]
                    [--out_format {usj,table,syntax-tree,usx,markdown,usfm}]
                    [--include_markers {book_headers,titles,...}]
                    [--exclude_markers {book_headers,titles,...}]
                    [--csv_col_sep CSV_COL_SEP] [--csv_row_sep CSV_ROW_SEP]
                    [--ignore_errors] [--combine_text]
                    infile

Uses the tree-sitter-usfm grammar to parse and convert USFM to Syntax-tree,
JSON, CSV, USX etc.

positional arguments:
  infile                input usfm or usj file

options:
  -h, --help            show this help message and exit
  --in_format {usfm,usj}
                        input file format
  --out_format {usj,table,syntax-tree,usx,markdown,usfm}
                        output format
  --include_markers {book_headers,titles,comments,paragraphs,characters,notes,study_bible,bcv,text,ide,usfm,h,toc,toca,imt,is,ip,ipi,im,imi,ipq,imq,ipr,iq,ib,ili,iot,io,iex,imte,ie,mt,mte,cl,cd,ms,mr,s,sr,r,d,sp,sd,sts,rem,lit,restore,p,m,po,pr,cls,pmo,pm,pmc,pmr,pi,mi,nb,pc,ph,q,qr,qc,qa,qm,qd,lh,li,lf,lim,litl,tr,tc,th,tcr,thr,table,b,add,bk,dc,ior,iqt,k,litl,nd,ord,pn,png,qac,qs,qt,rq,sig,sls,tl,wj,em,bd,bdit,it,no,sc,sup,rb,pro,w,wh,wa,wg,lik,liv,jmp,f,fe,ef,efe,x,ex,fr,ft,fk,fq,fqa,fl,fw,fp,fv,fdc,xo,xop,xt,xta,xk,xq,xot,xnt,xdc,esb,cat,id,c,v,text-in-excluded-parent}
                        the list of of contents to be included
  --exclude_markers {book_headers,titles,comments,paragraphs,characters,notes,study_bible,bcv,text,ide,usfm,h,toc,toca,imt,is,ip,ipi,im,imi,ipq,imq,ipr,iq,ib,ili,iot,io,iex,imte,ie,mt,mte,cl,cd,ms,mr,s,sr,r,d,sp,sd,sts,rem,lit,restore,p,m,po,pr,cls,pmo,pm,pmc,pmr,pi,mi,nb,pc,ph,q,qr,qc,qa,qm,qd,lh,li,lf,lim,litl,tr,tc,th,tcr,thr,table,b,add,bk,dc,ior,iqt,k,litl,nd,ord,pn,png,qac,qs,qt,rq,sig,sls,tl,wj,em,bd,bdit,it,no,sc,sup,rb,pro,w,wh,wa,wg,lik,liv,jmp,f,fe,ef,efe,x,ex,fr,ft,fk,fq,fqa,fl,fw,fp,fv,fdc,xo,xop,xt,xta,xk,xq,xot,xnt,xdc,esb,cat,id,c,v,text-in-excluded-parent}
                        the list of of contents to be included
  --csv_col_sep CSV_COL_SEP
                        column separator or delimiter. Only useful with
                        format=table.
  --csv_row_sep CSV_ROW_SEP
                        row separator or delimiter. Only useful with
                        format=table.
  --ignore_errors       to get some output from successfully parsed portions
  --combine_text        to be used along with exclude_markers or
                        include_markers, to concatinate the consecutive text
                        snippets, from different components, or not

Example

>>> python3 -m usfm_grammar sample.usfm --out_format usx

>>> usfm-grammar sample.usfm

>>> usfm-grammar sample.usfm --out_format usx

>>> usfm-grammar sample.usfm --include_markers bcv --include_markers text --include_markers s

>>> usfm-grammar sample-usj.json --out_format usfm

Filtering on USJ

The filtering on USJ, the JSON output, is a feature incorporated to allow data extraction, markup cleaning etc. The arguments exclude_markers and include_markers in the methods USFMParser.to_usj() makes this possible. Also the USFMParser.to_list(), can accept these inputs and perform similar operations. There is CLI versions also for these arguments to replicate the filtering feature there.

include_markers

Optional input parameter to to_usj() and to_list in python library and also in CLI when format=json or format=table. Defaults to None.When proivded, only those markers listed will be included in the output. include_markers is applied before applying exclude_markers.
exclude_markers

Optional input parameter to to_usj() and to_list in python library and also in CLI when format=json or format=table. Defaults to None. When proivded, all markers except those listed will be included in the output.
combine_texts

Optional input parameter to to_usj() and to_list in python library and also in CLI when format=json or format=table. Defaults to True. After filtering out makers like paragraphs and characters, we are left with texts from within them, if 'text-in-excluded-parent' is also not excluded. These text snippets may come as separate components in the contents list. When this option is True, the consequetive text snippets will be concatinated together. The text concatination is done in a puctuation and space aware manner. If users need more control over the space handling or for any other reason, would prefer the texts snippets as different components in the output, this can be set to False.

usfm_grammar.Filter

This Class provides a set of enums that would be useful in providing in the exclude_markers and include_markers inputs rather than users listing out individual markers. The class has following options

  BOOK_HEADERS : identification and introduction markers
  TITLES : section headings and associated markers
  COMMENTS : comment markers like \rem
  PARAGRAPHS : paragraph markers like \p, poetry markers, list table markers
  CHARACTERS : all character level markups like \em, \w, \wj etc and their nested versions with +
  NOTES : foot note, cross-reference and their content markers
  STUDY_BIBLE : \esb and `cat
  BCV : \id, \c and \v
  TEXT : 'text-in-excluded-parent'

To inspect which are the markers in each of these options, it could be just printed out, print(Filter.TITLES). These could be used individually or concatinated to get the desired filtering of markers and data:

output = my_parser.to_usj(include_markers=Filter.BCV)
output = my_parser.to_usj(include_markers=Filter.BCV+Filter.TEXT)
output = my_parser.to_usj(exclude_markers=Filter.PARAGRAPHS+Filter.CHARACTERS)

Inner contents of excluded markers

For markers like \p \q etc, by excluding them, we only remove them from the heirachy and retain the inner contents like \v, text etc that would be coming inside it. But for certain other markers like \f, \x, \esb etc, if they are excluded their inner contents are also excluded. Following is the set of all markers, who inner contents are discarded if they are mentioned in exclude_markers or not included in include_markers.
```
BOOK_HEADERS, TITLES, COMMENTS, NOTES, STUDY_BIBLE
```
:warning: Generally, it is recommended to NOT use both exclude_markers and include_markers together as it could lead to unexpected behavours and data loss. For instance if include_makers has \fk and exclude_markers has \f, the output will not contain \fk as all inner contents of \f will be discarded.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

3.0.0b5 pre-release

Mar 14, 2024

3.0.0b4 pre-release

Feb 12, 2024

3.0.0b3 pre-release

Dec 27, 2023

3.0.0b2 pre-release

Sep 22, 2023

3.0.0b1 pre-release

Jul 18, 2023

3.0.0a7 pre-release

Jul 18, 2023

3.0.0a6 pre-release

Jul 14, 2023

3.0.0a5 pre-release

Oct 28, 2022

3.0.0a4 pre-release

Aug 31, 2022

3.0.0a3 pre-release

Aug 19, 2022

3.0.0a2 pre-release

Aug 19, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

usfm_grammar-3.0.0b5-cp311-cp311-win_amd64.whl (260.8 kB view hashes)

Uploaded Mar 14, 2024 CPython 3.11 Windows x86-64

usfm_grammar-3.0.0b5-cp311-cp311-win32.whl (263.6 kB view hashes)

Uploaded Mar 14, 2024 CPython 3.11 Windows x86

usfm_grammar-3.0.0b5-cp311-cp311-musllinux_1_1_x86_64.whl (260.4 kB view hashes)

Uploaded Mar 14, 2024 CPython 3.11 musllinux: musl 1.1+ x86-64

usfm_grammar-3.0.0b5-cp311-cp311-musllinux_1_1_i686.whl (269.5 kB view hashes)

Uploaded Mar 14, 2024 CPython 3.11 musllinux: musl 1.1+ i686

usfm_grammar-3.0.0b5-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (260.1 kB view hashes)

Uploaded Mar 14, 2024 CPython 3.11 manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

usfm_grammar-3.0.0b5-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (269.2 kB view hashes)

Uploaded Mar 14, 2024 CPython 3.11 manylinux: glibc 2.17+ i686 manylinux: glibc 2.5+ i686

usfm_grammar-3.0.0b5-cp311-cp311-macosx_10_9_x86_64.whl (253.8 kB view hashes)

Uploaded Mar 14, 2024 CPython 3.11 macOS 10.9+ x86-64

usfm_grammar-3.0.0b5-cp310-cp310-win_amd64.whl (260.8 kB view hashes)

Uploaded Mar 14, 2024 CPython 3.10 Windows x86-64

usfm_grammar-3.0.0b5-cp310-cp310-win32.whl (263.6 kB view hashes)

Uploaded Mar 14, 2024 CPython 3.10 Windows x86

usfm_grammar-3.0.0b5-cp310-cp310-musllinux_1_1_x86_64.whl (260.4 kB view hashes)

Uploaded Mar 14, 2024 CPython 3.10 musllinux: musl 1.1+ x86-64

usfm_grammar-3.0.0b5-cp310-cp310-musllinux_1_1_i686.whl (269.5 kB view hashes)

Uploaded Mar 14, 2024 CPython 3.10 musllinux: musl 1.1+ i686

usfm_grammar-3.0.0b5-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (260.1 kB view hashes)

Uploaded Mar 14, 2024 CPython 3.10 manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

usfm_grammar-3.0.0b5-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (269.2 kB view hashes)

Uploaded Mar 14, 2024 CPython 3.10 manylinux: glibc 2.17+ i686 manylinux: glibc 2.5+ i686

usfm_grammar-3.0.0b5-cp310-cp310-macosx_10_9_x86_64.whl (253.8 kB view hashes)

Uploaded Mar 14, 2024 CPython 3.10 macOS 10.9+ x86-64

Hashes for usfm_grammar-3.0.0b5-cp311-cp311-win_amd64.whl

Hashes for usfm_grammar-3.0.0b5-cp311-cp311-win_amd64.whl
Algorithm	Hash digest
SHA256	`1c445a878291a5c31f5a504b19f7253639645750d2d2010be0fc075c0c0e4c10`
MD5	`cd76042235ff8d27769de41c8bee288a`
BLAKE2b-256	`5cbace62978cf3f0b72d2191d9118d5de59814dcfe202cddc83edee913f0fe92`

Hashes for usfm_grammar-3.0.0b5-cp311-cp311-win32.whl

Hashes for usfm_grammar-3.0.0b5-cp311-cp311-win32.whl
Algorithm	Hash digest
SHA256	`228c4839c44ad6aab84f25f9d17bd4875dfc4f9604e24c1ee613b757858e610e`
MD5	`69bd6c57fc89bc214b0f525622011417`
BLAKE2b-256	`fa54a958a11675d22015cd1f1d4f8fbbc029cf7ae98503b5eea8985975b711c9`

Hashes for usfm_grammar-3.0.0b5-cp311-cp311-musllinux_1_1_x86_64.whl

Hashes for usfm_grammar-3.0.0b5-cp311-cp311-musllinux_1_1_x86_64.whl
Algorithm	Hash digest
SHA256	`99b40511f1bc11ee379e29494adc0834d41e4d739f061598b3c594eb3a8eccf6`
MD5	`ad238f5b4578372fa4ee124a713458d5`
BLAKE2b-256	`d39683ede5f7a92f1f1c97b02d30bdade2fab175ab572ffd275f24d05e501ef5`

Hashes for usfm_grammar-3.0.0b5-cp311-cp311-musllinux_1_1_i686.whl

Hashes for usfm_grammar-3.0.0b5-cp311-cp311-musllinux_1_1_i686.whl
Algorithm	Hash digest
SHA256	`0112ef59e066b16a8e5bd62073feb21a34425a387360196ad13a625827e327cc`
MD5	`0d8211d1a1229032fc14ac0c3ef1bd9e`
BLAKE2b-256	`a02825b6979e17b9ecb632bdb27cdd3484350cf73faf458f8f24222969503208`

Hashes for usfm_grammar-3.0.0b5-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl

Hashes for usfm_grammar-3.0.0b5-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`0a7b163ae9af5c26266e44bd1f743ba6d0eb42c920245ae6a9dd7f85efed953f`
MD5	`837ce851d437b2b26e17aa606348edf8`
BLAKE2b-256	`214c5495c8668d1e9bdbc17c7dccb4b0a3adcb9ce9f06f944b668e7a3c5f919a`

Hashes for usfm_grammar-3.0.0b5-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl

Hashes for usfm_grammar-3.0.0b5-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm	Hash digest
SHA256	`a488c574667ab37ca6aec72daa033261600daf40e32f6e360277f75ea0aeffd7`
MD5	`0aea01391ecf67cf98035c84ef7673b5`
BLAKE2b-256	`f6f20aa72837c13d5a08649f7fc5fb96bf668201f3fc87c8758f2e107efe9888`

Hashes for usfm_grammar-3.0.0b5-cp311-cp311-macosx_10_9_x86_64.whl

Hashes for usfm_grammar-3.0.0b5-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm	Hash digest
SHA256	`d34d0b76da000218c32c7992e4fa6c4034ff7f4fee0a350c635e693b31115d87`
MD5	`eb6480377bd5c6e88cb59202df1f2dac`
BLAKE2b-256	`3b38194ef855bac8e2c84333e52d45ea03c3ebb4f708f977df6cf9118bfbbe10`

Hashes for usfm_grammar-3.0.0b5-cp310-cp310-win_amd64.whl

Hashes for usfm_grammar-3.0.0b5-cp310-cp310-win_amd64.whl
Algorithm	Hash digest
SHA256	`8b28a7530a026d0fba28ca48c2caf6fc0d8eb45231fe4515c92d3d1e506baf2f`
MD5	`8c73ef36e9608bebd289a50b2e3f5f18`
BLAKE2b-256	`04bc297c95e59b98309e476962b06f1300f4028a0a8eb22079186460dceb7e0e`

Hashes for usfm_grammar-3.0.0b5-cp310-cp310-win32.whl

Hashes for usfm_grammar-3.0.0b5-cp310-cp310-win32.whl
Algorithm	Hash digest
SHA256	`9bacf828aa7adca6fcd8c0f8e5ac0782efbab013d3a57d2d5c15bd42d731253a`
MD5	`1066d54239fd68e2dbb3dd13e06c2176`
BLAKE2b-256	`cfd48d0f5183f4fbdd22f412e0f103863fa76ed2b4f64be5168b0e3e4de921ae`

Hashes for usfm_grammar-3.0.0b5-cp310-cp310-musllinux_1_1_x86_64.whl

Hashes for usfm_grammar-3.0.0b5-cp310-cp310-musllinux_1_1_x86_64.whl
Algorithm	Hash digest
SHA256	`389ea63a2b4cc7280cf314d76db9571087e8c3900b51710e480be7dfea8fa976`
MD5	`0770586e015cdb7cb907146d67dd9947`
BLAKE2b-256	`aec05917bca4c33bbeec60bfd11a82d913810c58c3bffbdb45123a80688fc933`

Hashes for usfm_grammar-3.0.0b5-cp310-cp310-musllinux_1_1_i686.whl

Hashes for usfm_grammar-3.0.0b5-cp310-cp310-musllinux_1_1_i686.whl
Algorithm	Hash digest
SHA256	`99414d3f0922ab02502ed62b39443583970c932b284209ed8030e1f0809d7002`
MD5	`6502b64aea02372400dfda58dd09adae`
BLAKE2b-256	`7f945f07a281b5bdec90089ab2d413fbabf77a0962ddf68018c957b454bcf6f9`

Hashes for usfm_grammar-3.0.0b5-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl

Hashes for usfm_grammar-3.0.0b5-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`6b2a0902d7cc48a6b375c5997a41a5894ee704c17ffcd921e93a3a5b74ee742f`
MD5	`797d1f635d83ee393467f71cd7719190`
BLAKE2b-256	`8ef7dee51bf9aef86b9f0f0c9d2bc3e628a324e541666f0d1ba23733086e888c`

Hashes for usfm_grammar-3.0.0b5-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl

Hashes for usfm_grammar-3.0.0b5-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm	Hash digest
SHA256	`b584dcc7810bd65cd763f9a649fab30917fe63c0c7481f43563d3f5a509c487f`
MD5	`6e25917b5ab8db8064f96ea9b00c7a6d`
BLAKE2b-256	`9153bc66dea0e985e12dab3cfdb4927b8b4870d521a20fd4e8c45553f7b3772f`

Hashes for usfm_grammar-3.0.0b5-cp310-cp310-macosx_10_9_x86_64.whl

Hashes for usfm_grammar-3.0.0b5-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm	Hash digest
SHA256	`cc88f3de048747872ce0e9d229640ed5ab9fcb374ab9e7e8abeec4fb2fd2403c`
MD5	`cb3aa5ec43fec1345a151987f49c0a10`
BLAKE2b-256	`137b24f4547c433ea18b1b6613e607bb85e0fdba7a5c36c8a0643db93f7f0538`