The official package for dealing with the OpenITI corpus

These details have not been verified by PyPI

Project links

Project description

openiti

A Python library that combines all often-used code in the OpenITI project.

Full documentation and description can be found here: https://openiti.readthedocs.io/

Installation

pip install OpenITI

Alternatively, you might need to use pip3 install OpenITI or python -m pip install OpenITI.

Change log:

v.0.1.6:

the library has been updated to deal with the new manuscript URIs (see https://github.com/OpenITI/MSS)
helper/yml.py:
- implement new manuscript URIs (see https://github.com/OpenITI/MSS).
- readYML: add fix_yml_errors_silently argument (default: False). The function will now try to fix any faulty yml file.
- ymlToDic: idem
- fix_broken_yml:
  - previously, this function returned None if the yml could not be fixed. It will now always return a dictionary (keys that could not be fixed will not be in the dictionary).
  - if execute is True, the fixed yml file will be saved to fp (if silent is False, the user's permission will be asked first)
helper/uri.py: implement new manuscript URIs (see https://github.com/OpenITI/MSS). The components of these URIs are called location, manuscript and transcription IDs.
helper/templates.py: add templates for location, manuscript and transcription yml files.
helper/rgx.py: Add regexes for location, manuscript and transcription URIs.
helper/funcs:
- get_all_yml_files_in_folder: add "location", "manuscript" and "transcription" yml types
- get_all_text_files_in_folder: allow for multiple language components in version and transcription IDs
- get_page_numbers: Adapt the default page_regex argument to include page number formats like FolioVxxPxxxA, PageBegVxxPxxx and PageEndVxxPxxx.
- get_all_characters_in_folder: allow for multiple language components in version and transcription IDs
new_books/add/add_books.py:
- initialize_new_text: adapt the function to new manuscript URIs
new_books/convert/generic_converter: convert_file function now returns the output path
new_books/convert/helper/html2md.py:
- convert_li: fix numbered list numbering bug
new_books/convert/html_converter_generic.py‎: replace html entities

v.0.1.5.11:

minimum Python version bumped to 3.6 to allow the use of f-strings
the library is now built using pyproject.toml instead of setup.py
helper.ara:
- normalize_ara_light: add normalization of Persian letters
- normalize_ara_heavy: add normalization of Persian letters
helper.funcs: new functions:
- get_page_numbers: utility function that generates two lists: one containing all page numbers in the document, one containing the character offsets of (the end of) each of these page numbers in the document. These two lists are used by the get_page_number function to find a page number.
- search_in_text: search a literal string or regex (use_regex=True) in an OpenITI text
- search_regex_in_text: same as search_in_text with argument use_regex=True
- search_in_folder: search a literal string or regex (use_regex=True) in all OpenITI texts in a given folder (and its subfolders).
- search_regex_in_text: same as search_in_folder with argument use_regex=True
helper.funcs.get_page_number: BREAKING CHANGE: now takes three arguments:
- loc (int): a character position in the text
- page_numbers (list): a list of all page numbers in the text
- page_ends (list): a list of the character position of each page number in the text The two lists can be generated by the new get_page_numbers function
helper.funcs.get_sections:
- now optionally returns page numbers and character offsets for each section
- BREAKING CHANGE: instead of a tuple of lists, it now returns a list:
  - with default parameters: list of titles of sections
  - if any of the parameters is set to True, a list of dictionaries (possible keys: "title", "level", "parent_sections", "start_offset", "title_end", "end_offset", "start_page", "end_page")
helper.funcs.get_semantic_tag_elements: now optionally returns page numbers for each tag
helper.funcs.get_section_title: Now uses bisect.bisect_left internally
openiti/new_books/convert/generic_converter: all converters now create a folder "converted" inside the folder of the source texts if no destination folder was explicitly set.
openiti/new_books/convert/epub_converter_generic: now accommodates different styles of epub tables of content
openiti/new_books/convert/html2md: now use lxml feature extractor
openiti/new_books/convert/epub_converter_UrduELib: new converter for Urdu ELibrary
openiti/new_books/convert/tei_converter_PDL: new converter for Persian Digital Library files

v.0.1.5.10:

helper.ara:
- Arabic-Indic digits and Extended Arabic-Indic digits are removed from the ar_chars liststring, and put into a new liststring: ar_nums. This has the effect that numbers written written with these characters are not considered Arabic tokens anymore (just like numbers written with Western Arabic numerals).
- A large number of Greek, Coptic, Syriac and Latin characters are added to the allowed_chars stringlist, which means they should not be removed from texts before putting them into the corpus.
helper.funcs.text_cleaner: this function, which removes all non-word-characters, numbers and Latin-script characters from the texts, now uses helper.ara.transcription_chars instead of [A-z] to define Latin-script letters. This means it will now also remove common transcription letters (ā, ḥ, ...) instead of only ASCII letters.
helper.rgx:
- add a list of Islamicate language codes
- add regex for author (author_uri), book (book_uri) and version URIs (version_uri)
- fix the page number related regexes to include PageBeg and PageEnd tags, and folio numbers that end with lower-case "a" or "b"
- adapt the section_tag regex to include this new flavour: ### |5| (which is the same as a section tag with five pipes)
- add an all_tags regex that can be used to remove all OpenITI mARkdown tags
- add an html_tags regex that can be used to find html tags
helper.uri :
- build_pth: take into account the different repo name formats for Arabic and other languages (Arabic: 0025AH, 0050AH, ...; Persian: PER0025AH, PER0050AH, ...; Urdu: URD0025AH, ...)
- change_uri: add a non_25Y_folder argument. Set this to True if you want to use the function for folders that do not have subfolders for each 25-year period.
- add_character_count: idem
- move_yml: idem
- make_folder: idem
- move_to_new_uri_pth: idem
- check_yml_files: return list of paths to yml files where the checks failed instead of None
new_books.add.add_books: implement non_25Y_folder argument in all functions (see above in helper.uri)
new_books.convert.epub_converter_hindawi: deal with possibility of unavailable metadata
new_books.convert.helper.html2md: improve named entity tagging
new_books.convert.helper.html2md_LAL: various improvements for formatting Library of Arabic Literature XML files
openiti.new_books.convert.tei_converter_LAL: idem
new_books.convert.tei_converter_Wuerzburg: small post-processing tweaks.

v.0.1.5.9:

helper.yml: fix bug: pass reflow parameter in readYML function to ymlToDic
helper.funcs: Add functions:
- read_header: read header of an OpenITI file (local path / URL)
- read_text: read text of an OpenITI file (local path / URL)
- get_page_number: get the page number of a token based on its offset
- get_semantic_tag_elements: extract semantic tags (like @TOP, @PER) from an OpenITI text
- find_section_title: get the section title of any location inside a text
- get_sections: get a list of all sections in an OpenITI text
helper.ara:
- Add a whitelist of characters that are allowed in OpenITI texts, with support for Hebrew and Cyrillic characters.
- Add new characters (subscript alef, inverted damma, quranic sukun, small high madda) to the noise variable
new_books.convert.helper.html2md: Add underscore to allowed characters in named entity tags and fix named entity count
new_books.convert.tei_converter_LAL: add a new converter for TEI texts from the Library of Arabic Literature

v.0.1.5.8:

helper.uri: when a book URI changes, also change references to it in related books.
helper.yml: add functions to check completeness of yml files

v.0.1.5.7:

helper.uri: fix bug in the extension looping process of the check_token_count function

v.0.1.5.6:

helper.funcs: add natural_sort function to sort a list of strings that include numbers in natural order (e.g., ["1", "2", "10"] instead of ["1", "10", "2"] )
helper.uri: give files without extension priority over files with ".inProgress" extension in deciding which text file to use to count characters and tokens for a specific version yml file.
new_books.convert.epub_converter_masaha.py : remove superfluous backslash in EDITOR tag
new_books.convert.helper.html2md.py: fix bug in token count
openiti.new_books.convert.helper.html2md_eShia.py: fix bug in footnote conversion
openiti.new_books.convert.html_converter_eShia.py: improve eShia conversion
Add converters for Ghbook, Ghaemiyeh and Rafed files.

v.0.1.5.5:

helper.uri: add support for flat folders.

v.0.1.5.4: bug fix

helper.yml: fix remaining bugs with long lines.
helper.uri: fix bugs in check_yml_file function.

v.0.1.5.3: bug fix

helper.yml: make sure that yml keys always contain a hashtag.

v.0.1.5.2: bug fix

helper.uri: Remove test that blocked the script.

v.0.1.5.1: bug fix

helper.yml: fix bug related to long lines in the dicToYML function.

v.0.1.5:

helper.yml: add fix_broken_yml function to fix yml files that are unreadable due to indentation problems (or keys that don't end with a colon)
helper.uri: rewrite the check_yml_files function to fix a bug in the character count and add additional checks.
helper.funcs: allow more than one yml_type in the function get_all_yml_files_in_folder.
helper.ara:
- Add missing EXTENDED ARABIC-INDIC DIGITS characters 67890
- Add tokenize function
- Fix typos in normalize_per doctest
new_books.convert: add converter for Masāḥa Ḥurra epub files (epub_converter_masaha.py, with helper file html2md_masaha.py)
new_books.convert.epub_converter_generic.py: implement overwrite option for (dis)allowing overwriting existing converted files.

v.0.1.4:

helper.templates: replace the multiple book relations fields in the book yml file with a single field, #40#BOOK#RELATED##:.
helper.yml: make not rearranging lines ("reflowing") in yml files the default, and change the default line length to 80.
helper.funcs: add a get_all_yml_files_in_folder, analogous to the existing get_all_text_files_in_folder function

v.0.1.3:

new_books.convert: add converters for ALCorpus and Ptolemaeus texts
new_books.convert.helper.html2md: tweaks to import of options + small tweaks
helper.ara: Stop ar_cnt_file from raising exception if book misses splitter; instead, print warning
helper.funcs:
- fix bug in get_all_text_files_in_folder function: missing periods in regex.
- improve missing splitter message
- use ara.normalize_ara_light function instead of ara.normalize_ara_extra_light in text_cleaner function
helper.uri:
- make it possible to pass a specific version_fp to the check_token_count function; before, that function generated that path from the URI, but this created problems when files were not stored in the standard OpenITI folders.
- add find_latest parameter in check_token_count function; if ``False, the function will count the tokens in the specific version_fp` provided; if `True, the script will count tokens in the file with the most advanced extension (.mARkdown > .completed > .inProgress > [no extension])
helper.yml:
- make it possible to pass a specific yml_fp to the ymlToDic function, so that the script can print the path (if provided) for signalling empty yml files.
- Include possibility that yml key ends with more than one colon
- readYML: add exception message when yml file could not be read.

v.0.1.2:

openiti.helper.funcs: Fixed bug in report_missing_numbers function.
openiti.new_books.convert: added ShamAY converter and small updates to other shamela converters.

v.0.1.1:

openiti.helper.funcs: Added get_all_text_files_in_folder() generator
openiti.helper.uri: Fix bug in new_yml function (URI used to have ".yml" in it)
openiti.new_books.convert.shamela_converter.py: Improved formatting of the text and notes and added support for shamela collections in which the .mdb files contain more than one book.
openiti.new_books.convert.tei_converter_Thielen, new_books/convert/helper/html2md_Thielen: added new converter for TEI files provided by Jan Thielen.
new_books/convert/tei_converter_generic.py, new_books/convert/helper/html2md.py: Add the possibility to pass options to the markdownify function

v.0.1.0:

openiti.helper.yml: add support for empty lines and bullet lists in multiline values
openiti.new_books.convert.shamela_converter: fix bugs in shamela converter

v.0.0.9.post1 (patch):

openiti.helper.ara: fix bug in regex compilation

v.0.0.9:

openiti.new_books.convert : check and update all converters
openiti.helper.ara : make counting characters in editorial sections optional (default: include Arabic characters in editorial sections)
openiti.helper.yml : add custom error messages for broken and empty yml files
openiti.git.git_util : add git utilities class, with commit method

v.0.0.8:

openiti.new_books.add.add_books: fix import bug
openiti.new_books.convert: add converter for Noorlib html files

v.0.0.7:

openiti.git.get_issues: change authentication from username/password to GitHub token
openiti.helper.ara: add function to normalize composite Arabic characters
openiti.helper.uri: move functions for adding texts to the corpus to a new module, openiti.new_books.add.add_books
openiti.helper.uri: fix bug in the character count function (did not work if execute==True)
openiti.new_books.convert: restructured folder and moved helper functions into a new subfolder called helper
openiti.new_books.convert.generic_converter:
- reordered the main convert_file function and added inline documentation
- made convert_files_in_folder function more flexible
openiti.new_books.convert: added generic converters for shamela libraries, html and tei xml files, and custom converters for eShia and GRAR libraries
- openiti.new_books.convert.shamela_converter
- openiti.new_books.convert.html_converter_generic
- openiti.new_books.convert.html_converter_eShia
- openiti.new_books.convert.tei_converter_generic
- openiti.new_books.convert.tei_converter_GRAR
openiti.new_books.convert.helper: added helper functions for the new converters:
- openiti.new_books.convert.helper.html2md_eShia
- openiti.new_books.convert.helper.html2md_GRAR
- openiti.new_books.convert.helper.tei2md
- openiti.new_books.convert.helper.bok

v.0.0.6:

openiti.helper.uri: use both Arabic character and token count in yml files
openiti.helper.uri: add support for paths to files that are not in 25-years repos (e.g., for release)
openiti.helper.uri: fix bugs
added Sphinx documentation

v.0.0.5:

openiti.helper.funcs: added Arabic token count function
openiti.helper.uri: use Arabic token count instead of Arabic character count for yml file revision. Also, revise token count for every version yml file instead of only for version yml files that do not contain a count.

v.0.0.4:

openiti.helper.uri: removed the restriction on the use of digits in book titles
openiti.helper.uri: added a check for empty yml files
openiti.helper.yml: added documentation and doctests
openiti.helper.yml: added check for empty yml files + changed splitting of yml files so that even unindented multi-line values can be correctly parsed.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.6

Nov 17, 2025

0.1.5.10

Oct 11, 2023

0.1.5.9

Mar 8, 2023

0.1.5.8

Aug 2, 2022

0.1.5.7

Jun 30, 2022

0.1.5.6

Jun 28, 2022

0.1.5.5

Oct 13, 2021

0.1.5.4

Oct 13, 2021

0.1.5.3

Oct 12, 2021

0.1.5.2

Oct 12, 2021

0.1.5.1

Oct 12, 2021

0.1.5

Oct 10, 2021

0.1.4

Mar 9, 2021

0.1.3

Feb 11, 2021

0.1.2

Oct 14, 2020

0.1.1

Sep 30, 2020

0.1.0

Aug 26, 2020

0.0.9.post1

Aug 11, 2020

0.0.9

Aug 10, 2020

0.0.9a0 pre-release

Aug 11, 2020

0.0.8

Jun 24, 2020

0.0.7

Apr 24, 2020

0.0.6

Mar 24, 2020

0.0.5

Feb 18, 2020

0.0.4

Feb 18, 2020

0.0.3

Feb 17, 2020

0.0.2

Feb 17, 2020

0.0.1

Feb 17, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openiti-0.1.6.tar.gz (36.1 MB view details)

Uploaded Nov 17, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

openiti-0.1.6-py3-none-any.whl (270.1 kB view details)

Uploaded Nov 17, 2025 Python 3

File details

Details for the file openiti-0.1.6.tar.gz.

File metadata

Download URL: openiti-0.1.6.tar.gz
Upload date: Nov 17, 2025
Size: 36.1 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.10

File hashes

Hashes for openiti-0.1.6.tar.gz
Algorithm	Hash digest
SHA256	`c3470b73f9b20e45da6b87a4410fbb148f59ef59439e3226c61fb08142aa5e45`
MD5	`87c0460421e7964ff9cb7969edc88118`
BLAKE2b-256	`de443a95133db48703895696bb44195e8a89fa1a7688ae14deac10a80cd18b79`

See more details on using hashes here.

File details

Details for the file openiti-0.1.6-py3-none-any.whl.

File metadata

Download URL: openiti-0.1.6-py3-none-any.whl
Upload date: Nov 17, 2025
Size: 270.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.10

File hashes

Hashes for openiti-0.1.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6c1799a5c6f9f7f22a198b1d4c0c9579939795c25035eefa2e066ad288a18d43`
MD5	`cf8f127d71c7dd0971ecada3963d04e6`
BLAKE2b-256	`0bb7a355483fee6294da4cc6485eb2e30e4d002a96fb2e12aae32068fb2bcc9c`

See more details on using hashes here.

openiti 0.1.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

openiti

Installation

Change log:

v.0.1.6:

v.0.1.5.11:

v.0.1.5.10:

v.0.1.5.9:

v.0.1.5.8:

v.0.1.5.7:

v.0.1.5.6:

v.0.1.5.5:

v.0.1.5.4: bug fix

v.0.1.5.3: bug fix

v.0.1.5.2: bug fix

v.0.1.5.1: bug fix

v.0.1.5:

v.0.1.4:

v.0.1.3:

v.0.1.2:

v.0.1.1:

v.0.1.0:

v.0.0.9.post1 (patch):

v.0.0.9:

v.0.0.8:

v.0.0.7:

v.0.0.6:

v.0.0.5:

v.0.0.4:

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes