A python package with methods to handle the complexities of Hebrew text.

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
Programming Language
Typing
- Typed

Project description

Hebrew("בְּרֵאשִׁ֖ית")

A python package with methods to handle Hebrew text.

󠀠󠀠

Installation

$ pip install hebrew

Example

Hebrew assists in working with Hebrew text by providing methods to handle the text according to user-perceived characteristics. Additionally, methods for common Hebrew text processing are provided.

>> > from hebrew import Hebrew
>> >
>> > v2 = Hebrew(
    "וְהָאָ֗רֶץ הָיְתָ֥ה תֹ֙הוּ֙ וָבֹ֔הוּ וְחֹ֖שֶׁךְ עַל־פְּנֵ֣י תְה֑וֹם וְר֣וּחַ אֱלֹהִ֔ים מְרַחֶ֖פֶת עַל־פְּנֵ֥י הַמָּֽיִם׃")
>> >
>> > v2.no_punctuation()
וְהָאָרֶץ
הָיְתָה
תֹהוּ
וָבֹהוּ
וְחֹשֶׁךְ
עַל־פְּנֵי
תְהוֹם
וְרוּחַ
אֱלֹהִים
מְרַחֶפֶת
עַל־פְּנֵי
הַמָּיִם׃
>> >
>> > v2.text_only()
והארץ
היתה
תהו
ובהו
וחשך
על־פני
תהום
ורוח
אלהים
מרחפת
על־פני
המים
>> >
>> > v2.length
35
>> > v2.words(split_maqaf=True)
[וְהָאָ֗רֶץ, הָיְתָ֥ה, תֹ֙הוּ֙, וָבֹ֔הוּ, וְחֹ֖שֶׁךְ, עַל, פְּנֵ֣י, תְה֑וֹם, וְר֣וּחַ, אֱלֹהִ֔ים, מְרַחֶ֖פֶת, עַל,
 פְּנֵ֥י, הַמָּֽיִם׃]

Grapheme Characters

Hebrew text comes in different forms, depending on the context. Hebrew text may appear with Niqqudot "a system of diacritical signs used to represent vowels or distinguish between alternative pronunciations of letters of the Hebrew alphabet". ^1 Additionally, Hebrew text may appear with extensive punctuation characters that connect words, separate them, and cantillation marks "used as a guide for chanting the text, either from the printed text or, in the case of the public reading of the Torah" ^2.

Because of the above, from the perspective of a hebrew reader, the following 3 words are the same:

בְּרֵאשִׁ֖ית
בְּרֵאשִׁית
בראשית

However, as a unicode string, they are entirely different because of the additional characters.

>>> len("בְּרֵאשִׁ֖ית")  # 1
12
>>> len("בְּרֵאשִׁית")  # 2
11
>>> len("בראשית")  # 3
6

This impacts the user is a number of other ways. For example, if I want to get the root of this hebrew word using a slice: Expected: רֵאשִׁ֖ית

>>> he = "בְּרֵאשִׁ֖ית"
>>> he[-5:]
'ִׁ֖ית'

The solution to this is to handle the unicode string as a list of grapheme^3 characters, where each letter and its accompanying characters are treated as a single unit.

Working with Grapheme Characters

Using the grapheme library for python, we can work with the grapheme characters as units. This allows us to get the right number of characters, slice the string correctly, and more.

>>> import grapheme
>>> grapheme.length("בְּרֵאשִׁ֖ית")
6
>>> grapheme.slice("בְּרֵאשִׁ֖ית", start=1, end=6)
'רֵאשִׁ֖ית'

This library includes 2 classes. GraphemeString is a class that supports all the functions made available by grapheme. The 2nd class Hebrew subclasses GraphemeString and adds methods for handling Hebrew text. This allows us to interact with the text like so:

>>> from hebrew import Hebrew
>>>
>>> v2 = Hebrew("וְהָאָ֗רֶץ הָיְתָ֥ה תֹ֙הוּ֙ וָבֹ֔הוּ וְחֹ֖שֶׁךְ עַל־פְּנֵ֣י תְה֑וֹם וְר֣וּחַ אֱלֹהִ֔ים מְרַחֶ֖פֶת עַל־פְּנֵ֥י הַמָּֽיִם׃")
>>>
>>> v2.no_punctuation()
וְהָאָרֶץ הָיְתָה תֹהוּ וָבֹהוּ וְחֹשֶׁךְ עַל־פְּנֵי תְהוֹם וְרוּחַ אֱלֹהִים מְרַחֶפֶת עַל־פְּנֵי הַמָּיִם׃
>>>
>>> v2.text_only()
והארץ היתה תהו ובהו וחשך על־פני תהום ורוח אלהים מרחפת על־פני המים
>>>
>>> v2.length
35
>>> v2.words(split_maqaf=True)
[וְהָאָ֗רֶץ, הָיְתָ֥ה, תֹ֙הוּ֙, וָבֹ֔הוּ, וְחֹ֖שֶׁךְ, עַל, פְּנֵ֣י, תְה֑וֹם, וְר֣וּחַ, אֱלֹהִ֔ים, מְרַחֶ֖פֶת, עַל, פְּנֵ֥י, הַמָּֽיִם׃]

The text in these examples and used in testing were sourced from Sefaria.

Constants

Hebrew as constants for every letter as well as lists of character category's:

>>> from hebrew import Hebrew
>>>
>>> Hebrew.FINAL_LETTERS
['ך', 'ם', 'ן', 'ף', 'ץ']
>>>
>>> Hebrew(HS.ALEPH + HS.KUMATZ)
אָ
>>> Hebrew.YIDDISH_LETTERS
['ײ', 'װ', 'ױ']

Future Plans

My intention is to override some built-in python functions for a more seamless but opinionated developer experience. For example, slicing using the python [0:1] syntax, len(my_he_string), equality checks, and more. my_he_string.string is always available when access to the true unicode characters is needed.

Contributing

Contributions in the form of pull requests are very welcome! I'm sure many more helpful methods related to hebrew text could be helpful. More information and instructions for contributing can be found here.

Project details

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
Programming Language
Typing
- Typed

Release history Release notifications | RSS feed

0.8.1

Feb 25, 2024

0.6.0

Apr 19, 2022

0.5.8

Apr 18, 2022

0.5.7

Apr 18, 2022

0.5.6

Apr 18, 2022

0.5.5

Nov 22, 2021

0.5.4

Nov 21, 2021

0.5.3

Nov 16, 2021

0.5.2

Nov 14, 2021

0.5.0

Nov 14, 2021

0.4.0

Nov 14, 2021

This version

0.3.0

Nov 8, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hebrew-0.3.0.tar.gz (9.7 kB view details)

Uploaded Nov 8, 2021 Source

Built Distribution

hebrew-0.3.0-py3-none-any.whl (9.1 kB view details)

Uploaded Nov 8, 2021 Python 3

File details

Details for the file hebrew-0.3.0.tar.gz.

File metadata

Download URL: hebrew-0.3.0.tar.gz
Upload date: Nov 8, 2021
Size: 9.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.1.11 CPython/3.10.0 Windows/10

File hashes

Hashes for hebrew-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`4adab8d7f37ef9df27fe334976fb663b07c3020ca029a498b55fc567144c1f0a`
MD5	`eee405163afe8ba335bd6db225d2c6c4`
BLAKE2b-256	`766af1adaef584f02640fd31b43b8bcfb9af7fd2a53145c495ad9a23d5ebbf1d`

See more details on using hashes here.

File details

Details for the file hebrew-0.3.0-py3-none-any.whl.

File metadata

Download URL: hebrew-0.3.0-py3-none-any.whl
Upload date: Nov 8, 2021
Size: 9.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.1.11 CPython/3.10.0 Windows/10

File hashes

Hashes for hebrew-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3b96d678b49caa84fb403816d4339d08d3c8d3c591b6c9f00094fac4cc1be5ce`
MD5	`de91ca3b942d07135626b798b09a4229`
BLAKE2b-256	`75d87515ff30a495a0543a43087e354afc52be5031639e76d93f04f893dcf808`

See more details on using hashes here.

hebrew 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Hebrew("בְּרֵאשִׁ֖ית")

Installation

Example

Grapheme Characters

Working with Grapheme Characters

Constants

Future Plans

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes