CJK Line Break Organizer

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python
Topic
- Software Development :: Libraries :: Python Modules

Project description

Budou

English uses spacing and hyphenation as cues to allow for beautiful and legible line breaks. Certain CJK languages have none of these, and are notoriously more difficult. Breaks occur randomly, usually in the middle of a word. This is a long standing issue in typography on web, and results in degradation of readability.

Budou automatically translates CJK sentences into organized HTML code with lexical chunks wrapped in non-breaking markup so as to semantically control line breaks. Budou uses Google Cloud Natural Language API (NL API) to analyze the input sentence, and it concatenates proper words in order to produce meaningful chunks utilizing part-of-speech (pos) tagging and syntactic information. Processed chunks are wrapped with SPAN tag, so semantic units will no longer be split at the end of a line by specifying their display property as inline-block in CSS.

Install

Install the library by running pip install budou. Also, a credential json file is needed for authorization to NL API.

How to use

Get the parser by completing authentication with a credential file for NL API, which can be downloaded from Google Cloud Platform by navigating through "API Manager" > "Credentials" > "Create credentials" > "Service account key" > "JSON".

import budou
# Login to Cloud Natural Language API with credentials
parser = budou.authenticate('/path/to/credentials.json')
result = parser.parse(u'今日も元気です', attributes={'class': 'wordwrap'}, language='ja')

print(result['html_code'])  # => "<span class="wordwrap">今日も</span><span class="wordwrap">元気です</span>"

print(result['chunks'][0]['word'])  # => "今日も"
print(result['chunks'][1]['word'])  # => "元気です"

Semantic units in the output HTML will not be split at the end of line by conditioning each SPAN tag with display: inline-block in CSS.

.wordwrap {
  display: inline-block;
}

How it works

Nexus Example Image

Supported Language

Japanese (ja)
Simplified Chinese (zh-Hans)
Traditional Chinese (zh-Hant)

Support for other Asian languages with line break issues, such as Thai, will be added as Cloud Natural Language API adds support.

Korean support?

Korean has spaces between chunks, so you can organize line breaking simply by putting word-break: keep-all in your CSS. No need for Budou :)

Where to use

Budou is designed to be used mostly in eye-catching sentences such as titles and headings assuming split chunks would be more stood out negatively in larger typography.

Caching

Budou supports caching by default in order to save unnecessary requests to NL API and make the processing faster. If you want to force refresh the cache, put use_cache=False.

In a standard environment, Budou will create a cache file with python shelve format.

In Google App Engine Python Standard Environment, Budou will use memcache to make the cache available across instances.

Entity mode

Default parser only uses results from Syntactic Analysis for parsing, but you can also utilize Entity Analysis by specifying use_entity=True. Entity Analysis will improve the accuracy of parsing for some phrases, especially proper nouns, so it is recommended to use if your target sentences include a name of an individual person, place, organization etc. Please note that Entity Analysis will results in additional pricing because it requires additional requests to NL API. For more detail about API pricing, please refer to Pricing | Google Cloud Natural Language API Documentation.

import budou
# Login to Google Cloud Natural Language API with credentials
parser = budou.authenticate('/path/to/credentials.json')

# Without Entity mode (default)
result = parser.parse(u'六本木ヒルズでご飯を食べます。', use_cache=False, language='ja')
print(result['html_code'])  # => "<span class="ww">六本木</span><span class="ww">ヒルズに</span><span class="ww">います。</span>"

# With Entity mode
result = parser.parse(u'六本木ヒルズでご飯を食べます。', use_cache=False, language='ja', use_entity=True)
print(result['html_code'])  # => "<span class="ww">六本木ヒルズに</span><span class="ww">います。</span>"

Maximum chunk length

Some words (マルチスクリーン, インフルエンザ, etc) may stand out in certain formats due to their length. For example:

これが
マルチスクリーン
です。

By using max_length=6 in conjunction with display: inline-block styling on the output SPAN tags this can be avoided:

これがマルチス
クリーンです。

The output would instead look like this.

<span class="budou">これが</span>マルチスクリーン<span class="budou">です。</span>

Accessibility

Some screen reader software read wrapped chunks one by one when Budou is applied, which may degrades user experience for those who need audio support. You can attach any attribute to the output chunks to enhance accessibility. For example, you can make screen readers to read undivided sentences by combining aria-describedby and aria-label attribute in the output.

Input (your-script.py)

input_text = u'やりたいことのそばにいる'
element_id = 'description'
result = parser.parse(input_text, {'aria-describedby': element_id}, language='ja')

Template (your-template.tpl)

<p id="{{element_id}}" aria-label="{{input_text}}">{{result.html_code}}</p>

HTML Output (your-output.html)

<p id="description" aria-label="やりたいことのそばにいる">
  <span class="ww" aria-describedby="description">やりたい</span>
  <span class="ww" aria-describedby="description">ことの</span>
  <span class="ww" aria-describedby="description">そばに</span>
  <span class="ww" aria-describedby="description">いる</span>
</p>

Options

parser.parse() method accepts options below in addition to the input text.

Option	Type	Default	Description
attributes	dictionary	`{'class': 'ww'}`	A key-value mapping for attributes of output SPAN tags.
use_cache	boolean	`True`	Whether to use caching.
language	str	`None`	Language of the text. If `None` is provided, NL API tries to detect from the input text.
use_entity	boolean	`False`	Whether to use Entity mode.
max_length	int	`None`	Maximum chunk length. If a chunk is longer than this it will not be wrapped in a SPAN tag.

Pricing

Budou is backed up by Google Natural Language API, so cost may be incurred when using that API.

In other languages including Japanese, the default parser uses Syntax Analysis and incurs cost according to monthly usage. If you enable Entity mode by specifying use_entity=True, the parser uses both of Syntax Analysis and Entity Analysis, which will incur additional cost.

Google Cloud Natural Language API has free quota to start testing the feature at free of cost, but please refer to Google Cloud Natural Language API Pricing Guide for more detailed pricing information.

Author

Shuhei Iitsuka

Website: https://tushuhei.com
Twitter: https://twitter.com/tushuhei

Disclaimer

This library is authored by a Googler and copyrighted by Google, but is not an official Google product.

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python
Topic
- Software Development :: Libraries :: Python Modules

Release history Release notifications | RSS feed

0.9.8

Nov 7, 2019

0.9.7

Nov 5, 2019

0.9.6

Oct 30, 2019

0.9.5

Oct 30, 2019

0.9.4

Oct 22, 2019

0.9.3

Sep 11, 2018

0.9.2

Sep 3, 2018

0.9.1

Sep 3, 2018

This version

0.8.10

Apr 18, 2018

0.8.8

Apr 3, 2018

0.8.7

Apr 2, 2018

0.8.6

Apr 2, 2018

0.8.5

Mar 22, 2018

0.8.3

Dec 20, 2017

0.8.1

Dec 20, 2017

0.8.0

Oct 31, 2017

0.7.5

Oct 19, 2017

0.7.1

Oct 16, 2017

0.7.0

Oct 16, 2017

0.6.6

Oct 4, 2017

0.6.0

Sep 19, 2017

0.5.0

Sep 19, 2017

0.4.0

Sep 13, 2017

0.3.2

Aug 15, 2017

0.3.1

Jun 6, 2017

0.3.0

Apr 17, 2017

0.2.4

Apr 10, 2017

0.2.3

Apr 5, 2017

0.2.2

Feb 14, 2017

0.2.1

Dec 15, 2016

0.2.0

Dec 8, 2016

0.1.2

Oct 24, 2016

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

budou-0.8.10.tar.gz (14.6 kB view details)

Uploaded Apr 18, 2018 Source

Built Distribution

budou-0.8.10-py2.py3-none-any.whl (19.0 kB view details)

Uploaded Apr 18, 2018 Python 2Python 3

File details

Details for the file budou-0.8.10.tar.gz.

File metadata

Download URL: budou-0.8.10.tar.gz
Upload date: Apr 18, 2018
Size: 14.6 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for budou-0.8.10.tar.gz
Algorithm	Hash digest
SHA256	`2f302e879663e72b2914080d443acc52f0cfd322b7b190d3f3153e2e6268d0db`
MD5	`41117410753b788d630065e7dcc6f8ec`
BLAKE2b-256	`030d89229ddb6779b8b633151873b2c9ca8382411866b20ac1835acea47fa6d7`

See more details on using hashes here.

File details

Details for the file budou-0.8.10-py2.py3-none-any.whl.

File metadata

Download URL: budou-0.8.10-py2.py3-none-any.whl
Upload date: Apr 18, 2018
Size: 19.0 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No

File hashes

Hashes for budou-0.8.10-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`122f859b8a65ce7263873c53d0d459c3a1528853207b893d51a82e13c88be64e`
MD5	`30ebd78844713efb80bde6889c61f3c4`
BLAKE2b-256	`6a0dfab8121f5ff7d1a173f96f10a8776e49e1bd33835196ec9ae8828a4374f2`

See more details on using hashes here.

budou 0.8.10

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Budou

Install

How to use

How it works

Supported Language

Korean support?

Where to use

Caching

Entity mode

Maximum chunk length

Accessibility

Options

Pricing

Author

Disclaimer

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes