Skip to main content

A tool that highlights inconsistencies in word segmentation.

Project description

space-diff

Description

space-diff is a tool that highlights inconsistencies in word segmentation within spaced texts (such as training corpora) for any spaceless orthography.

This tool is Pure Python and requires Python 3.7+

Installation

pip install space-diff

Usage/Tutorial

Included with this project's homepage are two sample corpora of segmented traditional Chinese which will be used in this tutorial for ease in following along. (Adapted from Universal Dependencies' Chinese corpora.) The following instructions assume that you have space-diff installed already as well as downloaded the sample corpora.

Command line usage

You can simply call the tool at the command line as follows:

$ space-diff [-h] [-d] corp [corp ...]

with the optional -h/--help argument, the optional -d/--digits argument, and one or more corpus file of segmented text.

Using the sample data

By running:

$ space-diff sample_corp_a.txt sample_corp_b.txt

you will see the that the program updates you as it processes, and then ultimately prints a human-readable summary of its findings. Here's a sample:

Image of sample output

This output allows manual review each instance of segmentation inconsistency, where you can note which ones are errors and which are inherent variation. The idea is to then fix those that are actual errors in your corpora before training (a segmenter or some other stochastic tool) on that data.

Using your own data

For your own data, just pass the files and their paths if necessary, separated by spaces to space-diff and optionally save the output to wherever you'd like.

$ space-diff ~/path/to/thisfile.txt ~/path/to/another.txt ~/path/to/third.txt > ~/Desktop/seg_inconsistency.txt

Excluding digits

By default, the tool considers strings like 12, 712, 1 20, and 1220 as inconsistent segmentations of a 'multi-character' token 12. If you wish to declutter the output with numerical cases like this, pass space-diff the flag -d to ignore digits in its searching.

$ space-diff -d sample_corp_a.txt sample_corp_b.txt

or

$ space-diff sample_corp_a.txt sample_corp_b.txt --digits

License

GNU GPLv3 - see LICENSE file for details.

Contact

Blake Perry Smith middlename DOT lastname+'b' AT gmail

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

space-diff-0.0.7.tar.gz (4.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

space_diff-0.0.7-py3-none-any.whl (19.0 kB view details)

Uploaded Python 3

File details

Details for the file space-diff-0.0.7.tar.gz.

File metadata

  • Download URL: space-diff-0.0.7.tar.gz
  • Upload date:
  • Size: 4.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.0 setuptools/40.4.3 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.7.1

File hashes

Hashes for space-diff-0.0.7.tar.gz
Algorithm Hash digest
SHA256 2af01a53ff26e1141685759a868f4d46ac6b83285d9598bb15f0ace295af9e99
MD5 80d081c5322c87d346aa6aebfe273160
BLAKE2b-256 cab956723c82ef72f2b5878c9a50847b19a7e0113611a8e928d90cec7ac4283f

See more details on using hashes here.

File details

Details for the file space_diff-0.0.7-py3-none-any.whl.

File metadata

  • Download URL: space_diff-0.0.7-py3-none-any.whl
  • Upload date:
  • Size: 19.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.0 setuptools/40.4.3 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.7.1

File hashes

Hashes for space_diff-0.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 51d4635ef2ce343c009a810617cbbda54a8e6eb17f86101786708829eb6f8d91
MD5 9dcc78191c4aacc6b82e60d2b0778eee
BLAKE2b-256 374b4b2b0b90264003657fdc872acd11c5e010afb203eb77af56c8cef47f36cf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page