Skip to main content

A tool to separate truncated text.

Project description

Word Slicer

Cut your unspaced (or 'too spaced') long texts.

Usage

import wordslicer

model = wordslicer.train('train_file')
text = open('input_file', 'r').read()
text = wordslicer.separate(model, text) # or wordslicer.join(model, text)
save('output_file', text)

Performance

For an input of:

  • 161029 words to train
  • 1000 lines to separate

The results:

  • Text with 36889 words
  • Time: real 0m1,368s

Example:

>>> wordslicer.separate(model, "Boromirhesitatedforasecond.'Yes,andno,'heansweredslowly.'Yes:Ifoundhimsomewayupthehill,andIspoketohim.IurgedhimtocometoMinasTirithandnottogoeast.Igrewangryandheleftme.Hevanished.Ihaveneverseensuchathinghappenbefore.thoughIhaveheardofitintales.HemusthaveputtheRingon.Icouldnotfindhimagain.Ithoughthewouldreturntoyou.'")

Boromir hesitated for a second. 'Yes, and no,' he answered slowly. 'Yes: I found him some way up the hill, and I spoke to him. I urged him to come to Minas Tirith and not to go east. I grew angry and he left me. He vanished. I have never seen such a thing happen before. though I have heard of it in tales. He must have put the Ring on. I could not find him again. I though the would return to you.'

How to Install

pip3 install wordslicer

Features

  • Train your model: with the training ability, this package works with every language.

  • Evaluate your model: check if your training text is good enough for your input text:

image

Credits

This project was inspired by Generic Human on http://stackoverflow.com/a/11642687/2449774 . Thank you!

Project details


Release history Release notifications

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for wordslicer, version 0.1.0
Filename, size File type Python version Upload date Hashes
Filename, size wordslicer-0.1.0-py3-none-any.whl (4.3 kB) File type Wheel Python version py3 Upload date Hashes View hashes
Filename, size wordslicer-0.1.0.tar.gz (3.3 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page