Skip to main content

A NLP preprocessing package

Project description

# purewords

Purewords is a package used to clean raw texts for all languages.

## Install

`pip install purewords`


## Usage

### Module usage:

```python
import purewords

# raw sentence
inputs = "ha hi!!hello I\'m at http:www.google.com.tw\n\n"
+ "you know yahoo? my_computer is great. My phone number"
+ "is 02-3366-5678. <br>的啦<br> my password: 123-abc$99&^%Y)\'_\'(Y "
```
#### Treat inputs as a sentence and clean it.

Word tokens are splitted with whitespace
```python
# result: string
purewords.clean_sentence(inputs)
'ha hi hello i am at _url_ you know yahoo my computer is great my phone number is _phone_ 的 my password _num_ abc _num_ y y'
```

#### Treat inputs as a document and clean it.

Split document with some confident splitting token such as '.' or '?'.
```python
# result: list of cleaned string
purewords.clean_document(inputs)
['ha hi', 'hello i am at _url_', 'you know yahoo', 'my computer is great', 'my phone number is _phone_', '的 my password _num_ abc _num_ y y']
```

### Customed your purewords

You can use different setting in purewords.

```python
import purewords
from purewords.tokenizer import YoctolTokenizer
from purewords.filter_collection import document_filters
from purewords.filter_collection import token_filters

tokenizer = YoctolTokenizer()
pw = purewords.PureWords(
tokenizer=tokenizer, # select your tokenizer
document_filters=document_filters, # select your document filters
token_filters=token_filters, # select your token filters
max_len=200, # cut long sentence whose length exceed max_len
min_len=1 # ignore short sentence
)

inputs = 'This is a sentence.'

pw.clean_sentence(inputs)
pw.clean_document(inputs)
```

#### Tokenizer

##### Select your tokenizer in purewords

You can select `WhitespaceTokenizer` tokenizer if you prefer tokenize
sentences with whitespace or `JiebaTokenizer` for default jieba setting.

Otherwise, we use yoctol jeiba tokenizer as our default setting.

```python
from purewords.tokenizer import WhitespaceTokenizer

tokenizer = WhitespaceTokenizer()
pw = purewords.PureWords(
tokenizer=tokenizer
)
```

##### Add new words in JiebaTokenizer

You can add new word in JiebaTokenizer to customize your tokenizer.

```python
from purewords.tokenizer import JiebaTokenizer

tokenizer = JiebaTokenizer()
tokenizer.add_word(new_word, freq, tag) # The setting is same with jieba.add_word
tokenizer.add_words(new_word_list, freq, tag)

pw = purewords.PureWords(
tokenizer=tokenizer
)
```

#### Filter collection

You can customize your preprocesing ways in purewords.

* document_filters: preprocess the raw sentence before sentence splitting
* token_filters: preprocess tokens after tokenization of each sentence

##### Organize your filters

You can create your customized filters by adding your filters in our filter collection class.

Filter means a callable object which receives a raw sentence and returns the processed one.

The preprocessing order is consistent with the adding order of filters.

```python
from purewords.filter_collection import BaseFilterCollection

custom_filters = BaseFilterCollection()
custom_filters.add(filter_1)
custom_filters.add(filter_2)
...
custom_filters.add(filter_n)

pw = purewords.PureWords(
tokenizer=tokenizer,
document_filters=custom_filters,
)
```

#### Stopwords

You can add stopwords in `purewords/config/stopwords.txt`.


### Command line usage:

Preprocess text files into a single cleaned document from command line.

Usage:

#### Clean single txt files
```
python -m purewords input_file_path
```

#### Clean text files in a directory

Or, you can use following command to clean all the txt files in your directory.
```
python -m purewords -d your_raw_text_dir
```

#### Ignore short sentences

If you prefer long sentences and want to ignore short sentences less than 5 words, you can try this.
```
python -m purewords -min 5 your_text_file
```

#### Cut long sentences

Or you prefer short sentences less than 30 words and want to cut long sentences into short sentences.

You can set up the maximun sentence length like this.
```
python -m purewords -max 30 your_text_file
```

#### Use multi-thread to speed up

You can also use multi-trhead to speed up the cleaning process.

In the follwoing example, you clean all the text files with 4 threads
```
python -m purewords -j 4 -d your_raw_text_dir
```


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

purewords-0.1.1.tar.gz (3.0 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

purewords-0.1.1-py3.5.egg (3.1 MB view details)

Uploaded Egg

purewords-0.1.1-py3-none-any.whl (3.1 MB view details)

Uploaded Python 3

File details

Details for the file purewords-0.1.1.tar.gz.

File metadata

  • Download URL: purewords-0.1.1.tar.gz
  • Upload date:
  • Size: 3.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.0.0 requests-toolbelt/0.8.0 tqdm/4.24.0 CPython/3.5.2

File hashes

Hashes for purewords-0.1.1.tar.gz
Algorithm Hash digest
SHA256 b0c19ed7d744c3a2bfce6bd8f1350e562c83bb6c740705490c9cac7c5271a864
MD5 027ce88380820f5a5f7df24af413dbde
BLAKE2b-256 1b562cd7b56ff78ffd275989d830441b47813dc926f5947ec05ef8f15f691cfa

See more details on using hashes here.

File details

Details for the file purewords-0.1.1-py3.5.egg.

File metadata

  • Download URL: purewords-0.1.1-py3.5.egg
  • Upload date:
  • Size: 3.1 MB
  • Tags: Egg
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.0.0 requests-toolbelt/0.8.0 tqdm/4.24.0 CPython/3.5.2

File hashes

Hashes for purewords-0.1.1-py3.5.egg
Algorithm Hash digest
SHA256 49309c72076a9bb1629c12f3e844c6e4430579798342d0bbb5f5e7367bf00961
MD5 229541ccd6992144623c5f179cfc1d4c
BLAKE2b-256 da79506d4fd67b237e2bcd2d2e73a9d74ac6d763529b77bf4232e8ac9a7fb74f

See more details on using hashes here.

File details

Details for the file purewords-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: purewords-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 3.1 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.0.0 requests-toolbelt/0.8.0 tqdm/4.24.0 CPython/3.5.2

File hashes

Hashes for purewords-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ff7f32cc88aa33e3304ed043c33864101d4f98b5ac5a8fa3607fbd6b42711d8c
MD5 0f1c5390ff8c97c9a8cdb906e876e059
BLAKE2b-256 6969422ccfd15fd1821d50986105eafd953b7e726ef6ae695d922f9a597fa8ce

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page