A library for cleaning text data
Project description
Autoclean
Autoclean is a set of tools to clean textual data with minimal human intervention:
autoclean.filtering
uses unsupervised learning to filter out of domain sequences
autoclean.segmentation
uses unsupervised learning to segment joined up sequences
Filtering
The filtering package contains both an api and a cli.
An example use through the cli to filter out out-of-domain/anomalous data points:
- Download an English book from Project Gutenberg
(venv) $ wget https://www.gutenberg.org/cache/epub/29440/pg29440.txt
--2022-02-15 00:14:04-- https://www.gutenberg.org/cache/epub/29440/pg29440.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 890028 (869K) [text/plain]
Saving to: ‘pg29440.txt’
pg29440.txt 100%[==================================================================================================================>] 869.17K 730KB/s in 1.2s
2022-02-15 00:14:06 (730 KB/s) - ‘pg29440.txt’ saved [890028/890028]
- Run the filtering giving a high cost-threshold, so the output is just all the lines but ordered in decreasing order of cost:
(venv)$ time python -m autoclean.filtering.cli filter pg29440.txt filtered.pg20440.txt 10000
[2022-02-15 00:26:38][INFO][autoclean._estimate:178] > estimating [6]-gram from [pg29440.txt] ...
[2022-02-15 00:26:40][INFO][autoclean._estimate:182] > binarising [6]-gram from [/tmp/pg29440.txt.2022-02-15T00:26:38.2265355794898408859.6.arpa] ...
[2022-02-15 00:26:40][INFO][autoclean._estimate:178] > estimating [1]-gram from [pg29440.txt] ...
[2022-02-15 00:26:40][INFO][autoclean._estimate:182] > binarising [1]-gram from [/tmp/pg29440.txt.2022-02-15T00:26:40.2265355794898408859.1.arpa] ...
[2022-02-15 00:26:40][INFO][autoclean.filtering.filter_out: 30] > Reading from [pg29440.txt] and writing to [filtered.pg20440.txt] ...
[2022-02-15 00:26:40][INFO][autoclean.filtering.filter_out: 43] > Done.
real 0m3.201s
user 0m1.882s
sys 0m2.803s
- Check the top of the output file for standard English sentences:
(venv) $ head -n25 filtered.pg20440.txt
in some
of the country?
of the whole:--
according to the poet,
Election--Responsibility--Influence of the Church in such a
in the original.
character of the schoolmaster, and the right and duty of the
there of the influence which the personal character of the
their Church is the Church of the nation, and that it is they
true, and in the just and the true only.
stream of not half the volume of that in which the money of the
savageism of the other.
class with which the schoolmaster and the class with which the
between the feelings of the people and the anticipations of some of
the light of the flames.'
in any age of the Presbyterian Church, in one of the parish
According to the poet,
circumstances of Scotland in his days and of Scotland in the present
the State is the money of the people, and that the people have a right
is not according to the nature of things that the case of the tenant
their part of the scheme.'
which all recognition of the religious element on the part of the
simply those of the subject and the ratepayer.
character of our poor Highlanders, and of the influence of the bothie
themselves, and not the ministers of the Establishment, who are on
- Check the bottom of the output file for non-standard English text:
(venv) $ tail -n25 filtered.pg20440.txt
residence.'
periodicals.
consistency.
predecessor.
differently?
FINE-BODYISM.
ELIGIBLE.'{8}
Glencalvie:--
photographer.
resuscitated.
PERIODICALISM.
double-quote}
relinquished.
circumstances.
Protestantism.
significantly,
impossibility.
responsibility.
redistribution.
gbnewby@pglaf.org
kingdom.'--SISMONDI.
self-recommendation.
http://www.gutenberg.org
http://gutenberg.org/license).
http://www.gutenberg.org/2/9/4/4/29440/
The cli can also be used to evaluate the filtering algorithm with an in-domain and an out-of-domain text:
- Download French book from Project Gutenberg
(venv) $ wget https://www.gutenberg.org/cache/epub/46916/pg46916.txt
--2022-02-15 00:41:00-- https://www.gutenberg.org/cache/epub/46916/pg46916.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 120716 (118K) [text/plain]
Saving to: ‘pg46916.txt’
pg46916.txt 100%[==================================================================================================================>] 117.89K 347KB/s in 0.3s
2022-02-15 00:41:01 (347 KB/s) - ‘pg46916.txt’ saved [120716/120716]
- Run evaluation on the Downloaded English and French books, using the former as in-domain and the latter as out-of-domain
(venv) $ time python -m autoclean.filtering.cli eval pg29440.txt pg46916.txt
[2022-02-15 00:47:29][INFO][autoclean.filtering.evaluate: 62] > Reading from [pg29440.txt] and to [pg46916.txt] ...
[2022-02-15 00:47:29][INFO][autoclean._estimate:178] > estimating [6]-gram from [/tmp/pg29440.txt.pg46916.txt.2022-02-15T00:47:29.1382531328938857964.txt] ...
[2022-02-15 00:47:31][INFO][autoclean._estimate:182] > binarising [6]-gram from [/tmp/pg29440.txt.pg46916.txt.2022-02-15T00:47:29.1382531328938857964.txt.2022-02-15T00:47:29.6537006284435656167.6.arpa] ...
[2022-02-15 00:47:31][INFO][autoclean._estimate:178] > estimating [1]-gram from [/tmp/pg29440.txt.pg46916.txt.2022-02-15T00:47:29.1382531328938857964.txt] ...
[2022-02-15 00:47:31][INFO][autoclean._estimate:182] > binarising [1]-gram from [/tmp/pg29440.txt.pg46916.txt.2022-02-15T00:47:29.1382531328938857964.txt.2022-02-15T00:47:31.6537006284435656167.1.arpa] ...
[2022-02-15 00:47:31][INFO][autoclean.filtering.evaluate: 74] > Calculating 10%-cutoff rankings metrics ...
@100% precision [ 94.6832%] recall [ 94.6832%] fallout [ 38.4824%]
@ 90% precision [ 96.7801%] recall [ 87.1050%] fallout [ 20.9756%]
@ 80% precision [ 97.4352%] recall [ 77.9467%] fallout [ 14.8509%]
@ 70% precision [ 97.6359%] recall [ 68.3466%] fallout [ 11.9783%]
@ 60% precision [ 97.8532%] recall [ 58.7090%] fallout [ 9.3225%]
@ 50% precision [ 98.1878%] recall [ 49.0939%] fallout [ 6.5583%]
@ 40% precision [ 98.4650%] recall [ 39.3889%] fallout [ 4.4444%]
@ 30% precision [ 98.4024%] recall [ 29.5192%] fallout [ 3.4688%]
@ 20% precision [ 98.5399%] recall [ 19.7095%] fallout [ 2.1138%]
@ 10% precision [ 98.4270%] recall [ 9.8397%] fallout [ 1.1382%]
[2022-02-15 00:47:32][INFO][autoclean.filtering.evaluate:104] > Done.
real 0m3.400s
user 0m1.967s
sys 0m2.622s
Segmentation
Segmentation is based on the recursive dynamic-programming algorithm presented by Peter Norvig in chapter 14 of "Beautiful Data", by Seagaran and Hammerbacher.
It features an api and cli. An example usage through the cli:
- Download file to use as corpus (complete works of shakespeare)
(venv) $ wget http://norvig.com/ngrams/shakespeare.txt
--2022-02-15 12:47:22-- http://norvig.com/ngrams/shakespeare.txt
Resolving norvig.com (norvig.com)... 158.106.138.13
Connecting to norvig.com (norvig.com)|158.106.138.13|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4538523 (4.3M) [text/plain]
Saving to: ‘shakespeare.txt’
shakespeare.txt 100%[==================================================================================================================>] 4.33M 3.29MB/s in 1.3s
2022-02-15 12:47:24 (3.29 MB/s) - ‘shakespeare.txt’ saved [4538523/4538523]
- Create the file to be segmented by deleting spaces between words:
(venv) $ sed 's/\s//g' shakespeare.txt > joined.shakespeare.txt
- Run the segmentation with a sixgram language model:
(venv) $ time python -m autoclean.segmentation.cli seg --lm=5 -c shakespeare.txt joined.shakespeare.txt seg.shakespeare.txt
[2022-02-15 12:48:35][INFO][autoclean.segmentation.segment: 29] > Reading in corpus from [shakespeare.txt] ...
[2022-02-15 12:48:35][INFO][autoclean.segmentation.segment: 42] > Estimating [6]-ngram language model from [shakespeare.txt] ...
[2022-02-15 12:48:35][INFO][autoclean._estimate:178] > estimating [6]-gram from [/tmp/corpus.2022-02-15T12:48:35.6.txt] ...
[2022-02-15 12:48:44][INFO][autoclean._estimate:182] > binarising [6]-gram from [/tmp/corpus.2022-02-15T12:48:35.6.txt.2022-02-15T12:48:35.3998131699779813468.6.arpa] ...
[2022-02-15 12:48:45][INFO][autoclean.segmentation.segment: 45] > Segmenting and saving to [seg.shakespeare.txt] ...
[2022-02-15 12:48:45][INFO][autoclean.segmentation.segment: 49] > Reading in file to segment from [joined.shakespeare.txt] ...
[2022-02-15 12:52:56][INFO][autoclean.segmentation.segment: 59] > Done.
real 4m32.020s
user 4m30.212s
sys 0m2.493s
- Evaluate the segmentation using the original text as a gold standard:
(venv) $ time python -m autoclean.segmentation.cli eval shakespeare.txt seg.shakespeare.txt
[2022-02-15 12:55:32][INFO][autoclean.segmentation.evaluate: 70] > Reading segmented text from [shakespeare.txt] ...
[2022-02-15 12:55:32][INFO][autoclean.segmentation.evaluate: 71] > Reading gold standard text from [seg.shakespeare.txt] ...
[2022-02-15 12:55:32][INFO][autoclean.segmentation.evaluate: 84] > Accuracy: [99.80%], [128,844] correct out of [129,107] sequences
[2022-02-15 12:55:32][INFO][autoclean.segmentation.evaluate: 87] > Done.
real 0m0.624s
user 0m0.732s
sys 0m0.723s
Installation
As Autoclean is on Pypi and has direct dependencies like Pytorch and KenLM, and Pypi doesn't support them, they need to be installed separately and previous to installing Autoclean:
$ git clone git@github.com:JoseLlarena/autoclean.git
$ python3 virtualenv -p=3.8 venv
$ source venv/bin/activate
(venv) $ pip3 install -f https://download.pytorch.org/whl/cu113/torch_stable.html -r ./autoclean/requirements.txt
Then install Autoclean with pip:
(venv) $ pip3 install autoclean
The code requires python 3.8+ on and Nvidia GPU. The KenLM/SRILM binaries have been compiled against Linux x86_64 and may not work on other architectures.
Testing
The unit tests are written with pytest. Run with:
$ pip3 install pytest
$ pytest
Changelog
Check the Changelog for fixes and enhancements of each version.
License
Copyright Jose Llarena 2022.
Distributed under the terms of the MIT license, autoclean is free and open source software.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file autoclean-1.0.2.tar.gz
.
File metadata
- Download URL: autoclean-1.0.2.tar.gz
- Upload date:
- Size: 4.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.11.1 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1132f55f3a2e44cc9c44c65b28ccd66f5dbcafc8439bc0a7cf48af3e817233d3 |
|
MD5 | a2f4ccc569cde306a516267af8aa1b6c |
|
BLAKE2b-256 | 1940fb04691458ffd685b1e9ddefa637ccd61ff5eda456de34b90c98b764c49f |
File details
Details for the file autoclean-1.0.2-py3-none-any.whl
.
File metadata
- Download URL: autoclean-1.0.2-py3-none-any.whl
- Upload date:
- Size: 4.2 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.11.1 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8ad4410ace480c2329ade77009b915549832b1145db2fe590da00ef0d591b74d |
|
MD5 | 367519a765e3234fa5172c6d9470a659 |
|
BLAKE2b-256 | 0e120c986800b23f0f47935515612bdb37059368acc3922b8b63159d956b72bd |