Fast Brown Clustering
Project description
Brown Clustering
Easy to use and fast Brown Clustering in python.
Quick Start
Requirements and Installation
The project is based on Python 3.7+, because method signatures and type hints are beautiful. If you do not have Python 3.7, install it first. Here is how for Ubuntu 16.04. Then, in your favorite virtual environment, simply do:
pip install brown-clustering
Example Usage
Let's run named entity recognition (NER) over an example sentence. All you need to do is make a Sentence
, load
a pre-trained model and use it to predict tags for the sentence:
from brown_clustering import BigramCorpus, BrownClustering
# use some tokenized and preprocessed data
sentences = [
["This", "is", "an", "example"],
["This", "is", "another", "example"]
]
# create a corpus
corpus = BigramCorpus(sentences, alpha=0.5, min_count=0)
# (optional) print corpus statistics:
corpus.print_stats()
# create a clustering
clustering = BrownClustering(corpus, m=4)
# train the clustering
clusters = clustering.train()
Done! We have trained a brown clustering.
# use the clustered words
print(clusters)
# [['This'], ['example'], ['is'], ['an', 'another']]
# get codes for the words
print(clustering.codes())
# {'an': '110', 'another': '111', 'This': '00', 'example': '01', 'is': '10'}
Algorithm
This repository is based on yangyuan/brown-clustering, A full python implementation based on two papers:
The computational complexity is O(n(m²+n) + T)
where T is the total token count,
n is the unique token count and m is the computation window size.
Improvements towards the original
- Allow filtering the vocabulary by the minimum word count
- Implement a
DefaultValueDict
which allows the Laplace Smoothing to not artificially explode the ram for all non-existing 2grams, but stores the alpha as default value. - Use Tqdm for a nice progressbar
- Use Numba to speed up the performance by compiling to C code and using parallelism.
- Mask unused rows and columns instead of reallocating all matrices all the time.
- Publishing on Pypi
- Proper CI-CD and testing
Benchmarking
I benchmarked using the small_fraud_corpus_in.json as input, m=1000
clusters on a Lenovo Legion 7i
with Processor Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz, 2592 Mhz, 6 Core(s), 12 Logical Processor(s)
running the original code took me more than 16 hours
. With the current optimizations it takes 6:51 minutes
.
Other Brown Clustering libraries
repository | main-language | Installation | benchmark speed |
---|---|---|---|
brown-clustering | python | clone & run | ~ 16:00:00 |
generalized-brown | C++ & python | clone & make & run | n.a. |
brown-cluster | C++ | clone & make & run | n.a. |
This | python | pipy install & import | 00:06:51 |
if you know any missing libraries, please create an issue or a pull request.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file brown_clustering-0.1.6.tar.gz
.
File metadata
- Download URL: brown_clustering-0.1.6.tar.gz
- Upload date:
- Size: 7.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.16
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c81ef98f4390009bc336217cfb4fce1a5d0e393a09b2e7d2e6c2a75a0ed9c7ff |
|
MD5 | 0e6048cdd686651020f14658610ccf83 |
|
BLAKE2b-256 | bf66b77b03c88d8b30837d409f91ddd3a37a7aaebbdd2fcbd5c84108f36f5a7d |
File details
Details for the file brown_clustering-0.1.6-py3-none-any.whl
.
File metadata
- Download URL: brown_clustering-0.1.6-py3-none-any.whl
- Upload date:
- Size: 8.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.16
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9e3248c598d939ff82185f266ba6626d7a13ca5f13ad5a53ebb595dee7f3cb84 |
|
MD5 | d3514dc1588e1a57d684688dd452163d |
|
BLAKE2b-256 | baf4a7f818a99888b8a37f15760ba1f0c354dc794f404266e2ab9c8424d92607 |