Skip to main content

Fast Brown Clustering

Project description

Brown Clustering

PyPI version GitHub Issues License: MIT Easy to use and fast Brown Clustering in python.


Quick Start

Requirements and Installation

The project is based on Python 3.7+, because method signatures and type hints are beautiful. If you do not have Python 3.7, install it first. Here is how for Ubuntu 16.04. Then, in your favorite virtual environment, simply do:

pip install brown-clustering

Example Usage

Let's run named entity recognition (NER) over an example sentence. All you need to do is make a Sentence, load a pre-trained model and use it to predict tags for the sentence:

from brown_clustering import BigramCorpus, BrownClustering

# use some tokenized and preprocessed data
sentences = [
    ["This", "is", "an", "example"],
    ["This", "is", "another", "example"]
]


# create a corpus
corpus = BigramCorpus(sentences, alpha=0.5, min_count=0)

# (optional) print corpus statistics:
corpus.print_stats()

# create a clustering
clustering = BrownClustering(corpus, m=4)

# train the clustering
clusters = clustering.train()

Done! We have trained a brown clustering.

# use the clustered words
print(clusters)
# [['This'], ['example'], ['is'], ['an', 'another']]

# get codes for the words
print(clustering.codes())
# {'an': '110', 'another': '111', 'This': '00', 'example': '01', 'is': '10'}

Algorithm

This repository is based on yangyuan/brown-clustering, A full python implementation based on two papers:

The computational complexity is O(n(m²+n) + T) where T is the total token count, n is the unique token count and m is the computation window size.

Improvements towards the original

  • Allow filtering the vocabulary by the minimum word count
  • Implement a DefaultValueDict which allows the Laplace Smoothing to not artificially explode the ram for all non-existing 2grams, but stores the alpha as default value.
  • Use Tqdm for a nice progressbar
  • Use Numba to speed up the performance by compiling to C code and using parallelism.
  • Mask unused rows and columns instead of reallocating all matrices all the time.
  • Publishing on Pypi
  • Proper CI-CD and testing

Benchmarking

I benchmarked using the small_fraud_corpus_in.json as input, m=1000 clusters on a Lenovo Legion 7i with Processor Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz, 2592 Mhz, 6 Core(s), 12 Logical Processor(s)

running the original code took me more than 16 hours. With the current optimizations it takes 6:51 minutes.

Other Brown Clustering libraries

repository main-language Installation benchmark speed
brown-clustering python clone & run ~ 16:00:00
generalized-brown C++ & python clone & make & run n.a.
brown-cluster C++ clone & make & run n.a.
This python pipy install & import 00:06:51

if you know any missing libraries, please create an issue or a pull request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

brown_clustering-0.1.6.tar.gz (7.9 kB view details)

Uploaded Source

Built Distribution

brown_clustering-0.1.6-py3-none-any.whl (8.0 kB view details)

Uploaded Python 3

File details

Details for the file brown_clustering-0.1.6.tar.gz.

File metadata

  • Download URL: brown_clustering-0.1.6.tar.gz
  • Upload date:
  • Size: 7.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for brown_clustering-0.1.6.tar.gz
Algorithm Hash digest
SHA256 c81ef98f4390009bc336217cfb4fce1a5d0e393a09b2e7d2e6c2a75a0ed9c7ff
MD5 0e6048cdd686651020f14658610ccf83
BLAKE2b-256 bf66b77b03c88d8b30837d409f91ddd3a37a7aaebbdd2fcbd5c84108f36f5a7d

See more details on using hashes here.

File details

Details for the file brown_clustering-0.1.6-py3-none-any.whl.

File metadata

File hashes

Hashes for brown_clustering-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 9e3248c598d939ff82185f266ba6626d7a13ca5f13ad5a53ebb595dee7f3cb84
MD5 d3514dc1588e1a57d684688dd452163d
BLAKE2b-256 baf4a7f818a99888b8a37f15760ba1f0c354dc794f404266e2ab9c8424d92607

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page