Skip to main content

Linear-chain conditional random fields for natural language processing

Project description

Chaine

downloads downloads/month downloads/week

Chaine is a modern, fast and lightweight Python library implementing linear-chain conditional random fields (CRF). Use it for sequence labeling tasks like named entity recognition or part-of-speech tagging.

The main goals of this project are:

  • Usability: Designed with special focus on usability and a beautiful high-level API.
  • Efficiency: Performance critical parts are written in C and thus blazingly fast. Loading a model from disk and retrieving feature weights for inference is optimized for both speed and memory.
  • Persistency: No pickle or joblib is used for serialization. A trained model will be compatible with all versions for eternity, because the underlying C library will not change. I promise.
  • Compatibility: There are wheels for Linux, macOS and Windows. No compiler needed.
  • Minimalism: No code bloat, no external dependencies.

Install the latest stable version from PyPI:

pip install chaine

Table of contents

Algorithms

You can train models using the following methods:

Please refer to the paper by Lafferty et al. for a general introduction to conditional random fields. Other helpful sources are Chapter 8.5 of Jurafsky's and Martin's book "Speech and Language Processing", this blog post or this video by ritvikmath

Usage

Training and using a conditional random field for inference is easy as:

>>> import chaine
>>> tokens = [[{"index": 0, "text": "John"}, {"index": 1, "text": "Lennon"}]]
>>> labels = [["B-PER", "I-PER"]]
>>> model = chaine.train(tokens, labels)
>>> model.predict(tokens)
[['B-PER', 'I-PER']]

You can control verbosity with the argument verbose, where 0 will set the log level to ERROR, 1 to INFO (which is the default) and 2 to DEBUG.

Features

One token in a sequence is represented as a dictionary with describing feature names as keys and respective values of type string, integer, float or boolean:

{
    "text": "John",
    "num_characters": 4,
    "relative_index": 0.0,
    "is_number": False,
}

One sequence is represented as a list of feature dictionaries:

[
    {"text": "John", "num_characters": 4}, 
    {"text": "Lennon", "num_characters": 6}
]

One data set is represented as an iterable of a list of feature dictionaries:

[
    [
        {"text": "John", "num_characters": 4}, 
        {"text": "Lennon", "num_characters": 6}
    ],
    [
        {"text": "Paul", "num_characters": 4}, 
        {"text": "McCartney", "num_characters": 9}
    ],
    ...
]

This is the expected input format for training. For inference, you can also process a single sequence rather than a batch of multiple sequences.

Generators

Depending on the size of your data set, it probably makes sense to use generators. Something like this would be totally fine for both training and inference:

([extract_features(token) for token in tokens] for tokens in dataset)

Assuming dataset is a generator as well, only one sequence is loaded into memory at a time.

Training

You can either use the high-level function to train a model (which also loads and returns it):

>>> import chaine
>>> chaine.train(tokens, labels)

or the lower-level Trainer class:

>>> from chaine import Trainer
>>> trainer = Trainer()

A Trainer object has a method train() to learn states and transitions from the given data set. You have to provide a filepath to serialize the model to:

>>> trainer.train(tokens, labels, model_filepath="model.chaine")

Hyperparameters

Before training a model, you might want to find out the ideal hyperparameters first. You can just set the respective argument to True:

>>> import chaine
>>> model = chaine.train(tokens, labels, optimize_hyperparameters=True)

This might be very memory and time consuming, because 5-fold cross validation for each of the 10 trials for each of the algorithms is performed.

or use the HyperparameterOptimizer class and have more control over the optimization process:

>>> from chaine import HyperparameterOptimizer
>>> from chaine.optimization import L2SGDSearchSpace
>>> optimizer = HyperparameterOptimizer(trials=50, folds=3, spaces=[L2SGDSearchSpace()])
>>> optimizer.optimize_hyperparameters(tokens, labels, sample_size=1000)

This will make 50 trials with 3-fold cross validation for the Stochastic Gradient Descent algorithm and return a sorted list of hyperparameters with evaluation stats. The given data set is downsampled to 1000 instances.

Example of a hyperparameter optimization report
[
    {
        "hyperparameters": {
            "algorithm": "lbfgs",
            "min_freq": 0,
            "all_possible_states": true,
            "all_possible_transitions": true,
            "num_memories": 8,
            "c1": 0.9,
            "c2": 0.31,
            "epsilon": 0.00011,
            "period": 17,
            "delta": 0.00051,
            "linesearch": "Backtracking",
            "max_linesearch": 31
        },
        "stats": {
            "mean_precision": 0.4490952380952381,
            "stdev_precision": 0.16497993418839532,
            "mean_recall": 0.4554858934169279,
            "stdev_recall": 0.20082402876210334,
            "mean_f1": 0.45041435392087253,
            "stdev_f1": 0.17914435056760908,
            "mean_time": 0.3920876979827881,
            "stdev_time": 0.0390961164333519
        }
    },
    {
        "hyperparameters": {
            "algorithm": "lbfgs",
            "min_freq": 5,
            "all_possible_states": true,
            "all_possible_transitions": false,
            "num_memories": 9,
            "c1": 1.74,
            "c2": 0.09,
            "epsilon": 0.0008600000000000001,
            "period": 1,
            "delta": 0.00045000000000000004,
            "linesearch": "StrongBacktracking",
            "max_linesearch": 34
        },
        "stats": {
            "mean_precision": 0.4344436335328176,
            "stdev_precision": 0.15542689556199216,
            "mean_recall": 0.4385174258109041,
            "stdev_recall": 0.19873733310765845,
            "mean_f1": 0.43386496201052716,
            "stdev_f1": 0.17225578421967264,
            "mean_time": 0.12209572792053222,
            "stdev_time": 0.0236177196325414
        }
    },
    {
        "hyperparameters": {
            "algorithm": "lbfgs",
            "min_freq": 2,
            "all_possible_states": true,
            "all_possible_transitions": true,
            "num_memories": 1,
            "c1": 0.91,
            "c2": 0.4,
            "epsilon": 0.0008400000000000001,
            "period": 13,
            "delta": 0.00018,
            "linesearch": "MoreThuente",
            "max_linesearch": 43
        },
        "stats": {
            "mean_precision": 0.41963433149859447,
            "stdev_precision": 0.16363544501259455,
            "mean_recall": 0.4331173486012196,
            "stdev_recall": 0.21344965207006913,
            "mean_f1": 0.422038027332145,
            "stdev_f1": 0.18245844823319127,
            "mean_time": 0.2586916446685791,
            "stdev_time": 0.04341208573100539
        }
    },
    {
        "hyperparameters": {
            "algorithm": "l2sgd",
            "min_freq": 5,
            "all_possible_states": true,
            "all_possible_transitions": true,
            "c2": 1.68,
            "period": 2,
            "delta": 0.00047000000000000004,
            "calibration_eta": 0.0006900000000000001,
            "calibration_rate": 2.9000000000000004,
            "calibration_samples": 1400,
            "calibration_candidates": 25,
            "calibration_max_trials": 23
        },
        "stats": {
            "mean_precision": 0.2571428571428571,
            "stdev_precision": 0.43330716823151716,
            "mean_recall": 0.01,
            "stdev_recall": 0.022360679774997897,
            "mean_f1": 0.01702127659574468,
            "stdev_f1": 0.038060731531911314,
            "mean_time": 0.15442829132080077,
            "stdev_time": 0.051750737506044905
        }
    }
]

Inference

The high-level function chaine.train() returns a Model object. You can load an already trained model from disk by initializing a Model object with the model's filepath:

>>> from chaine import Model
>>> model = Model("model.chaine")

You can predict labels for a batch of sequences:

>>> tokens = [
...     [{"index": 0, "text": "John"}, {"index": 1, "text": "Lennon"}],
...     [{"index": 0, "text": "Paul"}, {"index": 1, "text": "McCartney"}],
...     [{"index": 0, "text": "George"}, {"index": 1, "text": "Harrison"}],
...     [{"index": 0, "text": "Ringo"}, {"index": 1, "text": "Starr"}]
... ]
>>> model.predict(tokens)
[['B-PER', 'I-PER'], ['B-PER', 'I-PER'], ['B-PER', 'I-PER'], ['B-PER', 'I-PER']]

or only for a single sequence:

>>> model.predict_single(tokens[0])
['B-PER', 'I-PER']

If you are interested in the model's probability distribution for a given sequence, you can:

>>> model.predict_proba_single(tokens[0])
[[{'B-PER': 0.99, 'I-PER': 0.01}, {'B-PER': 0.01, 'I-PER': 0.99}]]

Use the model.predict_proba() method for a batch of sequences.

Weights

After loading a trained model, you can inspect the learned transition and state weights:

>>> model = Model("model.chaine")
>>> model.transitions
[{'from': 'B-PER', 'to': 'I-PER', 'weight': 1.430506540616852e-06}]
>>> model.states
[{'feature': 'text:John', 'label': 'B-PER', 'weight': 9.536710877105517e-07}, ...]

You can also dump both transition and state weights as JSON:

>>> model.dump_states("states.json")
>>> model.dump_transitions("transitions.json")

Credits

This project makes use of and is partially based on:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

chaine-3.11.0-cp311-cp311-win_amd64.whl (370.9 kB view details)

Uploaded CPython 3.11Windows x86-64

chaine-3.11.0-cp311-cp311-win32.whl (352.6 kB view details)

Uploaded CPython 3.11Windows x86

chaine-3.11.0-cp311-cp311-musllinux_1_1_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.11musllinux: musl 1.1+ x86-64

chaine-3.11.0-cp311-cp311-musllinux_1_1_i686.whl (1.7 MB view details)

Uploaded CPython 3.11musllinux: musl 1.1+ i686

chaine-3.11.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

chaine-3.11.0-cp311-cp311-manylinux_2_17_i686.manylinux_2_12_i686.manylinux2010_i686.manylinux2014_i686.whl (1.1 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.12+ i686manylinux: glibc 2.17+ i686

chaine-3.11.0-cp311-cp311-macosx_12_0_x86_64.whl (406.7 kB view details)

Uploaded CPython 3.11macOS 12.0+ x86-64

File details

Details for the file chaine-3.11.0-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: chaine-3.11.0-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 370.9 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.0 CPython/3.11.3 Windows/10

File hashes

Hashes for chaine-3.11.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 8de49057a1c02d21c8bcc8d12fac5a49b440bae220f1ed2ae1568dadc899a6a1
MD5 f65e33d9b058c9f0b3dac2f276e3e0dd
BLAKE2b-256 39527dfa021178113a570d93644574290a9c512e4437204457e567094d8c38ed

See more details on using hashes here.

File details

Details for the file chaine-3.11.0-cp311-cp311-win32.whl.

File metadata

  • Download URL: chaine-3.11.0-cp311-cp311-win32.whl
  • Upload date:
  • Size: 352.6 kB
  • Tags: CPython 3.11, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.0 CPython/3.11.3 Windows/10

File hashes

Hashes for chaine-3.11.0-cp311-cp311-win32.whl
Algorithm Hash digest
SHA256 b4ec11c3a06c56039e1c41c57cf4ac1225a8426628870a9b057c399d947e4715
MD5 ec644ebe2f23ee498ac8c8e437433202
BLAKE2b-256 bbca1dcee42e124db12fc36f5ce377001511e30ba98ac142552f7647c52bfffd

See more details on using hashes here.

File details

Details for the file chaine-3.11.0-cp311-cp311-musllinux_1_1_x86_64.whl.

File metadata

  • Download URL: chaine-3.11.0-cp311-cp311-musllinux_1_1_x86_64.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.11, musllinux: musl 1.1+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.0 CPython/3.11.3 Linux/5.15.0-1037-azure

File hashes

Hashes for chaine-3.11.0-cp311-cp311-musllinux_1_1_x86_64.whl
Algorithm Hash digest
SHA256 7e877aa9fe0971358a6e72537ff642b42203c917cf888b09abc0afdb20d507ae
MD5 93983f0c7868d6ce2c482c514e378989
BLAKE2b-256 e10cdb8c9d37bb15531d93e371c1e15fd0a1ca100e5da924628f67df2721ad15

See more details on using hashes here.

File details

Details for the file chaine-3.11.0-cp311-cp311-musllinux_1_1_i686.whl.

File metadata

  • Download URL: chaine-3.11.0-cp311-cp311-musllinux_1_1_i686.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.11, musllinux: musl 1.1+ i686
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.0 CPython/3.11.3 Linux/5.15.0-1037-azure

File hashes

Hashes for chaine-3.11.0-cp311-cp311-musllinux_1_1_i686.whl
Algorithm Hash digest
SHA256 2fdd2de99efe52b0e92e79a7ef6489d064d1a35fbbc1a5e087a9317c39f5465b
MD5 d0c0fbf86a6c19869c76003ce7d751c0
BLAKE2b-256 b06c5f54247484ab912b8fbbca10d42be1e6bb27d855951e6372d57b39977968

See more details on using hashes here.

File details

Details for the file chaine-3.11.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for chaine-3.11.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c5e7953e9419ab43730a527b9c26460472ff22445681bccbc492948001df4434
MD5 b39f3b75fbe191a1487d7451ec2f8d16
BLAKE2b-256 10fa7d42195ebf49dd9e8f29cf4fabd9618f071f2f18faa7487898b70c687379

See more details on using hashes here.

File details

Details for the file chaine-3.11.0-cp311-cp311-manylinux_2_17_i686.manylinux_2_12_i686.manylinux2010_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for chaine-3.11.0-cp311-cp311-manylinux_2_17_i686.manylinux_2_12_i686.manylinux2010_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 a800e66ae04bdc7e3f8c67bb1c14e47b9387208fbef44aec19c432dedb742923
MD5 72c405371f91a4a4dfe2ff0e061562cf
BLAKE2b-256 ee6782d1914820def462405140a3656e7fcb796daa976f871454f93a683f5ad9

See more details on using hashes here.

File details

Details for the file chaine-3.11.0-cp311-cp311-macosx_12_0_x86_64.whl.

File metadata

File hashes

Hashes for chaine-3.11.0-cp311-cp311-macosx_12_0_x86_64.whl
Algorithm Hash digest
SHA256 6f4566166381798f092047283fe07afee47b7298520bf3c5856f2f95d986aaa9
MD5 ac54aaaabe3946d2a72b46be4772184e
BLAKE2b-256 763dced93e71dddaccbb3d1875254ba8d003a25bad9ad81a957fbdfcf248d230

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page