Linear-chain conditional random fields for natural language processing

These details have not been verified by PyPI

Project description

Chaine

Chaine is a modern, fast and lightweight Python library implementing linear-chain conditional random fields (CRF). Use it for sequence labeling tasks like named entity recognition or part-of-speech tagging.

The main goals of this project are:

Usability: Designed with special focus on usability and a beautiful high-level API.
Efficiency: Performance critical parts are written in C and thus blazingly fast. Loading a model from disk and retrieving feature weights for inference is optimized for both speed and memory.
Persistency: No pickle or joblib is used for serialization. A trained model will be compatible with all versions for eternity, because the underlying C library will not change. I promise.
Compatibility: There are wheels for Linux, macOS and Windows. No compiler needed.
Minimalism: No code bloat, no external dependencies.

Install the latest stable version from PyPI:

pip install chaine

Algorithms
Usage
Credits

Algorithms

You can train models using the following methods:

Limited-Memory BFGS (Nocedal 1980)
Orthant-Wise Limited-Memory Quasi-Newton (Andrew et al. 2007)
Stochastic Gradient Descent (Shalev et al. 2007)
Averaged Perceptron (Collins 2002)
Passive Aggressive (Crammer et al. 2006)
Adaptive Regularization of Weight Vectors (Mejer et al. 2010)

Please refer to the paper by Lafferty et al. for a general introduction to conditional random fields. Other helpful sources are Chapter 8.5 of Jurafsky's and Martin's book "Speech and Language Processing", this blog post or this video by ritvikmath

Usage

Training and using a conditional random field for inference is easy as:

>>> import chaine
>>> tokens = [[{"index": 0, "text": "John"}, {"index": 1, "text": "Lennon"}]]
>>> labels = [["B-PER", "I-PER"]]
>>> model = chaine.train(tokens, labels)
>>> model.predict(tokens)
[['B-PER', 'I-PER']]

You can control verbosity with the argument verbose, where 0 will set the log level to ERROR, 1 to INFO (which is the default) and 2 to DEBUG.

Features

One token in a sequence is represented as a dictionary with describing feature names as keys and respective values of type string, integer, float or boolean:

{
    "text": "John",
    "num_characters": 4,
    "relative_index": 0.0,
    "is_number": False,
}

One sequence is represented as a list of feature dictionaries:

[
    {"text": "John", "num_characters": 4}, 
    {"text": "Lennon", "num_characters": 6}
]

One data set is represented as an iterable of a list of feature dictionaries:

[
    [
        {"text": "John", "num_characters": 4}, 
        {"text": "Lennon", "num_characters": 6}
    ],
    [
        {"text": "Paul", "num_characters": 4}, 
        {"text": "McCartney", "num_characters": 9}
    ],
    ...
]

This is the expected input format for training. For inference, you can also process a single sequence rather than a batch of multiple sequences.

Generators

Depending on the size of your data set, it probably makes sense to use generators. Something like this would be totally fine for both training and inference:

([extract_features(token) for token in tokens] for tokens in dataset)

Assuming dataset is a generator as well, only one sequence is loaded into memory at a time.

Training

You can either use the high-level function to train a model (which also loads and returns it):

>>> import chaine
>>> chaine.train(tokens, labels)

or the lower-level Trainer class:

>>> from chaine import Trainer
>>> trainer = Trainer()

A Trainer object has a method train() to learn states and transitions from the given data set. You have to provide a filepath to serialize the model to:

>>> trainer.train(tokens, labels, model_filepath="model.chaine")

Hyperparameters

Before training a model, you might want to find out the ideal hyperparameters first. You can just set the respective argument to True:

>>> import chaine
>>> model = chaine.train(tokens, labels, optimize_hyperparameters=True)

This might be very memory and time consuming, because 5-fold cross validation for each of the 10 trials for each of the algorithms is performed.

or use the HyperparameterOptimizer class and have more control over the optimization process:

>>> from chaine import HyperparameterOptimizer
>>> from chaine.optimization import L2SGDSearchSpace
>>> optimizer = HyperparameterOptimizer(trials=50, folds=3, spaces=[L2SGDSearchSpace()])
>>> optimizer.optimize_hyperparameters(tokens, labels, sample_size=1000)

This will make 50 trials with 3-fold cross validation for the Stochastic Gradient Descent algorithm and return a sorted list of hyperparameters with evaluation stats. The given data set is downsampled to 1000 instances.

Example of a hyperparameter optimization report

[
    {
        "hyperparameters": {
            "algorithm": "lbfgs",
            "min_freq": 0,
            "all_possible_states": true,
            "all_possible_transitions": true,
            "num_memories": 8,
            "c1": 0.9,
            "c2": 0.31,
            "epsilon": 0.00011,
            "period": 17,
            "delta": 0.00051,
            "linesearch": "Backtracking",
            "max_linesearch": 31
        },
        "stats": {
            "mean_precision": 0.4490952380952381,
            "stdev_precision": 0.16497993418839532,
            "mean_recall": 0.4554858934169279,
            "stdev_recall": 0.20082402876210334,
            "mean_f1": 0.45041435392087253,
            "stdev_f1": 0.17914435056760908,
            "mean_time": 0.3920876979827881,
            "stdev_time": 0.0390961164333519
        }
    },
    {
        "hyperparameters": {
            "algorithm": "lbfgs",
            "min_freq": 5,
            "all_possible_states": true,
            "all_possible_transitions": false,
            "num_memories": 9,
            "c1": 1.74,
            "c2": 0.09,
            "epsilon": 0.0008600000000000001,
            "period": 1,
            "delta": 0.00045000000000000004,
            "linesearch": "StrongBacktracking",
            "max_linesearch": 34
        },
        "stats": {
            "mean_precision": 0.4344436335328176,
            "stdev_precision": 0.15542689556199216,
            "mean_recall": 0.4385174258109041,
            "stdev_recall": 0.19873733310765845,
            "mean_f1": 0.43386496201052716,
            "stdev_f1": 0.17225578421967264,
            "mean_time": 0.12209572792053222,
            "stdev_time": 0.0236177196325414
        }
    },
    {
        "hyperparameters": {
            "algorithm": "lbfgs",
            "min_freq": 2,
            "all_possible_states": true,
            "all_possible_transitions": true,
            "num_memories": 1,
            "c1": 0.91,
            "c2": 0.4,
            "epsilon": 0.0008400000000000001,
            "period": 13,
            "delta": 0.00018,
            "linesearch": "MoreThuente",
            "max_linesearch": 43
        },
        "stats": {
            "mean_precision": 0.41963433149859447,
            "stdev_precision": 0.16363544501259455,
            "mean_recall": 0.4331173486012196,
            "stdev_recall": 0.21344965207006913,
            "mean_f1": 0.422038027332145,
            "stdev_f1": 0.18245844823319127,
            "mean_time": 0.2586916446685791,
            "stdev_time": 0.04341208573100539
        }
    },
    {
        "hyperparameters": {
            "algorithm": "l2sgd",
            "min_freq": 5,
            "all_possible_states": true,
            "all_possible_transitions": true,
            "c2": 1.68,
            "period": 2,
            "delta": 0.00047000000000000004,
            "calibration_eta": 0.0006900000000000001,
            "calibration_rate": 2.9000000000000004,
            "calibration_samples": 1400,
            "calibration_candidates": 25,
            "calibration_max_trials": 23
        },
        "stats": {
            "mean_precision": 0.2571428571428571,
            "stdev_precision": 0.43330716823151716,
            "mean_recall": 0.01,
            "stdev_recall": 0.022360679774997897,
            "mean_f1": 0.01702127659574468,
            "stdev_f1": 0.038060731531911314,
            "mean_time": 0.15442829132080077,
            "stdev_time": 0.051750737506044905
        }
    }
]

Inference

The high-level function chaine.train() returns a Model object. You can load an already trained model from disk by initializing a Model object with the model's filepath:

>>> from chaine import Model
>>> model = Model("model.chaine")

You can predict labels for a batch of sequences:

>>> tokens = [
...     [{"index": 0, "text": "John"}, {"index": 1, "text": "Lennon"}],
...     [{"index": 0, "text": "Paul"}, {"index": 1, "text": "McCartney"}],
...     [{"index": 0, "text": "George"}, {"index": 1, "text": "Harrison"}],
...     [{"index": 0, "text": "Ringo"}, {"index": 1, "text": "Starr"}]
... ]
>>> model.predict(tokens)
[['B-PER', 'I-PER'], ['B-PER', 'I-PER'], ['B-PER', 'I-PER'], ['B-PER', 'I-PER']]

or only for a single sequence:

>>> model.predict_single(tokens[0])
['B-PER', 'I-PER']

If you are interested in the model's probability distribution for a given sequence, you can:

>>> model.predict_proba_single(tokens[0])
[[{'B-PER': 0.99, 'I-PER': 0.01}, {'B-PER': 0.01, 'I-PER': 0.99}]]

Use the model.predict_proba() method for a batch of sequences.

Weights

After loading a trained model, you can inspect the learned transition and state weights:

>>> model = Model("model.chaine")
>>> model.transitions
[{'from': 'B-PER', 'to': 'I-PER', 'weight': 1.430506540616852e-06}]
>>> model.states
[{'feature': 'text:John', 'label': 'B-PER', 'weight': 9.536710877105517e-07}, ...]

You can also dump both transition and state weights as JSON:

>>> model.dump_states("states.json")
>>> model.dump_transitions("transitions.json")

Credits

This project makes use of and is partially based on:

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

4.0.0b4 pre-release

Jan 8, 2026

4.0.0b2 pre-release

Jan 7, 2026

3.13.1

Dec 12, 2024

3.12.1

Oct 6, 2023

This version

3.11.0

May 24, 2023

2.1.0

Dec 19, 2022

2.0.1

Aug 10, 2022

2.0.0

Jun 23, 2022

2.0.0a4 pre-release

Mar 15, 2022

2.0.0a3 pre-release

Mar 15, 2022

2.0.0a2 pre-release

Mar 13, 2022

2.0.0a1 pre-release

Mar 13, 2022

1.4.0

Sep 2, 2021

1.3.1

Jul 7, 2021

1.3.0

Jul 6, 2021

1.2.1

Mar 23, 2021

1.2.0

Mar 23, 2021

1.1.3

Mar 22, 2021

1.1.2

Mar 22, 2021

1.1.0

Jan 11, 2021

1.0.2

Jan 11, 2021

1.0.1

Jan 11, 2021

1.0.0

Jan 6, 2021

0.2.2

Jan 4, 2021

0.2.1

Jan 4, 2021

0.2.0

Dec 6, 2020

0.1.0

Nov 28, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

chaine-3.11.0-cp311-cp311-win_amd64.whl (370.9 kB view details)

Uploaded May 24, 2023 CPython 3.11Windows x86-64

chaine-3.11.0-cp311-cp311-win32.whl (352.6 kB view details)

Uploaded May 24, 2023 CPython 3.11Windows x86

chaine-3.11.0-cp311-cp311-musllinux_1_1_x86_64.whl (1.7 MB view details)

Uploaded May 24, 2023 CPython 3.11musllinux: musl 1.1+ x86-64

chaine-3.11.0-cp311-cp311-musllinux_1_1_i686.whl (1.7 MB view details)

Uploaded May 24, 2023 CPython 3.11musllinux: musl 1.1+ i686

chaine-3.11.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB view details)

Uploaded May 24, 2023 CPython 3.11manylinux: glibc 2.17+ x86-64

chaine-3.11.0-cp311-cp311-manylinux_2_17_i686.manylinux_2_12_i686.manylinux2010_i686.manylinux2014_i686.whl (1.1 MB view details)

Uploaded May 24, 2023 CPython 3.11manylinux: glibc 2.12+ i686manylinux: glibc 2.17+ i686

chaine-3.11.0-cp311-cp311-macosx_12_0_x86_64.whl (406.7 kB view details)

Uploaded May 24, 2023 CPython 3.11macOS 12.0+ x86-64

File details

Details for the file chaine-3.11.0-cp311-cp311-win_amd64.whl.

File metadata

Download URL: chaine-3.11.0-cp311-cp311-win_amd64.whl
Upload date: May 24, 2023
Size: 370.9 kB
Tags: CPython 3.11, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.5.0 CPython/3.11.3 Windows/10

File hashes

Hashes for chaine-3.11.0-cp311-cp311-win_amd64.whl
Algorithm	Hash digest
SHA256	`8de49057a1c02d21c8bcc8d12fac5a49b440bae220f1ed2ae1568dadc899a6a1`
MD5	`f65e33d9b058c9f0b3dac2f276e3e0dd`
BLAKE2b-256	`39527dfa021178113a570d93644574290a9c512e4437204457e567094d8c38ed`

See more details on using hashes here.

File details

Details for the file chaine-3.11.0-cp311-cp311-win32.whl.

File metadata

Download URL: chaine-3.11.0-cp311-cp311-win32.whl
Upload date: May 24, 2023
Size: 352.6 kB
Tags: CPython 3.11, Windows x86
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.5.0 CPython/3.11.3 Windows/10

File hashes

Hashes for chaine-3.11.0-cp311-cp311-win32.whl
Algorithm	Hash digest
SHA256	`b4ec11c3a06c56039e1c41c57cf4ac1225a8426628870a9b057c399d947e4715`
MD5	`ec644ebe2f23ee498ac8c8e437433202`
BLAKE2b-256	`bbca1dcee42e124db12fc36f5ce377001511e30ba98ac142552f7647c52bfffd`

See more details on using hashes here.

File details

Details for the file chaine-3.11.0-cp311-cp311-musllinux_1_1_x86_64.whl.

File metadata

Download URL: chaine-3.11.0-cp311-cp311-musllinux_1_1_x86_64.whl
Upload date: May 24, 2023
Size: 1.7 MB
Tags: CPython 3.11, musllinux: musl 1.1+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.5.0 CPython/3.11.3 Linux/5.15.0-1037-azure

File hashes

Hashes for chaine-3.11.0-cp311-cp311-musllinux_1_1_x86_64.whl
Algorithm	Hash digest
SHA256	`7e877aa9fe0971358a6e72537ff642b42203c917cf888b09abc0afdb20d507ae`
MD5	`93983f0c7868d6ce2c482c514e378989`
BLAKE2b-256	`e10cdb8c9d37bb15531d93e371c1e15fd0a1ca100e5da924628f67df2721ad15`

See more details on using hashes here.

File details

Details for the file chaine-3.11.0-cp311-cp311-musllinux_1_1_i686.whl.

File metadata

Download URL: chaine-3.11.0-cp311-cp311-musllinux_1_1_i686.whl
Upload date: May 24, 2023
Size: 1.7 MB
Tags: CPython 3.11, musllinux: musl 1.1+ i686
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.5.0 CPython/3.11.3 Linux/5.15.0-1037-azure

File hashes

Hashes for chaine-3.11.0-cp311-cp311-musllinux_1_1_i686.whl
Algorithm	Hash digest
SHA256	`2fdd2de99efe52b0e92e79a7ef6489d064d1a35fbbc1a5e087a9317c39f5465b`
MD5	`d0c0fbf86a6c19869c76003ce7d751c0`
BLAKE2b-256	`b06c5f54247484ab912b8fbbca10d42be1e6bb27d855951e6372d57b39977968`

See more details on using hashes here.

File details

Details for the file chaine-3.11.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: chaine-3.11.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: May 24, 2023
Size: 1.2 MB
Tags: CPython 3.11, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.5.0 CPython/3.11.3 Linux/5.15.0-1037-azure

File hashes

Hashes for chaine-3.11.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`c5e7953e9419ab43730a527b9c26460472ff22445681bccbc492948001df4434`
MD5	`b39f3b75fbe191a1487d7451ec2f8d16`
BLAKE2b-256	`10fa7d42195ebf49dd9e8f29cf4fabd9618f071f2f18faa7487898b70c687379`

See more details on using hashes here.

File details

Details for the file chaine-3.11.0-cp311-cp311-manylinux_2_17_i686.manylinux_2_12_i686.manylinux2010_i686.manylinux2014_i686.whl.

File metadata

Download URL: chaine-3.11.0-cp311-cp311-manylinux_2_17_i686.manylinux_2_12_i686.manylinux2010_i686.manylinux2014_i686.whl
Upload date: May 24, 2023
Size: 1.1 MB
Tags: CPython 3.11, manylinux: glibc 2.12+ i686, manylinux: glibc 2.17+ i686
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.5.0 CPython/3.11.3 Linux/5.15.0-1037-azure

File hashes

Hashes for chaine-3.11.0-cp311-cp311-manylinux_2_17_i686.manylinux_2_12_i686.manylinux2010_i686.manylinux2014_i686.whl
Algorithm	Hash digest
SHA256	`a800e66ae04bdc7e3f8c67bb1c14e47b9387208fbef44aec19c432dedb742923`
MD5	`72c405371f91a4a4dfe2ff0e061562cf`
BLAKE2b-256	`ee6782d1914820def462405140a3656e7fcb796daa976f871454f93a683f5ad9`

See more details on using hashes here.

File details

Details for the file chaine-3.11.0-cp311-cp311-macosx_12_0_x86_64.whl.

File metadata

Download URL: chaine-3.11.0-cp311-cp311-macosx_12_0_x86_64.whl
Upload date: May 24, 2023
Size: 406.7 kB
Tags: CPython 3.11, macOS 12.0+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.5.0 CPython/3.11.3 Darwin/21.6.0

File hashes

Hashes for chaine-3.11.0-cp311-cp311-macosx_12_0_x86_64.whl
Algorithm	Hash digest
SHA256	`6f4566166381798f092047283fe07afee47b7298520bf3c5856f2f95d986aaa9`
MD5	`ac54aaaabe3946d2a72b46be4772184e`
BLAKE2b-256	`763dced93e71dddaccbb3d1875254ba8d003a25bad9ad81a957fbdfcf248d230`

See more details on using hashes here.

chaine 3.11.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Chaine

Table of contents

Algorithms

Usage

Features

Generators

Training

Hyperparameters

Inference

Weights

Credits

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes