PosLog: A CRF-based Part-of-Speech Tagger for Log Messages

Project description

PosLog

A CRF-based Part-of-Speech (POS) Tagger for Log Messages.

Usage

Use Default Model

There are three ways to use the default model:

Predict the PoS tags of a list of tokens returning a list of tags.

from poslog import PosLogTokenizer, PosLogCRF

msg="Tag this sentence."

tokenizer=PosLogTokenizer()
tokens=tokenizer.tokenize(msg)
# ['Tag', 'this', 'sentence', '.']

pos_log=PosLogCRF()
# predict(X:list[str])->list[str]
pos_log.predict(tokens)
# ['VERB' 'DET' 'NOUN' 'PUNCT']

Predict the PoS tags of a string returning a list of tags.

# predict_string(X:str)->list[str]
pos_log.predict_string(msg)
# ['VERB' 'DET' 'NOUN' 'PUNCT']

Predict the PoS tags of a string returning a list of tuples with token and tag.

# predict_string_as_tuple(X:str)->list[tuple[str,str]]
pos_log.predict_string_as_tuple(msg)
# [('Tag', 'VERB'), ('this', 'DET'), ('sentence', 'NOUN'), ('.', 'PUNCT')]

Train Your Own Model

Define model name in constructor:

pos_log=PosLogCRF(model_name="abs_path_to_my_model")

You can give abs_path_to_my_model as absolute path or relative path.
Note: Relative paths models will stored in package directory models/ and will be overwritten if you renew the environment.

PosLog takes training data as tokens and tags separately:

train(X_train_tokens:list[list[str]], y_train_tags:list[list[str]])

Or as token and tag pairs:

train_from_tagged_sents(tagged_sents:list[list[tuple[str,str]]])

After training, the model will be saved in the path you provided in the constructor.
Note: Training will override existing model with the same name.

Use Your Own Model

Just call the constructor with the model name:

pos_log=PosLogCRF(model_name="my_model")

Tokenization

Since PosLog was trained on a corpus we tokenized a specific way, we included the tokenizer PosLogTokenizer in this package.

We use three preprocessing steps before tokenization to adapt to log message specific characteristics:

We escape quotation marks with spaces to distinguish them from words with tailing apostrophes.
We extend NLTK's contraction list with 124 more cases where we split or replace contracted words.
We apply NLTK's word_tokenize which makes a few more replacements and returns a token list.

The following shows an example of the three steps of tokenization:

Example for the three steps of tokenization. 0 shows the input string and 3 the output list of tokenization.

0 (Input):  "Can't read 'block_x'."
1:          "Can't read 'block_x '."
2:          "Cannot read 'block_x '."
3 (Output): ["Can", "not", "read", "'", "block_x", "'", "."]

Dependencies

PosLog relies on

nltk corpora: words, stopwords, wordnet and
sklearn for the CRF classifier sklearn-crfsuite.

Project details

Release history Release notifications | RSS feed

This version

0.7

May 19, 2025

0.6

May 19, 2025

0.5

May 13, 2025

0.4

May 12, 2025

0.3

May 12, 2025

0.2

May 12, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

poslog-0.7.tar.gz (460.7 kB view details)

Uploaded May 19, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

poslog-0.7-py3-none-any.whl (458.4 kB view details)

Uploaded May 19, 2025 Python 3

File details

Details for the file poslog-0.7.tar.gz.

File metadata

Download URL: poslog-0.7.tar.gz
Upload date: May 19, 2025
Size: 460.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.0

File hashes

Hashes for poslog-0.7.tar.gz
Algorithm	Hash digest
SHA256	`6d34334291886439136b7905cfe8a5c37b8eeb6e5e2d6eb50ae93acf27ffd78f`
MD5	`b27a14b78ab0d8cf6a7e6a9aab2ad926`
BLAKE2b-256	`78e19c2703f3d2343430f259ab379925231d13ea3eca132405e80cc82f54ccf5`

See more details on using hashes here.

File details

Details for the file poslog-0.7-py3-none-any.whl.

File metadata

Download URL: poslog-0.7-py3-none-any.whl
Upload date: May 19, 2025
Size: 458.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.0

File hashes

Hashes for poslog-0.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6706f3540995eadc58d84a50aad9aa7d05af6381c8ede47e307c28a0351512d6`
MD5	`d059ad30ad8a12e693a715e3a980595e`
BLAKE2b-256	`cd69f601d54dc6afa688b0f6d26214a050b2add5a5ab8e6c7e10c10bd3485b60`

See more details on using hashes here.

poslog 0.7

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

PosLog

Usage

Use Default Model

Train Your Own Model

Use Your Own Model

Tokenization

Dependencies

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes