Skip to main content

PosLog: A CRF-based Part-of-Speech Tagger for Log Messages

Project description

PosLog

A CRF-based Part-of-Speech (POS) Tagger for Log Messages. In comparison to SoTA PoS taggers, PosLog is trained on a corpus of log messages and achieves an accuracy of 98.27% on the test set.

Table: Accuracy of PoS tagger in comparison.
Ordered by increasing accuracy. The time is shown in seconds per 1 million tokens.

Tagger Time Accuracy
NLTK 28 77.28%
HanTa 506 78.74%
TreeTagger 43 79.58%
SpaCy 428 80.89%
Stanza 5,376 90.25%
poslog 45 98.27%

Usage

Use Default Model

There are three ways to use the default model:

  1. Predict the PoS tags of a list of tokens returning a list of tags.

    from poslog import PosLogTokenizer, PosLogCRF
    
    msg="Tag this sentence."
    
    tokenizer=PosLogTokenizer()
    tokens=tokenizer.tokenize(msg)
    # ['Tag', 'this', 'sentence', '.']
    
    pos_log=PosLogCRF()
    # predict(X:list[str])->list[str]
    pos_log.predict(tokens)
    # ['VERB' 'DET' 'NOUN' 'PUNCT']
    
  2. Predict the PoS tags of a string returning a list of tags.

    # predict_string(X:str)->list[str]
    pos_log.predict_string(msg)
    # ['VERB' 'DET' 'NOUN' 'PUNCT']
    
  3. Predict the PoS tags of a string returning a list of tuples with token and tag.

    # predict_string_as_tuple(X:str)->list[tuple[str,str]]
    pos_log.predict_string_as_tuple(msg)
    # [('Tag', 'VERB'), ('this', 'DET'), ('sentence', 'NOUN'), ('.', 'PUNCT')]
    

Train Your Own Model

Define model name in constructor:

pos_log=PosLogCRF(model_name="abs_path_to_my_model")

You can give abs_path_to_my_model as absolute path or relative path.
Note: Relative paths models will stored in package directory models/ and will be overwritten if you renew the environment.

PosLog takes training data as tokens and tags separately:

train(X_train_tokens:list[list[str]], y_train_tags:list[list[str]])

Or as token and tag pairs:

train_from_tagged_sents(tagged_sents:list[list[tuple[str,str]]])

After training, the model will be saved in the path you provided in the constructor.
Note: Training will override existing model with the same name.

Use Your Own Model

Just call the constructor with the model name:

pos_log=PosLogCRF(model_name="my_model")

Tokenization

Since PosLog was trained on a corpus we tokenized a specific way, we included the tokenizer PosLogTokenizer in this package.

We use three preprocessing steps before tokenization to adapt to log message specific characteristics:

  1. We escape quotation marks with spaces to distinguish them from words with tailing apostrophes.

  2. We extend NLTK's contraction list with 124 more cases where we split or replace contracted words.

  3. We apply NLTK's word_tokenize which makes a few more replacements and returns a token list.

The following shows an example of the three steps of tokenization:

Example for the three steps of tokenization. 0 shows the input string and 3 the output list of tokenization.

0 (Input):  "Can't read 'block_x'."
1:          "Can't read 'block_x '."
2:          "Cannot read 'block_x '."
3 (Output): ["Can", "not", "read", "'", "block_x", "'", "."]

Dependencies

PosLog relies on

  • nltk corpora: words, stopwords, wordnet and
  • sklearn for the CRF classifier sklearn-crfsuite.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

poslog-0.6.tar.gz (461.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

poslog-0.6-py3-none-any.whl (458.3 kB view details)

Uploaded Python 3

File details

Details for the file poslog-0.6.tar.gz.

File metadata

  • Download URL: poslog-0.6.tar.gz
  • Upload date:
  • Size: 461.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.0

File hashes

Hashes for poslog-0.6.tar.gz
Algorithm Hash digest
SHA256 3729daaff602d88e9bf8facb7122430e699526911674a241656419818e1969b8
MD5 04eeccf4c31f644accefa5aef28c22a5
BLAKE2b-256 4558862dc08c36a3b1aa401a704cf30bb74a2faa9df55163f943090e0e6f3209

See more details on using hashes here.

File details

Details for the file poslog-0.6-py3-none-any.whl.

File metadata

  • Download URL: poslog-0.6-py3-none-any.whl
  • Upload date:
  • Size: 458.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.0

File hashes

Hashes for poslog-0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 85d309b6b21b8fd5b4b100a081847c4d5d615b7f30d6dcea1868a058bfe81d78
MD5 d9b3fbbd2dd8846c3f824eb9472bf71a
BLAKE2b-256 2c54748f69dcc543810fc8e950a9bccc0a65144b294f00dd9518e901fb5c2fef

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page