Skip to main content

Citation style classifier

Project description

Citation style classifier

Citation style classifier can automatically infer citation style from a reference string. The classifier is a Logistic Regression model trained on 90,000 reference strings. The following citation styles are supported by default:

  • acm-sig-proceedings
  • american-chemical-society
  • american-chemical-society-with-titles
  • american-institute-of-physics
  • american-sociological-association
  • apa
  • bmc-bioinformatics
  • chicago-author-date
  • elsevier-without-titles
  • elsevier-with-titles
  • harvard3
  • ieee
  • iso690-author-date-en
  • modern-language-association
  • springer-basic-author-date
  • springer-lecture-notes-in-computer-science
  • vancouver
  • unknown

The package contains the training data, the classification model, and the code for feature extraction, selection, training and prediction.

Installation

    pip3 install styleclass

Classification

From command line:

    styleclass_classify -r "reference string"
    styleclass_classify -i /file/with/reference/strings/one/per/line -o /output/file

In Python code:

    from styleclass.classify import classify
    from styleclass.train import get_default_model

    model = get_default_model()
    prediction = classify("reference string", *model)
    prediction = classify(["reference string #1", "reference string #2", "reference string #3"], *model)

Data

Styleclass package contains two datasets: training set and test set. Each of them contains a sample of 5,000 DOIs formatted in 17 citation styles (listed above), which gives 85,000 reference strings. Both datasets were generated automatically using Crossref REST API.

A new dataset can be generated using the script styleclass_generate_dataset.

Models

The default model was trained on the training dataset. Before the training, the dataset was cleaned and enriched with random noise. 5,000 strings with "unknown" style were also generated and added to the dataset.

Script styleclass_train_model can be used to train a new model. This is useful especially when you need to operate of a different set of citation styles than our default. The script prepares the data for training in the same was as was done for training of the default model.

Evaluation

styleclass_evaluate script can be used to evaluate exisitng model on a test set, in terms of accuracy.

The accuracy of the default model estimated on our test set is 95%.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

styleclass-0.0.1.tar.gz (6.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

styleclass-0.0.1-py3-none-any.whl (6.0 MB view details)

Uploaded Python 3

File details

Details for the file styleclass-0.0.1.tar.gz.

File metadata

  • Download URL: styleclass-0.0.1.tar.gz
  • Upload date:
  • Size: 6.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.6.9

File hashes

Hashes for styleclass-0.0.1.tar.gz
Algorithm Hash digest
SHA256 682ae33a64b0509a8ffae1e6e8271fb8933ea72b3e74d9f8f5a1c793e353c017
MD5 524948fd7b21b2cf1b32a935414f2c03
BLAKE2b-256 0c1f44dde54500b64ffd6643554aea7a190d04571259a5fc948e6811cb8bc4c4

See more details on using hashes here.

File details

Details for the file styleclass-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: styleclass-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 6.0 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.6.9

File hashes

Hashes for styleclass-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c6352a33961f798d75517b293192b25654e1c48ef9f56b889f24beac38693bd6
MD5 1c4bf1097c47d066cbf82a924d143c0c
BLAKE2b-256 867c60e549b11435e1c2e8a5d29ff959bf10c19190162c9f690ccb240442d6ba

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page