Skip to main content

Keyword extraction from linguistic publications

Project description

Keyword extraction from langsci publications

langscikw is a Python package and command line tool for bigram keyterm extraction. It is optimized for long, English, linguistic publications and can also be applied to TeX code.

Keyword extraction is done in three steps. No preprocessing is needed.

  • Step 1: KWE from the input document using YAKE. This is the simplest step as it doesn't need a corpus and should extract the most important keywords.
  • Step 2: KWE using TF-IDF trained on a raw TeX corpus. This step yields some more general keywords relevant to the linguistic discipline.
  • Step 3: KWE using TF-IDF trained on a detexed corpus. This step fills in some missing keywords that also appear in a reference corpus of 10,000 previously accepted keywords, keywordslist.txt.

The number of steps can be controlled by (not) providing the relevant corpora during training. The result needs some manual correction and supplementation of relevant unigrams or trigrams.

Installation

pip3 install langscikw

Developed in Python 3.7.3 32-bit. Needs at least Python 3.7 and the following packages: jellyfish, joblib, networkx, scikit-learn, segtok, regex.

Download the langsci corpus files from here or use your own.

Usage

Command line

The command line tool provides only a simple interface. If you'd like to customize the model parameters or the number of steps, please see below.

langscikw inputfile [n] [corpus1] [corpus2] [keywordslist] [--silent]

The keywords are printed to the console and can be redirected to a text file.

Arguments

  • inputfile: Path to a .txt or .tex file or directory from which to extract keywords.
  • n: Optional Number of keywords to extract. Defaults to 300.
  • corpus1: Optional Path to corpus for step 2, usually raw TeX files or a joblib-compressed file. If not provided, looks for corpus_tex.gz in the current directory.
  • corpus2: Optional Path to corpus for step 3, usually detexed files or a joblib-compressed file. If not provided, looks for corpus_detexed.gz in the current directory.
  • keywordslist: Optional Path to a list of gold keywords for step 3. A default list based on langsci publications is installed with the package.
  • --silent: Optional Only print the result to the console, no progress updates.

KWE class

import langscikw
input_path = "my_book"              # File/directory to extract keywords from
keywords_path = "keywords.txt"      # File to save keywords to

kwe = langscikw.KWE()
kwe.train("corpus_tex.gz", "corpus_detexed.gz")
kws = kwe.extract_keywords(input_path, n=300, dedup_lim=0.85)
for kw in kws:
    print(kw)                       # Keywords are alphabetically sorted strings

Keyword arguments

  • n: Optional Number of keywords to extract. Defaults to 300.
  • dedup_lim: Optional Deduplication limit. Keywords that have a Jaro-Winkler Similarity of >dedup_lim are not added to the final list. Defaults to 0.85.

Stand-alone models

The YAKE and TF-IDF models may also be used on their own. Please consult the docstrings for more information.

import langscikw
yake = langscikw.yakemodel.YakeExtractor()      # -> extract_keywords()
tfidf = langscikw.tfidfmodel.TfidfExtractor()   # -> train() -> extract_keywords()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langscikw-0.0.1.tar.gz (90.1 kB view details)

Uploaded Source

Built Distribution

langscikw-0.0.1-py3-none-any.whl (90.3 kB view details)

Uploaded Python 3

File details

Details for the file langscikw-0.0.1.tar.gz.

File metadata

  • Download URL: langscikw-0.0.1.tar.gz
  • Upload date:
  • Size: 90.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.9.5

File hashes

Hashes for langscikw-0.0.1.tar.gz
Algorithm Hash digest
SHA256 999fc399e47335fb6c2fe813896f55ebfc5accbbe313890bdc754d94167adb4c
MD5 27e6435aca42e7db2176deedea406c74
BLAKE2b-256 cadd9d6ef126837b7c5857ccaa07a0b1fa7e16b24fbaa9583e08feddb58ac455

See more details on using hashes here.

File details

Details for the file langscikw-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: langscikw-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 90.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.9.5

File hashes

Hashes for langscikw-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b3f87092ca9668c8c050863d9ec84a9b46861b96ac04c595930bf071f40e4d9a
MD5 0a914afed6383cfdd1c893f3521a04c8
BLAKE2b-256 f9401079ae7fdce1cbab50b8f64b7d63ad2e8c50ec62dd0a0471edff6e5363fc

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page