Skip to main content

Natural Language Processing Utility Functions

Project description

This repository contains several functions to analyze text corpora. Mainly, text documents can be transformed into (sparse, dictionary based) tf-idf features, based on which the similarities between the documents can be computed, the dataset can be classified with knn, or the corpus can be visualized in two dimensions.

The individual library components are largely independent of another (besides most of them using functions from dict_utils.py), which means you might also find only parts of this library interesting, e.g. embedding.py, which contains a concise python implementation of t-SNE, which can be used to embed data points in 2D based on any kind of similarity matrix, not necessarily created with the scripts from this library.

If any of this code was helpful for your research, please consider citing it:

https://zenodo.org/badge/17917498.svg
@misc{franziska_horn_2018_1254413,
  author       = {Franziska Horn},
  title        = {cod3licious/nlputils},
  month        = may,
  year         = 2018,
  doi          = {10.5281/zenodo.1254413},
  url          = {https://doi.org/10.5281/zenodo.1254413}
}

The code is intended for research purposes. It was programmed for Python 2.7, but should also run on newer Python 3 versions - please open an issue if you find something isn’t working there!

installation

You either download the code from here and include the nlputils folder in your $PYTHONPATH or install (the library components only) via pip:

$ pip install nlputils

nlputils library components

dependencies: numpy, scipy, unidecode, matplotlib

  • dict_utils.py: various helper functions to manipulate dictionaries, e.g. to invert them on various levels (for example transform a dict with {document: {word: count}} into {word: {document: count}}).

  • features.py: this contains code to preprocess texts and transform them into tf-idf features. It’s somewhat similar to the sklearn TfidfVectorizer, but based on (sparse) dictionaries instead of sparse vectors. These dictionary based document features are the main input used for other parts of this library. But there is also a features2mat function to transform the dictionaries into a sparse feature matrix, which can be used with sklearn classifiers, for example.

  • simcoefs.py: this has one main function compute_sim, which gets as input the tf-idf feature dictionaries of two documents and then computes their similarity. Concerning the type of similarity to compute between the documents, you can chose from a large variety of similarity coefficients, kernel functions, and distance measures, implemented based on [RIE08].

  • simmat.py: this contains wrapper functions for simcoefs.py to speed up the computation of the similarity matrix for a whole corpus.

  • ml_utils.py: helper function to perform a cross-validation.

  • knn_classifier.py: based on a similarity matrix, perform k-nearest-neighbors classification.

  • embedding.py: based on a similarity matrix, project data points to 2D with classical scaling or t-SNE.

  • visualize.py: helper functions to create a plot of the dataset based on the 2D embedding. This can also create a json file, which can be used with d3.js to create an interactive visualization of the data.

examples

additional dependencies: sklearn

In the iPython Notebook at examples/examples.ipynb are several examples on how to use the above described library components.

If you have any questions please don’t hesitate to send me an email and of course if you should find any bugs or want to contribute other improvements, pull requests are very welcome!

[RIE08]

Rieck, Konrad, and Pavel Laskov. “Linear-time computation of similarity measures for sequential data.” Journal of Machine Learning Research 9.Jan (2008): 23-48.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nlputils-1.0.7.tar.gz (20.0 kB view details)

Uploaded Source

Built Distribution

nlputils-1.0.7-py2.py3-none-any.whl (19.9 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file nlputils-1.0.7.tar.gz.

File metadata

  • Download URL: nlputils-1.0.7.tar.gz
  • Upload date:
  • Size: 20.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.3

File hashes

Hashes for nlputils-1.0.7.tar.gz
Algorithm Hash digest
SHA256 bd431a42497a5845ba7799d2fb7443786b776f430f6163b9ad6654f0a01b8fcc
MD5 d8089f871884f06fa9239278b881a9b2
BLAKE2b-256 c5332cd5a4f65df94d32b547468568900b8dc46174d6550a0a69dcb20d2bcd91

See more details on using hashes here.

File details

Details for the file nlputils-1.0.7-py2.py3-none-any.whl.

File metadata

  • Download URL: nlputils-1.0.7-py2.py3-none-any.whl
  • Upload date:
  • Size: 19.9 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.3

File hashes

Hashes for nlputils-1.0.7-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 bb4e8d041952deb90bde6b7bc0db3a5ad0e0e1ce849ee7b8ce737258c94bef71
MD5 706ccf8828c04a9d3d5e380b61f47281
BLAKE2b-256 57570055e76980ee1596ecbd932724aef970a88e7361905a14aaa098866159a5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page