Skip to main content

Multi-class text categorization using state-of-the-art pre-trained contextualized language models, e.g. BERT.

Project description

Text2Class

Build multi-class text classifiers using state-of-the-art pre-trained contextualized language models, e.g. BERT. Only a few hundred samples per class are necessary to get started.

Background

This project is based on our study: Transfer Learning Robustness in Multi-Class Categorization by Fine-Tuning Pre-Trained Contextualized Language Models.

Citation

To cite this work, use the following BibTeX citation.

@article{transfer2019multiclass,
  title={Transfer Learning Robustness in Multi-Class Categorization by Fine-Tuning Pre-Trained Contextualized Language Models},
  author={Liu, Xinyi and Wangperawong, Artit},
  journal={arXiv preprint arXiv:1909.03564},
  year={2019}
}

Installation

pip install text2class

Example usage

Create a dataframe with two columns, such as 'text' and 'label'. No text pre-processing is necessary.

import pandas as pd
from text2class.text_classifier import TextClassifier

df = pd.read_csv("data.csv")

train = df.sample(frac=0.9,random_state=200)
test = df.drop(train.index)

cls = TextClassifier(
	num_labels=3,
	data_column="text",
	label_column="label",
	max_seq_length=128
)

cls.fit(train)

predictions = cls.predict(test["text"])

Advanced usage

Model type

The default model is an uncased Bidirectional Encoder Representations from Transformers (BERT) consisting of 12 transformer layers, 12 self-attention heads per layer, and a hidden size of 768. Below are all models currently supported that you can specify with hub_module_handle. We expect that more will be added in the future. For more information, see BERT's GitHub.

https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1
https://tfhub.dev/google/bert_uncased_L-24_H-1024_A-16/1
https://tfhub.dev/google/bert_cased_L-12_H-768_A-12/1
https://tfhub.dev/google/bert_cased_L-24_H-1024_A-16/1
https://tfhub.dev/google/bert_chinese_L-12_H-768_A-12/1
https://tfhub.dev/google/bert_multi_cased_L-12_H-768_A-12/1

cls = TextClassifier(
	num_labels=3,
	data_column="text",
	label_column="label",
	max_seq_length=128,
	hub_module_handle="https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1"
)

Contributing

Text2Class is an open-source project founded and maintained to better serve the machine learning and data science community. Please feel free to submit pull requests to contribute to the project. By participating, you are expected to adhere to Text2Class's code of conduct.

Questions?

For questions or help using Text2Class, please submit a GitHub issue.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

text2class-0.0.4.tar.gz (32.7 kB view details)

Uploaded Source

Built Distribution

text2class-0.0.4-py3-none-any.whl (37.5 kB view details)

Uploaded Python 3

File details

Details for the file text2class-0.0.4.tar.gz.

File metadata

  • Download URL: text2class-0.0.4.tar.gz
  • Upload date:
  • Size: 32.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.21.0 setuptools/45.1.0 requests-toolbelt/0.9.1 tqdm/4.28.1 CPython/3.6.9

File hashes

Hashes for text2class-0.0.4.tar.gz
Algorithm Hash digest
SHA256 76bafc8a7305ecbeffae76ae487c266efa5a2a7472e3653e933500f73b0872dd
MD5 d3154c9ed140b62e020755f1845e0eee
BLAKE2b-256 cd15ba732ea88f5a0e493e7b8c8dda3bbee68923ac87ffbf2b4f57e62e257a4b

See more details on using hashes here.

File details

Details for the file text2class-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: text2class-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 37.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.21.0 setuptools/45.1.0 requests-toolbelt/0.9.1 tqdm/4.28.1 CPython/3.6.9

File hashes

Hashes for text2class-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 fd13871023c9847ec6b9394eac20acbeed8ad73829a2e2bde6f22149fe4647f3
MD5 67e87774b0f57e85fc7c5bab0458fafc
BLAKE2b-256 cf53c898361833afc53aaad398b8559475b40113fae4166e3604434a6df3ad0f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page