A Simple Package to Train Bert-Like Model for Text Classification
Project description
The Augmented Social Scientist
This package allows to simply train BERT-like models for text classifications.
It comes with our article "The Augmented Social Scientist: Using Sequential Transfer Learning to Annotate Millions of Texts with Human-Level Accuracy" published on Sociological Methods & Research by Salomé Do, Étienne Ollion and Rubing Shen.
To install the package
- Use pip
pip install AugmentedSocialScientist
- Or from source
git clone https://github.com/rubingshen/AugmentedSocialScientist.git
pip install ./AugmentedSocialScientist
Import BERT model
from AugmentedSocialScientist import bert
The module bert
contains 3 main functions:
bert.encode()
to preprocess the data;bert.run_training()
to train, validate and save a model;bert.predict_with_model()
to make predictions with a saved model.
Tutorial
Check here for a Google Colab tutorial.
Languages supported
BERT is a pre-trained language model for the English language. The package also contains models for other languages:
camembert
for French;arabic_bert
for Arabic;chinese_bert
for Chinese;german_bert
for German;hindi_bert
for Hindi;italian_bert
for Italian;portuguese_bert
for Portuguese;russian_bert
for Russian;spanish_bert
for Spanish;swedish_bert
for Swedish;xlmroberta
which is a multi-lingual model supporting 100 languages.
To use them, simply import the corresponding model and replace bert
with the name of the imported model.
For example, to use the French language model camembert
:
- Import the model
camembert
:
from AugmentedSocialScientist import camembert
- Then use the functions
camembert.encode()
,camembert.run_training()
,camembert.predict_with_model()
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Hashes for AugmentedSocialScientist-1.1.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8b9b6f3f0a43de4642911038852c26ed8705a1fa1d701de70b2887e5cfa90c38 |
|
MD5 | 758061cd8ab8d08c647406dbd4fafd76 |
|
BLAKE2b-256 | ad33331a98e77aaa14583cf0b6c17c3b31a3bbc0d54288ba7211c444a780f0e6 |