Code for Paper: “Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors
Project description
Code for Paper: “Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors
This paper is accepted to Findings of ACL2023.
Getting Started
This codebase is available on pypi.org via:
pip install npc-gzip
Usage
See the examples directory for example usage.
Testing
This package utilizes poetry
to maintain its dependencies and pytest
to execute tests. To get started running the tests:
poetry shell
poetry install
pytest
Original Codebase
Require
See requirements.txt
.
Install requirements in a clean environment:
conda create -n npc python=3.7
conda activate npc
pip install -r requirements.txt
Run
python main_text.py
By default, this will only use 100 test and training samples per class as a quick demo. They can be changed by --num_test
, --num_train
.
--compressor <gzip, lzma, bz2>
--dataset <AG_NEWS, SogouNews, DBpedia, YahooAnswers, 20News, Ohsumed_single, R8, R52, kinnews, kirnews, swahili, filipino> [Note that for small datasets like kinnews, default 100-shot is too big, need to set --num_test and --num_train.]
--num_train <INT>
--num_test <INT>
--data_dir <DIR> [This needs to be specified for R8, R52 and Ohsumed.]
--all_test [This will use the whole test dataset.]
--all_train
--record [This will record the distance matrix in order to save for the future use. It's helpful when you when to run on the whole dataset.]
--test_idx_start <INT>
--test_idx_end <INT> [These two args help us to run on a certain range of test set. Also helpful for calculating the distance matrix on the whole dataset.]
--para [This will use multiprocessing to accelerate.]
--output_dir <DIR> [The output directory to save information of tested indices or distance matrix.]
Calculate Accuracy (Optional)
If we want to calculate accuracy from recorded distance file <DISTANCE DIR>
, use
python main_text.py --record --score --distance_fn <DISTANCE DIR>
to calculate accuracy. Otherwise, the accuracy will be calculated automatically using the command in the last section.
Use Custom Dataset
You can use your own custom dataset by passing custom
to --dataset
; pass the data directory that contains train.txt
and test.txt
to --data_dir
; pass the class number to the --class_num
.
Both train.txt
and test.txt
are expected to have the format {label}\t{text}
per line.
You can change the delimiter according to you dataset by changing delimiter
in load_custom_dataset()
in data.py
.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file npc_gzip-0.1.1.tar.gz
.
File metadata
- Download URL: npc_gzip-0.1.1.tar.gz
- Upload date:
- Size: 11.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/4.0.2 CPython/3.11.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d81142055516d0ce4c0eedfec87a53cdfe0cf7b3cd4f59f9c033a219355e2e1e |
|
MD5 | be80ce8d2f2567bce4bcca1980f30261 |
|
BLAKE2b-256 | c0bc936d3d8828a6c5e739863b03490a6048b484a67b797c0e55c6dcf48764db |
File details
Details for the file npc_gzip-0.1.1-py3-none-any.whl
.
File metadata
- Download URL: npc_gzip-0.1.1-py3-none-any.whl
- Upload date:
- Size: 13.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/4.0.2 CPython/3.11.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e8051e6cc24ad2f19028f3a1f9427790e742fb5dd32fb1c669de79928c532264 |
|
MD5 | f78db36698298b11ba604f248dfe980e |
|
BLAKE2b-256 | 9fa8bb0b75702a650d4757f6db171f8bd82264c0fd43dbb010c09921ef7d6c06 |