Skip to main content

Code for Paper: “Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors

Project description

Code for Paper: “Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors

This paper is accepted to Findings of ACL2023.

Getting Started

This codebase is available on pypi.org via:

pip install npc-gzip

Usage

See the examples directory for example usage.

Testing

This package utilizes poetry to maintain its dependencies and pytest to execute tests. To get started running the tests:

poetry shell
poetry install
pytest

Original Codebase

Require

See requirements.txt.

Install requirements in a clean environment:

conda create -n npc python=3.7
conda activate npc
pip install -r requirements.txt

Run

python main_text.py

By default, this will only use 100 test and training samples per class as a quick demo. They can be changed by --num_test, --num_train.

--compressor <gzip, lzma, bz2>
--dataset <AG_NEWS, SogouNews, DBpedia, YahooAnswers, 20News, Ohsumed_single, R8, R52, kinnews, kirnews, swahili, filipino> [Note that for small datasets like kinnews, default 100-shot is too big, need to set --num_test and --num_train.]
--num_train <INT>
--num_test <INT>
--data_dir <DIR> [This needs to be specified for R8, R52 and Ohsumed.]
--all_test [This will use the whole test dataset.]
--all_train
--record [This will record the distance matrix in order to save for the future use. It's helpful when you when to run on the whole dataset.]
--test_idx_start <INT>
--test_idx_end <INT> [These two args help us to run on a certain range of test set. Also helpful for calculating the distance matrix on the whole dataset.]
--para [This will use multiprocessing to accelerate.]
--output_dir <DIR> [The output directory to save information of tested indices or distance matrix.]

Calculate Accuracy (Optional)

If we want to calculate accuracy from recorded distance file <DISTANCE DIR>, use

python main_text.py --record --score --distance_fn <DISTANCE DIR>

to calculate accuracy. Otherwise, the accuracy will be calculated automatically using the command in the last section.

Use Custom Dataset

You can use your own custom dataset by passing custom to --dataset; pass the data directory that contains train.txt and test.txt to --data_dir; pass the class number to the --class_num.

Both train.txt and test.txt are expected to have the format {label}\t{text} per line.

You can change the delimiter according to you dataset by changing delimiter in load_custom_dataset() in data.py.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

npc_gzip-0.1.1.tar.gz (11.3 kB view details)

Uploaded Source

Built Distribution

npc_gzip-0.1.1-py3-none-any.whl (13.4 kB view details)

Uploaded Python 3

File details

Details for the file npc_gzip-0.1.1.tar.gz.

File metadata

  • Download URL: npc_gzip-0.1.1.tar.gz
  • Upload date:
  • Size: 11.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for npc_gzip-0.1.1.tar.gz
Algorithm Hash digest
SHA256 d81142055516d0ce4c0eedfec87a53cdfe0cf7b3cd4f59f9c033a219355e2e1e
MD5 be80ce8d2f2567bce4bcca1980f30261
BLAKE2b-256 c0bc936d3d8828a6c5e739863b03490a6048b484a67b797c0e55c6dcf48764db

See more details on using hashes here.

File details

Details for the file npc_gzip-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: npc_gzip-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 13.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for npc_gzip-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e8051e6cc24ad2f19028f3a1f9427790e742fb5dd32fb1c669de79928c532264
MD5 f78db36698298b11ba604f248dfe980e
BLAKE2b-256 9fa8bb0b75702a650d4757f6db171f8bd82264c0fd43dbb010c09921ef7d6c06

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page