Tool for identifying candidate documents for software mention extraction.
Project description
SoFair Filter
Simple command line tool for identifying candidate documents for software mention extraction.
Installation
pip install sofairfilter
The default configuration uses the flash attention (https://github.com/Dao-AILab/flash-attention) that must be installed separately afterward. You can install it with:
pip install flash-attn --no-build-isolation
Usage
To process a folder containing text documents and filter them based on the presence of software mentions, you can use the following command:
sofairfilter folder_with_txt_documents
It will print paths to the documents that contain software mentions.
Custom Configuration
You can run it with a custom configuration file using the --config option:
sofairfilter folder_with_txt_documents --config path/to/config.yaml
The default configuration is:
model_factory: # Model configuration.
model_path: SoFairOA/sofair-modernBERT-base-filter # Name or path to the model.
attn_implementation: flash_attention_2 # The attention implementation to use in the model (if relevant). Can be any of "eager" (manual implementation of the attention), "sdpa" (using F.scaled_dot_product_attention), or "flash_attention_2" (using Dao-AILab/flash-attention). By default, if available, SDPA will be used for torch>=2.1.1. The default is otherwise the manual "eager" implementation.
cache_dir: # Path to Hugging Face cache directory.
quantization: # Configuration for bits and bytes quantization.
load_in_8bit: false # This flag is used to enable 8-bit quantization with LLM.int8().
load_in_4bit: false # This flag is used to enable 4-bit quantization by replacing the Linear layers with FP4/NF4 layers from `bitsandbytes`.
llm_int8_threshold: 6.0 # This corresponds to the outlier threshold for outlier detection as described in `LLM.int8() : 8-bit Matrix Multiplication for Transformers at Scale` paper: https://arxiv.org/abs/2208.07339 Any hidden states value that is above this threshold will be considered an outlier and the operation on those values will be done in fp16. Values are usually normally distributed, that is, most values are in the range [-3.5, 3.5], but there are some exceptional systematic outliers that are very differently distributed for large models. These outliers are often in the interval [-60, -6] or [6, 60]. Int8 quantization works well for values of magnitude ~5, but beyond that, there is a significant performance penalty. A good default threshold is 6, but a lower threshold might be needed for more unstable models (small models, fine-tuning).
llm_int8_skip_modules: # An explicit list of the modules that we do not want to convert in 8-bit. This is useful for models such as Jukebox that has several heads in different places and not necessarily at the last position. For example for `CausalLM` models, the last `lm_head` is kept in its original `dtype`.
llm_int8_enable_fp32_cpu_offload: false # This flag is used for advanced use cases and users that are aware of this feature. If you want to split your model in different parts and run some parts in int8 on GPU and some parts in fp32 on CPU, you can use this flag. This is useful for offloading large models such as `google/flan-t5-xxl`. Note that the int8 operations will not be run on CPU.
llm_int8_has_fp16_weight: false # This flag runs LLM.int8() with 16-bit main weights. This is useful for fine-tuning as the weights do not have to be converted back and forth for the backward pass.
bnb_4bit_compute_dtype: # This sets the computational type which might be different than the input type. For example, inputs might be fp32, but computation can be set to bf16 for speedups.
bnb_4bit_quant_type: fp4 # This sets the quantization data type in the bnb.nn.Linear4Bit layers. Options are FP4 and NF4 data types which are specified by `fp4` or `nf4`.
bnb_4bit_use_double_quant: false # This flag is used for nested quantization where the quantization constants from the first quantization are quantized again.
bnb_4bit_quant_storage: # This sets the storage type to pack the quanitzed 4-bit prarams.
torch_dtype: bfloat16 # Override the default torch.dtype and load the model under a specific dtype
trust_remote_code: false # Whether to trust remote code.
config: # Configuration for the model.
device: cuda # Device map for the model. If not specified, the model will be loaded on the CPU. Defaults to auto.
labels: # Classification labels, the position is specifying label id. Leave empty for automatic detection of labels from dataset or using labels from model configuration.
tokenizer: # Hugging Face tokenizer for the model. Leave empty if you wish to initialize it from the model.
threshold: # The threshold for the model's confidence probability. Documents with a probability below this threshold will be filtered out. By default, no threshold is applied and a class with the highest probability is selected.
batch_size: 32 # Batch size for processing documents.
See help for more options:
sofairfilter --help
Evaluation
We evaluated this model on the test set of SoFairOA/sofair_softcite_somesci (sofair_softcite_somesci_documents) dataset:
| precision | 0.8625730994152047 |
|---|---|
| recall | 0.9104938271604939 |
| f1 | 0.8858858858858859 |
| accuracy | 0.9268527430221367 |
Scripts used for evaluation are available in the experiments/sofair_softcite_somesci folder.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sofairfilter-1.0.0.tar.gz.
File metadata
- Download URL: sofairfilter-1.0.0.tar.gz
- Upload date:
- Size: 9.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
30deca2d6f660fbce1d1f95b4a777a55412f5c4d20973fd11fbb704ecb6c5fcc
|
|
| MD5 |
dc042b29ce8a054f1130f7b521cfbb6c
|
|
| BLAKE2b-256 |
85e4434f04cbef9f4f58742e0ae56b03eeb042a893da4954632de32b6ece2f16
|
File details
Details for the file sofairfilter-1.0.0-py3-none-any.whl.
File metadata
- Download URL: sofairfilter-1.0.0-py3-none-any.whl
- Upload date:
- Size: 13.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f43a10c1df4b5a138e7aa6b65a35b503503b3fa07cbb31a17db2f6d805d7036f
|
|
| MD5 |
b560bffb6af3f8617ac036369da46576
|
|
| BLAKE2b-256 |
36a097e8189e55456ac535b945a80b57428be9ceefea707c1220a1a178fa72c9
|