Tool for identifying candidate documents for software mention extraction.

These details have not been verified by PyPI

Project links

Repository

Project description

SoFair Filter

Simple command line tool for identifying candidate documents for software mention extraction.

Installation

pip install sofairfilter

The default configuration uses the flash attention (https://github.com/Dao-AILab/flash-attention) that must be installed separately afterward. You can install it with:

pip install flash-attn --no-build-isolation

Usage

To process a folder containing text documents and filter them based on the presence of software mentions, you can use the following command:

sofairfilter folder_with_txt_documents

It will print paths to the documents that contain software mentions.

Custom Configuration

You can run it with a custom configuration file using the --config option:

sofairfilter folder_with_txt_documents --config path/to/config.yaml

The default configuration is:

model_factory:  # Model configuration.
  model_path: SoFairOA/sofair-modernBERT-base-filter  # Name or path to the model.
  attn_implementation: flash_attention_2 # The attention implementation to use in the model (if relevant). Can be any of "eager" (manual implementation of the attention), "sdpa" (using F.scaled_dot_product_attention), or "flash_attention_2" (using Dao-AILab/flash-attention). By default, if available, SDPA will be used for torch>=2.1.1. The default is otherwise the manual "eager" implementation.
  cache_dir: # Path to Hugging Face cache directory.
  quantization: # Configuration for bits and bytes quantization.
    load_in_8bit: false  # This flag is used to enable 8-bit quantization with LLM.int8().
    load_in_4bit: false # This flag is used to enable 4-bit quantization by replacing the Linear layers with FP4/NF4 layers from `bitsandbytes`.
    llm_int8_threshold: 6.0 # This corresponds to the outlier threshold for outlier detection as described in `LLM.int8() : 8-bit Matrix Multiplication for Transformers at Scale` paper: https://arxiv.org/abs/2208.07339 Any hidden states value that is above this threshold will be considered an outlier and the operation on those values will be done in fp16. Values are usually normally distributed, that is, most values are in the range [-3.5, 3.5], but there are some exceptional systematic outliers that are very differently distributed for large models. These outliers are often in the interval [-60, -6] or [6, 60]. Int8 quantization works well for values of magnitude ~5, but beyond that, there is a significant performance penalty. A good default threshold is 6, but a lower threshold might be needed for more unstable models (small models, fine-tuning).
    llm_int8_skip_modules: # An explicit list of the modules that we do not want to convert in 8-bit. This is useful for models such as Jukebox that has several heads in different places and not necessarily at the last position. For example for `CausalLM` models, the last `lm_head` is kept in its original `dtype`.
    llm_int8_enable_fp32_cpu_offload: false # This flag is used for advanced use cases and users that are aware of this feature. If you want to split your model in different parts and run some parts in int8 on GPU and some parts in fp32 on CPU, you can use this flag. This is useful for offloading large models such as `google/flan-t5-xxl`. Note that the int8 operations will not be run on CPU.
    llm_int8_has_fp16_weight: false # This flag runs LLM.int8() with 16-bit main weights. This is useful for fine-tuning as the weights do not have to be converted back and forth for the backward pass.
    bnb_4bit_compute_dtype: # This sets the computational type which might be different than the input type. For example, inputs might be fp32, but computation can be set to bf16 for speedups.
    bnb_4bit_quant_type: fp4 # This sets the quantization data type in the bnb.nn.Linear4Bit layers. Options are FP4 and NF4 data types which are specified by `fp4` or `nf4`.
    bnb_4bit_use_double_quant: false # This flag is used for nested quantization where the quantization constants from the first quantization are quantized again.
    bnb_4bit_quant_storage: # This sets the storage type to pack the quanitzed 4-bit prarams.
  torch_dtype: bfloat16 # Override the default torch.dtype and load the model under a specific dtype
  trust_remote_code: false # Whether to trust remote code.
  config: # Configuration for the model.
  device: cuda # Device map for the model. If not specified, the model will be loaded on the CPU. Defaults to auto.
  labels: # Classification labels, the position is specifying label id. Leave empty for automatic detection of labels from dataset or using labels from model configuration.
tokenizer: # Hugging Face tokenizer for the model. Leave empty if you wish to initialize it from the model.
threshold: # The threshold for the model's confidence probability. Documents with a probability below this threshold will be filtered out. By default, no threshold is applied and a class with the highest probability is selected.
batch_size: 32 # Batch size for processing documents.

See help for more options:

sofairfilter --help

Evaluation

We evaluated this model on the test set of SoFairOA/sofair_softcite_somesci (sofair_softcite_somesci_documents) dataset:

precision	0.8625730994152047
recall	0.9104938271604939
f1	0.8858858858858859
accuracy	0.9268527430221367

Scripts used for evaluation are available in the experiments/sofair_softcite_somesci folder.

Project details

These details have not been verified by PyPI

Project links

Repository

Release history Release notifications | RSS feed

This version

1.0.0

Jul 25, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sofairfilter-1.0.0.tar.gz (9.3 kB view details)

Uploaded Jul 25, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sofairfilter-1.0.0-py3-none-any.whl (13.0 kB view details)

Uploaded Jul 25, 2025 Python 3

File details

Details for the file sofairfilter-1.0.0.tar.gz.

File metadata

Download URL: sofairfilter-1.0.0.tar.gz
Upload date: Jul 25, 2025
Size: 9.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.18

File hashes

Hashes for sofairfilter-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`30deca2d6f660fbce1d1f95b4a777a55412f5c4d20973fd11fbb704ecb6c5fcc`
MD5	`dc042b29ce8a054f1130f7b521cfbb6c`
BLAKE2b-256	`85e4434f04cbef9f4f58742e0ae56b03eeb042a893da4954632de32b6ece2f16`

See more details on using hashes here.

File details

Details for the file sofairfilter-1.0.0-py3-none-any.whl.

File metadata

Download URL: sofairfilter-1.0.0-py3-none-any.whl
Upload date: Jul 25, 2025
Size: 13.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.18

File hashes

Hashes for sofairfilter-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f43a10c1df4b5a138e7aa6b65a35b503503b3fa07cbb31a17db2f6d805d7036f`
MD5	`b560bffb6af3f8617ac036369da46576`
BLAKE2b-256	`36a097e8189e55456ac535b945a80b57428be9ceefea707c1220a1a178fa72c9`

See more details on using hashes here.

sofairfilter 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

SoFair Filter

Installation

Usage

Custom Configuration

Evaluation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes