Skip to main content

Tool for identifying candidate documents for software mention extraction.

Project description

SoFair Filter

Simple command line tool for identifying candidate documents for software mention extraction.

Installation

pip install sofairfilter

The default configuration uses the flash attention (https://github.com/Dao-AILab/flash-attention) that must be installed separately afterward. You can install it with:

pip install flash-attn --no-build-isolation

Usage

To process a folder containing text documents and filter them based on the presence of software mentions, you can use the following command:

sofairfilter folder_with_txt_documents

It will print paths to the documents that contain software mentions.

Custom Configuration

You can run it with a custom configuration file using the --config option:

sofairfilter folder_with_txt_documents --config path/to/config.yaml

The default configuration is:

model_factory:  # Model configuration.
  model_path: SoFairOA/sofair-modernBERT-base-filter  # Name or path to the model.
  attn_implementation: flash_attention_2 # The attention implementation to use in the model (if relevant). Can be any of "eager" (manual implementation of the attention), "sdpa" (using F.scaled_dot_product_attention), or "flash_attention_2" (using Dao-AILab/flash-attention). By default, if available, SDPA will be used for torch>=2.1.1. The default is otherwise the manual "eager" implementation.
  cache_dir: # Path to Hugging Face cache directory.
  quantization: # Configuration for bits and bytes quantization.
    load_in_8bit: false  # This flag is used to enable 8-bit quantization with LLM.int8().
    load_in_4bit: false # This flag is used to enable 4-bit quantization by replacing the Linear layers with FP4/NF4 layers from `bitsandbytes`.
    llm_int8_threshold: 6.0 # This corresponds to the outlier threshold for outlier detection as described in `LLM.int8() : 8-bit Matrix Multiplication for Transformers at Scale` paper: https://arxiv.org/abs/2208.07339 Any hidden states value that is above this threshold will be considered an outlier and the operation on those values will be done in fp16. Values are usually normally distributed, that is, most values are in the range [-3.5, 3.5], but there are some exceptional systematic outliers that are very differently distributed for large models. These outliers are often in the interval [-60, -6] or [6, 60]. Int8 quantization works well for values of magnitude ~5, but beyond that, there is a significant performance penalty. A good default threshold is 6, but a lower threshold might be needed for more unstable models (small models, fine-tuning).
    llm_int8_skip_modules: # An explicit list of the modules that we do not want to convert in 8-bit. This is useful for models such as Jukebox that has several heads in different places and not necessarily at the last position. For example for `CausalLM` models, the last `lm_head` is kept in its original `dtype`.
    llm_int8_enable_fp32_cpu_offload: false # This flag is used for advanced use cases and users that are aware of this feature. If you want to split your model in different parts and run some parts in int8 on GPU and some parts in fp32 on CPU, you can use this flag. This is useful for offloading large models such as `google/flan-t5-xxl`. Note that the int8 operations will not be run on CPU.
    llm_int8_has_fp16_weight: false # This flag runs LLM.int8() with 16-bit main weights. This is useful for fine-tuning as the weights do not have to be converted back and forth for the backward pass.
    bnb_4bit_compute_dtype: # This sets the computational type which might be different than the input type. For example, inputs might be fp32, but computation can be set to bf16 for speedups.
    bnb_4bit_quant_type: fp4 # This sets the quantization data type in the bnb.nn.Linear4Bit layers. Options are FP4 and NF4 data types which are specified by `fp4` or `nf4`.
    bnb_4bit_use_double_quant: false # This flag is used for nested quantization where the quantization constants from the first quantization are quantized again.
    bnb_4bit_quant_storage: # This sets the storage type to pack the quanitzed 4-bit prarams.
  torch_dtype: bfloat16 # Override the default torch.dtype and load the model under a specific dtype
  trust_remote_code: false # Whether to trust remote code.
  config: # Configuration for the model.
  device: cuda # Device map for the model. If not specified, the model will be loaded on the CPU. Defaults to auto.
  labels: # Classification labels, the position is specifying label id. Leave empty for automatic detection of labels from dataset or using labels from model configuration.
tokenizer: # Hugging Face tokenizer for the model. Leave empty if you wish to initialize it from the model.
threshold: # The threshold for the model's confidence probability. Documents with a probability below this threshold will be filtered out. By default, no threshold is applied and a class with the highest probability is selected.
batch_size: 32 # Batch size for processing documents.

See help for more options:

sofairfilter --help

Evaluation

We evaluated this model on the test set of SoFairOA/sofair_softcite_somesci (sofair_softcite_somesci_documents) dataset:

precision 0.8625730994152047
recall 0.9104938271604939
f1 0.8858858858858859
accuracy 0.9268527430221367

Scripts used for evaluation are available in the experiments/sofair_softcite_somesci folder.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sofairfilter-1.0.0.tar.gz (9.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sofairfilter-1.0.0-py3-none-any.whl (13.0 kB view details)

Uploaded Python 3

File details

Details for the file sofairfilter-1.0.0.tar.gz.

File metadata

  • Download URL: sofairfilter-1.0.0.tar.gz
  • Upload date:
  • Size: 9.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.18

File hashes

Hashes for sofairfilter-1.0.0.tar.gz
Algorithm Hash digest
SHA256 30deca2d6f660fbce1d1f95b4a777a55412f5c4d20973fd11fbb704ecb6c5fcc
MD5 dc042b29ce8a054f1130f7b521cfbb6c
BLAKE2b-256 85e4434f04cbef9f4f58742e0ae56b03eeb042a893da4954632de32b6ece2f16

See more details on using hashes here.

File details

Details for the file sofairfilter-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: sofairfilter-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 13.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.18

File hashes

Hashes for sofairfilter-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f43a10c1df4b5a138e7aa6b65a35b503503b3fa07cbb31a17db2f6d805d7036f
MD5 b560bffb6af3f8617ac036369da46576
BLAKE2b-256 36a097e8189e55456ac535b945a80b57428be9ceefea707c1220a1a178fa72c9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page