Skip to main content

High-performance memory-mapped n-gram engine for large text corpora

Project description

Fastgram

High-performance memory-mapped n-gram engine compatible with InfiniGram-style shard directories (tokenized.*, table.*, offset.*).

Build:

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release

cmake --build build -j

Test:

ctest --test-dir build --output-on-failure

Tools:

  • tg_rpc: stdin/stdout RPC for benchmarking + integration
  • tg_query: quick CLI query helper
  • tg_build_unigram_ranges: build unigram_ranges.bin for faster unigram range lookup
  • tools/run_bench.py: benchmark runner (uses bench/bench_config.json)
  • tools/gen_bench_queries.py: build deterministic query suite for coverage
  • tools/run_bench_suite.py: suite runner (uses bench/bench_suite_config.json)
  • tg_slice_index: build deterministic slices for build benchmarks
  • tg_build_index: build table/full index (benchmark target)
  • tools/run_build_bench.py: index build benchmark runner (uses bench/build_bench_config.json)
  • tools/verify_built_index.py: correctness check for build outputs

Benchmarking:

Query benchmarks measure find and ntd operation performance:

  • python tools/run_bench.py - runs find/ntd benchmarks using bench/bench_config.json
  • python tools/run_bench_suite.py - runs comprehensive suite using bench/bench_suite_config.json

Build benchmarks measure index construction performance:

  1. Create test slices from an existing index:
# Small slice: 2000 docs, token_width=2 (u16)
./build/tg_slice_index <source_index_dir> bench/build_inputs/small 2000 2

# Medium slice: 20000 docs, token_width=2 (u16)
./build/tg_slice_index <source_index_dir> bench/build_inputs/medium 20000 2
  1. Build reference indices:
# token_width=2, version=4, mode=table_only, ram_cap=8GB
./build/tg_build_index bench/build_inputs/small bench/build_refs/small 2 4 table_only 8589934592
./build/tg_build_index bench/build_inputs/medium bench/build_refs/medium 2 4 table_only 8589934592
  1. Run build benchmarks:
python tools/run_build_bench.py

Notes:

  • Build scripts auto-detect build directory or use GRAM_BUILD_DIR environment variable
  • ram_cap_bytes in configs is 8589934592 (8GB) to limit memory during benchmarking
  • Generate query suites with python tools/gen_bench_queries.py --index-dir <path> --eos <eos_id> --vocab <vocab_size>

Python:

python -m pip install -e .

python -c "from fastgram import GramEngine; print(GramEngine)"

Download indices:

Requires AWS CLI (aws).

gram # interactive

gram list

gram download v4_pileval_gpt2 --to index/v4_pileval_gpt2

Run:

gram run --index index/v4_pileval_gpt2 --prompt "natural language processing"

Interactive run:

gram -> 1 (run)

Settings in run mode:

/settings

/set topk 50

/set temperature 0.8

/gen 20 hello world

Notes:

  • Uses the tokenizer specified for the index in the catalog.
  • Some tokenizers require HF_TOKEN (for gated models like Llama-2).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fast_gram-0.1.0.tar.gz (31.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fast_gram-0.1.0-cp313-cp313-macosx_10_15_universal2.whl (755.5 kB view details)

Uploaded CPython 3.13macOS 10.15+ universal2 (ARM64, x86-64)

File details

Details for the file fast_gram-0.1.0.tar.gz.

File metadata

  • Download URL: fast_gram-0.1.0.tar.gz
  • Upload date:
  • Size: 31.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for fast_gram-0.1.0.tar.gz
Algorithm Hash digest
SHA256 3a6ad06c0f142867d4b50dc85aea25f410e92f2c412b9391d469c9d9e18b7231
MD5 0aa7ed1d06e966a03824cbc843b9abb9
BLAKE2b-256 b26e0f57696fa6013b777362edbb1da195957a1228227d17b1838b9e7e048fd8

See more details on using hashes here.

File details

Details for the file fast_gram-0.1.0-cp313-cp313-macosx_10_15_universal2.whl.

File metadata

File hashes

Hashes for fast_gram-0.1.0-cp313-cp313-macosx_10_15_universal2.whl
Algorithm Hash digest
SHA256 c03af617deabaae394b4b9f73ed073a41b302376a81f76db3cc7c4038abd0c9d
MD5 4f038127f5e45040ee3a4cbbc7e47d2f
BLAKE2b-256 5f17591a753d22d30738513041288a043ee9d09bd999ed85693caf3a710ce81b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page