High-performance memory-mapped n-gram engine for large text corpora
Project description
Fastgram
High-performance memory-mapped n-gram engine compatible with InfiniGram-style shard directories (tokenized.*, table.*, offset.*).
Build:
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
Test:
ctest --test-dir build --output-on-failure
Tools:
tg_rpc: stdin/stdout RPC for benchmarking + integrationtg_query: quick CLI query helpertg_build_unigram_ranges: buildunigram_ranges.binfor faster unigram range lookuptools/run_bench.py: benchmark runner (usesbench/bench_config.json)tools/gen_bench_queries.py: build deterministic query suite for coveragetools/run_bench_suite.py: suite runner (usesbench/bench_suite_config.json)tg_slice_index: build deterministic slices for build benchmarkstg_build_index: build table/full index (benchmark target)tools/run_build_bench.py: index build benchmark runner (usesbench/build_bench_config.json)tools/verify_built_index.py: correctness check for build outputs
Benchmarking:
Query benchmarks measure find and ntd operation performance:
python tools/run_bench.py- runs find/ntd benchmarks usingbench/bench_config.jsonpython tools/run_bench_suite.py- runs comprehensive suite usingbench/bench_suite_config.json
Build benchmarks measure index construction performance:
- Create test slices from an existing index:
# Small slice: 2000 docs, token_width=2 (u16)
./build/tg_slice_index <source_index_dir> bench/build_inputs/small 2000 2
# Medium slice: 20000 docs, token_width=2 (u16)
./build/tg_slice_index <source_index_dir> bench/build_inputs/medium 20000 2
- Build reference indices:
# token_width=2, version=4, mode=table_only, ram_cap=8GB
./build/tg_build_index bench/build_inputs/small bench/build_refs/small 2 4 table_only 8589934592
./build/tg_build_index bench/build_inputs/medium bench/build_refs/medium 2 4 table_only 8589934592
- Run build benchmarks:
python tools/run_build_bench.py
Notes:
- Build scripts auto-detect build directory or use
GRAM_BUILD_DIRenvironment variable ram_cap_bytesin configs is 8589934592 (8GB) to limit memory during benchmarking- Generate query suites with
python tools/gen_bench_queries.py --index-dir <path> --eos <eos_id> --vocab <vocab_size>
Python:
python -m pip install -e .
python -c "from fastgram import GramEngine; print(GramEngine)"
Download indices:
Requires AWS CLI (aws).
gram # interactive
gram list
gram download v4_pileval_gpt2 --to index/v4_pileval_gpt2
Run:
gram run --index index/v4_pileval_gpt2 --prompt "natural language processing"
Interactive run:
gram -> 1 (run)
Settings in run mode:
/settings
/set topk 50
/set temperature 0.8
/gen 20 hello world
Notes:
- Uses the tokenizer specified for the index in the catalog.
- Some tokenizers require
HF_TOKEN(for gated models like Llama-2).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fast_gram-0.1.0.tar.gz.
File metadata
- Download URL: fast_gram-0.1.0.tar.gz
- Upload date:
- Size: 31.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3a6ad06c0f142867d4b50dc85aea25f410e92f2c412b9391d469c9d9e18b7231
|
|
| MD5 |
0aa7ed1d06e966a03824cbc843b9abb9
|
|
| BLAKE2b-256 |
b26e0f57696fa6013b777362edbb1da195957a1228227d17b1838b9e7e048fd8
|
File details
Details for the file fast_gram-0.1.0-cp313-cp313-macosx_10_15_universal2.whl.
File metadata
- Download URL: fast_gram-0.1.0-cp313-cp313-macosx_10_15_universal2.whl
- Upload date:
- Size: 755.5 kB
- Tags: CPython 3.13, macOS 10.15+ universal2 (ARM64, x86-64)
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c03af617deabaae394b4b9f73ed073a41b302376a81f76db3cc7c4038abd0c9d
|
|
| MD5 |
4f038127f5e45040ee3a4cbbc7e47d2f
|
|
| BLAKE2b-256 |
5f17591a753d22d30738513041288a043ee9d09bd999ed85693caf3a710ce81b
|