Semantic search and document parsing tools for the command line
Project description
SemTools for Python
A collection of high-performance command-line tools for document processing and semantic search, built in Python. It leverages modern libraries like asyncio for concurrency, lancedb for efficient vector storage, and model2vec for state-of-the-art local embeddings.
parse: Parse documents (PDF, DOCX, etc.) into clean markdown using the LlamaParse API, with intelligent caching to avoid re-processing.search: Perform fast, local semantic search on text files. It uses multilingual embeddings to find relevant lines of text based on meaning, not just keywords.workspace: Manage persistent workspaces to accelerate searches over large and evolving collections of documents. Embeddings are stored and indexed, and only changed files are re-processed.
Key Features
- Fast Local Semantic Search: Uses
model2vecembeddings (minishlab/potion-multilingual-128M) for high-quality, multilingual semantic search that runs entirely on your machine. - Powerful Document Parsing: Integrates with LlamaParse for robust parsing of complex documents like PDFs into structured markdown.
- Efficient Caching: The
parsetool caches results, only re-processing files when their content changes. - Persistent Workspaces: The
searchtool can use workspaces powered by LanceDB to store and index embeddings, making subsequent searches on large file sets nearly instantaneous. - Unix-Friendly: Designed to be a good citizen in a Unix-style shell, easily chainable with tools like
xargs,grep, andfind. - Async Powered: Built with Python's
asyncioto handle concurrent operations efficiently, especially for parsing multiple documents.
Installation
Prerequisites:
- Python 3.13 or newer.
- For the
parsetool: A LlamaIndex Cloud API key. Get one for free at Llama Cloud.
Install from PyPI:
pip install semtools-py
This will make the parse, search, and workspace commands available in your shell.
Quick Start
Basic Usage
# Parse some files into a cache directory (~/.semtools/cache/parse)
parse my_dir/*.pdf
# Search some text-based files
search "some keywords" *.txt --top-k 5 --n-lines 7
# Combine parsing and search
# The parse command outputs the paths to the cached markdown files
parse my_docs/*.pdf | xargs search "API endpoints"
Using Workspaces
Workspaces accelerate search by creating a persistent, indexed database of your file embeddings.
# 1. Create and select a workspace
# Workspaces are stored in ~/.semtools/workspaces/
workspace use my-project-workspace
> Workspace 'my-project-workspace' configured.
> To activate it, run:
> export SEMTOOLS_WORKSPACE=my-project-workspace
>
> Or add this to your shell profile (.bashrc, .zshrc, etc.)
# 2. Activate the workspace in your shell
export SEMTOOLS_WORKSPACE=my-project-workspace
# 3. Prime the workspace by running an initial search.
# This will embed all specified files and build a vector index.
# This may take some time on the first run.
search "initial query" ./large_codebase/**/*.py --top-k 10
# 4. Subsequent searches are now extremely fast.
# Only new or modified files will be re-embedded.
search "a different query" ./large_codebase/**/*.py --top-k 10
# If you delete files, prune the workspace to remove stale entries
workspace prune
# Check the status of your active workspace
workspace status
> Active workspace: my-project-workspace
> Root: /home/user/.semtools/workspaces/my-project-workspace
> Documents: 1503
> Index: Yes (IVF_PQ)
# Delete a workspace permanently
workspace delete my-project-workspace
Running from Source (Standalone)
If you prefer to run the tools directly from a cloned repository without installing the package globally, you can use an editable install:
# Clone the repository
git clone https://github.com/your-repo/semtools-py.git
cd semtools-py
# Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate
# Install in editable mode with development dependencies
pip install -e ".[dev]"
After this setup, the parse, search, and workspace commands are available directly in your shell (within the activated environment). They point to your local source code, allowing for development and standalone use.
# Now you can run the commands from your local code
parse ./my_docs/*.pdf --verbose
search "local development" ./**/*.py
### Programmatic Usage (as a Library)
You can also use `semtools` directly in your Python code. The core logic is exposed through classes like `Searcher`.
Here is an example of how to perform a search programmatically:
import asyncio
from pathlib import Path
from semtools.search import Searcher
async def main():
# Create a dummy file to search in
p = Path("my_document.txt")
p.write_text("The quick brown fox jumps over the lazy dog.\nAnother line about something else.")
# Instantiate the searcher
searcher = Searcher()
# Perform the search (note that it's an async operation)
query = "fast animal"
files = [str(p)]
results = await searcher.search(query=query, files=files, top_k=1)
# Process the results
if results:
print(f"Found {len(results)} result(s):")
for result in results:
print(f" - Path: {result.path}")
print(f" Line: {result.line_number + 1}") # +1 for 1-based indexing
print(f" Distance: {result.distance:.4f}")
else:
print("No results found.")
# Clean up the dummy file
p.unlink()
if __name__ == "__main__":
asyncio.run(main())
CLI Help
$ parse --help
Usage: parse [OPTIONS] FILES...
A CLI tool for parsing documents using various backends
Arguments:
FILES... [required]
Options:
-c, --parse-config TEXT Path to the config file. Defaults to
~/.semtools/parse_config.json
-b, --backend TEXT The backend type to use for parsing. Defaults to
`llama-parse` [default: llama-parse]
-v, --verbose Verbose output while parsing
--help Show this message and exit.
$ search --help
Usage: search [OPTIONS] QUERY [FILES]...
A CLI tool for fast semantic keyword search
Arguments:
QUERY [required]
[FILES]...
Options:
-n, --n-lines INTEGER How many lines before/after to return as context
[default: 3]
--top-k INTEGER The top-k files or texts to return (ignored if
max_distance is set) [default: 3]
-m, --max-distance FLOAT
Return all results with distance below this
threshold (0.0+)
-i, --ignore-case Perform case-insensitive search (default is false)
--help Show this message and exit.
$ workspace --help
Usage: workspace [OPTIONS] COMMAND [ARGS]...
Manage semtools workspaces
Options:
--help Show this message and exit.
Commands:
delete Permanently delete a workspace
prune Remove stale or missing files from store
status Show active workspace and basic stats
use Use or create a workspace (prints export command to run)
Configuration
The parse tool requires a LlamaParse API key. It can be configured in two ways:
-
Environment Variable (Recommended):
export LLAMA_CLOUD_API_KEY="your_api_key_here"
-
Configuration File: Create a file at
~/.semtools/parse_config.json. The tool will load this file if it exists. Seesrc/semtools/parse/config.pyfor all options.
Qualitative Benchmark
SemTools includes a qualitative benchmark to evaluate the retrieval performance of the search command against a curated dataset of arXiv research papers.
The benchmark uses a powerful LLM (Google's Gemini) as an "Oracle" to generate complex questions and ground truth answers from a set of source documents. It then executes search for each question and asks the Oracle to synthesize a new answer using only the search results. A final Markdown report is generated comparing the ground truth answer, the search-augmented answer, and retrieval metrics (Precision/Recall).
Running the Benchmark
- Get the Source Code: The benchmark scripts are part of the development repository and not included in the PyPI package. Clone the repository to get the necessary files:
git clone https://github.com/your-repo/semtools-py.git cd semtools-py
- Install Dependencies: Ensure you have the development dependencies installed:
pip install -e ".[dev]"
- Set API Key: Set your Gemini API key:
export GEMINI_API_KEY="your_gemini_api_key"
- Download Data: Download the benchmark dataset:
python benchmarks/arxiv/download_arxiv_files.py - Run: Run the benchmark:
python benchmarks/arxiv/benchmark.py --mode workspace
A report file (benchmark_qualitative_report_workspace.md) will be created in thebenchmarks/arxivdirectory.
License
This project is licensed under the MIT License.
Acknowledgments
- LlamaParse for the powerful document parsing API.
- model2vec for the fast, high-quality local embedding generation.
- LanceDB for the efficient and scalable vector database engine.
- minishlab/potion-multilingual-128M for the excellent open-source embedding model.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file semtools_py-0.1.0.tar.gz.
File metadata
- Download URL: semtools_py-0.1.0.tar.gz
- Upload date:
- Size: 153.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5621bda7dd15efec1e39e956afca1e57fa31bb7599dd286ce319b8739590c8c8
|
|
| MD5 |
f0509426be312bb1a3e8bcd0cb649c09
|
|
| BLAKE2b-256 |
b53f89719d1945932f3ef57ff796428531817723c3d202f05e9729cf276d7247
|
File details
Details for the file semtools_py-0.1.0-py3-none-any.whl.
File metadata
- Download URL: semtools_py-0.1.0-py3-none-any.whl
- Upload date:
- Size: 24.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3cf1798b1e155b28805d68fc094f2bd9d35d86dcaad0766f064c407422862e58
|
|
| MD5 |
070f0efeaec526b93b9b467e84917180
|
|
| BLAKE2b-256 |
7f51ec63b3b4a8d4efb4a292964e5f5cb5122b6ecc12b8269e930bdc41f7db5e
|