Semantic search and document parsing tools for the command line

Project description

SemTools for Python

A collection of high-performance command-line tools for document processing and semantic search, built in Python. It leverages modern libraries like asyncio for concurrency, lancedb for efficient vector storage, and model2vec for state-of-the-art local embeddings.

parse: Parse documents (PDF, DOCX, etc.) into clean markdown using the LlamaParse API, with intelligent caching to avoid re-processing.
search: Perform fast, local semantic search on text files. It uses multilingual embeddings to find relevant lines of text based on meaning, not just keywords.
workspace: Manage persistent workspaces to accelerate searches over large and evolving collections of documents. Embeddings are stored and indexed, and only changed files are re-processed.

Key Features

Fast Local Semantic Search: Uses model2vec embeddings (minishlab/potion-multilingual-128M) for high-quality, multilingual semantic search that runs entirely on your machine.
Powerful Document Parsing: Integrates with LlamaParse for robust parsing of complex documents like PDFs into structured markdown.
Efficient Caching: The parse tool caches results, only re-processing files when their content changes.
Persistent Workspaces: The search tool can use workspaces powered by LanceDB to store and index embeddings, making subsequent searches on large file sets nearly instantaneous.
Unix-Friendly: Designed to be a good citizen in a Unix-style shell, easily chainable with tools like xargs, grep, and find.
Async Powered: Built with Python's asyncio to handle concurrent operations efficiently, especially for parsing multiple documents.

Installation

Prerequisites:

Python 3.13 or newer.
For the parse tool: A LlamaIndex Cloud API key. Get one for free at Llama Cloud.

Install from PyPI:

pip install semtools-py

This will make the parse, search, and workspace commands available in your shell.

Quick Start

Basic Usage

# Parse some files into a cache directory (~/.semtools/cache/parse)
parse my_dir/*.pdf

# Search some text-based files
search "some keywords" *.txt --top-k 5 --n-lines 7

# Combine parsing and search
# The parse command outputs the paths to the cached markdown files
parse my_docs/*.pdf | xargs search "API endpoints"

Using Workspaces

Workspaces accelerate search by creating a persistent, indexed database of your file embeddings.

# 1. Create and select a workspace
# Workspaces are stored in ~/.semtools/workspaces/
workspace use my-project-workspace
> Workspace 'my-project-workspace' configured.
> To activate it, run:
>   export SEMTOOLS_WORKSPACE=my-project-workspace
>
> Or add this to your shell profile (.bashrc, .zshrc, etc.)

# 2. Activate the workspace in your shell
export SEMTOOLS_WORKSPACE=my-project-workspace

# 3. Prime the workspace by running an initial search.
# This will embed all specified files and build a vector index.
# This may take some time on the first run.
search "initial query" ./large_codebase/**/*.py --top-k 10

# 4. Subsequent searches are now extremely fast.
# Only new or modified files will be re-embedded.
search "a different query" ./large_codebase/**/*.py --top-k 10

# If you delete files, prune the workspace to remove stale entries
workspace prune

# Check the status of your active workspace
workspace status
> Active workspace: my-project-workspace
> Root: /home/user/.semtools/workspaces/my-project-workspace
> Documents: 1503
> Index: Yes (IVF_PQ)

# Delete a workspace permanently
workspace delete my-project-workspace

Running from Source (Standalone)

If you prefer to run the tools directly from a cloned repository without installing the package globally, you can use an editable install:

# Clone the repository
git clone https://github.com/your-repo/semtools-py.git
cd semtools-py

# Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate

# Install in editable mode with development dependencies
pip install -e ".[dev]"

After this setup, the parse, search, and workspace commands are available directly in your shell (within the activated environment). They point to your local source code, allowing for development and standalone use.

# Now you can run the commands from your local code
parse ./my_docs/*.pdf --verbose
search "local development" ./**/*.py

### Programmatic Usage (as a Library)

You can also use `semtools` directly in your Python code. The core logic is exposed through classes like `Searcher`.

Here is an example of how to perform a search programmatically:

import asyncio
from pathlib import Path
from semtools.search import Searcher

async def main():
    # Create a dummy file to search in
    p = Path("my_document.txt")
    p.write_text("The quick brown fox jumps over the lazy dog.\nAnother line about something else.")

    # Instantiate the searcher
    searcher = Searcher()

    # Perform the search (note that it's an async operation)
    query = "fast animal"
    files = [str(p)]
    results = await searcher.search(query=query, files=files, top_k=1)

    # Process the results
    if results:
        print(f"Found {len(results)} result(s):")
        for result in results:
            print(f"  - Path: {result.path}")
            print(f"    Line: {result.line_number + 1}")  # +1 for 1-based indexing
            print(f"    Distance: {result.distance:.4f}")
    else:
        print("No results found.")
    
    # Clean up the dummy file
    p.unlink()

if __name__ == "__main__":
    asyncio.run(main())

CLI Help

$ parse --help
Usage: parse [OPTIONS] FILES...

  A CLI tool for parsing documents using various backends

Arguments:
  FILES...  [required]

Options:
  -c, --parse-config TEXT  Path to the config file. Defaults to
                           ~/.semtools/parse_config.json
  -b, --backend TEXT       The backend type to use for parsing. Defaults to
                           `llama-parse`  [default: llama-parse]
  -v, --verbose            Verbose output while parsing
  --help                   Show this message and exit.

$ search --help
Usage: search [OPTIONS] QUERY [FILES]...

  A CLI tool for fast semantic keyword search

Arguments:
  QUERY  [required]
  [FILES]...

Options:
  -n, --n-lines INTEGER   How many lines before/after to return as context
                          [default: 3]
  --top-k INTEGER         The top-k files or texts to return (ignored if
                          max_distance is set)  [default: 3]
  -m, --max-distance FLOAT
                          Return all results with distance below this
                          threshold (0.0+)
  -i, --ignore-case       Perform case-insensitive search (default is false)
  --help                  Show this message and exit.

$ workspace --help
Usage: workspace [OPTIONS] COMMAND [ARGS]...

  Manage semtools workspaces

Options:
  --help  Show this message and exit.

Commands:
  delete  Permanently delete a workspace
  prune   Remove stale or missing files from store
  status  Show active workspace and basic stats
  use     Use or create a workspace (prints export command to run)

Configuration

The parse tool requires a LlamaParse API key. It can be configured in two ways:

Environment Variable (Recommended):

export LLAMA_CLOUD_API_KEY="your_api_key_here"

Configuration File: Create a file at ~/.semtools/parse_config.json. The tool will load this file if it exists. See src/semtools/parse/config.py for all options.

Qualitative Benchmark

SemTools includes a qualitative benchmark to evaluate the retrieval performance of the search command against a curated dataset of arXiv research papers.

The benchmark uses a powerful LLM (Google's Gemini) as an "Oracle" to generate complex questions and ground truth answers from a set of source documents. It then executes search for each question and asks the Oracle to synthesize a new answer using only the search results. A final Markdown report is generated comparing the ground truth answer, the search-augmented answer, and retrieval metrics (Precision/Recall).

Running the Benchmark

Get the Source Code: The benchmark scripts are part of the development repository and not included in the PyPI package. Clone the repository to get the necessary files:
```
git clone https://github.com/your-repo/semtools-py.git
cd semtools-py
```
Install Dependencies: Ensure you have the development dependencies installed:
```
pip install -e ".[dev]"
```

Set API Key: Set your Gemini API key:

export GEMINI_API_KEY="your_gemini_api_key"

Download Data: Download the benchmark dataset:

python benchmarks/arxiv/download_arxiv_files.py

Run: Run the benchmark:
```
python benchmarks/arxiv/benchmark.py --mode workspace
```
A report file (benchmark_qualitative_report_workspace.md) will be created in the benchmarks/arxiv directory.

License

This project is licensed under the MIT License.

Acknowledgments

LlamaParse for the powerful document parsing API.
model2vec for the fast, high-quality local embedding generation.
LanceDB for the efficient and scalable vector database engine.
minishlab/potion-multilingual-128M for the excellent open-source embedding model.

Project details

Release history Release notifications | RSS feed

This version

0.1.0

Nov 15, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semtools_py-0.1.0.tar.gz (153.6 kB view details)

Uploaded Nov 15, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

semtools_py-0.1.0-py3-none-any.whl (24.0 kB view details)

Uploaded Nov 15, 2025 Python 3

File details

Details for the file semtools_py-0.1.0.tar.gz.

File metadata

Download URL: semtools_py-0.1.0.tar.gz
Upload date: Nov 15, 2025
Size: 153.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for semtools_py-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`5621bda7dd15efec1e39e956afca1e57fa31bb7599dd286ce319b8739590c8c8`
MD5	`f0509426be312bb1a3e8bcd0cb649c09`
BLAKE2b-256	`b53f89719d1945932f3ef57ff796428531817723c3d202f05e9729cf276d7247`

See more details on using hashes here.

File details

Details for the file semtools_py-0.1.0-py3-none-any.whl.

File metadata

Download URL: semtools_py-0.1.0-py3-none-any.whl
Upload date: Nov 15, 2025
Size: 24.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for semtools_py-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3cf1798b1e155b28805d68fc094f2bd9d35d86dcaad0766f064c407422862e58`
MD5	`070f0efeaec526b93b9b467e84917180`
BLAKE2b-256	`7f51ec63b3b4a8d4efb4a292964e5f5cb5122b6ecc12b8269e930bdc41f7db5e`

See more details on using hashes here.

semtools-py 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

SemTools for Python

Key Features

Installation

Quick Start

Basic Usage

Using Workspaces

Running from Source (Standalone)

CLI Help

Configuration

Qualitative Benchmark

Running the Benchmark

License

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes