Skip to main content

Full-text search with zero thinking.

Project description

Roughsearch

roughsearch-logo

A full-text search engine that tries to require as little thinking as possible.

Roughsearch is a lightweight full-text search engine built on DuckDB and BM25. It targets Japanese and English only. You can use it from the CLI or call it directly from Python.

So when you want to search a large pile of text files in a decent way, what do you do? Start by writing a Dockerfile? Spin up an Elasticsearch container? Install a morphological analysis plugin? Fire a huge number of API requests at it? Wait forever for indexing? No. Your life will end first.

Roughsearch gives up on flexibility completely. Fine-grained settings, grand scoring systems, and everything else are gone. This software has exactly one purpose: "search roughly." Run $ pip install roughsearch, and the environment is ready. Point a command at the target directory, and the indexing is done. That is all.

Architecture

Roughsearch indexes loaded documents with the following pipeline:

  1. It reads the text and runs morphological analysis with Sudachi.
  2. It extracts major terms from the result, mainly nouns, verbs, and adjectives.
  3. It indexes the normalized form of each extracted term and a romaji version of that term.

At search time, the original terms score higher and the transliterated alphabet forms score lower, producing weighted best-match results. All of this data is stored in a single .duckdb file, which makes the index highly portable.

Installation

$ pip install roughsearch

This software requires Python 3.11 or later.

Usage

Embedded Use

When you want to add a simple full-text search engine to your own system.

import roughsearch

with roughsearch.Client("docs.duckdb", language="ja") as rs:
    rs.add("doc-001", title="いろはにほへと", body="あのイーハトーヴォのすきとおった風")
    rs.add("doc-002", title="ちりぬるを", body="夏でも底に冷たさをもつ青いそら")
    rs.reindex()

    results = rs.search("風")
    for hit in results.hits:
        print(hit.score, hit.title, hit.snippet)

CLI Server

When you just want to get it running.
The server exposes a REST API that any frontend can use for search.

$ roughsearch init docs.duckdb --language ja
$ roughsearch add docs.duckdb ./docs
$ roughsearch serve docs.duckdb --port 8080

add, serve, search, and dump normally use the default language saved by init. If needed, you can temporarily override it with --language.

HTTP Client

When you want to connect to a running Roughsearch server and query it.

import roughsearch

rs = roughsearch.HttpClient("http://localhost:8080")
results = rs.search("空")

CLI Reference

Commands

Command Description
init <db_path> Initialize and create a new database
add <db_path> <path> Add documents from a directory and rebuild the index
serve <db_path> Start the REST API server
search <db_path> <query> Search from the command line
reindex <db_path> Rebuild the FTS index, for example after adding documents
reanalyze <db_path> Reanalyze stored documents with the current analyzer and rebuild the index, for example after a software update
dump <db_path> Print stored documents as JSON to stdout
stats <db_path> Show the document count
inspect [text] Analyzer debugging command that prints tokenization results as JSON

Options

init

Option Default Description
--language ja Database analyzer language (en or ja)

add

Option Default Description
--glob None Glob pattern for target files such as *.md
--language None Temporarily override the language for added documents

serve

Option Default Description
--language None Temporarily override the default language used by the server
--host 127.0.0.1 Bind address
--port 8080 Port number

search

Option Default Description
--language None Language filter for the search
--limit 20 Maximum number of results

dump

Option Default Description
--language None Filter by language
--limit 20 Maximum number of output rows

inspect

Option Default Description
--language ja Analyzer language
--title "" Text to analyze on the title side
--file None Read the body from a file. If set, it takes precedence over the positional text argument

Examples

Index and Search a Local Document Directory

$ pip install roughsearch

$ roughsearch init notes.duckdb --language ja
$ roughsearch add notes.duckdb ./notes --glob "*.md"
$ roughsearch search notes.duckdb "ニンジャ"

Embedded Python Use with Metadata and Filters

import roughsearch

with roughsearch.Client("notes.duckdb", language="ja") as rs:
    rs.add(
        "note-001",
        title="いろはにほへと",
        body="あのイーハトーヴォのすきとおった風",
        metadata={"tags": ["note", "japanese"], "source": "handbook"},
        source_uri="handbook/note-001.md",
    )
    rs.reindex()

    from roughsearch.search.query import SearchQuery, SearchFilters
    results = rs.search(
        SearchQuery(
            query="風",
            filters=SearchFilters(tags=["note"]),
            highlight=True,
            limit=10,
        )
    )

Start the API Server and Search with curl

$ roughsearch serve docs.duckdb --port 8080 &

$ curl -s -X POST http://localhost:8080/documents \
  -H "Content-Type: application/json" \
  -d '{"id":"1","title":"いろはにほへと","body":"あのイーハトーヴォのすきとおった風"}'

$ curl -s -X POST http://localhost:8080/reindex

$ curl -s -X POST http://localhost:8080/search \
  -H "Content-Type: application/json" \
  -d '{"query":"風","limit":5}' | python -m json.tool

Bulk Add

import roughsearch

docs = [
    {"id": "1", "title": "いろはにほへと", "body": "あのイーハトーヴォのすきとおった風"},
    {"id": "2", "title": "ちりぬるを",  "body": "夏でも底に冷たさをもつ青いそら"},
]

with roughsearch.Client("bulk.duckdb") as rs:
    rs.add_documents(docs)
    rs.reindex()
    print(rs.search("風").total)

Output Format

{
  "query": "風",
  "total": 1,
  "hits": [
    {
      "id": "doc-001",
      "score": 8.512,
      "title": "いろはにほへと",
      "snippet": "あのイーハトーヴォのすきとおった<mark>風</mark>",
      "body": "あのイーハトーヴォのすきとおった風",
      "language": "ja",
      "source_uri": null,
      "heading_path": null,
      "parent_id": null,
      "chunk_id": null,
      "metadata": {}
    }
  ]
}

REST API Endpoints

Method Path Description
GET /health Health check
GET /stats Document counts by language
POST /documents Add one document
POST /documents/bulk Add multiple documents
GET /documents/{id} Fetch a document by ID
DELETE /documents/{id} Soft-delete a document
POST /search Full-text search
POST /reindex Rebuild the FTS index
POST /optimize Run a DB checkpoint and compaction

Notes

  • Reindexing is required after writes. Documents added with add() are stored immediately, but they will not appear in search results until you call reindex(). This keeps bulk imports fast.
  • Assume a single writer. DuckDB does not support concurrent writes. Run one server process and only one write operation at a time.
  • It listens on localhost by default. If you need external access, place it behind a reverse proxy such as nginx.

License

MIT. See LICENSE for details.

powered by Sudachi: Apache License v2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

roughsearch-0.1.0.tar.gz (65.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

roughsearch-0.1.0-py3-none-any.whl (23.7 kB view details)

Uploaded Python 3

File details

Details for the file roughsearch-0.1.0.tar.gz.

File metadata

  • Download URL: roughsearch-0.1.0.tar.gz
  • Upload date:
  • Size: 65.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for roughsearch-0.1.0.tar.gz
Algorithm Hash digest
SHA256 235f6dbe3a06873ef6688115c8d433d270c5f449a5bbef2ae4f9e710f97a49e5
MD5 c3696a06eb5396db547122f8037b7a2c
BLAKE2b-256 bf38805d653f76211e8887b1df59b1fdbc9c11f58a839aabaac962bd0895c2bc

See more details on using hashes here.

File details

Details for the file roughsearch-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: roughsearch-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 23.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for roughsearch-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8147f0bac87af37dbe775c6586245667e526e8208b7cac24cb0707ecef79d3a5
MD5 2a6d061c783990043226970f478a1e0b
BLAKE2b-256 d42ad10b65abf067868137cc89584f85ceffba79b1c3ff00195f771110cbb078

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page