Skip to main content

Full-text search with zero thinking.

Project description

Roughsearch

roughsearch-logo

A full-text search engine that tries to require as little thinking as possible.

Roughsearch is a lightweight full-text search engine built on DuckDB and BM25. It targets Japanese and English only. You can use it from the CLI or call it directly from Python.

So when you want to search a large pile of text files in a decent way, what do you do? Start by writing a Dockerfile? Spin up an Elasticsearch container? Install a morphological analysis plugin? Fire a huge number of API requests at it? Wait forever for indexing? No. Your life will end first.

Roughsearch gives up on flexibility completely. Fine-grained settings, grand scoring systems, and everything else are gone. This software has exactly one purpose: "search roughly." Run $ pip install roughsearch, and the environment is ready. Point a command at the target directory, and the indexing is done. That is all.

Architecture

Roughsearch indexes loaded documents with the following pipeline:

  1. It reads the text and runs morphological analysis with Sudachi.
  2. It extracts major terms from the result, mainly nouns, verbs, and adjectives.
  3. It indexes the normalized form of each extracted term and a romaji version of that term.

At search time, the original terms score higher and the transliterated alphabet forms score lower, producing weighted best-match results. All of this data is stored in a single .duckdb file, which makes the index highly portable.

Installation

$ pip install roughsearch

This software requires Python 3.11 or later.

Usage

Embedded Use

When you want to add a simple full-text search engine to your own system.

import roughsearch

with roughsearch.Client("docs.duckdb", language="ja") as rs:
    rs.add("doc-001", title="いろはにほへと", body="あのイーハトーヴォのすきとおった風")
    rs.add("doc-002", title="ちりぬるを", body="夏でも底に冷たさをもつ青いそら")
    rs.reindex()

    results = rs.search("風")
    for hit in results.hits:
        print(hit.score, hit.title, hit.snippet)

CLI Server

When you just want to get it running.
The server exposes a REST API that any frontend can use for search.

$ roughsearch init docs.duckdb --language ja
$ roughsearch add docs.duckdb ./docs
$ roughsearch serve docs.duckdb --port 8080

add, serve, search, and dump normally use the default language saved by init. If needed, you can temporarily override it with --language.

HTTP Client

When you want to connect to a running Roughsearch server and query it.

import roughsearch

rs = roughsearch.HttpClient("http://localhost:8080")
results = rs.search("空")

CLI Reference

Commands

Command Description
init <db_path> Initialize and create a new database
add <db_path> <path> Add documents from a directory and rebuild the index
serve <db_path> Start the REST API server
search <db_path> <query> Search from the command line
reindex <db_path> Rebuild the FTS index, for example after adding documents
reanalyze <db_path> Reanalyze stored documents with the current analyzer and rebuild the index, for example after a software update
dump <db_path> Print stored documents as JSON to stdout
stats <db_path> Show the document count
inspect [text] Analyzer debugging command that prints tokenization results as JSON

Options

init

Option Default Description
--language ja Database analyzer language (en or ja)

add

Option Default Description
--glob None Glob pattern for target files such as *.md
--language None Temporarily override the language for added documents

serve

Option Default Description
--language None Temporarily override the default language used by the server
--host 127.0.0.1 Bind address
--port 8080 Port number

search

Option Default Description
--language None Language filter for the search
--limit 20 Maximum number of results

dump

Option Default Description
--language None Filter by language
--limit 20 Maximum number of output rows

inspect

Option Default Description
--language ja Analyzer language
--title "" Text to analyze on the title side
--file None Read the body from a file. If set, it takes precedence over the positional text argument

Examples

Index and Search a Local Document Directory

$ pip install roughsearch

$ roughsearch init notes.duckdb --language ja
$ roughsearch add notes.duckdb ./notes --glob "*.md"
$ roughsearch search notes.duckdb "ニンジャ"

Embedded Python Use with Metadata and Filters

import roughsearch

with roughsearch.Client("notes.duckdb", language="ja") as rs:
    rs.add(
        "note-001",
        title="いろはにほへと",
        body="あのイーハトーヴォのすきとおった風",
        metadata={"tags": ["note", "japanese"], "source": "handbook"},
        source_uri="handbook/note-001.md",
    )
    rs.reindex()

    from roughsearch.search.query import SearchQuery, SearchFilters
    results = rs.search(
        SearchQuery(
            query="風",
            filters=SearchFilters(tags=["note"]),
            highlight=True,
            limit=10,
        )
    )

Start the API Server and Search with curl

$ roughsearch serve docs.duckdb --port 8080 &

$ curl -s -X POST http://localhost:8080/documents \
  -H "Content-Type: application/json" \
  -d '{"id":"1","title":"いろはにほへと","body":"あのイーハトーヴォのすきとおった風"}'

$ curl -s -X POST http://localhost:8080/reindex

$ curl -s -X POST http://localhost:8080/search \
  -H "Content-Type: application/json" \
  -d '{"query":"風","limit":5}' | python -m json.tool

Bulk Add

import roughsearch

docs = [
    {"id": "1", "title": "いろはにほへと", "body": "あのイーハトーヴォのすきとおった風"},
    {"id": "2", "title": "ちりぬるを",  "body": "夏でも底に冷たさをもつ青いそら"},
]

with roughsearch.Client("bulk.duckdb") as rs:
    rs.add_documents(docs)
    rs.reindex()
    print(rs.search("風").total)

Output Format

{
  "query": "風",
  "total": 1,
  "hits": [
    {
      "id": "doc-001",
      "score": 8.512,
      "title": "いろはにほへと",
      "snippet": "あのイーハトーヴォのすきとおった<mark>風</mark>",
      "body": "あのイーハトーヴォのすきとおった風",
      "language": "ja",
      "source_uri": null,
      "heading_path": null,
      "parent_id": null,
      "chunk_id": null,
      "metadata": {}
    }
  ]
}

REST API Endpoints

Method Path Description
GET /health Health check
GET /stats Document counts by language
POST /documents Add one document
POST /documents/bulk Add multiple documents
GET /documents/{id} Fetch a document by ID
DELETE /documents/{id} Soft-delete a document
POST /search Full-text search
POST /reindex Rebuild the FTS index
POST /optimize Run a DB checkpoint and compaction

Notes

  • Reindexing is required after writes. Documents added with add() are stored immediately, but they will not appear in search results until you call reindex(). This keeps bulk imports fast.
  • Assume a single writer. DuckDB does not support concurrent writes. Run one server process and only one write operation at a time.
  • It listens on localhost by default. If you need external access, place it behind a reverse proxy such as nginx.

License

MIT. See LICENSE for details.

powered by Sudachi: Apache License v2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

roughsearch-0.1.2.tar.gz (72.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

roughsearch-0.1.2-py3-none-any.whl (25.8 kB view details)

Uploaded Python 3

File details

Details for the file roughsearch-0.1.2.tar.gz.

File metadata

  • Download URL: roughsearch-0.1.2.tar.gz
  • Upload date:
  • Size: 72.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for roughsearch-0.1.2.tar.gz
Algorithm Hash digest
SHA256 5421fc588f11559c1043a4388e47bd5f6d0c7444fdd3813da44acd3f1625641e
MD5 f8f0ff20cea36b289be78f9f5fb22226
BLAKE2b-256 34d53dec927aeb587a2829fb4e94395df2b819dbf87cd6acfc883cbe1eca9f1a

See more details on using hashes here.

File details

Details for the file roughsearch-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: roughsearch-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 25.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for roughsearch-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 c6fa94c88b43f575b0e045edd773f01a0c38ab9c0eba4ae9bf5526fad8dafcea
MD5 7621bd5c933917133df3df3ab508d0f5
BLAKE2b-256 f1f39d91c930340af072081044803e6c226d9632b9ca048aab979eb778280730

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page