Apache Lucene based in-memory local search engine for Python
Project description
nlp4j-local-search
English | 日本語
Use Apache Lucene from Python without running Elasticsearch, OpenSearch, Solr, or Docker.
nlp4j-local-search is a lightweight in-memory full-text search library for Python.
It allows you to use Apache Lucene-based search functionality directly from Python, without setting up a search server.
This library is designed for:
- NLP experiments
- RAG prototyping
- Local full-text search
- Jupyter Notebook and Google Colab experiments
- Small search applications
- Test code that needs temporary search indexes
Internally, it uses Java and Apache Lucene, but Python users do not need to write Java code.
Why this library?
Elasticsearch, OpenSearch, and Apache Solr are powerful search engines, and they are all built on Apache Lucene.
However, for small experiments, local prototypes, or notebook-based workflows, setting up a full search server can be too heavy.
With nlp4j-local-search, you can create a Lucene-based search index directly inside your Python process.
from nlp4j_local_search import SearchEngine
with SearchEngine("en") as engine:
engine.add("1", "Developers are searching documents with a local search engine.")
engine.add("2", "A developer searched many documents yesterday.")
engine.add("3", "This tool searches local JSON documents.")
engine.commit()
for r in engine.search("search"):
print(r.id, r.body, r.score)
No server.
No Docker.
No external search engine process.
Features
- Python-first API
- Apache Lucene-based full-text search
- In-memory local search
- No Elasticsearch required
- No OpenSearch required
- No Solr required
- No Docker required
- Japanese full-text search
- English full-text search
- JSON document input
- Useful for NLP and RAG experiments
Installation
Note: PyPI release is under preparation.
For now, please install directly from GitHub.
pip install git+https://github.com/oyahiroki/nlp4j-local-search.git
For development:
git clone https://github.com/oyahiroki/nlp4j-local-search.git
cd nlp4j-local-search
pip install -e .
Requirements
- Python 3.8 or later
- Java runtime environment
- jpype1
Quick Start
from nlp4j_local_search import SearchEngine
engine = SearchEngine("ja")
engine.add("1", "東京都は日本の都道府県のひとつです")
engine.add("2", "京都は日本の都市です")
engine.add("3", "京都市には任天堂の本社があります")
engine.commit()
results = engine.search("京都")
for r in results:
print(r.id, r.body, r.score)
engine.close()
Japanese Analyzer Example: Avoiding Noisy Substring Matches
Japanese text search is different from simple substring matching.
For example, if you search for 京都 using simple substring matching, a sentence containing 東京都 may also match because 東京都 contains the characters 京都.
However, with Japanese full-text analysis, 東京都 and 京都 can be treated as different terms.
from nlp4j_local_search import SearchEngine
with SearchEngine("ja") as engine:
engine.add("1", "東京都は日本の都道府県のひとつです")
engine.add("2", "京都は日本の都市です")
engine.add("3", "京都市には任天堂の本社があります")
engine.commit()
for r in engine.search("京都", limit=10):
print(r.id, r.body, r.score)
---
## Recommended Usage
Using `SearchEngine` as a context manager is recommended.
```python
from nlp4j_local_search import SearchEngine
with SearchEngine("ja") as engine:
engine.add("1", "東京都は日本の都道府県のひとつです")
engine.add("2", "京都は日本の都市です。")
engine.add("3", "京都市には任天堂の本社があります")
engine.add_json({"id": "4", "body": "京都府は広いです"})
engine.commit()
for r in engine.search("京都", limit=10):
print(r.id, r.body, r.score)
Example output:
2 京都は日本の都市です。 0.18059490621089935
4 京都府は広いです 0.18059490621089935
3 京都市には任天堂の本社があります 0.16212496161460876
Adding Documents
You can add a document by specifying an ID and body text.
engine.add("1", "Kyoto is a historical city in Japan.")
Adding JSON Documents
You can also add a document as a Python dictionary.
engine.add_json({
"id": "1",
"body": "Kyoto is a historical city in Japan."
})
Or as a JSON string.
engine.add_json("""
{
"id": "2",
"body": "Osaka is a large city in western Japan."
}
""")
This is useful for NLP workflows where JSON and JSONL are commonly used as intermediate data formats.
Searching
results = engine.search("Kyoto")
You can specify the maximum number of search results.
results = engine.search("Kyoto", limit=10)
Each result has the following attributes:
r.id
r.body
r.score
Language Settings
Japanese:
engine = SearchEngine("ja")
English:
engine = SearchEngine("en")
English Analyzer Example
When using SearchEngine("en"), English text is analyzed with an English analyzer.
This means that search can handle common English word variations such as:
searchsearchessearchedsearching
It can also handle cases such as:
document/documentsLucene/Lucene's- uppercase / lowercase differences
This is useful when you want more than simple substring matching.
from nlp4j_local_search import SearchEngine
with SearchEngine("en") as engine:
engine.add("1", "Developers are searching documents with a local search engine.")
engine.add("2", "A developer searched many documents yesterday.")
engine.add("3", "This tool searches local JSON documents.")
engine.add("4", "Lucene's EnglishAnalyzer is useful for English full-text search.")
engine.add("5", "The quick brown fox jumps over the lazy dog.")
engine.commit()
print("Query: search")
for r in engine.search("search", limit=10):
print(r.id, r.body, r.score)
print("Query: document")
for r in engine.search("document", limit=10):
print(r.id, r.body, r.score)
print("Query: lucene")
for r in engine.search("lucene", limit=10):
print(r.id, r.body, r.score)
Unlike simple substring matching, English full-text search can match related word forms such as search, searched, and searching.
This makes it useful for local search, NLP experiments, and search baseline evaluation.
Japanese Search Example
For Japanese text, use SearchEngine("ja").
from nlp4j_local_search import SearchEngine
with SearchEngine("ja") as engine:
engine.add("1", "東京都は日本の都道府県のひとつです")
engine.add("2", "京都は日本の都市です")
engine.add("3", "京都市には任天堂の本社があります")
engine.add("4", "大阪は関西の大都市です")
engine.commit()
for r in engine.search("京都", limit=10):
print(r.id, r.body, r.score)
This is useful when you want to try Japanese full-text search locally without setting up a search server.
Google Colab
nlp4j-local-search can also be used in Google Colab.
!pip install git+https://github.com/oyahiroki/nlp4j-local-search.git
Then:
from nlp4j_local_search import SearchEngine
with SearchEngine("ja") as engine:
engine.add("1", "東京都は日本の都道府県のひとつです")
engine.add("2", "京都は日本の都市です")
engine.add("3", "京都市には任天堂の本社があります")
engine.add_json({"id": "4", "body": "京都府は広いです"})
engine.commit()
results = engine.search("京都", limit=10)
for r in results:
print(f"ID: {r.id}, Score: {r.score:.4f}")
print(f"Body: {r.body}")
print("-" * 50)
Notes:
- The index is stored in memory.
- If the Colab session is reset, the index will be lost.
- JVM startup may take a few seconds on the first run.
Design Concept
Local Search
This library is not a search server.
You do not need to run:
- Elasticsearch
- OpenSearch
- Solr
- Docker
The search engine runs inside your Python process.
In-Memory Index
By default, the search index is created in memory.
This makes the library useful for:
- Temporary experiments
- Unit tests
- Jupyter Notebook
- Google Colab
- Proof-of-concept development
- Local NLP workflows
The index is not persisted to disk.
Python-First API
Although the internal implementation uses Java and Apache Lucene, the public API is designed for Python users.
engine = SearchEngine("en")
That is enough to start using Lucene-based search from Python.
Use Cases
NLP Experiments
You can quickly create a searchable index from text data, Wikipedia-derived datasets, dictionary data, or intermediate NLP results.
RAG Prototyping
Before building a full RAG system, you can test local keyword search behavior with small or medium-sized datasets.
Search Baseline for Embedding Experiments
When evaluating embedding models, it is often useful to compare vector search results with traditional keyword-based full-text search.
Test Code
Because the index is in memory, you can create and discard search indexes during automated tests.
Current Status
This project is currently in an early development stage.
Current focus:
- Simple local full-text search from Python
- Japanese search
- English search
- JSON document input
- In-memory indexing
APIs may change in future versions.
Roadmap
Planned or considered features:
- PyPI release
- Improved Google Colab support
- Vector search
- Aggregation
- JSON Query DSL
- OpenSearch-compatible API
Project Information
Package name:
nlp4j-local-search
Python module name:
nlp4j_local_search
Current version:
0.1.0
License
Apache License 2.0
Author
Hiroki Oya
GitHub:
https://github.com/oyahiroki
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nlp4j_local_search-0.2.0.tar.gz.
File metadata
- Download URL: nlp4j_local_search-0.2.0.tar.gz
- Upload date:
- Size: 43.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
82f398de5d67424e4269debd91e2f6e737d492d39c214dd29ace6d897172d81f
|
|
| MD5 |
51a5208b7deb19635241ee5c649ca2b5
|
|
| BLAKE2b-256 |
48de72ad701cc7f264e5c049a3edc5d9fef0637cf7154cc6f6d2733ed45a8921
|
File details
Details for the file nlp4j_local_search-0.2.0-py3-none-any.whl.
File metadata
- Download URL: nlp4j_local_search-0.2.0-py3-none-any.whl
- Upload date:
- Size: 43.5 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5990b4027263a17d83f95134a78190e8c3310a0a74d7ba99719ed4ff06b75d48
|
|
| MD5 |
8fa7afe8c58e5cb0f90c09c0e4636196
|
|
| BLAKE2b-256 |
9754440997ff73a5ca40ddf9a39d4c7813ca9caa9689885ba87aeae9380437a2
|