Project description

Text Search Engine

This project implements a simple text search engine using Python. It processes text files to create a searchable index of sentences, allowing users to perform semantic searches against this indexed data. See the project [final report](docs/Information Retrieval Systems.pdf) for more details.

Installation

Python virtual environment

To set up the project environment, follow these steps:

Clone the project repository or download the project files to your local machine.
Navigate to the project directory.
Create a Python virtual environment in the project directory:
```
pip install virtualenv
python -m virtualenv .venv
```
Activate the virtual environment (mac/linux):
```
source .venv/bin/activate
```

Install dependencies

Not that you have a virtual environment, you're ready to install some Python packages and download language models (spaCy and BERT).

Install the required packages using the requirements.txt file:
```
pip install -r requirements.txt
```
Download the small spaCy language model (for sentence segmentation):
```
python -m spacy download en_core_web_sm
```

Download the small BERT embedding model:

python -c 'from sentence_transformers import SentenceTransformer; sbert = SentenceTransformer("paraphrase-MiniLM-L6-v2")'

Quick start

You can search an example corpus of nutrition and health documents by running the search_engine.py script.

Search your personal docs

Replace the text files in data/corpus with your own.
Start the command-line search engine with:
```
python search_engine.py --refresh
```

The --refresh flag ensures that a fresh index is created based on your documents. Otherwise it may ignore the data/corpus directory and reuse an existing index and corpus in the data/cache directory.

The search_engine.py script will first segement the text files into sentences. Then it will create an inverse index to provide context for any retrieved information. It will also create embedding vectors and locality sensitive hashes for experimenting with vector database and RAG (retrieval augmented generation) then allow you to process search requests, returning the top matching sentences along with their filenames and line numbers.

Contributing

Contributions to this project are welcome.

License

This project is licensed under MIT License.

Project details

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.1.5

Mar 17, 2024

0.1.4

Mar 16, 2024

0.1.3

Mar 16, 2024

0.1.1

Feb 26, 2024

This version

0.1.0

Feb 17, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

knowt-0.1.0.tar.gz (13.2 kB view hashes)

Uploaded Feb 17, 2024 Source

Hashes for knowt-0.1.0.tar.gz

Hashes for knowt-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`4b81439ff2f49539d115731e058ee06704bc97bddd2948a5cf86df0c9bf1d443`
MD5	`4928a9ee60aff9600b5e6240bdf0b1f1`
BLAKE2b-256	`88bac9f06076f5146982d17c5f95dd9b45c9778aea12e78368ae6022c66c762d`