Skip to main content

A tiny search engine for personal use.

Project description

winzig

winzig is a tiny search engine designed for personal use that enables users to download and search for posts from their favourite feeds.

This project was heavily inspired by the microsearch project and this article about it.

Python SQLite Poetry

Motivation

For quite some time, I've been contemplating the idea of creating my own personal search engine. I wanted a tool that could facilitate searching through my personal notes, books, articles, podcast transcripts, and anything else I wished to include. However, I was unsure of how or where to begin until I discovered the microsearch project, which reignited the momentum for the idea in my mind.

This project started as a clone of the microsearch project to be able to better understand how some things worked. Later, I decided to start implementing some changes like keeping all the data in a SQLite database or building a sort-of inverted index after crawling.

Features

  • Fetch only what you need: winzig optimizes data retrieval by excluding previously fetched content, making sure that only new content is downloaded each time.
  • Async, Async, Async: Both crawling and subsequent data processing operate asynchronously, resulting in lightning-fast performance.
  • Efficient data management with SQLite: All the data is stored in a SQLite database in your home directory.
  • Easy to use CLI: The CLI provides simple commands for crawling and searching effortlessly, as well as some feedback.
  • Enhanced search speed: The post-crawling processing ensures near-instantaneous search results.
  • TUI (barebones): winzig provides a basic TUI that facilitates an interactive search experience.

Installation

You'll need Python >= 3.12 to be able to run winzig.

pip

pip install winzig

pipx

pipx install winzig

Cloning this repository

Clone this repo with git clone:

git clone https://github.com/dnlzrgz/winzig winzig

Or use gh if you prefer it instead:

gh repo clone dnlzrgz/winzig

Then, create a virtualvenv inside the winzig directory:

python -m venv venv

Activate the virtualvenv:

source venv/bin/activate

And run:

pip install .

Instead of using pip you can also use poetry:

poetry install

And now you should be able to run:

winzig --help

Usage

The first time you initiate a crawl, you'll need a file containing a list of feeds to fetch. These feeds will be stored in the SQLite database. Therefore, there is no need to provide this file again unless you're adding new feeds. This repository contains a feeds file that you can use. If instead you want to fetch posts directly, you can also do it by providing a list with the URLs.

Currently, there is no way to manage the feeds or posts added to the database. So if you want to remove some of them you will need to do it manually. However, it may be more efficient to delete the database and crawl again.

Crawl

winzig crawl

Feeds

winzig crawl feeds --file="feeds"

Posts

winzig crawl posts --file="posts"

Searching

The following command starts a search for content matching the provided query and after a few seconds will return a list of relevant links.

winzig search --query="async databases with sqlalchemy"

By default the number of results is 5 but you can change this by using the -n flag.

winzig search --query="async databases with sqlalchemy" -n 10

TUI

If you prefer you can use the TUI to interact with the search engine. The TUI is its early stage but it offers basic functionality and faster search experiences compared to the search command since the content is indexed once and not each time you want to search something.

winzig tui

More feeds, please

If you're looking to expand your feed collection significantly, you can get a curated list of feeds from the blogs.hn repository with just a couple of commands.

  1. Download the JSON file containing the relevant information from the blogs.hn repository.
curl -sL https://raw.githubusercontent.com/surprisetalk/blogs.hn/main/blogs.json -o hn.json
  1. Extract the feeds using jq. Make sure you have it installed in your system.
jq -r '.[] | select(.feed != null) | .feed' hn.json >> urls

Incorporating feeds from the resultant file will significantly increase the number of requests made. Based on my experience, fetching posts from each feed, extracting content, and performing other operations may take approximately 20 to 30 minutes, depending on your Internet connection speed. The search speed will still be pretty fast.

Roadmap

  • Add a TUI using textual.
  • Build inverted index after crawling.
  • Make the CLI nicer.
  • Improve logging.
  • Improve error handling.
  • Add support for crawling individual posts.
  • Improve TUI.
  • Add tests.
  • Add support for documents like markdown or plain text files.
  • Add support for PDFs and other formats.
  • Add commands to manage the SQLite database.
  • Add support for advanced queries.

Contributing

If you are interested in contributing, please open an issue first. I will try to answer as soon as possible.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

winzig-0.2.5.tar.gz (13.5 kB view hashes)

Uploaded Source

Built Distribution

winzig-0.2.5-py3-none-any.whl (14.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page