A selfhosted service for indexing and searching personal web history.

These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
License
- OSI Approved :: GNU General Public License v3 (GPLv3)
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Internet :: WWW/HTTP :: Indexing/Search
Typing
- Typed

Project description

Memoria

A selfhosted service for indexing and searching personal web history.

Memoria ingests URLs from browsing history, then scrapes and indexes the web content to create a personalized search engine.

Sections
🚀 § Running Memoria
⚙️ § Configuration
🧩 § Plugins

Other Documentation
📑 Plugin Development

Running Memoria

To run Memoria you will need an Elasticsearch instance. The "Running With Containers" example will start one for you, or you can deploy one manually and configure Memoria to connect to it. Once Memoria is running via one of the methods below you can access the web interface at http://localhost/.

Running With Python

python3 -m pip install .
python3 -m memoria.web --port 80

# Or without installing:
PYTHONPATH=./src python -m memoria.web --port 80

Notes:

Your distribution may require that you create a virtual environment to install Python packages.
Memoria is currently designed to run under Python 3.12. Your mileage may vary attempting to run under Python 3.11.

Running With Containers

Self-contained Compose (including an Elasticsearch instance):

# With Docker Compose or Podman Compose:
podman-compose --profile elasticsearch up

# Cleanup:
podman-compose down --volumes

Single Docker container (for use with an existing Elasticsearch instance):

# Build or pull
podman build -t ghcr.io/sidneys1/memoria .
podman pull ghcr.io/sidneys1/memoria

# With plain Docker or Podman
podman run --name memoria -e MEMORIA_ELASTIC_HOST=http://hostname:9200/ -p 80 ghcr.io/sidneys1/memoria

# Cleanup:
podman container rm memoria
podman image rm ghcr.io/sidneys1/memoria

Note that Podman commands may require sudo to run, or that you configure your Podman environment to run rootless.

Advanced Container Deployment

You can deploy Memoria as a container. The provided Containerfile builds a lightweight image based on python:3.12-alpine, which runs Memoria under Uvicorn on the exposed port 80.

podman build -t sidneys1/memoria .

You can also deploy Memoria with Docker Compose or Podman Compose (as shown here).

The file compose.yaml shows the most basic Compose strategy, building and launching a Memoria container. You can use Memoria with an existing Elasticsearch instance like so[^1]:

export ELASTIC_HOST=http://hostname:9200/
podman-compose up --build

[^1]: See §Configuration for more environment variables.

A Compose profile named elasticsearch is also provided that will additionally launch an Elasticsearch container.

# To start self-contained. See notes below regarding default credentials.
podman-compose up --build --profile elasticsearch

[!NOTE] Currently the only way to import browser history is by uploading a browser history database on the Settings page. More import strategies are coming soon™.

Configuration

Allow and Deny Lists

Memoria utilizes allow and deny lists to filter incoming history items so that unwanted websites aren't indexed. These lists are currently just text files containing one rule per line.

Shell-like quotation marks and backslashes are supported. A history item will be downloaded by Memoria, given the entries matching its domain name, if the URL is:

Matched by any strong allowlist entry pertaining; or
Matched by any weak allowlist entry pertaining, and doesn't match any strong denylist entries pertaining.

Additionally, if a subdomain is not matched by any entries then its parent domains will be used sequentially. For example, if gist.github.com doesn't match any entries, then entries for github.com will be checked.

A weak list entry is composed of just a domain name:

example.com

While a strong list entry is composed of a domain name and zero or more rules that can further restrict the entry:

example.com /login r^/$

There are currently two types of rules:

Path rules start with / and match if the URL path-part begins with this value.
Regular expression rules start with r and match if any part of the URL matches.

So, to break it down, putting example.com in the allowlist and this entry in the denylist:

`example.comdomain /loginpath rule r^/$'regex rule`

Would result in these URLs being allowed:

https://example.com/foo
https://example.com/foo/bar/baz#link?search=bat

And these URLs being denied:

https://www.example.comdomain/loginpath rule
https://www.example.comdomain/loginpath rule/flow2?step=0
https://example.comdomain/ regex rule

Examples

Allow all URLs under GitHub.com, except login, search, my (Sidneys1) own projects and pages, and searches within projects or organizations:
```
# allowlist.txt
github.com

# denylist.txt
github.com /login /search /Sidneys1/ 'r/(?:search|repositories|issues)\?q='
```
Allow any page under a domain except the landing page (example.com/):
```
# allowlist.txt
example.com

# denylist.txt
example.com r^/$
```

Deny any page at stackoverflow.com except questions:

# allowlist.txt
stackoverflow.com /questions/ /q/

# denylist.txt
stackoverflow.com

Options

Memoria has several deployment configuration options that control overall behavior. These can be set via environment variables or container secrets. The following configuration options are provided:

	Name	Description	Default
Importing	`downloader`	The downloader plugin^§ to use	`AiohttpDownloader`
`extractor`	The extractor plugin^§ to use	`HtmlExtractor`
`filter_stack`	A list of filter plugins^§ to use	`["HtmlContentFinder"]`
`import_threads`	The maximum number of processes to use to download history items	$\frac{cpus}{2}$[^2]
Allow/Deny Lists	`allowlist`	Path to a file defining allowlist^§ entries	`./data/allowlist.txt`
`denylist`	Path to a file defining denylist^§ entries	`./data/denylist.txt`
Databases	`database_uri`	Connection URI to the Memoria database	`sqlite+aiosqlite:///./data/memoria.db`
`elastic_host`	Elasticsearch connection URI	`http://elasticsearch:9200`
`elastic_user`	Elasticsearch Authentication	`elastic`
`elastic_password`	Elasticsearch Authentication	None

[^2]: Or 1 if CPU count cannot be determined.

Any of these settings can be configured with uppercase environment variables prefixed with MEMORIA_ (e.g., MEMORIA_ELASTIC_PASSWORD). Additionally, settings can be read from files from /run/secrets[^3], which will take precedence over any environment variables. For example, to set elastic_password with a Docker or Podman secret, you can:

printf 'my-password-here' | podman secret create memoria_elastic_password -
podman run --name memoria --secret memoria_elastic_password -p 80 sidneys1/memoria

[^3]: The secrets directory can be overridden with the SECRETS_DIR environment variable.

Plugins

Memoria utilizes a plugin architecture that allows for different methods of downloading URLs, transforming the downloaded content, and extracting indexable plain text from the content. Selecting which plugins Memoria will use is described in §Configuration.

There are currently three types of Memoria Plugins, used during web content retrieval and processing:

Downloaders
Downloaders are responsible for accessing a URL and retrieving its content from the internet. They can provide this content in many different formats to the next plugin in the stack. The most basic Downloaders (like the built-in default, AiohttpDownloader) only support downloading raw HTML to provide to the remaining plugins.
Filters
Filters transform input from the previous plugin in the stack (either the Downloader or another Filter). They can change the content format or modify it in place.

By default Memoria uses the built in HtmlContentFinder plugin to remove extraneous HTML elements and prune the input to a single <main>, <article>, or <... id="content"> element (if one exists).
Extractors
Extractors are the last plugin to run, and are responsible for converting the input from the previous plugin (either the Downloader or the last Filter) into plain text that will be stored in Elasticsearch for indexing and searching.

By default Memoria uses the built in HtmlExtractor plugin to convert the input HTML into plain text. It also searches the original downloaded HTML (before any potential modification by Filter plugins) for <meta ...> values that could be used to enrich the Elasticsearch document, such as "author" or "description".

[!TIP] See the 📑 Plugin Development guide for information on developing your own Memoria plugins.

Project details

These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
License
- OSI Approved :: GNU General Public License v3 (GPLv3)
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Internet :: WWW/HTTP :: Indexing/Search
Typing
- Typed

Release history Release notifications | RSS feed

0.2

Jul 2, 2024

0.1

Jun 11, 2024

0.1b0 pre-release

Jun 4, 2024

This version

0.1a0 pre-release

Jun 4, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

memoria_search-0.1a0.tar.gz (2.1 MB view hashes)

Uploaded Jun 4, 2024 Source

Built Distribution

memoria_search-0.1a0-py3-none-any.whl (2.1 MB view hashes)

Uploaded Jun 4, 2024 Python 3

Hashes for memoria_search-0.1a0.tar.gz

Hashes for memoria_search-0.1a0.tar.gz
Algorithm	Hash digest
SHA256	`99164d3ac077f5dd3eeb28a3bc42df7fd41fea7e33b4a814eab54a303c300d4c`
MD5	`e2f26ce3eccc4e747cf781b1f54fe019`
BLAKE2b-256	`ebd8fe675a9cc6271514d5d7d1885697b2bb446003ae754eac4823c44c8812b6`

Hashes for memoria_search-0.1a0-py3-none-any.whl

Hashes for memoria_search-0.1a0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f232ac6bdc77677ba0a2e4040f2527ccad3afa2b8b6b476af3407d834e9f283e`
MD5	`566bf677ce2cd35aa6fb29e643836daf`
BLAKE2b-256	`d9100f52484b792ab0c23433f7141f8dadf6d96ed1c83223b58326f4cbaee57a`

memoria-search 0.1a0

Navigation

Verified details (What is this?)

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Memoria

Running Memoria

Configuration

Allow and Deny Lists

`example.comdomain /loginpath rule r^/$'regex rule`

https://www.`example.com`domain`/login`path rule

https://www.`example.com`domain`/login`path rule/flow2?step=0

https://`example.com`domain`/` regex rule

Options

Plugins

Project details

Verified details (What is this?)

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

memoria-search 0.1a0

Navigation

Verified details (What is this?)

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Memoria

Running Memoria

Configuration

Allow and Deny Lists

example.comdomain /loginpath rule r^/$'regex rule

https://www.example.comdomain/loginpath rule

https://www.example.comdomain/loginpath rule/flow2?step=0

https://example.comdomain/ regex rule

Options

Plugins

Project details

Verified details (What is this?)

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

`example.comdomain /loginpath rule r^/$'regex rule`

https://www.`example.com`domain`/login`path rule

https://www.`example.com`domain`/login`path rule/flow2?step=0

https://`example.com`domain`/` regex rule