Skip to main content

A configurable local chatbot library with lightweight memory indexing.

Project description

llama-simple-chat-bot

A small Python library for configurable local chatbots.

llama-simple-chat-bot runs an open-source language model on the local machine and adds a lightweight persistent memory index. It is designed for Windows and Linux on amd64 machines, and it does not require a GPU. The default runtime path uses GGUF models through llama-cpp-python with n_gpu_layers set to 0.

[!NOTE] The library does not call remote OpenAI APIs. Model inference is local, while memory indexing is handled with lightweight files on disk.

Features

  • JSON bot profiles: name, description, personality, birthday, skills, species, model settings, and memory directory.
  • Local LLM backends: llama_cpp_python, llama_cpp_cli, and a deterministic echo backend for tests.
  • Persistent memory: every exchange is logged, split into segments, summarized, indexed, and searched during future conversations.
  • Associative recall: related past segments can be injected into the prompt as context before the model answers.
  • CLI and Python API.
  • No required third-party dependencies for the core package. Local inference is available through the optional local extra.

Recommended Local Models

These presets are intentionally small enough for local GGUF use, with a few better-quality options for slower but more reliable CPU chat:

  • qwen2.5-0.5b-instruct-q4_k_m: about 491 MB, multilingual, good default for Chinese and English.
  • qwen2.5-1.5b-instruct-q4_k_m: about 1120 MB, much better than 0.5B for identity stability, memory use, and ordinary chat quality.
  • qwen2.5-3b-instruct-q4_k_m: about 2100 MB, a stronger choice for roleplay, Chinese chat, and basic reasoning if you can accept slower CPU inference.
  • smollm2-360m-instruct-q4_k_m: about 271 MB, very small and fast for quick experiments.

The project can also use any local GGUF file supported by llama.cpp.

[!TIP] If you care about actual chat quality, start with qwen2.5-1.5b-instruct-q4_k_m. Use qwen2.5-3b-instruct-q4_k_m when role consistency and answer quality matter more than speed. Keep qwen2.5-0.5b-instruct-q4_k_m for lightweight testing, and smollm2-360m-instruct-q4_k_m only for very small experiments.

Setup

Create a virtual environment before installing optional local inference dependencies.

[!IMPORTANT] Install optional dependencies inside a virtual environment. The project does not require modifying your base Python environment.

The recommended CPU-only path installs the core package first. The first real local-model run then installs the prebuilt CPU llama-cpp-python wheel automatically if it is missing:

python -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install -U pip
python -m pip install -e .

On Linux:

python3 -m venv .venv
source .venv/bin/activate
python -m pip install -U pip
python -m pip install -e .

[!TIP] The automatic installer runs python -m pip install -r requirements-local-cpu.txt. That requirements file uses https://abetlen.github.io/llama-cpp-python/whl/cpu plus --only-binary llama-cpp-python. If no compatible wheel exists for your Python version and platform, pip fails instead of starting a slow local build.

If the CPU wheel is unavailable on your machine, either install llama.cpp separately and set "backend": "llama_cpp_cli" in the config, or use python -m pip install -e '.[local]' when you intentionally want to build llama-cpp-python from source.

[!NOTE] The local extra is kept for packaging compatibility, but the documented quick path uses requirements-local-cpu.txt because pip dependency metadata cannot store a custom wheel index URL.

[!WARNING] CPU-only inference is usable with small GGUF models, but it is still slower than GPU inference. Keep model.n_gpu_layers at 0 when the machine has no compatible GPU.

Quick Start

Write an example config:

llama-simple-chat-bot init-config examples/my_bot.json

List model presets:

llama-simple-chat-bot models

Download a small GGUF model:

llama-simple-chat-bot download-model qwen2.5-0.5b-instruct-q4_k_m --models-dir models

For a better local chat model:

llama-simple-chat-bot download-model qwen2.5-1.5b-instruct-q4_k_m --models-dir models

Or a stronger 3B preset:

llama-simple-chat-bot download-model qwen2.5-3b-instruct-q4_k_m --models-dir models

Start chatting:

llama-simple-chat-bot chat --config examples/my_bot.json

Shortest Start

[!TIP] Use this section when you only want the shortest path from a fresh checkout to a running bot.

For a real local model run:

python -m venv .venv
source .venv/bin/activate
python -m pip install -e .
llama-simple-chat-bot init-config bot.json
llama-simple-chat-bot download-model qwen2.5-0.5b-instruct-q4_k_m --models-dir models
llama-simple-chat-bot chat --config bot.json

On Windows PowerShell, use:

python -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install -e .
llama-simple-chat-bot init-config bot.json
llama-simple-chat-bot download-model qwen2.5-0.5b-instruct-q4_k_m --models-dir models
llama-simple-chat-bot chat --config bot.json

If you want a more usable default on CPU, replace the download step with:

llama-simple-chat-bot download-model qwen2.5-1.5b-instruct-q4_k_m --models-dir models

For a dependency-free smoke test that does not load a model:

python -m pip install -e .
llama-simple-chat-bot ask --config examples/echo_config.json "hello"

[!NOTE] The echo backend is only a deterministic smoke-test backend. It verifies the CLI, config loading, and memory plumbing without loading a language model.

Example Profiles

The examples/ directory includes ready-to-edit bot profiles:

Run any profile with:

llama-simple-chat-bot chat --config examples/nekomimi_config.json

To debug memory retrieval while chatting, add --verbose:

llama-simple-chat-bot chat --config examples/nekomimi_config.json --verbose

Verbose mode prints diagnostics before the model starts generating: the memory index path and detected encoding, the query terms, whether the turn used recent-overview or keyword retrieval, scored candidate segments, selected prompt hits, and the recalled memory block injected into the model context. The same diagnostics are available for one-shot asks:

llama-simple-chat-bot ask --config examples/my_bot.json --verbose "What do you remember about Python packaging?"

Send one message:

llama-simple-chat-bot ask --config examples/my_bot.json "What do you remember about me?"

Search memory without loading the model:

llama-simple-chat-bot memory-search --config examples/my_bot.json "Python packaging"

JSON Config

See examples/bot_config.json.

Important fields:

  • name, description, personality, birthday, skills, and species are injected at runtime as authoritative system rules, so the bot knows its configured identity.
  • memory_dir controls where index.json and segment .jsonl files are stored.
  • model.backend selects llama_cpp_python, llama_cpp_cli, or echo.
  • model.model_path points to a local GGUF file.
  • Preset downloads are available through llama-simple-chat-bot download-model for qwen2.5-0.5b-instruct-q4_k_m, qwen2.5-1.5b-instruct-q4_k_m, qwen2.5-3b-instruct-q4_k_m, and smollm2-360m-instruct-q4_k_m.
  • model.n_gpu_layers defaults to 0, which keeps inference on CPU.
  • system_rules is the place to enforce speech style and role constraints, such as asking a catgirl profile to naturally end replies with .
  • memory.segment_exchange_limit controls when a new memory segment starts.
  • memory.summary_mode can be extractive or llm. extractive is faster; llm asks the local model to rewrite the segment summary.

Relative paths inside a config file are resolved relative to that config file. JSON config files can be encoded as UTF-8, UTF-8 with BOM, GB2312, or GBK. Memory index.json and segment .jsonl files are read with the same encoding fallbacks and are written back as UTF-8.

[!NOTE] GB2312 and GBK support is intended for Chinese JSON config files produced by older Windows editors or tooling.

Python API

from llama_simple_chat_bot import BotConfig, ChatBot

config = BotConfig.from_file("examples/bot_config.json")
bot = ChatBot(config)

reply = bot.ask("Remember that I prefer SQLite for small apps.")
print(reply)

for hit in bot.search_memory("SQLite"):
    print(hit.summary)

You can also build the config in code:

from llama_simple_chat_bot import BotConfig, ChatBot, MemoryConfig, ModelConfig

config = BotConfig(
    name="Mira",
    description="A practical local assistant with persistent memory.",
    personality="warm, curious, and concise",
    birthday="2026-06-02",
    skills=["Python", "summarization"],
    species="local digital companion",
    memory_dir="./memory/mira",
    model=ModelConfig(
        backend="llama_cpp_python",
        model_path="./models/qwen2.5-0.5b-instruct-q4_k_m.gguf",
        chat_format="chatml",
        n_gpu_layers=0,
    ),
    memory=MemoryConfig(summary_mode="extractive"),
)

bot = ChatBot(config)
print(bot.ask("Hello."))

Memory Layout

The configured memory directory contains:

  • index.json: all conversation log entries plus segment metadata, summaries, keywords, and log file references.
  • segments/*.jsonl: append-only per-segment logs.

At response time, the bot searches the index for direct matches and related associative matches, formats the best hits, and injects them into the local model's system context.

For broad questions like "what did we talk about before?", memory recall uses recent segments as an overview. For questions with a concrete topic, such as "did we talk about SQLite?", it uses keyword retrieval so old but relevant segments can beat newer unrelated chats.

[!WARNING] Memory files contain conversation content. Do not commit real user memory directories, downloaded model files, or private chat logs.

Tests

The test suite uses only the built-in unittest module and the echo backend:

python -m unittest

Acknowledgements

This project builds on the local inference ecosystem around llama.cpp, the Python bindings provided by llama-cpp-python, open GGUF model releases from the Qwen and SmolLM communities, and the Python standard library.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llama_simple_chat_bot-0.1.0.tar.gz (37.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llama_simple_chat_bot-0.1.0-py3-none-any.whl (52.2 kB view details)

Uploaded Python 3

File details

Details for the file llama_simple_chat_bot-0.1.0.tar.gz.

File metadata

  • Download URL: llama_simple_chat_bot-0.1.0.tar.gz
  • Upload date:
  • Size: 37.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.4.1 CPython/3.11.15 Windows/10

File hashes

Hashes for llama_simple_chat_bot-0.1.0.tar.gz
Algorithm Hash digest
SHA256 24981d0f5f7196a42514e06002b4650f905b43572e669cad107574f5008d03c4
MD5 fafe3a777654b21f4c049e5e1668047d
BLAKE2b-256 753d7c3be59dd1082a73e0dc589f094d475d739227f4bfa2eaa0024c28968e5d

See more details on using hashes here.

File details

Details for the file llama_simple_chat_bot-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for llama_simple_chat_bot-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0292be5bcd568c38c112e6b1f6631d812a23e0759778e8e3631eadff7d0d08d6
MD5 79584753249caab4bedb17939d5ff57e
BLAKE2b-256 b3dd098678c4b91d64b6e83f5c20b86d08d165532f0066b734f43bb84a40b693

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page