Extract structured data from unstructured text — no AI, just regular expressions. 🔍

These details have not been verified by PyPI

Project links

Project description

Deckard 🕵️‍♂️

Extract structured data from unstructured text — no AI, just regular expressions. 🔍

Deckard is a library of regular-expression patterns for extracting structured data (addresses, phone numbers, email addresses, etc.) and a small set of helper utilities that make using those patterns easier.

Status: very early-stage project. Right now the repository contains mostly patterns for Poland. I am looking for contributors from around the world 🌍 — address formats, phone-number formats and other data representations differ by country, so the goal is to gather country-specific patterns for many regions.

Key features ✨

🗂️ A collection of ready-to-use regex patterns organized by country (for example deckard/patterns/pl.py).
📦 Universal patterns (e.g. email) live in deckard/patterns/standard.py.
🛠️ A small helper function deckard.search that combines multiple patterns and returns named-group matches (deckard/main.py).

Installation ⚙️

From PyPI:

pip install deckard

Editable / local development install:

pip install -e .

For contributors — install dependencies with Poetry 🧑‍💻

This project uses Poetry to manage dependencies and development dependencies.

Install Poetry (see https://python-poetry.org for instructions).
From the project root run:

poetry install

This will create a virtual environment and install runtime and development dependencies (including pytest).

To run tests using Poetry:

poetry run pytest

Or start a shell in the created virtualenv and run tests directly:

poetry shell
pytest

Quick usage 🧭

Example using the current public API:

from deckard import search
from deckard.patterns import standard, pl

text = (
    "Hello, my email is spaceshaman@tuta.io and my phone number is "
    "+48 792 321 321 and my address is ul. Tesotowa 12/6A, 66-700 Bielsko-Biała."
)

result = search([standard.EMAIL, pl.MOBILE_PHONE, pl.ADDRESS], text)

# result.groupdict() will return a dict of named groups, for example:
# {
#   'email': 'spaceshaman@tuta.io',
#   'mobile_phone': '792 321 321',
#   'street': 'ul. Tesotowa',
#   'building': '12',
#   'apartment': '6A',
#   'zip_code': '66-700',
#   'city': 'Bielsko-Biała'
# }

The search helper composes the provided patterns into a single regex (using lookaheads) and returns the first match as a regex.Match object (or None if nothing matched).

Repository layout

deckard/ — library code
- deckard/main.py — helper search function
- deckard/patterns/standard.py — universal patterns (e.g. EMAIL)
- deckard/patterns/pl.py — Poland-specific patterns (address, postal code, phone, etc.)
tests/ — unit tests

Examples of existing tests:

tests/test_standard_patterns.py — test for standard.EMAIL
tests/test_search_with_multiple_patterns.py — integration tests combining standard.EMAIL with patterns from pl.py
tests/pl/test_search_address_pl.py — tests for Polish address patterns

Every new pattern must come with tests. Pull requests without tests will not be accepted.

Contributing — how to add new patterns

Create a new file under deckard/patterns/ named by the country code, e.g. us.py, de.py, fr.py.
Define constants (UPPERCASE) for each pattern, for example MOBILE_PHONE, ADDRESS, ZIP_CODE.
Add tests under tests/. Use the existing Polish tests (e.g. tests/test_search_with_multiple_patterns.py) as a template. Provide normal and edge-case examples.
In the PR description explain local rules (phone number format, postal code format, common street abbreviations, etc.).
PRs without tests will not be accepted.

Tips 💡:

🧾 Use clear, consistent named groups in regexes (?P<name>) so groupdict() returns a predictable structure.
📝 Document complex patterns with comments and example inputs if necessary.

Discussion and roadmap 🚧

The project is not yet final — everything is open for discussion. Areas for contributors and discussion include:

📋 Defining a minimal set of patterns every country should provide (email, phone, address, postal code, national ID where applicable).
🔠 Standardizing group names (street, building, apartment, zip_code, city, country, mobile_phone, etc.).
⚖️ Tools for validation and normalization of extracted values.
🤖 Automating tests with sample documents in various languages.

If you want to help, open an issue or a PR — a short description of the local data format and one or two patterns with tests is a great place to start.

License 📄

This project is licensed under the MIT License. See the LICENSE file for the full text.

Thanks for your interest — please join the effort. Together we can build an international library of patterns to extract structured data from arbitrary text using robust regular expressions. 🚀

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Aug 17, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deckard-0.1.0.tar.gz (4.8 kB view details)

Uploaded Aug 17, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

deckard-0.1.0-py3-none-any.whl (6.0 kB view details)

Uploaded Aug 17, 2025 Python 3

File details

Details for the file deckard-0.1.0.tar.gz.

File metadata

Download URL: deckard-0.1.0.tar.gz
Upload date: Aug 17, 2025
Size: 4.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.4 CPython/3.12.11 Linux/6.8.0-1031-azure

File hashes

Hashes for deckard-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`7fdcbac0ae6eef71123efe0efffed065ab66c471699bcf86f6195afbf7ea8d32`
MD5	`8d175a41edbd27baf3efddfe09446bc1`
BLAKE2b-256	`ff1553003d36fd8d2d79ba0f29bdb751a8bf76e2f54dfcf6ba10e658bbbf9aae`

See more details on using hashes here.

File details

Details for the file deckard-0.1.0-py3-none-any.whl.

File metadata

Download URL: deckard-0.1.0-py3-none-any.whl
Upload date: Aug 17, 2025
Size: 6.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.4 CPython/3.12.11 Linux/6.8.0-1031-azure

File hashes

Hashes for deckard-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d7e616f7507dbedad0787cf21b291c3efc747f3dd24ad1bbc820bdbab4e369ee`
MD5	`4edee0ade70d10e3e6378498cf7827f6`
BLAKE2b-256	`8bb790e19c1738af0768c86254731c304ebe20446b40ed137f471038e5f2a413`

See more details on using hashes here.

deckard 0.1.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

Deckard 🕵️‍♂️

Key features ✨

Installation ⚙️

For contributors — install dependencies with Poetry 🧑‍💻

Quick usage 🧭

Repository layout

Contributing — how to add new patterns

Discussion and roadmap 🚧

License 📄

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes