Skip to main content

Extract structured data from unstructured text — no AI, just regular expressions. 🔍

Project description

Deckard 🕵️‍♂️

Extract structured data from unstructured text — no AI, just regular expressions. 🔍

GitHub License Tests Codecov PyPI - Python Version PyPI - Version Code style: black Linting: Ruff Pytest

Deckard is a library of regular-expression patterns for extracting structured data (addresses, phone numbers, email addresses, etc.) and a small set of helper utilities that make using those patterns easier.

Status: very early-stage project. Right now the repository contains mostly patterns for Poland. I am looking for contributors from around the world 🌍 — address formats, phone-number formats and other data representations differ by country, so the goal is to gather country-specific patterns for many regions.

Key features ✨

Installation ⚙️

From PyPI:

pip install deckard

Editable / local development install:

pip install -e .

For contributors — install dependencies with Poetry 🧑‍💻

This project uses Poetry to manage dependencies and development dependencies.

  1. Install Poetry (see https://python-poetry.org for instructions).
  2. From the project root run:
poetry install

This will create a virtual environment and install runtime and development dependencies (including pytest).

To run tests using Poetry:

poetry run pytest

Or start a shell in the created virtualenv and run tests directly:

poetry shell
pytest

Quick usage 🧭

Example using the current public API:

from deckard import search
from deckard.patterns import standard, pl

text = (
    "Hello, my email is spaceshaman@tuta.io and my phone number is "
    "+48 792 321 321 and my address is ul. Tesotowa 12/6A, 66-700 Bielsko-Biała."
)

result = search([standard.EMAIL, pl.MOBILE_PHONE, pl.ADDRESS], text)

# result.groupdict() will return a dict of named groups, for example:
# {
#   'email': 'spaceshaman@tuta.io',
#   'mobile_phone': '792 321 321',
#   'street': 'ul. Tesotowa',
#   'building': '12',
#   'apartment': '6A',
#   'zip_code': '66-700',
#   'city': 'Bielsko-Biała'
# }

The search helper composes the provided patterns into a single regex (using lookaheads) and returns the first match as a regex.Match object (or None if nothing matched).

Repository layout

Examples of existing tests:

Every new pattern must come with tests. Pull requests without tests will not be accepted.

Contributing — how to add new patterns

  1. Create a new file under deckard/patterns/ named by the country code, e.g. us.py, de.py, fr.py.
  2. Define constants (UPPERCASE) for each pattern, for example MOBILE_PHONE, ADDRESS, ZIP_CODE.
  3. Add tests under tests/. Use the existing Polish tests (e.g. tests/test_search_with_multiple_patterns.py) as a template. Provide normal and edge-case examples.
  4. In the PR description explain local rules (phone number format, postal code format, common street abbreviations, etc.).
  5. PRs without tests will not be accepted.

Tips 💡:

  • 🧾 Use clear, consistent named groups in regexes (?P<name>) so groupdict() returns a predictable structure.
  • 📝 Document complex patterns with comments and example inputs if necessary.

Discussion and roadmap 🚧

The project is not yet final — everything is open for discussion. Areas for contributors and discussion include:

  • 📋 Defining a minimal set of patterns every country should provide (email, phone, address, postal code, national ID where applicable).
  • 🔠 Standardizing group names (street, building, apartment, zip_code, city, country, mobile_phone, etc.).
  • ⚖️ Tools for validation and normalization of extracted values.
  • 🤖 Automating tests with sample documents in various languages.

If you want to help, open an issue or a PR — a short description of the local data format and one or two patterns with tests is a great place to start.

License 📄

This project is licensed under the MIT License. See the LICENSE file for the full text.


Thanks for your interest — please join the effort. Together we can build an international library of patterns to extract structured data from arbitrary text using robust regular expressions. 🚀

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deckard-0.1.0.tar.gz (4.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

deckard-0.1.0-py3-none-any.whl (6.0 kB view details)

Uploaded Python 3

File details

Details for the file deckard-0.1.0.tar.gz.

File metadata

  • Download URL: deckard-0.1.0.tar.gz
  • Upload date:
  • Size: 4.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.4 CPython/3.12.11 Linux/6.8.0-1031-azure

File hashes

Hashes for deckard-0.1.0.tar.gz
Algorithm Hash digest
SHA256 7fdcbac0ae6eef71123efe0efffed065ab66c471699bcf86f6195afbf7ea8d32
MD5 8d175a41edbd27baf3efddfe09446bc1
BLAKE2b-256 ff1553003d36fd8d2d79ba0f29bdb751a8bf76e2f54dfcf6ba10e658bbbf9aae

See more details on using hashes here.

File details

Details for the file deckard-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: deckard-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 6.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.4 CPython/3.12.11 Linux/6.8.0-1031-azure

File hashes

Hashes for deckard-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d7e616f7507dbedad0787cf21b291c3efc747f3dd24ad1bbc820bdbab4e369ee
MD5 4edee0ade70d10e3e6378498cf7827f6
BLAKE2b-256 8bb790e19c1738af0768c86254731c304ebe20446b40ed137f471038e5f2a413

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page