Extract files (PDF, DOCX, PPTX, XLSX) into Obsidian-flavored Markdown

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

gucky92

These details have not been verified by PyPI

Project description

obsidian-import

Extract files (PDF, DOCX, PPTX, XLSX, CSV, JSON, YAML, images) into Obsidian-flavored Markdown.

The mirror of obsidian-export: where obsidian-export converts Obsidian notes to PDF/DOCX, obsidian-import converts external documents into Obsidian-ready markdown with YAML frontmatter.

Installation

pip install obsidian-import

With optional backends:

pip install obsidian-import[markitdown]    # fallback for HTML, etc.
pip install obsidian-import[docling]       # high-quality ML-based extraction

Quick Start

Single file

obsidian-import convert report.pdf --output vault/imports/report.md

Batch extraction

obsidian-import batch --config config.yaml

Check backend availability

obsidian-import doctor

Python API

from pathlib import Path
from obsidian_import import extract_file, extract_text, discover_files, config_for_backend
from obsidian_import.config import load_config
from obsidian_import.output import format_output

config = load_config(Path("config.yaml"))

# Single file (full document with frontmatter)
doc = extract_file(Path("report.pdf"), config)
markdown = format_output(doc, config.output)

# Quick text extraction (no config file needed)
config = config_for_backend("markitdown", timeout_seconds=60, max_file_size_mb=50, xlsx_max_rows_per_sheet=500, extract_images=False)
text = extract_text(Path("report.pdf"), config)

# Batch discovery
for file in discover_files(config):
    print(f"{file.extension}  {file.size_bytes:,} bytes  {file.path}")

`config_for_backend()` — Quick Configuration

For consumers that just need text extraction without managing the full config surface:

from obsidian_import import extract_text, config_for_backend

config = config_for_backend(
    backend="markitdown",
    timeout_seconds=60,
    max_file_size_mb=50,
    xlsx_max_rows_per_sheet=500,
    extract_images=False,
)
text = extract_text(Path("document.docx"), config)

This sets all backends to the specified backend name. All parameters are required — no hidden defaults.

`config_from_overrides()` — Partial Overrides

For library consumers that need full control over individual config keys without writing a YAML file. The overrides dict deep-merges onto the bundled defaults, exactly like load_config does for user YAML:

from pathlib import Path
from obsidian_import import extract_text, config_from_overrides

if __name__ == "__main__":
    config = config_from_overrides(
        {
            "extraction": {"max_file_size_mb": 25, "isolation": "process"},
            "backends": {"pdf": "docling"},
        }
    )
    text = extract_text(Path("document.pdf"), config)

With isolation: "process", extraction calls in a script must run under an if __name__ == "__main__": guard: multiprocessing spawn re-imports the calling module in the child, and an unguarded top-level call crashes the child before it can extract anything. (Installed CLI entry points and pytest are unaffected.) Process mode also spawns a fresh interpreter per file, so backend imports are re-paid on every call and count against timeout_seconds — negligible for native backends, but several seconds to tens of seconds for docling/torch. Prefer thread mode for docling batch runs, or raise timeout_seconds.

Configuration

Create a config.yaml:

input:
  directories:
    - path: /path/to/documents
      extensions: [".pdf", ".docx", ".pptx", ".xlsx", ".csv", ".json", ".yaml", ".png", ".jpg"]
      exclude: ["*.tmp", "~$*"]

output:
  directory: ./extracted
  frontmatter: true
  metadata_fields:
    - title
    - source
    - original_path
    - file_type
    - extracted_at
    - page_count

backends:
  pdf: native        # pdfplumber + pypdf
  docx: native       # defusedxml
  pptx: native       # python-pptx
  xlsx: native       # openpyxl
  csv: native        # stdlib csv -> GFM table
  json: native       # stdlib json -> fenced code block
  yaml: native       # PyYAML -> fenced code block
  image: native      # Obsidian ![[wikilink]] embed
  html: markitdown   # .html / .htm via markitdown (no native backend)
  default: native    # fallback for unknown extensions

extraction:
  timeout_seconds: 120
  max_file_size_mb: 100        # enforced both in discovery and at the
                               # extract_file/extract_text entry points
  xlsx_max_rows_per_sheet: 500
  isolation: thread            # thread = lower latency (one-shot CLI use);
                               # process = killed on timeout + memory isolation
                               # (recommended for long-running daemons; see
                               # the __main__-guard and per-file import-cost
                               # notes above)

# Pass-through: copy files as-is without extraction
passthrough:
  extensions: [".md", ".markdown", ".canvas"]
  paths: ["raw/**"]
  patterns: []

Backend Selection

Backend	Extensions	Dependencies	Quality
`native`	.pdf, .docx, .pptx, .xlsx, .csv, .json, .yaml/.yml, images	Core (included)	Good for text-heavy documents
`markitdown`	Any	`[markitdown]` extra	Good fallback for HTML, etc.
`docling`	Any	`[docling]` extra	Best for complex layouts, tables

Format-Specific Behavior

Format	Native Backend Output
PDF	Page-by-page markdown with tables and metadata
DOCX	Headings, paragraphs, and tables from XML
PPTX	Slide-by-slide with titles, body text, and notes
XLSX	Sheet-by-sheet GFM markdown tables
CSV	GFM markdown table
JSON	Pretty-printed fenced code block
YAML/YML	Fenced code block
Images (PNG, JPG, GIF, SVG, WEBP, BMP, TIFF)	Obsidian wikilink embed `![[image.png]]`

Pass-Through Mode

Files matching pass-through rules are copied to the output directory as-is, without extraction or conversion. This is useful for:

.md files that are already Obsidian-ready
.csv, .json, .yaml files used by Obsidian plugins (e.g., Dataview)
Any file type where transformation is unwanted

Pass-through rules are evaluated before backend dispatch. A file matches if it hits any rule (OR logic):

passthrough:
  # Extension list (cheapest check, runs first)
  extensions: [".md", ".markdown", ".canvas"]

  # fnmatch patterns (matched against full source path string;
  # '*' matches '/', so '**/' is not needed for directory traversal)
  paths: ["notes/raw/**", "**/*.template.*"]

  # Regex patterns (matched against full source path string)
  patterns: [".*\\.generated\\..*"]

Decision tree:

File discovered
  |
  +- matches passthrough? -> COPY as-is (no .md wrapper)
  |
  +- NO -> backend dispatch -> extract -> write .md

Media Extraction

PDF, DOCX, and PPTX files can contain embedded images. Enable media extraction to save these as separate files alongside the markdown output:

media:
  extract_images: true     # enable/disable embedded image extraction
  image_format: png        # output format: png, jpg, webp
  image_max_dimension: 0   # max width/height in px (0 = no resize)

Extracted images are saved in per-document media folders (<doc-stem>/) and referenced via Obsidian wikilinks (![[doc-stem/image_001.png]]).

To disable media extraction (e.g., for text-only pipelines), set extract_images: false in your config YAML or pass extract_images=False to config_for_backend().

Image Handling

Images are handled differently from text documents. The native image backend generates an Obsidian wikilink embed:

---
title: diagram
source: obsidian-import
file_type: png
---

![[diagram.png]]

The image file is automatically copied alongside the .md output so Obsidian can render it inline. Supported formats: PNG, JPG, JPEG, GIF, SVG, WEBP, BMP, TIFF.

CLI Reference

Command	Description
`obsidian-import convert <path>`	Extract a single file
`obsidian-import discover --config <yaml>`	List matching files
`obsidian-import batch --config <yaml>`	Extract all discovered files (with pass-through)
`obsidian-import doctor`	Check backend availability

Output Format

Extracted files are written as Obsidian-flavored markdown with YAML frontmatter:

---
title: Annual Report
source: obsidian-import
original_path: /documents/report.pdf
file_type: pdf
extracted_at: 2026-03-09T10:30:00Z
page_count: 12
---

# Annual Report

## Page 1

Content extracted from the first page...

Related Packages

obsidian-export -- Convert Obsidian notes to PDF/DOCX
agentic-brain -- Agentic knowledge management (consumes both packages)

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

gucky92

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.2.0

Jun 11, 2026

1.1.2

May 20, 2026

1.1.1

May 11, 2026

1.1.0

Apr 28, 2026

1.0.4

Apr 13, 2026

1.0.3

Mar 30, 2026

1.0.2

Mar 20, 2026

1.0.1

Mar 17, 2026

1.0.0

Mar 12, 2026

0.1.0

Mar 9, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

obsidian_import-1.2.0.tar.gz (29.2 kB view details)

Uploaded Jun 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

obsidian_import-1.2.0-py3-none-any.whl (41.5 kB view details)

Uploaded Jun 11, 2026 Python 3

File details

Details for the file obsidian_import-1.2.0.tar.gz.

File metadata

Download URL: obsidian_import-1.2.0.tar.gz
Upload date: Jun 11, 2026
Size: 29.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for obsidian_import-1.2.0.tar.gz
Algorithm	Hash digest
SHA256	`2e658a8b5f4d12c951d7c92c9dd2e3a9617c748d83f81479416e8ddd76f6a4ba`
MD5	`645725541aeadca2c55ce094b4e6b98f`
BLAKE2b-256	`98c2a66e48c620d9ec67b923218c7e264c5406bf1604f8d3d6e4189e0a76db9d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for obsidian_import-1.2.0.tar.gz:

Publisher: publish.yml on neuralsignal/obsidian-import

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: obsidian_import-1.2.0.tar.gz
- Subject digest: 2e658a8b5f4d12c951d7c92c9dd2e3a9617c748d83f81479416e8ddd76f6a4ba
- Sigstore transparency entry: 1791240033
- Sigstore integration time: Jun 11, 2026
Source repository:
- Permalink: neuralsignal/obsidian-import@f25183e003f748282fab6205f1beabad26c11112
- Branch / Tag: refs/tags/v1.2.0
- Owner: https://github.com/neuralsignal
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@f25183e003f748282fab6205f1beabad26c11112
- Trigger Event: push

File details

Details for the file obsidian_import-1.2.0-py3-none-any.whl.

File metadata

Download URL: obsidian_import-1.2.0-py3-none-any.whl
Upload date: Jun 11, 2026
Size: 41.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for obsidian_import-1.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f0fb24ceca0a3decb66109cf39a7b3a36fcdc7a7be72a2ba344927d5cd4c7b89`
MD5	`457a9133b9c80e7bcd534824dccf56c8`
BLAKE2b-256	`ecd6619c62f42a4868a964a935defc32276a32c26d3063a62ed7747ca9546b81`

See more details on using hashes here.

Provenance

The following attestation bundles were made for obsidian_import-1.2.0-py3-none-any.whl:

Publisher: publish.yml on neuralsignal/obsidian-import

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: obsidian_import-1.2.0-py3-none-any.whl
- Subject digest: f0fb24ceca0a3decb66109cf39a7b3a36fcdc7a7be72a2ba344927d5cd4c7b89
- Sigstore transparency entry: 1791240416
- Sigstore integration time: Jun 11, 2026
Source repository:
- Permalink: neuralsignal/obsidian-import@f25183e003f748282fab6205f1beabad26c11112
- Branch / Tag: refs/tags/v1.2.0
- Owner: https://github.com/neuralsignal
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@f25183e003f748282fab6205f1beabad26c11112
- Trigger Event: push

obsidian-import 1.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

obsidian-import

Installation

Quick Start

Single file

Batch extraction

Check backend availability

Python API

config_for_backend() — Quick Configuration

config_from_overrides() — Partial Overrides

Configuration

Backend Selection

Format-Specific Behavior

Pass-Through Mode

Media Extraction

Image Handling

CLI Reference

Output Format

Related Packages

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`config_for_backend()` — Quick Configuration

`config_from_overrides()` — Partial Overrides