Skip to main content

Extract files (PDF, DOCX, PPTX, XLSX) into Obsidian-flavored Markdown

Project description

obsidian-import

Extract files (PDF, DOCX, PPTX, XLSX, CSV, JSON, YAML, images) into Obsidian-flavored Markdown.

The mirror of obsidian-export: where obsidian-export converts Obsidian notes to PDF/DOCX, obsidian-import converts external documents into Obsidian-ready markdown with YAML frontmatter.

Installation

pip install obsidian-import

With optional backends:

pip install obsidian-import[markitdown]    # fallback for HTML, etc.
pip install obsidian-import[docling]       # high-quality ML-based extraction

Quick Start

Single file

obsidian-import convert report.pdf --output vault/imports/report.md

Batch extraction

obsidian-import batch --config config.yaml

Check backend availability

obsidian-import doctor

Python API

from pathlib import Path
from obsidian_import import extract_file, extract_text, discover_files, config_for_backend
from obsidian_import.config import load_config
from obsidian_import.output import format_output

config = load_config(Path("config.yaml"))

# Single file (full document with frontmatter)
doc = extract_file(Path("report.pdf"), config)
markdown = format_output(doc, config.output)

# Quick text extraction (no config file needed)
config = config_for_backend("markitdown", timeout_seconds=60, max_file_size_mb=50, xlsx_max_rows_per_sheet=500, extract_images=False)
text = extract_text(Path("report.pdf"), config)

# Batch discovery
for file in discover_files(config):
    print(f"{file.extension}  {file.size_bytes:,} bytes  {file.path}")

config_for_backend() — Quick Configuration

For consumers that just need text extraction without managing the full config surface:

from obsidian_import import extract_text, config_for_backend

config = config_for_backend(
    backend="markitdown",
    timeout_seconds=60,
    max_file_size_mb=50,
    xlsx_max_rows_per_sheet=500,
    extract_images=False,
)
text = extract_text(Path("document.docx"), config)

This sets all backends to the specified backend name. All parameters are required — no hidden defaults.

config_from_overrides() — Partial Overrides

For library consumers that need full control over individual config keys without writing a YAML file. The overrides dict deep-merges onto the bundled defaults, exactly like load_config does for user YAML:

from pathlib import Path
from obsidian_import import extract_text, config_from_overrides

if __name__ == "__main__":
    config = config_from_overrides(
        {
            "extraction": {"max_file_size_mb": 25, "isolation": "process"},
            "backends": {"pdf": "docling"},
        }
    )
    text = extract_text(Path("document.pdf"), config)

With isolation: "process", extraction calls in a script must run under an if __name__ == "__main__": guard: multiprocessing spawn re-imports the calling module in the child, and an unguarded top-level call crashes the child before it can extract anything. (Installed CLI entry points and pytest are unaffected.) Process mode also spawns a fresh interpreter per file, so backend imports are re-paid on every call and count against timeout_seconds — negligible for native backends, but several seconds to tens of seconds for docling/torch. Prefer thread mode for docling batch runs, or raise timeout_seconds.

Configuration

Create a config.yaml:

input:
  directories:
    - path: /path/to/documents
      extensions: [".pdf", ".docx", ".pptx", ".xlsx", ".csv", ".json", ".yaml", ".png", ".jpg"]
      exclude: ["*.tmp", "~$*"]

output:
  directory: ./extracted
  frontmatter: true
  metadata_fields:
    - title
    - source
    - original_path
    - file_type
    - extracted_at
    - page_count

backends:
  pdf: native        # pdfplumber + pypdf
  docx: native       # defusedxml
  pptx: native       # python-pptx
  xlsx: native       # openpyxl
  csv: native        # stdlib csv -> GFM table
  json: native       # stdlib json -> fenced code block
  yaml: native       # PyYAML -> fenced code block
  image: native      # Obsidian ![[wikilink]] embed
  html: markitdown   # .html / .htm via markitdown (no native backend)
  default: native    # fallback for unknown extensions

extraction:
  timeout_seconds: 120
  max_file_size_mb: 100        # enforced both in discovery and at the
                               # extract_file/extract_text entry points
  xlsx_max_rows_per_sheet: 500
  isolation: thread            # thread = lower latency (one-shot CLI use);
                               # process = killed on timeout + memory isolation
                               # (recommended for long-running daemons; see
                               # the __main__-guard and per-file import-cost
                               # notes above)

# Pass-through: copy files as-is without extraction
passthrough:
  extensions: [".md", ".markdown", ".canvas"]
  paths: ["raw/**"]
  patterns: []

Backend Selection

Backend Extensions Dependencies Quality
native .pdf, .docx, .pptx, .xlsx, .csv, .json, .yaml/.yml, images Core (included) Good for text-heavy documents
markitdown Any [markitdown] extra Good fallback for HTML, etc.
docling Any [docling] extra Best for complex layouts, tables

Format-Specific Behavior

Format Native Backend Output
PDF Page-by-page markdown with tables and metadata
DOCX Headings, paragraphs, and tables from XML
PPTX Slide-by-slide with titles, body text, and notes
XLSX Sheet-by-sheet GFM markdown tables
CSV GFM markdown table
JSON Pretty-printed fenced code block
YAML/YML Fenced code block
Images (PNG, JPG, GIF, SVG, WEBP, BMP, TIFF) Obsidian wikilink embed ![[image.png]]

Pass-Through Mode

Files matching pass-through rules are copied to the output directory as-is, without extraction or conversion. This is useful for:

  • .md files that are already Obsidian-ready
  • .csv, .json, .yaml files used by Obsidian plugins (e.g., Dataview)
  • Any file type where transformation is unwanted

Pass-through rules are evaluated before backend dispatch. A file matches if it hits any rule (OR logic):

passthrough:
  # Extension list (cheapest check, runs first)
  extensions: [".md", ".markdown", ".canvas"]

  # fnmatch patterns (matched against full source path string;
  # '*' matches '/', so '**/' is not needed for directory traversal)
  paths: ["notes/raw/**", "**/*.template.*"]

  # Regex patterns (matched against full source path string)
  patterns: [".*\\.generated\\..*"]

Decision tree:

File discovered
  |
  +- matches passthrough? -> COPY as-is (no .md wrapper)
  |
  +- NO -> backend dispatch -> extract -> write .md

Media Extraction

PDF, DOCX, and PPTX files can contain embedded images. Enable media extraction to save these as separate files alongside the markdown output:

media:
  extract_images: true     # enable/disable embedded image extraction
  image_format: png        # output format: png, jpg, webp
  image_max_dimension: 0   # max width/height in px (0 = no resize)

Extracted images are saved in per-document media folders (<doc-stem>/) and referenced via Obsidian wikilinks (![[doc-stem/image_001.png]]).

To disable media extraction (e.g., for text-only pipelines), set extract_images: false in your config YAML or pass extract_images=False to config_for_backend().

Image Handling

Images are handled differently from text documents. The native image backend generates an Obsidian wikilink embed:

---
title: diagram
source: obsidian-import
file_type: png
---

![[diagram.png]]

The image file is automatically copied alongside the .md output so Obsidian can render it inline. Supported formats: PNG, JPG, JPEG, GIF, SVG, WEBP, BMP, TIFF.

CLI Reference

Command Description
obsidian-import convert <path> Extract a single file
obsidian-import discover --config <yaml> List matching files
obsidian-import batch --config <yaml> Extract all discovered files (with pass-through)
obsidian-import doctor Check backend availability

Output Format

Extracted files are written as Obsidian-flavored markdown with YAML frontmatter:

---
title: Annual Report
source: obsidian-import
original_path: /documents/report.pdf
file_type: pdf
extracted_at: 2026-03-09T10:30:00Z
page_count: 12
---

# Annual Report

## Page 1

Content extracted from the first page...

Related Packages

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

obsidian_import-1.2.0.tar.gz (29.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

obsidian_import-1.2.0-py3-none-any.whl (41.5 kB view details)

Uploaded Python 3

File details

Details for the file obsidian_import-1.2.0.tar.gz.

File metadata

  • Download URL: obsidian_import-1.2.0.tar.gz
  • Upload date:
  • Size: 29.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for obsidian_import-1.2.0.tar.gz
Algorithm Hash digest
SHA256 2e658a8b5f4d12c951d7c92c9dd2e3a9617c748d83f81479416e8ddd76f6a4ba
MD5 645725541aeadca2c55ce094b4e6b98f
BLAKE2b-256 98c2a66e48c620d9ec67b923218c7e264c5406bf1604f8d3d6e4189e0a76db9d

See more details on using hashes here.

Provenance

The following attestation bundles were made for obsidian_import-1.2.0.tar.gz:

Publisher: publish.yml on neuralsignal/obsidian-import

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file obsidian_import-1.2.0-py3-none-any.whl.

File metadata

  • Download URL: obsidian_import-1.2.0-py3-none-any.whl
  • Upload date:
  • Size: 41.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for obsidian_import-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f0fb24ceca0a3decb66109cf39a7b3a36fcdc7a7be72a2ba344927d5cd4c7b89
MD5 457a9133b9c80e7bcd534824dccf56c8
BLAKE2b-256 ecd6619c62f42a4868a964a935defc32276a32c26d3063a62ed7747ca9546b81

See more details on using hashes here.

Provenance

The following attestation bundles were made for obsidian_import-1.2.0-py3-none-any.whl:

Publisher: publish.yml on neuralsignal/obsidian-import

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page