Skip to main content

Extract files (PDF, DOCX, PPTX, XLSX) into Obsidian-flavored Markdown

Project description

obsidian-import

Extract files (PDF, DOCX, PPTX, XLSX, CSV, JSON, YAML, images) into Obsidian-flavored Markdown.

The mirror of obsidian-export: where obsidian-export converts Obsidian notes to PDF/DOCX, obsidian-import converts external documents into Obsidian-ready markdown with YAML frontmatter.

Installation

pip install obsidian-import

With optional backends:

pip install obsidian-import[markitdown]    # fallback for HTML, etc.
pip install obsidian-import[docling]       # high-quality ML-based extraction

Quick Start

Single file

obsidian-import convert report.pdf --output vault/imports/report.md

Batch extraction

obsidian-import batch --config config.yaml

Check backend availability

obsidian-import doctor

Python API

from pathlib import Path
from obsidian_import import extract_file, extract_text, discover_files, config_for_backend
from obsidian_import.config import load_config
from obsidian_import.output import format_output

config = load_config(Path("config.yaml"))

# Single file (full document with frontmatter)
doc = extract_file(Path("report.pdf"), config)
markdown = format_output(doc, config.output)

# Quick text extraction (no config file needed)
config = config_for_backend("markitdown", timeout_seconds=60, max_file_size_mb=50, xlsx_max_rows_per_sheet=500)
text = extract_text(Path("report.pdf"), config)

# Batch discovery
for file in discover_files(config):
    print(f"{file.extension}  {file.size_bytes:,} bytes  {file.path}")

config_for_backend() — Quick Configuration

For consumers that just need text extraction without managing the full config surface:

from obsidian_import import extract_text, config_for_backend

config = config_for_backend(
    backend="markitdown",
    timeout_seconds=60,
    max_file_size_mb=50,
    xlsx_max_rows_per_sheet=500,
)
text = extract_text(Path("document.docx"), config)

This sets all backends to the specified backend name and disables media extraction. All parameters are required — no hidden defaults.

Configuration

Create a config.yaml:

input:
  directories:
    - path: /path/to/documents
      extensions: [".pdf", ".docx", ".pptx", ".xlsx", ".csv", ".json", ".yaml", ".png", ".jpg"]
      exclude: ["*.tmp", "~$*"]

output:
  directory: ./extracted
  frontmatter: true
  metadata_fields:
    - title
    - source
    - original_path
    - file_type
    - extracted_at
    - page_count

backends:
  pdf: native        # pdfplumber + pypdf
  docx: native       # defusedxml
  pptx: native       # python-pptx
  xlsx: native       # openpyxl
  csv: native        # stdlib csv -> GFM table
  json: native       # stdlib json -> fenced code block
  yaml: native       # PyYAML -> fenced code block
  image: native      # Obsidian ![[wikilink]] embed
  default: native    # fallback for unknown extensions

extraction:
  timeout_seconds: 120
  max_file_size_mb: 100
  xlsx_max_rows_per_sheet: 500

# Pass-through: copy files as-is without extraction
passthrough:
  extensions: [".md", ".markdown", ".canvas"]
  paths: ["raw/**"]
  patterns: []

Backend Selection

Backend Extensions Dependencies Quality
native .pdf, .docx, .pptx, .xlsx, .csv, .json, .yaml/.yml, images Core (included) Good for text-heavy documents
markitdown Any [markitdown] extra Good fallback for HTML, etc.
docling Any [docling] extra Best for complex layouts, tables

Format-Specific Behavior

Format Native Backend Output
PDF Page-by-page markdown with tables and metadata
DOCX Headings, paragraphs, and tables from XML
PPTX Slide-by-slide with titles, body text, and notes
XLSX Sheet-by-sheet GFM markdown tables
CSV GFM markdown table
JSON Pretty-printed fenced code block
YAML/YML Fenced code block
Images (PNG, JPG, GIF, SVG, WEBP, BMP, TIFF) Obsidian wikilink embed ![[image.png]]

Pass-Through Mode

Files matching pass-through rules are copied to the output directory as-is, without extraction or conversion. This is useful for:

  • .md files that are already Obsidian-ready
  • .csv, .json, .yaml files used by Obsidian plugins (e.g., Dataview)
  • Any file type where transformation is unwanted

Pass-through rules are evaluated before backend dispatch. A file matches if it hits any rule (OR logic):

passthrough:
  # Extension list (cheapest check, runs first)
  extensions: [".md", ".markdown", ".canvas"]

  # fnmatch patterns (matched against full source path string;
  # '*' matches '/', so '**/' is not needed for directory traversal)
  paths: ["notes/raw/**", "**/*.template.*"]

  # Regex patterns (matched against full source path string)
  patterns: [".*\\.generated\\..*"]

Decision tree:

File discovered
  |
  +- matches passthrough? -> COPY as-is (no .md wrapper)
  |
  +- NO -> backend dispatch -> extract -> write .md

Media Extraction

PDF, DOCX, and PPTX files can contain embedded images. Enable media extraction to save these as separate files alongside the markdown output:

media:
  extract_images: true     # enable/disable embedded image extraction
  image_format: png        # output format: png, jpg, webp
  image_max_dimension: 0   # max width/height in px (0 = no resize)

Extracted images are saved in per-document media folders (<doc-stem>/) and referenced via Obsidian wikilinks (![[doc-stem/image_001.png]]).

To disable media extraction (e.g., for text-only pipelines), set extract_images: false or use config_for_backend() which disables it by default.

Image Handling

Images are handled differently from text documents. The native image backend generates an Obsidian wikilink embed:

---
title: diagram
source: obsidian-import
file_type: png
---

![[diagram.png]]

The image file is automatically copied alongside the .md output so Obsidian can render it inline. Supported formats: PNG, JPG, JPEG, GIF, SVG, WEBP, BMP, TIFF.

CLI Reference

Command Description
obsidian-import convert <path> Extract a single file
obsidian-import discover --config <yaml> List matching files
obsidian-import batch --config <yaml> Extract all discovered files (with pass-through)
obsidian-import doctor Check backend availability

Output Format

Extracted files are written as Obsidian-flavored markdown with YAML frontmatter:

---
title: Annual Report
source: obsidian-import
original_path: /documents/report.pdf
file_type: pdf
extracted_at: 2026-03-09T10:30:00Z
page_count: 12
---

# Annual Report

## Page 1

Content extracted from the first page...

Related Packages

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

obsidian_import-1.0.1.tar.gz (21.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

obsidian_import-1.0.1-py3-none-any.whl (32.5 kB view details)

Uploaded Python 3

File details

Details for the file obsidian_import-1.0.1.tar.gz.

File metadata

  • Download URL: obsidian_import-1.0.1.tar.gz
  • Upload date:
  • Size: 21.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for obsidian_import-1.0.1.tar.gz
Algorithm Hash digest
SHA256 235e7ea6bd47ead42691f34070fd9bb7438e7237a5eae8ee66b34cd0f6d06caf
MD5 39bb0d473be15b2cc399f6073d58905c
BLAKE2b-256 3fc216826f082077f4a7642e3a5160e9d7eabb5816b170cfa10422da40ff174f

See more details on using hashes here.

Provenance

The following attestation bundles were made for obsidian_import-1.0.1.tar.gz:

Publisher: publish.yml on neuralsignal/obsidian-import

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file obsidian_import-1.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for obsidian_import-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 28c1e2b1d9ce8b02405244a2c15a29f1a662ded957189adff1f7cf61ab5353c4
MD5 0ab65cab92e3f1ac56ab3882cf972179
BLAKE2b-256 afb315e5af321f67e79ccd4a3c95fc42f4b782fac877d44cc2e9437a6b6796d9

See more details on using hashes here.

Provenance

The following attestation bundles were made for obsidian_import-1.0.1-py3-none-any.whl:

Publisher: publish.yml on neuralsignal/obsidian-import

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page