Skip to main content

Extract files (PDF, DOCX, PPTX, XLSX) into Obsidian-flavored Markdown

Project description

obsidian-import

Extract files (PDF, DOCX, PPTX, XLSX, CSV, JSON, YAML, images) into Obsidian-flavored Markdown.

The mirror of obsidian-export: where obsidian-export converts Obsidian notes to PDF/DOCX, obsidian-import converts external documents into Obsidian-ready markdown with YAML frontmatter.

Installation

pip install obsidian-import

With optional backends:

pip install obsidian-import[markitdown]    # fallback for HTML, etc.
pip install obsidian-import[docling]       # high-quality ML-based extraction

Quick Start

Single file

obsidian-import convert report.pdf --output vault/imports/report.md

Batch extraction

obsidian-import batch --config config.yaml

Check backend availability

obsidian-import doctor

Python API

from pathlib import Path
from obsidian_import import extract_file, extract_text, discover_files, config_for_backend
from obsidian_import.config import load_config
from obsidian_import.output import format_output

config = load_config(Path("config.yaml"))

# Single file (full document with frontmatter)
doc = extract_file(Path("report.pdf"), config)
markdown = format_output(doc, config.output)

# Quick text extraction (no config file needed)
config = config_for_backend("markitdown", timeout_seconds=60, max_file_size_mb=50, xlsx_max_rows_per_sheet=500, extract_images=False)
text = extract_text(Path("report.pdf"), config)

# Batch discovery
for file in discover_files(config):
    print(f"{file.extension}  {file.size_bytes:,} bytes  {file.path}")

config_for_backend() — Quick Configuration

For consumers that just need text extraction without managing the full config surface:

from obsidian_import import extract_text, config_for_backend

config = config_for_backend(
    backend="markitdown",
    timeout_seconds=60,
    max_file_size_mb=50,
    xlsx_max_rows_per_sheet=500,
    extract_images=False,
)
text = extract_text(Path("document.docx"), config)

This sets all backends to the specified backend name. All parameters are required — no hidden defaults.

Configuration

Create a config.yaml:

input:
  directories:
    - path: /path/to/documents
      extensions: [".pdf", ".docx", ".pptx", ".xlsx", ".csv", ".json", ".yaml", ".png", ".jpg"]
      exclude: ["*.tmp", "~$*"]

output:
  directory: ./extracted
  frontmatter: true
  metadata_fields:
    - title
    - source
    - original_path
    - file_type
    - extracted_at
    - page_count

backends:
  pdf: native        # pdfplumber + pypdf
  docx: native       # defusedxml
  pptx: native       # python-pptx
  xlsx: native       # openpyxl
  csv: native        # stdlib csv -> GFM table
  json: native       # stdlib json -> fenced code block
  yaml: native       # PyYAML -> fenced code block
  image: native      # Obsidian ![[wikilink]] embed
  html: markitdown   # .html / .htm via markitdown (no native backend)
  default: native    # fallback for unknown extensions

extraction:
  timeout_seconds: 120
  max_file_size_mb: 100
  xlsx_max_rows_per_sheet: 500

# Pass-through: copy files as-is without extraction
passthrough:
  extensions: [".md", ".markdown", ".canvas"]
  paths: ["raw/**"]
  patterns: []

Backend Selection

Backend Extensions Dependencies Quality
native .pdf, .docx, .pptx, .xlsx, .csv, .json, .yaml/.yml, images Core (included) Good for text-heavy documents
markitdown Any [markitdown] extra Good fallback for HTML, etc.
docling Any [docling] extra Best for complex layouts, tables

Format-Specific Behavior

Format Native Backend Output
PDF Page-by-page markdown with tables and metadata
DOCX Headings, paragraphs, and tables from XML
PPTX Slide-by-slide with titles, body text, and notes
XLSX Sheet-by-sheet GFM markdown tables
CSV GFM markdown table
JSON Pretty-printed fenced code block
YAML/YML Fenced code block
Images (PNG, JPG, GIF, SVG, WEBP, BMP, TIFF) Obsidian wikilink embed ![[image.png]]

Pass-Through Mode

Files matching pass-through rules are copied to the output directory as-is, without extraction or conversion. This is useful for:

  • .md files that are already Obsidian-ready
  • .csv, .json, .yaml files used by Obsidian plugins (e.g., Dataview)
  • Any file type where transformation is unwanted

Pass-through rules are evaluated before backend dispatch. A file matches if it hits any rule (OR logic):

passthrough:
  # Extension list (cheapest check, runs first)
  extensions: [".md", ".markdown", ".canvas"]

  # fnmatch patterns (matched against full source path string;
  # '*' matches '/', so '**/' is not needed for directory traversal)
  paths: ["notes/raw/**", "**/*.template.*"]

  # Regex patterns (matched against full source path string)
  patterns: [".*\\.generated\\..*"]

Decision tree:

File discovered
  |
  +- matches passthrough? -> COPY as-is (no .md wrapper)
  |
  +- NO -> backend dispatch -> extract -> write .md

Media Extraction

PDF, DOCX, and PPTX files can contain embedded images. Enable media extraction to save these as separate files alongside the markdown output:

media:
  extract_images: true     # enable/disable embedded image extraction
  image_format: png        # output format: png, jpg, webp
  image_max_dimension: 0   # max width/height in px (0 = no resize)

Extracted images are saved in per-document media folders (<doc-stem>/) and referenced via Obsidian wikilinks (![[doc-stem/image_001.png]]).

To disable media extraction (e.g., for text-only pipelines), set extract_images: false in your config YAML or pass extract_images=False to config_for_backend().

Image Handling

Images are handled differently from text documents. The native image backend generates an Obsidian wikilink embed:

---
title: diagram
source: obsidian-import
file_type: png
---

![[diagram.png]]

The image file is automatically copied alongside the .md output so Obsidian can render it inline. Supported formats: PNG, JPG, JPEG, GIF, SVG, WEBP, BMP, TIFF.

CLI Reference

Command Description
obsidian-import convert <path> Extract a single file
obsidian-import discover --config <yaml> List matching files
obsidian-import batch --config <yaml> Extract all discovered files (with pass-through)
obsidian-import doctor Check backend availability

Output Format

Extracted files are written as Obsidian-flavored markdown with YAML frontmatter:

---
title: Annual Report
source: obsidian-import
original_path: /documents/report.pdf
file_type: pdf
extracted_at: 2026-03-09T10:30:00Z
page_count: 12
---

# Annual Report

## Page 1

Content extracted from the first page...

Related Packages

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

obsidian_import-1.1.0.tar.gz (23.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

obsidian_import-1.1.0-py3-none-any.whl (35.0 kB view details)

Uploaded Python 3

File details

Details for the file obsidian_import-1.1.0.tar.gz.

File metadata

  • Download URL: obsidian_import-1.1.0.tar.gz
  • Upload date:
  • Size: 23.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for obsidian_import-1.1.0.tar.gz
Algorithm Hash digest
SHA256 39a06b84f7916b57b880c2ced3eb7943c13f6f1b42386d6cb13a10e6a1a27cb2
MD5 6786f36f033871ea41c3a5343a43e4b4
BLAKE2b-256 24020b82e234a5e18d16be6c3072c6860e395a6f66922c1c0d83be5f18262685

See more details on using hashes here.

Provenance

The following attestation bundles were made for obsidian_import-1.1.0.tar.gz:

Publisher: publish.yml on neuralsignal/obsidian-import

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file obsidian_import-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: obsidian_import-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 35.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for obsidian_import-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ed559beef28216e37abf439671c441dfe66269094a49fdcb09a098964b1ae141
MD5 4abe03ec7c6a91d80cd9d4fe16be6174
BLAKE2b-256 0aa79ce7860b645345135dc55b03e9c156a608d3e45b503fbd0cd5bcf349ac87

See more details on using hashes here.

Provenance

The following attestation bundles were made for obsidian_import-1.1.0-py3-none-any.whl:

Publisher: publish.yml on neuralsignal/obsidian-import

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page