Extract files (PDF, DOCX, PPTX, XLSX) into Obsidian-flavored Markdown
Project description
obsidian-import
Extract files (PDF, DOCX, PPTX, XLSX, CSV, JSON, YAML, images) into Obsidian-flavored Markdown.
The mirror of obsidian-export: where obsidian-export converts Obsidian notes to PDF/DOCX, obsidian-import converts external documents into Obsidian-ready markdown with YAML frontmatter.
Installation
pip install obsidian-import
With optional backends:
pip install obsidian-import[markitdown] # fallback for HTML, etc.
pip install obsidian-import[docling] # high-quality ML-based extraction
Quick Start
Single file
obsidian-import convert report.pdf --output vault/imports/report.md
Batch extraction
obsidian-import batch --config config.yaml
Check backend availability
obsidian-import doctor
Python API
from pathlib import Path
from obsidian_import import extract_file, extract_text, discover_files, config_for_backend
from obsidian_import.config import load_config
from obsidian_import.output import format_output
config = load_config(Path("config.yaml"))
# Single file (full document with frontmatter)
doc = extract_file(Path("report.pdf"), config)
markdown = format_output(doc, config.output)
# Quick text extraction (no config file needed)
config = config_for_backend("markitdown", timeout_seconds=60, max_file_size_mb=50, xlsx_max_rows_per_sheet=500)
text = extract_text(Path("report.pdf"), config)
# Batch discovery
for file in discover_files(config):
print(f"{file.extension} {file.size_bytes:,} bytes {file.path}")
config_for_backend() — Quick Configuration
For consumers that just need text extraction without managing the full config surface:
from obsidian_import import extract_text, config_for_backend
config = config_for_backend(
backend="markitdown",
timeout_seconds=60,
max_file_size_mb=50,
xlsx_max_rows_per_sheet=500,
)
text = extract_text(Path("document.docx"), config)
This sets all backends to the specified backend name and disables media extraction. All parameters are required — no hidden defaults.
Configuration
Create a config.yaml:
input:
directories:
- path: /path/to/documents
extensions: [".pdf", ".docx", ".pptx", ".xlsx", ".csv", ".json", ".yaml", ".png", ".jpg"]
exclude: ["*.tmp", "~$*"]
output:
directory: ./extracted
frontmatter: true
metadata_fields:
- title
- source
- original_path
- file_type
- extracted_at
- page_count
backends:
pdf: native # pdfplumber + pypdf
docx: native # defusedxml
pptx: native # python-pptx
xlsx: native # openpyxl
csv: native # stdlib csv -> GFM table
json: native # stdlib json -> fenced code block
yaml: native # PyYAML -> fenced code block
image: native # Obsidian ![[wikilink]] embed
default: native # fallback for unknown extensions
extraction:
timeout_seconds: 120
max_file_size_mb: 100
xlsx_max_rows_per_sheet: 500
# Pass-through: copy files as-is without extraction
passthrough:
extensions: [".md", ".markdown", ".canvas"]
paths: ["raw/**"]
patterns: []
Backend Selection
| Backend | Extensions | Dependencies | Quality |
|---|---|---|---|
native |
.pdf, .docx, .pptx, .xlsx, .csv, .json, .yaml/.yml, images | Core (included) | Good for text-heavy documents |
markitdown |
Any | [markitdown] extra |
Good fallback for HTML, etc. |
docling |
Any | [docling] extra |
Best for complex layouts, tables |
Format-Specific Behavior
| Format | Native Backend Output |
|---|---|
| Page-by-page markdown with tables and metadata | |
| DOCX | Headings, paragraphs, and tables from XML |
| PPTX | Slide-by-slide with titles, body text, and notes |
| XLSX | Sheet-by-sheet GFM markdown tables |
| CSV | GFM markdown table |
| JSON | Pretty-printed fenced code block |
| YAML/YML | Fenced code block |
| Images (PNG, JPG, GIF, SVG, WEBP, BMP, TIFF) | Obsidian wikilink embed ![[image.png]] |
Pass-Through Mode
Files matching pass-through rules are copied to the output directory as-is, without extraction or conversion. This is useful for:
.mdfiles that are already Obsidian-ready.csv,.json,.yamlfiles used by Obsidian plugins (e.g., Dataview)- Any file type where transformation is unwanted
Pass-through rules are evaluated before backend dispatch. A file matches if it hits any rule (OR logic):
passthrough:
# Extension list (cheapest check, runs first)
extensions: [".md", ".markdown", ".canvas"]
# fnmatch patterns (matched against full source path string;
# '*' matches '/', so '**/' is not needed for directory traversal)
paths: ["notes/raw/**", "**/*.template.*"]
# Regex patterns (matched against full source path string)
patterns: [".*\\.generated\\..*"]
Decision tree:
File discovered
|
+- matches passthrough? -> COPY as-is (no .md wrapper)
|
+- NO -> backend dispatch -> extract -> write .md
Media Extraction
PDF, DOCX, and PPTX files can contain embedded images. Enable media extraction to save these as separate files alongside the markdown output:
media:
extract_images: true # enable/disable embedded image extraction
image_format: png # output format: png, jpg, webp
image_max_dimension: 0 # max width/height in px (0 = no resize)
Extracted images are saved in per-document media folders (<doc-stem>/) and referenced via Obsidian wikilinks (![[doc-stem/image_001.png]]).
To disable media extraction (e.g., for text-only pipelines), set extract_images: false or use config_for_backend() which disables it by default.
Image Handling
Images are handled differently from text documents. The native image backend generates an Obsidian wikilink embed:
---
title: diagram
source: obsidian-import
file_type: png
---
![[diagram.png]]
The image file is automatically copied alongside the .md output so Obsidian can render it inline. Supported formats: PNG, JPG, JPEG, GIF, SVG, WEBP, BMP, TIFF.
CLI Reference
| Command | Description |
|---|---|
obsidian-import convert <path> |
Extract a single file |
obsidian-import discover --config <yaml> |
List matching files |
obsidian-import batch --config <yaml> |
Extract all discovered files (with pass-through) |
obsidian-import doctor |
Check backend availability |
Output Format
Extracted files are written as Obsidian-flavored markdown with YAML frontmatter:
---
title: Annual Report
source: obsidian-import
original_path: /documents/report.pdf
file_type: pdf
extracted_at: 2026-03-09T10:30:00Z
page_count: 12
---
# Annual Report
## Page 1
Content extracted from the first page...
Related Packages
- obsidian-export -- Convert Obsidian notes to PDF/DOCX
- agentic-brain -- Agentic knowledge management (consumes both packages)
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file obsidian_import-1.0.1.tar.gz.
File metadata
- Download URL: obsidian_import-1.0.1.tar.gz
- Upload date:
- Size: 21.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
235e7ea6bd47ead42691f34070fd9bb7438e7237a5eae8ee66b34cd0f6d06caf
|
|
| MD5 |
39bb0d473be15b2cc399f6073d58905c
|
|
| BLAKE2b-256 |
3fc216826f082077f4a7642e3a5160e9d7eabb5816b170cfa10422da40ff174f
|
Provenance
The following attestation bundles were made for obsidian_import-1.0.1.tar.gz:
Publisher:
publish.yml on neuralsignal/obsidian-import
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
obsidian_import-1.0.1.tar.gz -
Subject digest:
235e7ea6bd47ead42691f34070fd9bb7438e7237a5eae8ee66b34cd0f6d06caf - Sigstore transparency entry: 1114975128
- Sigstore integration time:
-
Permalink:
neuralsignal/obsidian-import@61984e295770af6fc68c9a00cab75290e6a00026 -
Branch / Tag:
refs/tags/v1.0.1 - Owner: https://github.com/neuralsignal
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@61984e295770af6fc68c9a00cab75290e6a00026 -
Trigger Event:
push
-
Statement type:
File details
Details for the file obsidian_import-1.0.1-py3-none-any.whl.
File metadata
- Download URL: obsidian_import-1.0.1-py3-none-any.whl
- Upload date:
- Size: 32.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
28c1e2b1d9ce8b02405244a2c15a29f1a662ded957189adff1f7cf61ab5353c4
|
|
| MD5 |
0ab65cab92e3f1ac56ab3882cf972179
|
|
| BLAKE2b-256 |
afb315e5af321f67e79ccd4a3c95fc42f4b782fac877d44cc2e9437a6b6796d9
|
Provenance
The following attestation bundles were made for obsidian_import-1.0.1-py3-none-any.whl:
Publisher:
publish.yml on neuralsignal/obsidian-import
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
obsidian_import-1.0.1-py3-none-any.whl -
Subject digest:
28c1e2b1d9ce8b02405244a2c15a29f1a662ded957189adff1f7cf61ab5353c4 - Sigstore transparency entry: 1114975133
- Sigstore integration time:
-
Permalink:
neuralsignal/obsidian-import@61984e295770af6fc68c9a00cab75290e6a00026 -
Branch / Tag:
refs/tags/v1.0.1 - Owner: https://github.com/neuralsignal
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@61984e295770af6fc68c9a00cab75290e6a00026 -
Trigger Event:
push
-
Statement type: