Skip to main content

Tools for markdown parsing and generation

Project description

dn

Markdown parsing and generation

To install: pip install dn

Optional Dependencies

This package supports converting various file formats to Markdown, with each format requiring specific dependencies:

Format      Required Package(s)
----------- -----------------
PDF         pypdf
Word        mammoth
Excel       pandas, openpyxl, tabulate
PowerPoint  python-pptx
HTML        html2text
Notebooks   nbconvert, nbformat

Installation Options

You can install these dependencies after the fact, if and when package complains it needs some specific resource.

You can also install these when installing dn, like so:

    # Install with minimal dependencies
    pip install dn

    # Install with support for specific formats
    pip install dn[pdf]               # PDF conversion support
    pip install dn[word]              # Word document support
    pip install dn[excel]             # Excel support
    pip install dn[powerpoint]        # PowerPoint support
    pip install dn[html]              # HTML conversion
    pip install dn[notebook]          # Jupyter Notebook support

    # Install multiple format support
    pip install dn[pdf,word,excel]    # Multiple formats

    # Install all optional dependencies
    pip install dn[all]

Examples

To and from jupyter notebooks

from dn import markdown_to_notebook

sample_markdown = """# Sample Notebook

This is a markdown cell with some explanation.

```python
# This is a code cell
print("Hello, World!")
x = 42
print(f"The answer is {x}")
```

## Another Section

More markdown content here.

```python
# Another code cell
def greet(name):
    return f"Hello, {name}!"

print(greet("Jupyter"))
```

Final markdown cell."""

Test basic functionality

notebook = markdown_to_notebook(sample_markdown)
print(f"Created notebook with {len(notebook['cells'])} cells")

Test with file output

output_path = markdown_to_notebook(
    sample_markdown,
    egress="./sample_notebook.ipynb"
)
print(f"Saved notebook to: {output_path}")
Created notebook with 5 cells
Saved notebook to: /Users/thorwhalen/Dropbox/py/proj/t/dn/misc/sample_notebook.ipynb
from dn import notebook_to_markdown

md_string = notebook_to_markdown(notebook)
print(md_string)
# Sample Notebook

This is a markdown cell with some explanation.



```python
# This is a code cell
print("Hello, World!")
x = 42
print(f"The answer 
...
nt(greet("Jupyter"))

```

Final markdown cell.

... and other formats

from dn import pdf_to_markdown  # requires pypdf
from dn import docx_to_markdown  # requires mammoth
from dn import excel_to_markdown  # requires pandas
from dn import pptx_to_markdown  # requires python-pptx
from dn import html_to_markdown  # requires html2text

Markdown stores

User story: I have a directory with multiple files in different formats.

I want to batch convert all supported files to markdown and store them in memory.

from dn import Files, bytes_store_to_markdown_store

from dn.tests.utils_for_testing_dn import test_data_dir

# Setup source files from test directory
src_files = Files(test_data_dir)

# Setup target store as an in-memory dictionary
target_store = {}

# Convert all files in directory to markdown
result = bytes_store_to_markdown_store(src_files, target_store, verbose=False)

# Check that the result is the target_store
assert result is target_store

# Verify that the supported file types were converted correctly
supported_files = [
    "test.docx",
    "test.pptx",
    "test.pdf",
    "test.html",
    "test.xlsx",
    "test.txt",
    "test.md",
    "test.ipynb",
]

print(f"\nSupported files (given what packages are installed here): {supported_files}\n")

for filename in supported_files:
    assert f"{filename}.md" in target_store, f"{filename} not found in target_store"
    assert len(target_store[f"{filename}.md"]) > 0, f"{filename} conversion failed"
invalid pdf header: b'PK\x03\x04\n'
EOF marker not found
EOF marker not found
invalid pdf header: b'PK\x03\x04\x14'
EOF marker not found
invalid pdf h
...
df header: b'PK\x03\x04\x14'
EOF marker not found



Supported files (given what packages are installed here): ['test.docx', 'test.pptx', 'test.pdf', 'test.html', 'test.xlsx', 'test.txt', 'test.md', 'test.ipynb']

Convert this notebook into a markdown for the README.md

from dn import notebook_to_markdown

notebook_to_markdown('~/Dropbox/py/proj/t/dn/misc/dn_readme.ipynb', target_file='../README.md')
HTML output truncated. (Data removed)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dn-0.0.8.tar.gz (15.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dn-0.0.8-py3-none-any.whl (15.2 kB view details)

Uploaded Python 3

File details

Details for the file dn-0.0.8.tar.gz.

File metadata

  • Download URL: dn-0.0.8.tar.gz
  • Upload date:
  • Size: 15.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.18

File hashes

Hashes for dn-0.0.8.tar.gz
Algorithm Hash digest
SHA256 c946e58fe96c2385e44a97052a18c91551a908ef3d7abe06f4d2fcf5ac8801e0
MD5 e0718e89ef857a32fb83f7cda9052ed9
BLAKE2b-256 7ca5bb2f05a45b814c371363e493195b5468de9bfdca1ae55f7a320cd8e31475

See more details on using hashes here.

File details

Details for the file dn-0.0.8-py3-none-any.whl.

File metadata

  • Download URL: dn-0.0.8-py3-none-any.whl
  • Upload date:
  • Size: 15.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.18

File hashes

Hashes for dn-0.0.8-py3-none-any.whl
Algorithm Hash digest
SHA256 4ce02ebcb5601c2698ef16f6e21667a28a51e549c5e05bed15e9f3a6ecd7800f
MD5 ad8b0b08b53c596767b7fcb088ff69d0
BLAKE2b-256 0b11f6489b18b334a60641132641f4f21e392594977cbd5848c9b549e4de94ff

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page