Skip to main content

Convert Google docs to markdown

Project description

wikinator

Convert a Google drive download into a markdown-based wiki.

Note: This is a work in progress, and not all features will be supported or working properly.

tl;dr

uvx wikinator some/dir another_dir
uvx wikinator some/dir -graphql https://wiki.example.com/graphql -token 'graphql-auth-token'

Given a directory, convert supported file types into markdown-based files while maintaining names and directory structure. This can then be uploaded into various wiki systems.

Supported File Types

  • DOCX files (default for GDocs) are converted to markdown
  • images are extracted, uploaded and embedded in the markdown
  • text and code file types are wrapped in markdown code blocks
  • CSV and XSLT are converted to markdown tables
  • for any document that is converted to markdown, a copy of the original is uploaded and attached

Supported Wiki Import

  • wiki.js (and other GraphQL-based wikis)
  • Obsidian

The development log will be kept here until the 1.0 release.

Usage

uvx wikinator some/dir another_dir
uvx wikinator some/dir -graphql https://wiki.example.com/graphql -token 'graphql-auth-token'

Build & Test

  1. Clone
    git clone https://github.com/philion/wikinator.git
    cd wikinator
    
  2. Run, with uv
    uv run wikinator [options]
    
  3. Test, with pytest
    uv run pytest
    

Development Log

2025-07-05

Starting work on image preservation.

Looking first at https://github.com/haesleinhuepf/docx2markdown for images.

Created a Docx2MarkdownConverter which almost works: images are put in the wrong path in the MD (s/images/ instead of just images). There's probably an easy fix, but lets try a pandoc version.

Creating PandocConverter to try and compare output.

pandoc {indoc} -f docx -t markdown --wrap=none --markdown-headings=atx --extract-media=images -o {outdoc}

Neither produces desired results.

Trying a literal hack of docx2markdown, to see how quickly I can fix the little problems I saw.

Got it working quickly, removed a little bug, got the images.

Now looking over the DOCX XML format to see how much I can scrape out.

https://learn.microsoft.com/en-us/dotnet/api/documentformat.openxml.wordprocessing.fontsizecomplexscript?view=openxml-3.0.1

Added detection for strikethru and Courier New (as "code font").

This is good enough for v0.2!

Noticed when working on strikethru that nested lists didn't seem to be working. try that next.

Oops. minor bug. fixing with v0.3



### 2025-07-04
Let's make a project! Today's goals:
- [x] clean up code and README
- [x] add CLI options, using type (not all implemented)
- [x] initial commit to github
- [ ] add image handling
- [x] upload to pypi and confirm uvx commands

Cruft removed. README updated. (author waves, breaking 4th wall)

Moving on the main() cleanup and adding support for https://github.com/fastapi/typer

Added simple CLI options for src and dest. Got end-to-end tree processing.

Added Makefile to help with release management. Got PyPI setup: https://pypi.org/project/wikinator/

`uvx wikinator` is working.

Let's go for git and call it a day!

### 2025-07-03
Next steps are testing different document converters and accessing google drive via API.

#### Markdown conversion libraries
- pandoc, see https://docs.asciidoctor.org/asciidoctor/latest/migrate/ms-word/
- markitdown, https://github.com/microsoft/markitdown
- docx2markdown, https://github.com/haesleinhuepf/docx2markdown
- docx2md, https://github.com/mattn/docx2md

Reference:
- https://www.docstomarkdown.pro/convert-word-or-docs-to-markdown-using-pandoc/

#### Google Drive API
Starting with https://developers.google.com/workspace/drive/api/quickstart/python

> Note: Follow those Google directions for setting up everything. It's complicated compared to simply generating a service token. Your intrepid author made different tokens in different accounts and couldn't access anything! And get permissions right! Document specific needs in intstall docs.

Further aside: There are two versions of the tool: file-based and google-takeout. The google related stuff will always be a bear to setup.

Made suffienct progress to feel like there a seperate CLI tool here. Set aside for now, and focus on:
1. Build file-based output
2. Generate and link images
3. Clean up for initial 0.1 version

### 2025-07-02
Initial time-boxed work started to examine what would be required to migrate our existing GoogleDocs-based info repo into a wiki, with wiki.js being targeted.

Initial proof-of-concept goals:
- Convert a docx page to md or asciidoc
- Upload test pages to wiki.js

I was able to get this working in sample code in a few hours.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wikinator-0.3.0.tar.gz (69.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wikinator-0.3.0-py3-none-any.whl (13.2 kB view details)

Uploaded Python 3

File details

Details for the file wikinator-0.3.0.tar.gz.

File metadata

  • Download URL: wikinator-0.3.0.tar.gz
  • Upload date:
  • Size: 69.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.19

File hashes

Hashes for wikinator-0.3.0.tar.gz
Algorithm Hash digest
SHA256 e98fd76d0dd5ae3fc02c9d5b7f991d4df21065e0a4124257cb1aea2343ab89fb
MD5 405a1df64d134088ff58eb5729f41584
BLAKE2b-256 d9df600834db388812fd2fcf394af523403dcdf2a46a47b9c4d3c49eb6b5fd90

See more details on using hashes here.

File details

Details for the file wikinator-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: wikinator-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 13.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.19

File hashes

Hashes for wikinator-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cc894bee4f467a297364cf3953aec909adb7553a45f8d1623ad82a8396668153
MD5 36c659532efddbfb19405f8d053ad4d2
BLAKE2b-256 e133aa23d3f32e5d10e2736db318bf0887182bc03a0e8769b561a352a3304081

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page