Skip to main content

Extract XML from the OS X dictionaries.

Project description

Before You Start

Apple-peeler was written using python 3.9 (but it should be trivial to support earlier versions of python 3.5+).

Installation

pip install apple-peeler

Dependencies

BeautifulSoup 4, lxml, and click

Usage

Apple likes to move around the dictionaries location from macOS version to macOS version. So if the dictionaries are no longer at the path below you can tell apple-peeler where to look by exporting DICT_BASE in your environment or using the --base option below.

export DICT_BASE="/System/Library/AssetsV2/com_apple_MobileAsset_DictionaryServices_dictionaryOSX/"

After that, useage is straightforward.

Usage: apple-peeler [OPTIONS]

Extract XML from Apple Dictionary files.

Options:
--base DIRECTORY                The root directory of the OS X dictionaries.
                                (Default: /System/Library/AssetsV2/com_apple
                                _MobileAsset_DictionaryServices_dictionaryOS
                                X/) [Env var DICT_BASE]
--out DIRECTORY                 The path to place extracted XML files.
-d, --dictionary [
    all|Arabic - English|Danish|Duden Dictionary Data Set I|Dutch|
    Dutch - English|French|French - English|French - German|German - English|
    Hebrew|Hindi|Hindi - English|Indonesian - English|Italian|
    Italian - English|Korean|Korean - English|New Oxford American Dictionary|
    Norwegian|Oxford American Writer's Thesaurus|
    Oxford Dictionary of English|Oxford Thesaurus of English|
    Polish - English|Portuguese|Portuguese - English|Russian|
    Russian - English|Sanseido Super Daijirin|
    Sanseido The WISDOM English-Japanese Japanese-English Dictionary|
    Simplified Chinese - English|Simplified Chinese - Japanese|Spanish|
    Spanish - English|Swedish|Thai|Thai - English|
    The Standard Dictionary of Contemporary Chinese|Traditional Chinese|
    Traditional Chinese - English|Turkish|Vietnamese - English]
                                The dictionary to extract or 'all'.
                                (Default: all) [Accepts multiple]
--format-xml / --no-format-xml  Format the XML files using BeautifulSoup.
                                (Default: False)
--debug                         Output debug information to STDERR.
                                (Default: False)
--help                          Show this message and exit.

Introduction

I need a ton of dictionary data for prototyping my learning a language tool, Parsnip, and licensing 40 dictionaries seems too expensive for a bootstrapper working on an MVP (I look forward to the day this is no longer true).

Parsnip uses Natural Language Processing and Dictionaries to decouple the word <-> sentence tug-of-war that's existed as long as flashcards have been used for language learning. I.e., should I make a word (concept) or a sentence (example) flashcard?

I care about what words I know for tracking purposes, but I want those words in context when I'm practicing. So the learning system breaks down sentences into lemmas (or dictionary form of a word) and a database of example sentences that the words appear in. This resolves the conceptual tug-of-war for flashcards.

But by removing reference data from the flashcards themselves, I need to integrate reference material directly into Parsnip's UI. JMDict is a great open source project for this, but that only covers a single language. So, I've been keeping my eyes open for people working on extracting the data from Apple's bundled dictionaries.

This has been a community effort that's spanned several years. My contribution is to collect the results, clear up some details about the file format, and package it into a general command-line tool.

References

This is inspired by Reverse-Engineering Apple Dictionary. And the discussion on Hacker News Hacker News: Reverse-Engineering Apple Dictionary (2020). Special thanks to tim-- and enragedcacti who introduced me to binwalk. And dunham who mentioned the random bytes looking like ints of payload sizes.

Additionally, I've found these posts informative:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

apple_peeler-0.1.1.tar.gz (7.8 kB view details)

Uploaded Source

Built Distribution

apple_peeler-0.1.1-py3-none-any.whl (7.9 kB view details)

Uploaded Python 3

File details

Details for the file apple_peeler-0.1.1.tar.gz.

File metadata

  • Download URL: apple_peeler-0.1.1.tar.gz
  • Upload date:
  • Size: 7.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.8 CPython/3.8.11 Darwin/20.5.0

File hashes

Hashes for apple_peeler-0.1.1.tar.gz
Algorithm Hash digest
SHA256 1a00fbf840797177d871936255c0bcc38f657e80a92408ada988665b1ae4b197
MD5 435bd4643ac8e75657fdea8b066c65dd
BLAKE2b-256 d90af5fb3eb7bc77efe07da0a6fc12eb17fa63ec3447fe8a00f123590f314e5d

See more details on using hashes here.

File details

Details for the file apple_peeler-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: apple_peeler-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 7.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.8 CPython/3.8.11 Darwin/20.5.0

File hashes

Hashes for apple_peeler-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 0036d2af06b957a94b50a63bd37941448fef0a71709216de4e6e9d91bf4ce74e
MD5 bb941591679472a25896b69bc4f65e6c
BLAKE2b-256 94f31e74ef885626871c4c2b44aed6a22ab34d8b156a9e54a4630aa2bbd0f5fb

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page