Parse org-mode files into structured data for the org-dex indexing system
Project description
#+title: org-dex-parse
#+author: gdvek
Extract structured data from org-mode files. Point it at an =.org= file,
get back Python objects — titles, timestamps, links, tags, clock entries,
properties, and more — ready to query, store, or pipe into whatever
you're building.
Built for [[https://github.com/gdvek/org-dex][org-dex]], usable standalone. Uses [[https://github.com/karlicoss/orgparse][orgparse]] as the parsing
backend.
* Try it
** From the command line
#+begin_src sh
pip install org-dex-parse # Python >= 3.11
#+end_src
Use one of your own org files, or create a test file:
#+begin_example
* TODO Write report :work:
DEADLINE: <2026-04-01>
:PROPERTIES:
:ID: abc-001
:END:
** Notes
Some references: [[id:other][see also]].
* DONE Review draft
CLOSED: [2026-03-15 Sun 10:00]
:PROPERTIES:
:ID: abc-002
:END:
#+end_example
#+begin_src sh
python -m org_dex_parse example.org
#+end_src
Output:
#+begin_example
example.org: 2 items
Write report
id=abc-001 level=1 line=1
todo=TODO
local_tags={'work'}
Review draft
id=abc-002 level=1 line=9
todo=DONE
#+end_example
Add =-v= to include body text, =--json= for machine-readable output.
A ready-made config file is included for common setups:
#+begin_src sh
python -m org_dex_parse --config examples/config.json example.org
#+end_src
It covers TODO keywords, drawer filtering, and item selection rules
for a typical org-mode setup. Copy it and adjust to your needs — the
fields are documented in [[*Configuration][Configuration]].
** From Python
#+begin_src python
from org_dex_parse import parse_file, Config
result = parse_file("notes.org", Config())
for item in result.items:
print(f"{item.todo or ''} {item.title}")
print(f" id={item.item_id} tags={item.local_tags}")
if item.deadline:
print(f" deadline={item.deadline.date}")
if item.links:
print(f" links={len(item.links)}")
#+end_src
Each =Item= in =result.items= is a heading with =:ID:= that passed
the configured predicate — see [[*Key concepts][Key concepts]].
* Key concepts
The parser distinguishes two kinds of headings:
- *Items* — headings with =:ID:= that pass the predicate. Each produces
a 24-field structured object.
- *Scaffolding* — everything else. Organizational headings whose
content (body, links, timestamps, clock) rolls up into the nearest
ancestor item. Nothing is lost — scaffolding content is collected,
not discarded.
You control what counts as an item through a *predicate*. The default
accepts every heading with =:ID:=. You can narrow it — for example,
require a =:Type:= property, or exclude headings with =ROAM_EXCLUDE=.
#+begin_example
org file
|
orgparse
(syntax tree)
|
org-dex-parse
(semantic layer)
|
Item stream
(24-field frozen
dataclasses)
|
+--------------+--------------+
v v v
org-dex custom indexers data pipelines
(DB + UI) (knowledge graphs)(analytics)
#+end_example
orgparse handles the org-mode grammar. org-dex-parse handles item
discrimination, field extraction, and content filtering.
* Installation
#+begin_src sh
pip install org-dex-parse
#+end_src
Requires Python >= 3.11. Single dependency: =orgparse>=0.4,<0.5=.
* Examples
The examples below show the parser on increasingly complex org files.
Each starts with the org source, then shows the Python code and what
each field contains.
** Example 1: default predicate — items and scaffolding
With =Config()= (default), every heading with =:ID:= is an item.
Headings without =:ID:= are scaffolding — their content rolls up into
the nearest ancestor item.
#+begin_example
* Project
** TODO Write report :work:
DEADLINE: <2026-04-01>
:PROPERTIES:
:ID: a1b2c3
:END:
*** Notes
Some text with [[id:ref][a link]].
Meeting on <2026-03-20 Thu>.
** DONE Review draft
CLOSED: [2026-03-15 Sun 10:00]
:PROPERTIES:
:ID: d4e5f6
:END:
** Background reading
No :ID: here — just an organizational heading.
#+end_example
#+begin_src python
config = Config(
todos=("TODO",),
dones=("DONE",),
)
result = parse_file("project.org", config)
# result.items → 2 items
#+end_src
| Heading | =:ID:=? | Item? | Why |
|--------------+-------+-------+--------------------------------|
| Project | no | no | No =:ID:= → scaffolding |
| Write report | yes | yes | Has =:ID:= |
| Notes | no | no | No =:ID:= → scaffolding of above |
| Review draft | yes | yes | Has =:ID:= |
| Bg reading | no | no | No =:ID:= → scaffolding |
"Notes" is scaffolding under "Write report". Its body text, the link
=[[id:ref][a link]]=, and the timestamp =<2026-03-20>= all become part
of the "Write report" item:
#+begin_src python
item = result.items[0] # Write report
item.title # "Write report"
item.todo # "TODO"
item.local_tags # frozenset({"work"})
item.deadline.date # datetime.date(2026, 4, 1)
item.active_ts[0].date # datetime.date(2026, 3, 20) ← from "Notes"
item.links[0].target # "id:ref" ← from "Notes"
item.body # "Notes\nSome text with a link.\nMeeting on ..."
#+end_src
** Example 2: =:Type:= predicate — narrower item definition
With =Config(item_predicate=["property", "Type"])=, a heading must have
*both* =:ID:= and a =:Type:= property to be an item:
#+begin_example
* Inbox
:PROPERTIES:
:ID: aaa-111
:Type: area
:END:
** TODO Buy groceries
SCHEDULED: <2026-03-17 Tue>
:PROPERTIES:
:ID: bbb-222
:Type: task
:END:
** Grocery list
:PROPERTIES:
:ID: ccc-333
:END:
- Milk
- Bread
#+end_example
#+begin_src python
config = Config(
item_predicate=["property", "Type"],
todos=("TODO",),
dones=("DONE",),
)
result = parse_file("inbox.org", config)
# result.items → 2 items (Inbox, Buy groceries)
# "Grocery list" has :ID: but no :Type: → scaffolding
#+end_src
| Heading | =:ID:=? | =:Type:=? | Item? | Why |
|---------------+-------+---------+-------+--------------------------------------|
| Inbox | yes | =area= | yes | Has =:ID:= + =:Type:= |
| Buy groceries | yes | =task= | yes | Has =:ID:= + =:Type:= |
| Grocery list | yes | — | no | Has =:ID:= but no =:Type:= → scaffolding |
"Grocery list" is scaffolding — but it's at level 2, a sibling of "Buy
groceries", not its child. Both are children of "Inbox". So "Grocery
list" content rolls up to *Inbox*, not "Buy groceries":
#+begin_src python
inbox = result.items[0] # Inbox
inbox.body # "Grocery list\n- Milk\n- Bread"
item = result.items[1] # Buy groceries
item.scheduled.date # datetime.date(2026, 3, 17)
item.properties # (("Type", "task"),)
item.parent_item_id # "aaa-111" ← Inbox is the parent item
item.body # None — no scaffolding under this item
#+end_src
** Example 3: org-roam style — exclude archived nodes
org-roam users typically want every =:ID:= heading *except* those marked
with =ROAM_EXCLUDE=. The =not= operator handles this:
#+begin_example
* Main topic
:PROPERTIES:
:ID: roam-001
:END:
This is a permanent note.
See also [[https://example.com/reference][Reference paper]].
** Supporting argument
:PROPERTIES:
:ID: roam-002
:END:
Evidence from [[id:roam-005][another note]].
** COMMENT Draft section
:PROPERTIES:
:ID: roam-003
:ROAM_EXCLUDE: t
:END:
Work in progress — not ready for the graph.
#+end_example
#+begin_src python
config = Config(
item_predicate=["not", ["property", "ROAM_EXCLUDE"]],
)
result = parse_file("roam-note.org", config)
# result.items → 2 items (Main topic, Supporting argument)
# "Draft section" is excluded by the predicate
#+end_src
| Heading | =:ID:=? | =ROAM_EXCLUDE=? | Item? | Why |
|------------+-------+---------------+-------+------------------------|
| Main topic | yes | no | yes | =:ID:= + not excluded |
| Supporting | yes | no | yes | =:ID:= + not excluded |
| Draft | yes | =t= | no | =ROAM_EXCLUDE= → scaffold |
#+begin_src python
item = result.items[0] # Main topic
item.links[0].target # "https://example.com/reference"
item.links[0].description # "Reference paper"
item.body
# "This is a permanent note.\n"
# "See also Reference paper.\n"
# "COMMENT Draft section\n" ← scaffolding heading
# "Work in progress — not ready ..." ← scaffolding body
#+end_src
** Example 4: LOGBOOK data — clock entries and state changes
Clock entries and state changes are extracted from the =:LOGBOOK:=
drawer. They are collected from the item and its scaffolding children.
#+begin_example
* TODO Deep work session :focus:
SCHEDULED: <2026-03-17 Tue 09:00>
:PROPERTIES:
:ID: clock-001
:END:
:LOGBOOK:
CLOCK: [2026-03-16 Mon 14:00]--[2026-03-16 Mon 15:30] => 1:30
CLOCK: [2026-03-16 Mon 10:00]--[2026-03-16 Mon 11:45] => 1:45
- State "TODO" from "PLANNING" [2026-03-15 Sun 09:00]
- State "PLANNING" from [2026-03-14 Sat 18:00]
:END:
Focus on the analysis section.
#+end_example
#+begin_src python
config = Config(
todos=("PLANNING", "TODO"),
dones=("DONE",),
)
result = parse_file("work.org", config)
item = result.items[0]
# Clock entries (collected from :LOGBOOK:)
len(item.clock) # 2
item.clock[0].start # datetime(2026, 3, 16, 10, 0)
item.clock[0].end # datetime(2026, 3, 16, 11, 45)
item.clock[0].duration_minutes # 105
item.clock[1].start # datetime(2026, 3, 16, 14, 0)
item.clock[1].duration_minutes # 90
# State changes (chronological order)
len(item.state_changes) # 2
item.state_changes[0].to_state # "PLANNING"
item.state_changes[0].from_state # None ← first assignment
item.state_changes[1].to_state # "TODO"
item.state_changes[1].from_state # "PLANNING"
# Body excludes LOGBOOK content
item.body # "Focus on the analysis section."
#+end_src
** Example 5: timestamps — dedicated vs generic
The parser distinguishes *dedicated* timestamps (=SCHEDULED=, =DEADLINE=,
=CLOSED=, =created=, =archived=) from *generic* timestamps found in the
body text. Each has its own field — no double-counting.
#+begin_example
* DONE Submit paper
SCHEDULED: <2026-03-01 Sun> DEADLINE: <2026-03-10 Tue> CLOSED: [2026-03-09 Mon 23:55]
:PROPERTIES:
:ID: ts-001
:CREATED: [2026-01-10 Sat]
:ARCHIVE_TIME: 2026-03-15 Sun 12:00
:END:
Submitted before the deadline.
Conference is <2026-06-15 Mon>--<2026-06-18 Thu>.
Received confirmation on [2026-03-10 Tue].
#+end_example
#+begin_src python
config = Config(dones=("DONE",))
result = parse_file("paper.org", config)
item = result.items[0]
# Dedicated timestamps — from planning line and properties
item.scheduled.date # datetime.date(2026, 3, 1)
item.scheduled.active # True (angle brackets)
item.deadline.date # datetime.date(2026, 3, 10)
item.closed.date # datetime.datetime(2026, 3, 9, 23, 55)
item.closed.active # False (square brackets)
item.created.date # datetime.date(2026, 1, 10)
item.archived.date # datetime.datetime(2026, 3, 15, 12, 0)
# Generic timestamps — from body text only (no overlap with above)
len(item.active_ts) # 0 ← the range endpoints are NOT here
len(item.inactive_ts) # 1 ← [2026-03-10 Tue]
len(item.range_ts) # 1 ← the conference range
item.range_ts[0].start.date # datetime.date(2026, 6, 15)
item.range_ts[0].end.date # datetime.date(2026, 6, 18)
item.range_ts[0].active # True
#+end_src
*Scaffolding planning lines become generic timestamps.* The rule above
(dedicated fields, no double-counting) applies only to the item's own
planning line. When a scaffolding heading has =SCHEDULED=, =DEADLINE=,
or =CLOSED=, those timestamps have no dedicated destination — they are
promoted to generic timestamps (=active_ts= / =inactive_ts=) so they are
not lost.
#+begin_example
* TODO Project plan
DEADLINE: <2026-04-01>
:PROPERTIES:
:ID: plan-001
:END:
** Phase 1
SCHEDULED: <2026-03-15 Sun>
Define requirements.
** Phase 2
DEADLINE: <2026-03-25 Tue>
Build prototype.
#+end_example
#+begin_src python
config = Config(todos=("TODO",), dones=("DONE",))
result = parse_file("plan.org", config)
item = result.items[0] # Project plan
# Item's own planning → dedicated field
item.deadline.date # datetime.date(2026, 4, 1)
# Scaffolding planning → promoted to generic timestamps
# Phase 1's SCHEDULED and Phase 2's DEADLINE have no dedicated
# field on the parent item, so they become active_ts.
len(item.active_ts) # 2
item.active_ts[0].date # datetime.date(2026, 3, 15) ← Phase 1 SCHEDULED
item.active_ts[1].date # datetime.date(2026, 3, 25) ← Phase 2 DEADLINE
#+end_src
** Example 6: tags, properties, and inheritance
Tags on a heading are =local_tags=. Tags from ancestors are
=inherited_tags= (minus any tags in =tags_exclude_from_inheritance=).
Properties come from the direct =:PROPERTIES:= drawer only — never from
children.
#+begin_example
#+FILETAGS: :project:
* Research :science:
:PROPERTIES:
:ID: tag-001
:Type: area
:Effort: 3:00
:END:
** Literature review :reading:
:PROPERTIES:
:ID: tag-002
:Type: task
:END:
#+end_example
#+begin_src python
config = Config(
item_predicate=["property", "Type"],
tags_exclude_from_inheritance=frozenset({"noexport"}),
)
result = parse_file("research.org", config)
parent = result.items[0] # Research
parent.local_tags # frozenset({"science"})
parent.inherited_tags # frozenset({"project"}) ← from FILETAGS
parent.properties # (("Type", "area"), ("Effort", "180"))
child = result.items[1] # Literature review
child.local_tags # frozenset({"reading"})
child.inherited_tags # frozenset({"project", "science"})
child.parent_item_id # "tag-001"
child.properties # (("Type", "task"),)
# Effort is NOT here — properties are per-heading, not inherited
#+end_src
** Example 7: links — org-mode and bare URLs
Links are extracted from the complete =raw_text= of the item (including
scaffolding children and content inside excluded drawers). Two kinds
are captured:
- *Org-mode links* — any =[[target]]= or =[[target][description]]=,
regardless of schema (=id:=, =https://=, =file:=, =./image.png=,
fuzzy, etc.). The target is stored raw — the consumer extracts
the schema if needed.
- *Bare URLs* — =http://= and =https://= URLs outside of =[[...]]=.
#+begin_example
* Reference collection
:PROPERTIES:
:ID: link-001
:END:
Key paper: [[https://arxiv.org/abs/2301.00001][Attention is all you need]].
Related note: [[id:abc-123][Transformer architecture]].
Blog post: https://example.com/transformers
:SEE_ALSO:
[[id:def-456][History of neural networks]]
:END:
#+end_example
#+begin_src python
config = Config(
exclude_drawers=frozenset({"see_also"}),
)
result = parse_file("refs.org", config)
item = result.items[0]
len(item.links) # 4
item.links[0].target # "https://arxiv.org/abs/2301.00001"
item.links[0].description # "Attention is all you need"
item.links[1].target # "id:abc-123"
item.links[1].description # "Transformer architecture"
item.links[2].target # "https://example.com/transformers"
item.links[2].description # None ← bare URL, no description
item.links[3].target # "id:def-456"
item.links[3].description # "History of neural networks"
# ↑ extracted from :SEE_ALSO: — links survive drawer exclusion
# Body EXCLUDES :SEE_ALSO: content
item.body
# "Key paper: Attention is all you need.\n"
# "Related note: Transformer architecture.\n"
# "Blog post: https://example.com/transformers"
#+end_src
** Example 8: body and raw_text — what's included, what's filtered
=body= is the filtered text meant for display. =raw_text= is the
complete unfiltered org-mode source. Both include scaffolding children.
#+begin_example
* TODO Prepare presentation :work:
DEADLINE: <2026-04-01>
:PROPERTIES:
:ID: body-001
:Type: task
:END:
:LOGBOOK:
- State "TODO" from "PLANNING" [2026-03-15 Sun 09:00]
:END:
First draft of the slides.
See [[id:ref-001][design document]].
** Outline
- Introduction (5 min)
- Main argument (15 min)
- Q&A (10 min)
#+end_example
#+begin_src python
config = Config(
item_predicate=["property", "Type"],
todos=("PLANNING", "TODO"),
dones=("DONE",),
)
result = parse_file("pres.org", config)
item = result.items[0]
# body: filtered, human-readable
# - PROPERTIES drawer: excluded (orgparse strips it from body)
# - LOGBOOK drawer: excluded (always, hardcoded)
# - "Outline" heading: INCLUDED (scaffolding heading text)
# - Link syntax resolved to description text
item.body
# "First draft of the slides.\n"
# "See design document.\n"
# "Outline\n"
# "- Introduction (5 min)\n"
# "- Main argument (15 min)\n"
# "- Q&A (10 min)"
# raw_text: complete unfiltered org source
# Includes PROPERTIES, LOGBOOK, link syntax, everything.
# Does NOT include content from other items.
"LOGBOOK" in item.raw_text # True
":ID:" in item.raw_text # True
"[[id:ref-001]" in item.raw_text # True ← raw link syntax preserved
#+end_src
* Configuration
=Config= controls what the parser considers an item and how it extracts
data. All fields have sensible defaults — the minimal config is
=Config()= (any heading with =:ID:= is an item).
#+begin_src python
from org_dex_parse import Config
config = Config(
# Which headings with :ID: are items (default: all of them)
item_predicate=["property", "Type"],
# TODO keywords for your org-mode setup
todos=("TODO", "NEXT", "DOING"),
dones=("DONE", "CANCELED"),
# Tags that don't propagate to children
# (matches org-tags-exclude-from-inheritance)
tags_exclude_from_inheritance=frozenset({"noexport", "pin"}),
# Drawers excluded from body text (not from links)
exclude_drawers=frozenset({"logbook", "see_also"}),
# Source blocks excluded from body text
exclude_blocks=frozenset({"comment"}),
# Properties omitted from Item.properties
exclude_properties=frozenset({"archive_file"}),
# Property name for creation date (default "CREATED")
created_property="CREATED",
# Extra characters allowed in tag names (default: none)
# Standard org-mode: [a-zA-Z0-9_@]
extra_tag_chars="%#",
)
#+end_src
** Item predicate
The predicate determines which =:ID:= headings become items. Three
forms are accepted:
| Form | Example | Use case |
|----------+---------------------------------------------+---------------------------------|
| =None= | =Config()= | All headings with =:ID:= |
| =list= | =Config(item_predicate=["property", "Type"])= | JSON-serializable (recommended) |
| =callable= | =Config(item_predicate=lambda h: ...)= | Python-only |
The =list= form uses s-expressions (JSON arrays) with these operators:
| Operator | Example | Meaning |
|----------+----------------------------------------------------------------------+--------------------------------|
| =property= | =["property", "Type"]= | Has property =Type= |
| =not= | =["not", ["property", "ARCHIVE_TIME"]]= | Negation |
| =and= | =["and", ["property", "Type"], ["not", ["property", "ARCHIVE_TIME"]]]= | All must match (short-circuit) |
| =or= | =["or", expr1, expr2]= | Any must match (short-circuit) |
The =list= form is the recommended interface — it is serializable (JSON-RPC,
config files, CLI) and covers the common cases. The =callable= form exists
for backward compatibility and advanced use.
** "TODO" and "DONE" keywords
org-mode needs to know your TODO keywords to correctly parse headings.
If you use custom keywords, pass them in =Config=:
#+begin_src python
config = Config(
todos=("TODO", "NEXT", "WAITING"),
dones=("DONE", "CANCELED"),
)
#+end_src
Without this, headings like =** NEXT Write report= will have
=item.todo = None= and ="NEXT"= will be part of =item.title=.
** Drawer and block exclusion
=exclude_drawers= and =exclude_blocks= control what is excluded from
=Item.body=. They do *not* affect link extraction — links are extracted
from the complete raw text, so links inside excluded drawers are still
captured.
The =:LOGBOOK:= drawer is always excluded from body and from generic
timestamp extraction. Its contents are parsed by dedicated handlers
(=Item.clock=, =Item.state_changes=).
* Item fields
Each =Item= is a frozen (immutable) dataclass with 24 fields:
| Field | Type | Description |
|----------------+-------------------------------+--------------------------------------------------|
| =title= | =str= | Heading text (without TODO/priority/tags) |
| =item_id= | =str= | Value of =:ID:= property |
| =level= | =int= | Heading level (1, 2, 3...) |
| =linenumber= | =int= | Source file line number |
| =file_path= | =str= | Path to the org file |
| =todo= | =str \vert None= | TODO keyword (=None= if absent) |
| =priority= | =str \vert None= | Priority letter (=None= if absent) |
| =local_tags= | =frozenset[str]= | Tags on this heading |
| =inherited_tags= | =frozenset[str]= | Tags from ancestor headings |
| =parent_item_id= | =str \vert None= | =:ID:= of nearest item ancestor |
| =scheduled= | =Timestamp \vert None= | =SCHEDULED= planning timestamp |
| =deadline= | =Timestamp \vert None= | =DEADLINE= planning timestamp |
| =closed= | =Timestamp \vert None= | =CLOSED= planning timestamp |
| =created= | =Timestamp \vert None= | Creation date (from configured property) |
| =archived= | =Timestamp \vert None= | Archive date (from =ARCHIVE_TIME= property) |
| =active_ts= | =tuple[Timestamp, ...]= | Generic active timestamps from body |
| =inactive_ts= | =tuple[Timestamp, ...]= | Generic inactive timestamps from body |
| =range_ts= | =tuple[Range, ...]= | Date ranges from body |
| =clock= | =tuple[ClockEntry, ...]= | CLOCK entries from =:LOGBOOK:= |
| =state_changes= | =tuple[StateChange, ...]= | State transitions from =:LOGBOOK:= |
| =body= | =str \vert None= | Body text (filtered, =None= if empty) |
| =raw_text= | =str= | Complete unfiltered source text |
| =links= | =tuple[Link, ...]= | All links (org-mode + bare URLs) |
| =properties= | =tuple[tuple[str, str], ...]= | Properties (excluding =ID=, =ARCHIVE_TIME=, created) |
** Supporting types
#+begin_src python
Timestamp(date, active, repeater)
# date: datetime.date | datetime.datetime
# active: bool # <...> = True, [...] = False
# repeater: str | None # e.g. "+1w"
Link(target, description)
# target: str # raw, e.g. "id:abc", "https://...", "Heading"
# description: str | None
Range(start, end, active)
# start: Timestamp
# end: Timestamp
# active: bool
ClockEntry(start, end, duration_minutes)
# start: datetime.datetime
# end: datetime.datetime | None # None for running clocks
# duration_minutes: int | None # None for running clocks
StateChange(to_state, from_state, timestamp)
# to_state: str # e.g. "DONE"
# from_state: str | None # e.g. "TODO", None for first
# timestamp: datetime.datetime
#+end_src
* CLI reference
All =Config= fields are available as CLI flags. Run
=python -m org_dex_parse --help= for the full list.
#+begin_src sh
# Default: any heading with :ID: is an item
python -m org_dex_parse file.org
# With a predicate
python -m org_dex_parse --predicate '["property", "Type"]' file.org
# With TODO keywords
python -m org_dex_parse --todos TODO,NEXT,DOING --dones DONE,CANCELED file.org
# From a config file (all fields optional)
python -m org_dex_parse --config myconfig.json file.org
# JSON output
python -m org_dex_parse --json file.org
# Verbosity: -v adds body, -vv adds raw_text
python -m org_dex_parse -v file.org
python -m org_dex_parse --json -vv file.org
#+end_src
An example config file is included in =examples/config.json= — it
documents all available fields and can be used directly:
#+begin_src sh
python -m org_dex_parse --config examples/config.json file.org
#+end_src
*Precedence:* CLI flags override config file values, which override
defaults.
* Performance
Extraction profile on a real-world org archive (4,380 items, Linux, Python 3.11):
| Field | Count |
|----------------+-------|
| title | 4380 |
| item_id | 4380 |
| level | 4380 |
| linenumber | 4380 |
| file_path | 4380 |
| todo | 4380 |
| priority | 1442 |
| local_tags | 4380 |
| inherited_tags | 4358 |
| parent_item_id | 0 |
| scheduled | 40 |
| deadline | 4 |
| closed | 4369 |
| created | 0 |
| archived | 4380 |
| active_ts | 2453 |
| inactive_ts | 255 |
| range_ts | 1874 |
| clock | 251 |
| state_changes | 872 |
| body | 3124 |
| raw_text | 4380 |
| links | 10214 |
| properties | 4755 |
| | |
| File size | 5.0 MB |
| Lines | 135,511 |
| Extraction time | 2.5 s |
Breakdown: orgparse loads the syntax tree in ~1.5 s, org-dex-parse
walks the tree and extracts all fields in ~1.0 s. The extraction
phase uses O(n) pre-computed caches for parent lookup and tag
inheritance.
* Assumptions and requirements
The parser makes the following assumptions about the org files it
processes:
- *=:ID:= is required.* A heading without an =:ID:= property is never an
item — it is scaffolding. This is a structural invariant, not a
configurable option.
- *TODO keywords must be declared.* org-mode determines TODO keywords at
file level (=#+TODO:=) or in Emacs configuration. The parser doesn't
read Emacs config — pass your keywords in =Config.todos= / =Config.dones=.
Without them, keywords are not recognized and become part of the
heading title.
- *=org-log-into-drawer= must be =t=* (the org-mode default). The parser
filters the =:LOGBOOK:= drawer by name. Custom drawer names and inline
logging are not supported (see [[*Limitations][Limitations]]).
* Limitations
Known limitations of v0.1
** LOGBOOK drawer name is hardcoded
The parser assumes =org-log-into-drawer= is =t= (Emacs default), which
means logging goes into a drawer named =:LOGBOOK:=. If your setup uses
a custom drawer name (=org-log-into-drawer= set to a string) or inline
logging (=org-log-into-drawer= set to =nil=), logging timestamps will
leak into =inactive_ts= as false positives.
** Tag character monkey-patch is not thread-safe
When =Config.extra_tag_chars= is non-empty, the parser temporarily
modifies a global regex in orgparse to allow the extra characters. This
is not thread-safe — do not call =parse_file= concurrently from multiple
threads with different =extra_tag_chars= values. Single-threaded use
(including sequential calls with different configs) is safe.
** COMMENT keyword not handled
org-mode treats headings starting with =COMMENT= as excluded from
export. The parser does not recognize =COMMENT= as a special keyword —
it becomes part of =Item.title= (or part of the scaffolding heading
text in =body=). If a =COMMENT= heading has =:ID:= and passes the
predicate, it produces an item like any other heading.
** Encrypted headings (org-crypt) not handled
org-mode supports encrypting subtrees via =org-crypt=. The encrypted
body (a PGP/GPG blob) is opaque text — the parser processes it as
regular body content, extracting meaningless timestamps, links, and
text from the ciphertext.
** orgparse private API dependency
The parser depends on 4 private attributes of orgparse (=_repeater=,
=_duration=, =_body_lines=, =RE_HEADING_TAGS=). All access is isolated in
an adapter module (=_orgparse_compat.py=) — the rest of the codebase never
touches orgparse internals directly. The attributes are protected by guard
tests and a version pin (=orgparse>=0.4,<0.5=), but may break if orgparse
changes its internals within the pinned range.
* Development
#+begin_src sh
git clone https://github.com/gdvek/org-dex-parse.git
cd org-dex-parse
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
pytest tests/ -v
#+end_src
* License
GPL-3.0-or-later
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
org_dex_parse-0.1.3.tar.gz
(88.5 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file org_dex_parse-0.1.3.tar.gz.
File metadata
- Download URL: org_dex_parse-0.1.3.tar.gz
- Upload date:
- Size: 88.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: Hatch/1.16.5 cpython/3.12.3 HTTPX/0.28.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1cd6f1b60316f91ac1ceb27bef5dfb687e55df3edb624087f2bea6debb76de74
|
|
| MD5 |
89a727aa7d812b2337f65976978ab045
|
|
| BLAKE2b-256 |
6473053f2ae0b6dd82128fe6d2f1e66c3923c4aa17cdfd659ffcb1e391664e5f
|
File details
Details for the file org_dex_parse-0.1.3-py3-none-any.whl.
File metadata
- Download URL: org_dex_parse-0.1.3-py3-none-any.whl
- Upload date:
- Size: 48.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: Hatch/1.16.5 cpython/3.12.3 HTTPX/0.28.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9258aa040943cc1748f650e77a49f49ce3c1fd28fbb56d13797b54ac1eae33be
|
|
| MD5 |
3091bd49da81082a6ef091ddd959f233
|
|
| BLAKE2b-256 |
7fe7f614d910180e45cb104fe78f288e41d394772f58a329dc7f646abb3a51e0
|