org-dex-parse

Parse org-mode files into structured data for the org-dex indexing system
These details have not been verified by PyPI
Project links

Project description

#+title: org-dex-parse
#+author: gdvek

Extract structured data from org-mode files.  Point it at an =.org= file,
get back Python objects — titles, timestamps, links, tags, clock entries,
properties, and more — ready to query, store, or pipe into whatever
you're building.

Built for [[https://github.com/gdvek/org-dex][org-dex]], usable standalone.  Uses [[https://github.com/karlicoss/orgparse][orgparse]] as the parsing
backend.

* Try it
** From the command line

  #+begin_src sh
  pip install org-dex-parse   # Python >= 3.11
  #+end_src

  Use one of your own org files, or create a test file:

  #+begin_example
  * TODO Write report                                       :work:
    DEADLINE: <2026-04-01>
    :PROPERTIES:
    :ID:  abc-001
    :END:
  ** Notes
     Some references: [[id:other][see also]].
  * DONE Review draft
    CLOSED: [2026-03-15 Sun 10:00]
    :PROPERTIES:
    :ID:  abc-002
    :END:
  #+end_example

  #+begin_src sh
  python -m org_dex_parse example.org
  #+end_src

  Output:

  #+begin_example
  example.org: 2 items
    Write report
      id=abc-001  level=1  line=1
      todo=TODO
      local_tags={'work'}
    Review draft
      id=abc-002  level=1  line=9
      todo=DONE
  #+end_example

  Add =-v= to include body text, =--json= for machine-readable output.

  A ready-made config file is included for common setups:

  #+begin_src sh
  python -m org_dex_parse --config examples/config.json example.org
  #+end_src

  It covers TODO keywords, drawer filtering, and item selection rules
  for a typical org-mode setup.  Copy it and adjust to your needs — the
  fields are documented in [[*Configuration][Configuration]].

** From Python

  #+begin_src python
  from org_dex_parse import parse_file, Config

  result = parse_file("notes.org", Config())

  for item in result.items:
      print(f"{item.todo or ''} {item.title}")
      print(f"  id={item.item_id}  tags={item.local_tags}")
      if item.deadline:
          print(f"  deadline={item.deadline.date}")
      if item.links:
          print(f"  links={len(item.links)}")
  #+end_src

  Each =Item= in =result.items= is a heading with =:ID:= that passed
  the configured predicate — see [[*Key concepts][Key concepts]].

* Key concepts
  The parser distinguishes two kinds of headings:

  - *Items* — headings with =:ID:= that pass the predicate.  Each produces
    a 24-field structured object.
  - *Scaffolding* — everything else.  Organizational headings whose
    content (body, links, timestamps, clock) rolls up into the nearest
    ancestor item.  Nothing is lost — scaffolding content is collected,
    not discarded.

  You control what counts as an item through a *predicate*.  The default
  accepts every heading with =:ID:=.  You can narrow it — for example,
  require a =:Type:= property, or exclude headings with =ROAM_EXCLUDE=.

  #+begin_example
                      org file
                         |
                      orgparse
                    (syntax tree)
                         |
                   org-dex-parse
                  (semantic layer)
                         |
                    Item stream
                  (24-field frozen
                   dataclasses)
                         |
          +--------------+--------------+
          v              v              v
     org-dex       custom indexers   data pipelines
     (DB + UI)     (knowledge graphs)(analytics)
  #+end_example

  orgparse handles the org-mode grammar.  org-dex-parse handles item
  discrimination, field extraction, and content filtering.

* Installation

  #+begin_src sh
  pip install org-dex-parse
  #+end_src

  Requires Python >= 3.11.  Single dependency: =orgparse>=0.4,<0.5=.

* Examples

  The examples below show the parser on increasingly complex org files.
  Each starts with the org source, then shows the Python code and what
  each field contains.

** Example 1: default predicate — items and scaffolding
   With =Config()= (default), every heading with =:ID:= is an item.
   Headings without =:ID:= are scaffolding — their content rolls up into
   the nearest ancestor item.

   #+begin_example
   * Project
   ** TODO Write report                                       :work:
      DEADLINE: <2026-04-01>
      :PROPERTIES:
      :ID:       a1b2c3
      :END:
   *** Notes
       Some text with [[id:ref][a link]].
       Meeting on <2026-03-20 Thu>.
   ** DONE Review draft
      CLOSED: [2026-03-15 Sun 10:00]
      :PROPERTIES:
      :ID:       d4e5f6
      :END:
   ** Background reading
      No :ID: here — just an organizational heading.
   #+end_example

   #+begin_src python
   config = Config(
       todos=("TODO",),
       dones=("DONE",),
   )
   result = parse_file("project.org", config)
   # result.items → 2 items
   #+end_src

   | Heading      | =:ID:=? | Item? | Why                            |
   |--------------+-------+-------+--------------------------------|
   | Project      | no    | no    | No =:ID:= → scaffolding          |
   | Write report | yes   | yes   | Has =:ID:=                       |
   | Notes        | no    | no    | No =:ID:= → scaffolding of above |
   | Review draft | yes   | yes   | Has =:ID:=                       |
   | Bg reading   | no    | no    | No =:ID:= → scaffolding          |

   "Notes" is scaffolding under "Write report".  Its body text, the link
   =[[id:ref][a link]]=, and the timestamp =<2026-03-20>= all become part
   of the "Write report" item:

   #+begin_src python
   item = result.items[0]  # Write report
   item.title           # "Write report"
   item.todo            # "TODO"
   item.local_tags      # frozenset({"work"})
   item.deadline.date   # datetime.date(2026, 4, 1)
   item.active_ts[0].date  # datetime.date(2026, 3, 20)  ← from "Notes"
   item.links[0].target    # "id:ref"                    ← from "Notes"
   item.body            # "Notes\nSome text with a link.\nMeeting on ..."
   #+end_src

** Example 2: =:Type:= predicate — narrower item definition
   With =Config(item_predicate=["property", "Type"])=, a heading must have
   *both* =:ID:= and a =:Type:= property to be an item:

   #+begin_example
   * Inbox
     :PROPERTIES:
     :ID:       aaa-111
     :Type:     area
     :END:
   ** TODO Buy groceries
      SCHEDULED: <2026-03-17 Tue>
      :PROPERTIES:
      :ID:       bbb-222
      :Type:     task
      :END:
   ** Grocery list
      :PROPERTIES:
      :ID:       ccc-333
      :END:
      - Milk
      - Bread
   #+end_example

   #+begin_src python
   config = Config(
       item_predicate=["property", "Type"],
       todos=("TODO",),
       dones=("DONE",),
   )
   result = parse_file("inbox.org", config)
   # result.items → 2 items (Inbox, Buy groceries)
   # "Grocery list" has :ID: but no :Type: → scaffolding
   #+end_src

   | Heading       | =:ID:=? | =:Type:=? | Item? | Why                                  |
   |---------------+-------+---------+-------+--------------------------------------|
   | Inbox         | yes   | =area=    | yes   | Has =:ID:= + =:Type:=                    |
   | Buy groceries | yes   | =task=    | yes   | Has =:ID:= + =:Type:=                    |
   | Grocery list  | yes   | —       | no    | Has =:ID:= but no =:Type:= → scaffolding |

   "Grocery list" is scaffolding — but it's at level 2, a sibling of "Buy
   groceries", not its child.  Both are children of "Inbox".  So "Grocery
   list" content rolls up to *Inbox*, not "Buy groceries":

   #+begin_src python
   inbox = result.items[0]  # Inbox
   inbox.body              # "Grocery list\n- Milk\n- Bread"

   item = result.items[1]  # Buy groceries
   item.scheduled.date     # datetime.date(2026, 3, 17)
   item.properties         # (("Type", "task"),)
   item.parent_item_id     # "aaa-111"  ← Inbox is the parent item
   item.body               # None — no scaffolding under this item
   #+end_src

** Example 3: org-roam style — exclude archived nodes
   org-roam users typically want every =:ID:= heading *except* those marked
   with =ROAM_EXCLUDE=.  The =not= operator handles this:

   #+begin_example
   * Main topic
     :PROPERTIES:
     :ID:       roam-001
     :END:
     This is a permanent note.
     See also [[https://example.com/reference][Reference paper]].
   ** Supporting argument
      :PROPERTIES:
      :ID:       roam-002
      :END:
      Evidence from [[id:roam-005][another note]].
   ** COMMENT Draft section
      :PROPERTIES:
      :ID:       roam-003
      :ROAM_EXCLUDE: t
      :END:
      Work in progress — not ready for the graph.
   #+end_example

   #+begin_src python
   config = Config(
       item_predicate=["not", ["property", "ROAM_EXCLUDE"]],
   )
   result = parse_file("roam-note.org", config)
   # result.items → 2 items (Main topic, Supporting argument)
   # "Draft section" is excluded by the predicate
   #+end_src

   | Heading    | =:ID:=? | =ROAM_EXCLUDE=? | Item? | Why                    |
   |------------+-------+---------------+-------+------------------------|
   | Main topic | yes   | no            | yes   | =:ID:= + not excluded     |
   | Supporting | yes   | no            | yes   | =:ID:= + not excluded     |
   | Draft      | yes   | =t=             | no    | =ROAM_EXCLUDE= → scaffold |

   #+begin_src python
   item = result.items[0]  # Main topic
   item.links[0].target       # "https://example.com/reference"
   item.links[0].description  # "Reference paper"
   item.body
   # "This is a permanent note.\n"
   # "See also Reference paper.\n"
   # "COMMENT Draft section\n"           ← scaffolding heading
   # "Work in progress — not ready ..."  ← scaffolding body
   #+end_src

** Example 4: LOGBOOK data — clock entries and state changes
   Clock entries and state changes are extracted from the =:LOGBOOK:=
   drawer.  They are collected from the item and its scaffolding children.

   #+begin_example
   * TODO Deep work session                                   :focus:
     SCHEDULED: <2026-03-17 Tue 09:00>
     :PROPERTIES:
     :ID:       clock-001
     :END:
     :LOGBOOK:
     CLOCK: [2026-03-16 Mon 14:00]--[2026-03-16 Mon 15:30] =>  1:30
     CLOCK: [2026-03-16 Mon 10:00]--[2026-03-16 Mon 11:45] =>  1:45
     - State "TODO"       from "PLANNING"  [2026-03-15 Sun 09:00]
     - State "PLANNING"   from              [2026-03-14 Sat 18:00]
     :END:
     Focus on the analysis section.
   #+end_example

   #+begin_src python
   config = Config(
       todos=("PLANNING", "TODO"),
       dones=("DONE",),
   )
   result = parse_file("work.org", config)
   item = result.items[0]

   # Clock entries (collected from :LOGBOOK:)
   len(item.clock)                  # 2
   item.clock[0].start              # datetime(2026, 3, 16, 10, 0)
   item.clock[0].end                # datetime(2026, 3, 16, 11, 45)
   item.clock[0].duration_minutes   # 105
   item.clock[1].start              # datetime(2026, 3, 16, 14, 0)
   item.clock[1].duration_minutes   # 90

   # State changes (chronological order)
   len(item.state_changes)               # 2
   item.state_changes[0].to_state        # "PLANNING"
   item.state_changes[0].from_state      # None  ← first assignment
   item.state_changes[1].to_state        # "TODO"
   item.state_changes[1].from_state      # "PLANNING"

   # Body excludes LOGBOOK content
   item.body   # "Focus on the analysis section."
   #+end_src

** Example 5: timestamps — dedicated vs generic
   The parser distinguishes *dedicated* timestamps (=SCHEDULED=, =DEADLINE=,
   =CLOSED=, =created=, =archived=) from *generic* timestamps found in the
   body text.  Each has its own field — no double-counting.

   #+begin_example
   * DONE Submit paper
     SCHEDULED: <2026-03-01 Sun> DEADLINE: <2026-03-10 Tue> CLOSED: [2026-03-09 Mon 23:55]
     :PROPERTIES:
     :ID:       ts-001
     :CREATED:  [2026-01-10 Sat]
     :ARCHIVE_TIME: 2026-03-15 Sun 12:00
     :END:
     Submitted before the deadline.
     Conference is <2026-06-15 Mon>--<2026-06-18 Thu>.
     Received confirmation on [2026-03-10 Tue].
   #+end_example

   #+begin_src python
   config = Config(dones=("DONE",))
   result = parse_file("paper.org", config)
   item = result.items[0]

   # Dedicated timestamps — from planning line and properties
   item.scheduled.date       # datetime.date(2026, 3, 1)
   item.scheduled.active     # True   (angle brackets)
   item.deadline.date        # datetime.date(2026, 3, 10)
   item.closed.date          # datetime.datetime(2026, 3, 9, 23, 55)
   item.closed.active        # False  (square brackets)
   item.created.date         # datetime.date(2026, 1, 10)
   item.archived.date        # datetime.datetime(2026, 3, 15, 12, 0)

   # Generic timestamps — from body text only (no overlap with above)
   len(item.active_ts)       # 0  ← the range endpoints are NOT here
   len(item.inactive_ts)     # 1  ← [2026-03-10 Tue]
   len(item.range_ts)        # 1  ← the conference range
   item.range_ts[0].start.date  # datetime.date(2026, 6, 15)
   item.range_ts[0].end.date    # datetime.date(2026, 6, 18)
   item.range_ts[0].active      # True
   #+end_src

   *Scaffolding planning lines become generic timestamps.*  The rule above
   (dedicated fields, no double-counting) applies only to the item's own
   planning line.  When a scaffolding heading has =SCHEDULED=, =DEADLINE=,
   or =CLOSED=, those timestamps have no dedicated destination — they are
   promoted to generic timestamps (=active_ts= / =inactive_ts=) so they are
   not lost.

   #+begin_example
   * TODO Project plan
     DEADLINE: <2026-04-01>
     :PROPERTIES:
     :ID:       plan-001
     :END:
   ** Phase 1
      SCHEDULED: <2026-03-15 Sun>
      Define requirements.
   ** Phase 2
      DEADLINE: <2026-03-25 Tue>
      Build prototype.
   #+end_example

   #+begin_src python
   config = Config(todos=("TODO",), dones=("DONE",))
   result = parse_file("plan.org", config)
   item = result.items[0]  # Project plan

   # Item's own planning → dedicated field
   item.deadline.date        # datetime.date(2026, 4, 1)

   # Scaffolding planning → promoted to generic timestamps
   # Phase 1's SCHEDULED and Phase 2's DEADLINE have no dedicated
   # field on the parent item, so they become active_ts.
   len(item.active_ts)       # 2
   item.active_ts[0].date    # datetime.date(2026, 3, 15)  ← Phase 1 SCHEDULED
   item.active_ts[1].date    # datetime.date(2026, 3, 25)  ← Phase 2 DEADLINE
   #+end_src

** Example 6: tags, properties, and inheritance
   Tags on a heading are =local_tags=.  Tags from ancestors are
   =inherited_tags= (minus any tags in =tags_exclude_from_inheritance=).
   Properties come from the direct =:PROPERTIES:= drawer only — never from
   children.

   #+begin_example
   #+FILETAGS: :project:

   * Research                                                :science:
     :PROPERTIES:
     :ID:       tag-001
     :Type:     area
     :Effort:   3:00
     :END:
   ** Literature review                                       :reading:
      :PROPERTIES:
      :ID:       tag-002
      :Type:     task
      :END:
   #+end_example

   #+begin_src python
   config = Config(
       item_predicate=["property", "Type"],
       tags_exclude_from_inheritance=frozenset({"noexport"}),
   )
   result = parse_file("research.org", config)

   parent = result.items[0]  # Research
   parent.local_tags       # frozenset({"science"})
   parent.inherited_tags   # frozenset({"project"})  ← from FILETAGS
   parent.properties       # (("Type", "area"), ("Effort", "180"))

   child = result.items[1]  # Literature review
   child.local_tags        # frozenset({"reading"})
   child.inherited_tags    # frozenset({"project", "science"})
   child.parent_item_id    # "tag-001"
   child.properties        # (("Type", "task"),)
   # Effort is NOT here — properties are per-heading, not inherited
   #+end_src

** Example 7: links — org-mode and bare URLs
   Links are extracted from the complete =raw_text= of the item (including
   scaffolding children and content inside excluded drawers).  Two kinds
   are captured:

   - *Org-mode links* — any =[[target]]= or =[[target][description]]=,
     regardless of schema (=id:=, =https://=, =file:=, =./image.png=,
     fuzzy, etc.).  The target is stored raw — the consumer extracts
     the schema if needed.
   - *Bare URLs* — =http://= and =https://= URLs outside of =[[...]]=.

   #+begin_example
   * Reference collection
     :PROPERTIES:
     :ID:       link-001
     :END:
     Key paper: [[https://arxiv.org/abs/2301.00001][Attention is all you need]].
     Related note: [[id:abc-123][Transformer architecture]].
     Blog post: https://example.com/transformers
     :SEE_ALSO:
     [[id:def-456][History of neural networks]]
     :END:
   #+end_example

   #+begin_src python
   config = Config(
       exclude_drawers=frozenset({"see_also"}),
   )
   result = parse_file("refs.org", config)
   item = result.items[0]

   len(item.links)  # 4

   item.links[0].target       # "https://arxiv.org/abs/2301.00001"
   item.links[0].description  # "Attention is all you need"

   item.links[1].target       # "id:abc-123"
   item.links[1].description  # "Transformer architecture"

   item.links[2].target       # "https://example.com/transformers"
   item.links[2].description  # None  ← bare URL, no description

   item.links[3].target       # "id:def-456"
   item.links[3].description  # "History of neural networks"
   # ↑ extracted from :SEE_ALSO: — links survive drawer exclusion

   # Body EXCLUDES :SEE_ALSO: content
   item.body
   # "Key paper: Attention is all you need.\n"
   # "Related note: Transformer architecture.\n"
   # "Blog post: https://example.com/transformers"
   #+end_src

** Example 8: body and raw_text — what's included, what's filtered

   =body= is the filtered text meant for display.  =raw_text= is the
   complete unfiltered org-mode source.  Both include scaffolding children.

   #+begin_example
   * TODO Prepare presentation                                :work:
     DEADLINE: <2026-04-01>
     :PROPERTIES:
     :ID:       body-001
     :Type:     task
     :END:
     :LOGBOOK:
     - State "TODO" from "PLANNING" [2026-03-15 Sun 09:00]
     :END:
     First draft of the slides.
     See [[id:ref-001][design document]].
   ** Outline
      - Introduction (5 min)
      - Main argument (15 min)
      - Q&A (10 min)
   #+end_example

   #+begin_src python
   config = Config(
       item_predicate=["property", "Type"],
       todos=("PLANNING", "TODO"),
       dones=("DONE",),
   )
   result = parse_file("pres.org", config)
   item = result.items[0]

   # body: filtered, human-readable
   # - PROPERTIES drawer: excluded (orgparse strips it from body)
   # - LOGBOOK drawer: excluded (always, hardcoded)
   # - "Outline" heading: INCLUDED (scaffolding heading text)
   # - Link syntax resolved to description text
   item.body
   # "First draft of the slides.\n"
   # "See design document.\n"
   # "Outline\n"
   # "- Introduction (5 min)\n"
   # "- Main argument (15 min)\n"
   # "- Q&A (10 min)"

   # raw_text: complete unfiltered org source
   # Includes PROPERTIES, LOGBOOK, link syntax, everything.
   # Does NOT include content from other items.
   "LOGBOOK" in item.raw_text       # True
   ":ID:" in item.raw_text          # True
   "[[id:ref-001]" in item.raw_text # True  ← raw link syntax preserved
   #+end_src

* Configuration
  =Config= controls what the parser considers an item and how it extracts
  data.  All fields have sensible defaults — the minimal config is
  =Config()= (any heading with =:ID:= is an item).

  #+begin_src python
  from org_dex_parse import Config

  config = Config(
      # Which headings with :ID: are items (default: all of them)
      item_predicate=["property", "Type"],

      # TODO keywords for your org-mode setup
      todos=("TODO", "NEXT", "DOING"),
      dones=("DONE", "CANCELED"),

      # Tags that don't propagate to children
      # (matches org-tags-exclude-from-inheritance)
      tags_exclude_from_inheritance=frozenset({"noexport", "pin"}),

      # Drawers excluded from body text (not from links)
      exclude_drawers=frozenset({"logbook", "see_also"}),

      # Source blocks excluded from body text
      exclude_blocks=frozenset({"comment"}),

      # Properties omitted from Item.properties
      exclude_properties=frozenset({"archive_file"}),

      # Property name for creation date (default "CREATED")
      created_property="CREATED",

      # Extra characters allowed in tag names (default: none)
      # Standard org-mode: [a-zA-Z0-9_@]
      extra_tag_chars="%#",
  )
  #+end_src

** Item predicate
   The predicate determines which =:ID:= headings become items.  Three
   forms are accepted:

   | Form     | Example                                     | Use case                        |
   |----------+---------------------------------------------+---------------------------------|
   | =None=     | =Config()=                                    | All headings with =:ID:=          |
   | =list=     | =Config(item_predicate=["property", "Type"])= | JSON-serializable (recommended) |
   | =callable= | =Config(item_predicate=lambda h: ...)=        | Python-only                     |

   The =list= form uses s-expressions (JSON arrays) with these operators:

   | Operator | Example                                                              | Meaning                        |
   |----------+----------------------------------------------------------------------+--------------------------------|
   | =property= | =["property", "Type"]=                                                 | Has property =Type=              |
   | =not=      | =["not", ["property", "ARCHIVE_TIME"]]=                                | Negation                       |
   | =and=      | =["and", ["property", "Type"], ["not", ["property", "ARCHIVE_TIME"]]]= | All must match (short-circuit) |
   | =or=       | =["or", expr1, expr2]=                                                 | Any must match (short-circuit) |

   The =list= form is the recommended interface — it is serializable (JSON-RPC,
   config files, CLI) and covers the common cases.  The =callable= form exists
   for backward compatibility and advanced use.

** "TODO" and "DONE" keywords
   org-mode needs to know your TODO keywords to correctly parse headings.
   If you use custom keywords, pass them in =Config=:

   #+begin_src python
   config = Config(
       todos=("TODO", "NEXT", "WAITING"),
       dones=("DONE", "CANCELED"),
   )
   #+end_src

   Without this, headings like =** NEXT Write report= will have
   =item.todo = None= and ="NEXT"= will be part of =item.title=.

** Drawer and block exclusion
   =exclude_drawers= and =exclude_blocks= control what is excluded from
   =Item.body=.  They do *not* affect link extraction — links are extracted
   from the complete raw text, so links inside excluded drawers are still
   captured.

   The =:LOGBOOK:= drawer is always excluded from body and from generic
   timestamp extraction.  Its contents are parsed by dedicated handlers
   (=Item.clock=, =Item.state_changes=).

* Item fields

  Each =Item= is a frozen (immutable) dataclass with 24 fields:

  | Field          | Type                          | Description                                      |
  |----------------+-------------------------------+--------------------------------------------------|
  | =title=          | =str=                           | Heading text (without TODO/priority/tags)        |
  | =item_id=        | =str=                           | Value of =:ID:= property                           |
  | =level=          | =int=                           | Heading level (1, 2, 3...)                       |
  | =linenumber=     | =int=                           | Source file line number                          |
  | =file_path=      | =str=                           | Path to the org file                             |
  | =todo=           | =str \vert None=                | TODO keyword (=None= if absent)                    |
  | =priority=       | =str \vert None=                | Priority letter (=None= if absent)                 |
  | =local_tags=     | =frozenset[str]=                | Tags on this heading                             |
  | =inherited_tags= | =frozenset[str]=                | Tags from ancestor headings                      |
  | =parent_item_id= | =str \vert None=                | =:ID:= of nearest item ancestor                    |
  | =scheduled=      | =Timestamp \vert None=          | =SCHEDULED= planning timestamp                     |
  | =deadline=       | =Timestamp \vert None=          | =DEADLINE= planning timestamp                      |
  | =closed=         | =Timestamp \vert None=          | =CLOSED= planning timestamp                        |
  | =created=        | =Timestamp \vert None=          | Creation date (from configured property)         |
  | =archived=       | =Timestamp \vert None=          | Archive date (from =ARCHIVE_TIME= property)        |
  | =active_ts=      | =tuple[Timestamp, ...]=         | Generic active timestamps from body              |
  | =inactive_ts=    | =tuple[Timestamp, ...]=         | Generic inactive timestamps from body            |
  | =range_ts=       | =tuple[Range, ...]=             | Date ranges from body                            |
  | =clock=          | =tuple[ClockEntry, ...]=        | CLOCK entries from =:LOGBOOK:=                     |
  | =state_changes=  | =tuple[StateChange, ...]=       | State transitions from =:LOGBOOK:=                 |
  | =body=           | =str \vert None=                | Body text (filtered, =None= if empty)              |
  | =raw_text=       | =str=                           | Complete unfiltered source text                  |
  | =links=          | =tuple[Link, ...]=              | All links (org-mode + bare URLs)                 |
  | =properties=     | =tuple[tuple[str, str], ...]= | Properties (excluding =ID=, =ARCHIVE_TIME=, created) |

** Supporting types

   #+begin_src python
   Timestamp(date, active, repeater)
   #   date: datetime.date | datetime.datetime
   #   active: bool            # <...> = True, [...] = False
   #   repeater: str | None    # e.g. "+1w"

   Link(target, description)
   #   target: str             # raw, e.g. "id:abc", "https://...", "Heading"
   #   description: str | None

   Range(start, end, active)
   #   start: Timestamp
   #   end: Timestamp
   #   active: bool

   ClockEntry(start, end, duration_minutes)
   #   start: datetime.datetime
   #   end: datetime.datetime | None      # None for running clocks
   #   duration_minutes: int | None       # None for running clocks

   StateChange(to_state, from_state, timestamp)
   #   to_state: str                      # e.g. "DONE"
   #   from_state: str | None             # e.g. "TODO", None for first
   #   timestamp: datetime.datetime
   #+end_src

* CLI reference
  All =Config= fields are available as CLI flags.  Run
  =python -m org_dex_parse --help= for the full list.

  #+begin_src sh
  # Default: any heading with :ID: is an item
  python -m org_dex_parse file.org

  # With a predicate
  python -m org_dex_parse --predicate '["property", "Type"]' file.org

  # With TODO keywords
  python -m org_dex_parse --todos TODO,NEXT,DOING --dones DONE,CANCELED file.org

  # From a config file (all fields optional)
  python -m org_dex_parse --config myconfig.json file.org

  # JSON output
  python -m org_dex_parse --json file.org

  # Verbosity: -v adds body, -vv adds raw_text
  python -m org_dex_parse -v file.org
  python -m org_dex_parse --json -vv file.org
  #+end_src

  An example config file is included in =examples/config.json= — it
  documents all available fields and can be used directly:

  #+begin_src sh
  python -m org_dex_parse --config examples/config.json file.org
  #+end_src

  *Precedence:* CLI flags override config file values, which override
  defaults.

* Performance
  Extraction profile on a real-world org archive (4,380 items, Linux, Python 3.11):

  | Field          | Count |
  |----------------+-------|
  | title          |  4380 |
  | item_id        |  4380 |
  | level          |  4380 |
  | linenumber     |  4380 |
  | file_path      |  4380 |
  | todo           |  4380 |
  | priority       |  1442 |
  | local_tags     |  4380 |
  | inherited_tags |  4358 |
  | parent_item_id |     0 |
  | scheduled      |    40 |
  | deadline       |     4 |
  | closed         |  4369 |
  | created        |     0 |
  | archived       |  4380 |
  | active_ts      |  2453 |
  | inactive_ts    |   255 |
  | range_ts       |  1874 |
  | clock          |   251 |
  | state_changes  |   872 |
  | body           |  3124 |
  | raw_text       |  4380 |
  | links          | 10214 |
  | properties     |  4755 |


  |                 |         |
  | File size       | 5.0 MB  |
  | Lines           | 135,511 |
  | Extraction time | 2.5 s   |

  Breakdown: orgparse loads the syntax tree in ~1.5 s, org-dex-parse
  walks the tree and extracts all fields in ~1.0 s.  The extraction
  phase uses O(n) pre-computed caches for parent lookup and tag
  inheritance.

* Assumptions and requirements
  The parser makes the following assumptions about the org files it
  processes:

  - *=:ID:= is required.* A heading without an =:ID:= property is never an
    item — it is scaffolding.  This is a structural invariant, not a
    configurable option.

  - *TODO keywords must be declared.* org-mode determines TODO keywords at
    file level (=#+TODO:=) or in Emacs configuration.  The parser doesn't
    read Emacs config — pass your keywords in =Config.todos= / =Config.dones=.
    Without them, keywords are not recognized and become part of the
    heading title.

  - *=org-log-into-drawer= must be =t=* (the org-mode default).  The parser
    filters the =:LOGBOOK:= drawer by name.  Custom drawer names and inline
    logging are not supported (see [[*Limitations][Limitations]]).

* Limitations
  Known limitations of v0.1

** LOGBOOK drawer name is hardcoded
   The parser assumes =org-log-into-drawer= is =t= (Emacs default), which
   means logging goes into a drawer named =:LOGBOOK:=.  If your setup uses
   a custom drawer name (=org-log-into-drawer= set to a string) or inline
   logging (=org-log-into-drawer= set to =nil=), logging timestamps will
   leak into =inactive_ts= as false positives.

** Tag character monkey-patch is not thread-safe
   When =Config.extra_tag_chars= is non-empty, the parser temporarily
   modifies a global regex in orgparse to allow the extra characters.  This
   is not thread-safe — do not call =parse_file= concurrently from multiple
   threads with different =extra_tag_chars= values.  Single-threaded use
   (including sequential calls with different configs) is safe.

** COMMENT keyword not handled
   org-mode treats headings starting with =COMMENT= as excluded from
   export.  The parser does not recognize =COMMENT= as a special keyword —
   it becomes part of =Item.title= (or part of the scaffolding heading
   text in =body=).  If a =COMMENT= heading has =:ID:= and passes the
   predicate, it produces an item like any other heading.

** Encrypted headings (org-crypt) not handled
   org-mode supports encrypting subtrees via =org-crypt=.  The encrypted
   body (a PGP/GPG blob) is opaque text — the parser processes it as
   regular body content, extracting meaningless timestamps, links, and
   text from the ciphertext.

** orgparse private API dependency
   The parser depends on 4 private attributes of orgparse (=_repeater=,
   =_duration=, =_body_lines=, =RE_HEADING_TAGS=).  All access is isolated in
   an adapter module (=_orgparse_compat.py=) — the rest of the codebase never
   touches orgparse internals directly.  The attributes are protected by guard
   tests and a version pin (=orgparse>=0.4,<0.5=), but may break if orgparse
   changes its internals within the pinned range.

* Development
  #+begin_src sh
  git clone https://github.com/gdvek/org-dex-parse.git
  cd org-dex-parse
  python -m venv .venv
  source .venv/bin/activate
  pip install -e ".[dev]"
  pytest tests/ -v
  #+end_src

* License
  GPL-3.0-or-later
Project details

These details have not been verified by PyPI
Project links

Release history Release notifications | RSS feed

This version
0.1.3
Mar 22, 2026
0.1.2
Mar 17, 2026
0.1.1
Mar 16, 2026
0.1.0
Mar 16, 2026
Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution

org_dex_parse-0.1.3.tar.gz (88.5 kB view details)
Uploaded Mar 22, 2026 Source
Built Distribution

If you're not sure about the file name format, learn more about wheel file names.
The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.
org_dex_parse-0.1.3-py3-none-any.whl (48.0 kB view details)
Uploaded Mar 22, 2026 Python 3
File details

Details for the file org_dex_parse-0.1.3.tar.gz.
File metadata

Download URL: org_dex_parse-0.1.3.tar.gz
Upload date: Mar 22, 2026
Size: 88.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: Hatch/1.16.5 cpython/3.12.3 HTTPX/0.28.1
File hashes

Hashes for org_dex_parse-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`1cd6f1b60316f91ac1ceb27bef5dfb687e55df3edb624087f2bea6debb76de74`
MD5	`89a727aa7d812b2337f65976978ab045`
BLAKE2b-256	`6473053f2ae0b6dd82128fe6d2f1e66c3923c4aa17cdfd659ffcb1e391664e5f`
See more details on using hashes here.
File details

Details for the file org_dex_parse-0.1.3-py3-none-any.whl.
File metadata

Download URL: org_dex_parse-0.1.3-py3-none-any.whl
Upload date: Mar 22, 2026
Size: 48.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: Hatch/1.16.5 cpython/3.12.3 HTTPX/0.28.1
File hashes

Hashes for org_dex_parse-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9258aa040943cc1748f650e77a49f49ce3c1fd28fbb56d13797b54ac1eae33be`
MD5	`3091bd49da81082a6ef091ddd959f233`
BLAKE2b-256	`7fe7f614d910180e45cb104fe78f288e41d394772f58a329dc7f646abb3a51e0`
See more details on using hashes here.
org-dex-parse 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes