Lightweight XBRL 2.1 / iXBRL 1.1 parser and structured data extraction library
Project description
xbrl-core — Lightweight XBRL 2.1 / iXBRL 1.1 Parser for Python
xbrl-core is a pure-Python parser and structured data extraction library for XBRL 2.1 instance documents and iXBRL (Inline XBRL) documents. It supports fact extraction, context/unit structuring, all five linkbase types (presentation, calculation, definition, label, reference), XSD schema parsing, calculation validation, text block extraction, pandas/DataFrame conversion, and Rich/HTML rendering. The only required dependency is lxml.
Installation
pip install xbrl-core
Optional dependencies:
# pandas + pyarrow (DataFrame conversion, Parquet export)
pip install 'xbrl-core[analysis]'
# Rich terminal display
pip install 'xbrl-core[display]'
# Excel export (pandas + openpyxl)
pip install 'xbrl-core[excel]'
# Everything
pip install 'xbrl-core[all]'
Quick Start
from xbrl_core import parse_xbrl_facts, structure_contexts, build_line_items
# 1. Parse an XBRL instance document
with open("instance.xbrl", "rb") as f:
parsed = parse_xbrl_facts(f.read(), source_path="instance.xbrl")
print(f"Facts: {parsed.fact_count}")
# 2. Structure contexts and build typed LineItems
ctx_map = structure_contexts(parsed.contexts)
items = build_line_items(parsed.facts, ctx_map)
for item in items[:5]:
print(item.local_name, item.value, item.period)
Parsing
XBRL Instance
parse_xbrl_facts() takes raw bytes and returns a ParsedXBRL containing facts, contexts, units, schema refs, footnote links, and ignored elements.
from xbrl_core import parse_xbrl_facts
parsed = parse_xbrl_facts(xbrl_bytes, source_path="example.xbrl")
# Extracted data
parsed.facts # tuple[RawFact, ...]
parsed.contexts # tuple[RawContext, ...]
parsed.units # tuple[RawUnit, ...]
parsed.schema_refs # tuple[RawSchemaRef, ...]
parsed.footnote_links # tuple[RawFootnoteLink, ...]
parsed.ignored_elements # tuple[IgnoredElement, ...]
parsed.fact_count # int
iXBRL (Inline XBRL)
parse_ixbrl_facts() parses iXBRL (XHTML-embedded XBRL) documents. The output is the same ParsedXBRL type, so downstream pipelines work identically.
from xbrl_core import parse_ixbrl_facts
parsed = parse_ixbrl_facts(ixbrl_bytes, source_path="report.htm")
for fact in parsed.facts[:5]:
print(fact.local_name, fact.value_raw)
iXBRL format attributes (ixt:numdotdecimal, ixt:numcommadecimal, etc.) and scale/sign attributes are automatically applied. Custom formats can be registered:
from xbrl_core import FormatRegistry, parse_ixbrl_facts
registry = FormatRegistry()
registry.register("dateyearmonthdaycjk", my_cjk_date_func)
parsed = parse_ixbrl_facts(ixbrl_bytes, format_registry=registry)
IXDS (Inline XBRL Document Set)
Multiple iXBRL files from a single filing can be merged:
from xbrl_core import parse_ixbrl_facts, merge_ixbrl_results
results = [parse_ixbrl_facts(f) for f in ixbrl_files]
merged = merge_ixbrl_results(results)
Strict / Lenient Mode
Both parsers accept a strict parameter. When strict=True (default), spec violations raise XbrlParseError. When strict=False, violations emit warnings and are recorded in ignored_elements.
parsed = parse_xbrl_facts(xbrl_bytes, strict=False)
for elem in parsed.ignored_elements:
print(elem.reason, elem.source_line)
Context Structuring
structure_contexts() converts raw context XML fragments into typed StructuredContext objects with period, entity, and dimension information.
from xbrl_core import structure_contexts, ContextCollection
ctx_map = structure_contexts(parsed.contexts)
# Direct dict access
ctx = ctx_map["CurrentYearInstant"]
print(ctx.period) # InstantPeriod(instant=datetime.date(2024, 3, 31))
print(ctx.entity_id) # "E00001"
print(ctx.dimensions) # tuple[DimensionMember, ...]
# ContextCollection for filtering
coll = ContextCollection(ctx_map)
coll.filter_instant() # instant contexts only
coll.filter_duration() # duration contexts only
coll.filter_no_dimensions() # no dimension members
coll.filter_by_dimension(axis="{ns}ProductAxis", member="{ns}SegmentA")
coll.latest_instant_period # most recent InstantPeriod
coll.unique_duration_periods # unique DurationPeriods, sorted
Unit Structuring
structure_units() converts raw unit XML fragments into typed StructuredUnit objects.
from xbrl_core import structure_units
unit_map = structure_units(parsed.units)
unit = unit_map["JPY"]
print(unit.is_monetary) # True
print(unit.currency_code) # "JPY"
unit = unit_map["pure"]
print(unit.is_pure) # True
unit = unit_map["JPYPerShare"]
print(unit.is_per_share) # True
Building LineItems
build_line_items() merges RawFact + StructuredContext + optional LabelResolver into fully typed LineItem objects.
from xbrl_core import build_line_items
items = build_line_items(parsed.facts, ctx_map, langs=("en", "ja"))
for item in items:
print(item.local_name) # "NetSales"
print(item.value) # Decimal('1234567890')
print(item.period) # InstantPeriod / DurationPeriod
print(item.entity_id) # "E00001"
print(item.dimensions) # tuple[DimensionMember, ...]
print(item.label("en")) # "Net sales"
print(item.label("ja")) # "売上高"
Linkbase Parsing
Presentation Linkbase
from xbrl_core import parse_presentation_linkbase, merge_presentation_trees
trees = parse_presentation_linkbase(pre_xml_bytes)
for role_uri, tree in trees.items():
# Flatten the tree (depth-first)
for node in tree.flatten(skip_abstract=True, skip_dimension=True):
print(" " * node.depth + node.concept)
# Get only the line-items subtree
for node in tree.line_items_roots():
print(node.concept, node.order)
# Merge multiple presentation linkbases
merged = merge_presentation_trees(trees_a, trees_b)
Calculation Linkbase
from xbrl_core import parse_calculation_linkbase
calc_lb = parse_calculation_linkbase(cal_xml_bytes)
for role_uri in calc_lb.role_uris:
tree = calc_lb.get_tree(role_uri)
for arc in tree.arcs:
sign = "+" if arc.weight == 1 else "-"
print(f" {arc.parent} {sign}-> {arc.child}")
# Query relationships
calc_lb.children_of("GrossProfit") # child arcs
calc_lb.parent_of("NetSales") # parent arcs
calc_lb.ancestors_of("NetSales", role_uri=role) # root-ward chain
Definition Linkbase
from xbrl_core import parse_definition_linkbase
def_trees = parse_definition_linkbase(def_xml_bytes)
for role_uri, tree in def_trees.items():
for hc in tree.hypercubes:
print(f"Table: {hc.table_concept}")
for axis in hc.axes:
print(f" Axis: {axis.axis_concept}")
if axis.domain:
print(f" Domain: {axis.domain.concept}")
Label Linkbase
from xbrl_core import parse_label_linkbase
labels = parse_label_linkbase(lab_xml_bytes)
for lab in labels:
print(f"{lab.concept_name} [{lab.lang}] = {lab.text}")
Reference Linkbase
from xbrl_core import parse_reference_linkbase
refs = parse_reference_linkbase(ref_xml_bytes)
for ref in refs:
print(f"{ref.concept_name}: {ref.role}")
for part in ref.parts:
print(f" {part.local_name} = {part.value}")
Footnotes
from xbrl_core import parse_footnote_links
footnote_map = parse_footnote_links(parsed.footnote_links)
notes = footnote_map.get("IdFact1234")
for n in notes:
print(n.text, n.lang)
print(footnote_map.fact_ids) # Fact IDs with footnotes
print(len(footnote_map)) # number of Facts with footnotes
Schema Parsing
from xbrl_core import parse_xsd_elements
elements = parse_xsd_elements(xsd_bytes)
elem = elements["NetSales"]
print(elem.period_type) # "duration"
print(elem.balance) # "credit"
print(elem.abstract) # False
print(elem.type_name) # "xbrli:monetaryItemType"
print(elem.substitution_group) # "xbrli:item"
Calculation Validation
Validates summation-item relationships per XBRL 2.1 section 5.2.5.2, with decimals-based rounding tolerance.
from xbrl_core import validate_calculations, parse_calculation_linkbase
calc_lb = parse_calculation_linkbase(cal_xml_bytes)
result = validate_calculations(items, calc_lb)
print(result) # "Calculation validation: PASS (checked=42, passed=42, errors=0, skipped=3)"
print(result.is_valid) # True
for issue in result.issues:
print(issue.parent_concept, issue.expected, issue.actual, issue.severity)
Text Block Extraction
Extracts textBlockItemType facts (e.g. MD&A, risk factors, notes) from filings.
from xbrl_core import extract_text_blocks, clean_html
blocks = extract_text_blocks(parsed.facts, ctx_map)
for block in blocks:
print(block.concept) # "BusinessRisksTextBlock"
print(block.period) # DurationPeriod(...)
plain = clean_html(block.html)
print(plain[:200])
clean_html() converts HTML fragments to plain text, preserving table structure with tabs and newlines — useful as preprocessing for LLM / RAG pipelines.
DataFrame Conversion
Requires pip install 'xbrl-core[analysis]'.
from xbrl_core import line_items_to_dataframe, to_csv, to_parquet
df = line_items_to_dataframe(items, label_lang="en")
print(df[["local_name", "label", "value", "period_end"]].head())
# Export
to_csv(df, "output.csv")
to_parquet(df, "output.parquet")
Requires pip install 'xbrl-core[excel]':
from xbrl_core import to_excel
to_excel(df, "output.xlsx", sheet_name="BalanceSheet")
Display
Rich (Terminal)
Requires pip install 'xbrl-core[display]'.
from rich.console import Console
from xbrl_core import render_statement
table = render_statement(items, title="Balance Sheet", label_lang="en")
Console().print(table)
Hierarchical Display
Use DisplayHint with presentation tree data for indented financial statements:
from xbrl_core import (
build_display_rows,
render_hierarchical_statement,
DisplayHint,
)
hints = [
DisplayHint(concept="AssetsAbstract", depth=0, is_abstract=True, label="Assets"),
DisplayHint(concept="CashAndDeposits", depth=1),
DisplayHint(concept="TotalAssets", depth=0, is_total=True),
]
# Rich Table
table = render_hierarchical_statement(items, hints=hints, title="BS")
# Or get raw DisplayRow objects
rows = build_display_rows(items, hints=hints)
HTML (Jupyter)
from xbrl_core import to_html
html = to_html(items, hints=hints, title="Balance Sheet")
Label Resolution
LabelResolver is a Protocol — implement it to inject taxonomy labels into build_line_items().
from xbrl_core import LabelResolver, LabelInfo, LabelSource
class MyResolver:
def resolve(self, concept_qname, lang, role):
# Look up label from your taxonomy data
return LabelInfo(text="Net sales", role=role, lang=lang, source=LabelSource.STANDARD)
def resolve_batch(self, concept_qnames, lang, role):
return {qn: self.resolve(qn, lang, role) for qn in concept_qnames}
items = build_line_items(parsed.facts, ctx_map, resolver=MyResolver(), langs=("en",))
Error Handling
All errors inherit from XbrlError and carry a structured error code and context.
from xbrl_core import XbrlError, XbrlParseError, XbrlValidationError
try:
parsed = parse_xbrl_facts(bad_bytes)
except XbrlParseError as e:
print(e.code) # "XBRL_PARSE_001"
print(e.context) # {"source_path": "..."}
| Error code prefix | Exception class | Description |
|---|---|---|
XBRL_PARSE_xxx |
XbrlParseError |
XML/XBRL parse errors |
XBRL_CTX_xxx |
XbrlParseError |
Context structuring errors |
XBRL_UNIT_xxx |
XbrlParseError |
Unit structuring errors |
XBRL_LINK_xxx |
XbrlParseError |
Linkbase parse errors |
XBRL_IXBRL_xxx |
XbrlParseError |
iXBRL parse errors |
XBRL_VAL_xxx |
XbrlValidationError |
Validation errors |
XbrlWarning (a UserWarning subclass) is emitted for non-fatal issues.
Customizing Error / Warning Classes
All linkbase parsers and structure_units() accept optional error_class and warning_class parameters. This allows downstream libraries to substitute their own exception and warning types — useful when wrapping xbrl-core in a domain-specific package (e.g. an EDINET library).
from xbrl_core import XbrlParseError, XbrlWarning, parse_calculation_linkbase
class EdinetParseError(XbrlParseError):
"""EDINET-specific parse error."""
class EdinetWarning(UserWarning):
"""EDINET-specific warning."""
lb = parse_calculation_linkbase(
xml_bytes,
error_class=EdinetParseError,
warning_class=EdinetWarning,
)
Supported by: parse_calculation_linkbase, parse_definition_linkbase, parse_presentation_linkbase, parse_label_linkbase, parse_reference_linkbase (error_class only), parse_footnote_links, structure_units.
Customizing Concept Extraction
Linkbase parsers extract concept local names from xlink:href fragments. The default logic handles standard XBRL taxonomy patterns ({prefix}_{YYYY-MM-DD}.xsd#prefix_ConceptName), but jurisdiction-specific taxonomies may use different naming conventions.
All linkbase parsers accept a concept_extractor parameter (Callable[[str], str | None]) to override this logic:
import re
from xbrl_core import ConceptExtractor, parse_label_linkbase
def edinet_concept_extractor(href: str) -> str | None:
"""EDINET Strategy 2: extract local name by backward _[A-Z] scan."""
if "#" not in href:
return None
fragment = href.rsplit("#", 1)[1]
m = re.search(r"_([A-Z][A-Za-z0-9]*)$", fragment)
return m.group(1) if m else fragment
labels = parse_label_linkbase(xml_bytes, concept_extractor=edinet_concept_extractor)
Supported by: parse_calculation_linkbase, parse_definition_linkbase, parse_presentation_linkbase, parse_label_linkbase, parse_reference_linkbase.
Requirements
Python 3.12+. The only required dependency is lxml >= 5.0.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file xbrl_core-0.1.2.tar.gz.
File metadata
- Download URL: xbrl_core-0.1.2.tar.gz
- Upload date:
- Size: 69.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2a35028d6a63dc86aff94054fc5d815f1c88818cb0a54af36979e4a3914f7b08
|
|
| MD5 |
038e8da48863df73e63ee6bd317f37de
|
|
| BLAKE2b-256 |
08215ad1b25bf7c2d65c3849f4eeb7a76d434c6b301b254a3a8dd6f539a06853
|
File details
Details for the file xbrl_core-0.1.2-py3-none-any.whl.
File metadata
- Download URL: xbrl_core-0.1.2-py3-none-any.whl
- Upload date:
- Size: 90.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8e588812517d3a14ae2b702ff23b235cfef3ff06178e78433821abc943008b88
|
|
| MD5 |
8fbf11c0b0bec57fa0e4e3e2657b7572
|
|
| BLAKE2b-256 |
cb9230fc7a8b0cb555ca8af6d1952a1161faf33a6092daa1dba8d15594041095
|