Skip to main content

Parser for XML generated by Axiell EMu

Project description

Emu XML Parser

  • Purpose: Parse XML files produced by Axiell EMu into Python-native records (lists of dicts) using schema information embedded in the XML processing instruction. The parser preserves nested tables/tuples and fills missing fields with sensible defaults.

Quick Install

  • Using pip (recommended for end users):
# create and activate a venv (recommended)
python3 -m venv .venv
source .venv/bin/activate

# install the package in editable mode for development
pip install -e .

# install test runner
pip install pytest
  • Other managers: conda or poetry also work — create/activate an env then install with pip install -e ..

Basic Usage (Python)

  • Import and parse an EMu XML file. The public function is parse exposed at the package root.
from emu_xml_parser import parse

rows = parse("/path/to/emu_export.xml")
  • If you want date fields parsed into Python date objects:
rows = parse("/path/to/emu_export.xml", parse_dates=True)

Single-Column Tables

EMu tables defined with only one field are automatically flattened to lists of strings instead of lists of dicts. This makes the data easier to work with.

Example Schema:

<?schema
table ecatalogue
	table common_name
		text short ComName
	end
	table element
		text long IPAnatomy
	end
end
?>

XML Data:

<tuple>
	<table name="common_name">
		<tuple>
			<atom name="ComName">Indian Bush Lark</atom>
		</tuple>
		<tuple>
			<atom name="ComName">Rufous-tailed Lark</atom>
		</tuple>
	</table>
	<table name="element">
		<tuple>
			<atom name="IPAnatomy">shell(s)</atom>
		</tuple>
	</table>
</tuple>

Python Output:

{
	"common_name": ["Indian Bush Lark", "Rufous-tailed Lark"],  # List of strings
	"element": ["shell(s)"]  # Not [{"IPAnatomy": "shell(s)"}]
}

Contrast with Multi-Column Tables:

Multi-field tables remain as lists of dicts:

<?schema
table ecatalogue
	table SitSiteRef_tab
		text long locality
		integer locality_irn
	end
end
?>

Python Output:

{
	"SitSiteRef_tab": [
		{"locality": "San Pedro", "locality_irn": 368989},
		{"locality": "Los Angeles", "locality_irn": 363879}
	]
}

Why This Matters:

  • Simplicity: Access values with row["common_name"][0] instead of row["common_name"][0]["ComName"]
  • Common pattern: Many EMu exports have single-field reference tables (taxonomy names, elements, etc.)
  • Backwards compatible: Multi-field tables work as expected

Minimal XML Example Input (EMu XML contains a <?schema ... ?> processing instruction):

<?xml version="1.0"?>
<?schema
table ecatalogue
	date date_emu_record_modified
	date date_emu_record_inserted
	integer irn
	text short emu_guid
	text short department
	text short catalogue_number
	table SitSiteRef_tab
		text long locality
		integer locality_irn
	end
	tuple SpeTaxonRef
		text short taxon_irn
	table common_name
      text short ComName
    end
	end

end
?>
<root>
	<tuple>
		<atom name="date_emu_record_modified">2023-05-18</atom>
		<atom name="date_emu_record_inserted">2012-10-30</atom>
		<atom name="irn">368521</atom>
		<atom name="emu_guid">8767ccff-...</atom>
		<atom name="department">Ornithology</atom>
		<atom name="catalogue_number">89334</atom>
		<tuple name="SpeTaxonRef">
			<atom name="taxon_irn">24960</atom>
		</tuple>
		<table name="common_name">
			<tuple>
				<atom name="ComName">Indian Bush Lark</atom>
			</tuple>
    	</table>
	</tuple>
</root>

Expected Python output (approx):

[
	{
		"date_emu_record_modified": "2023-05-18",
		"date_emu_record_inserted": "2012-10-30",
		"irn": 368521,
		"emu_guid": "8767ccff-...",
		"department": "Ornithology",
		"catalogue_number": "89334",
		"SpeTaxonRef": [{"taxon_irn": 24960}],
		"SitSiteRef_tab": [
			{
				"locality": None,
				"locality_irn": None
			}
		],
		"common_name": ["Indian Bush Lark"]
	}
]

Notes:

  • Atom fields become strings by default. When parse_dates=True, date-like fields are converted to Python date objects.
  • Multi-field tables (tables/tuples with multiple field definitions) are represented as lists of dicts. Single-field tables become lists of strings.
  • Missing fields are filled with empty strings or empty lists per the schema.

Testing

  • Run the test suite (after installing dev/test deps):
pytest -q

If you used a virtual environment, ensure it's activated before running pytest.

Working with Real / Large Fixtures

  • Keep small, anonymized fixtures under tests/fixtures and reference them in tests.
  • For large or private datasets, do not commit originals; point tests to a folder via TEST_EMU_XML_DIR and skip if unset.

Extending / Customizing

  • Conversion helpers live in emu_xml_parser.converter (e.g. date parsing/serialization) and validation/enforcement lives in emu_xml_parser.validator.
  • If you need different conversion rules, you can adapt convert_value or wrap the parser in a small class that injects custom converters.

Files of Interest

  • src/emu_xml_parser/core.py: entry point parse() for the package
  • src/emu_xml_parser/extractor.py: reads the <?schema ... ?> processing instruction
  • src/emu_xml_parser/schema.py: schema text → structured schema
  • src/emu_xml_parser/tuple_parser.py: recursive XML → dict conversion
  • src/emu_xml_parser/converter.py: value conversion utilities
  • src/emu_xml_parser/validator.py: schema enforcement and normalization

License & Contributing

  • Add your preferred license and contribution guidelines to the repository root.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

emu_xml_parser-0.1.0.tar.gz (6.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

emu_xml_parser-0.1.0-py3-none-any.whl (9.2 kB view details)

Uploaded Python 3

File details

Details for the file emu_xml_parser-0.1.0.tar.gz.

File metadata

  • Download URL: emu_xml_parser-0.1.0.tar.gz
  • Upload date:
  • Size: 6.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.14.3 Darwin/25.2.0

File hashes

Hashes for emu_xml_parser-0.1.0.tar.gz
Algorithm Hash digest
SHA256 2cb6dc95f496c823415346e50678ea235d82011d213db571da3377765b4c7f19
MD5 f9b42a02c8b2c74131f2a516d00393fd
BLAKE2b-256 9e7bf6c9e7c85836e1212e886c43fecad38bb3b777fa8d17a66eaf04fedc3972

See more details on using hashes here.

File details

Details for the file emu_xml_parser-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: emu_xml_parser-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 9.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.14.3 Darwin/25.2.0

File hashes

Hashes for emu_xml_parser-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 609555ad2bcf422905ba750f22b63b33fa993ace321ddc15684c1ea50d0f90c3
MD5 ecd5165a6dae9aa0da0095ee32bfa6fd
BLAKE2b-256 f390e80d287cee2ad86e6cefaea29a81ce64c5025b028c69e85b197eb35bc586

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page