Skip to main content

Parser for XML generated by Axiell EMu

Project description

Emu XML Parser

  • Purpose: Parse XML files produced by Axiell EMu into Python-native records (lists of dicts) using schema information embedded in the XML processing instruction. The parser preserves nested tables/tuples and fills missing fields with sensible defaults.

Quick Install

  • Using pip (recommended for end users):
# create and activate a venv (recommended)
python3 -m venv .venv
source .venv/bin/activate

# install the package in editable mode for development
pip install -e .

# install test runner
pip install pytest
  • Other managers: conda or poetry also work — create/activate an env then install with pip install -e ..

Basic Usage (Python)

  • Import and parse an EMu XML file. The public function is parse exposed at the package root.
from emu_xml_parser import parse

rows = parse("/path/to/emu_export.xml")
  • If you want date fields parsed into Python date objects:
rows = parse("/path/to/emu_export.xml", parse_dates=True)

Single-Column Tables

EMu tables defined with only one field are automatically flattened to lists of strings instead of lists of dicts. This makes the data easier to work with.

Example Schema:

<?schema
table ecatalogue
	table common_name
		text short ComName
	end
	table element
		text long IPAnatomy
	end
end
?>

XML Data:

<tuple>
	<table name="common_name">
		<tuple>
			<atom name="ComName">Indian Bush Lark</atom>
		</tuple>
		<tuple>
			<atom name="ComName">Rufous-tailed Lark</atom>
		</tuple>
	</table>
	<table name="element">
		<tuple>
			<atom name="IPAnatomy">shell(s)</atom>
		</tuple>
	</table>
</tuple>

Python Output:

{
	"common_name": ["Indian Bush Lark", "Rufous-tailed Lark"],  # List of strings
	"element": ["shell(s)"]  # Not [{"IPAnatomy": "shell(s)"}]
}

Contrast with Multi-Column Tables:

Multi-field tables remain as lists of dicts:

<?schema
table ecatalogue
	table SitSiteRef_tab
		text long locality
		integer locality_irn
	end
end
?>

Python Output:

{
	"SitSiteRef_tab": [
		{"locality": "San Pedro", "locality_irn": 368989},
		{"locality": "Los Angeles", "locality_irn": 363879}
	]
}

Why This Matters:

  • Simplicity: Access values with row["common_name"][0] instead of row["common_name"][0]["ComName"]
  • Common pattern: Many EMu exports have single-field reference tables (taxonomy names, elements, etc.)
  • Backwards compatible: Multi-field tables work as expected

Minimal XML Example Input (EMu XML contains a <?schema ... ?> processing instruction):

<?xml version="1.0"?>
<?schema
table ecatalogue
	date date_emu_record_modified
	date date_emu_record_inserted
	integer irn
	text short emu_guid
	text short department
	text short catalogue_number
	table SitSiteRef_tab
		text long locality
		integer locality_irn
	end
	tuple SpeTaxonRef
		text short taxon_irn
	table common_name
      text short ComName
    end
	end

end
?>
<root>
	<tuple>
		<atom name="date_emu_record_modified">2023-05-18</atom>
		<atom name="date_emu_record_inserted">2012-10-30</atom>
		<atom name="irn">368521</atom>
		<atom name="emu_guid">8767ccff-...</atom>
		<atom name="department">Ornithology</atom>
		<atom name="catalogue_number">89334</atom>
		<tuple name="SpeTaxonRef">
			<atom name="taxon_irn">24960</atom>
		</tuple>
		<table name="common_name">
			<tuple>
				<atom name="ComName">Indian Bush Lark</atom>
			</tuple>
    	</table>
	</tuple>
</root>

Expected Python output (approx):

[
	{
		"date_emu_record_modified": "2023-05-18",
		"date_emu_record_inserted": "2012-10-30",
		"irn": 368521,
		"emu_guid": "8767ccff-...",
		"department": "Ornithology",
		"catalogue_number": "89334",
		"SpeTaxonRef": [{"taxon_irn": 24960}],
		"SitSiteRef_tab": [
			{
				"locality": None,
				"locality_irn": None
			}
		],
		"common_name": ["Indian Bush Lark"]
	}
]

Notes:

  • Atom fields become strings by default. When parse_dates=True, date-like fields are converted to Python date objects.
  • Multi-field tables (tables/tuples with multiple field definitions) are represented as lists of dicts. Single-field tables become lists of strings.
  • Missing fields are filled with empty strings or empty lists per the schema.

Testing

  • Run the test suite (after installing dev/test deps):
pytest -q

If you used a virtual environment, ensure it's activated before running pytest.

Working with Real / Large Fixtures

  • Keep small, anonymized fixtures under tests/fixtures and reference them in tests.
  • For large or private datasets, do not commit originals; point tests to a folder via TEST_EMU_XML_DIR and skip if unset.

Extending / Customizing

  • Conversion helpers live in emu_xml_parser.converter (e.g. date parsing/serialization) and validation/enforcement lives in emu_xml_parser.validator.
  • If you need different conversion rules, you can adapt convert_value or wrap the parser in a small class that injects custom converters.

Files of Interest

  • src/emu_xml_parser/core.py: entry point parse() for the package
  • src/emu_xml_parser/extractor.py: reads the <?schema ... ?> processing instruction
  • src/emu_xml_parser/schema.py: schema text → structured schema
  • src/emu_xml_parser/tuple_parser.py: recursive XML → dict conversion
  • src/emu_xml_parser/converter.py: value conversion utilities
  • src/emu_xml_parser/validator.py: schema enforcement and normalization

License & Contributing

  • Add your preferred license and contribution guidelines to the repository root.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

emu_xml_parser-0.1.1.tar.gz (6.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

emu_xml_parser-0.1.1-py3-none-any.whl (9.2 kB view details)

Uploaded Python 3

File details

Details for the file emu_xml_parser-0.1.1.tar.gz.

File metadata

  • Download URL: emu_xml_parser-0.1.1.tar.gz
  • Upload date:
  • Size: 6.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.14.3 Darwin/25.2.0

File hashes

Hashes for emu_xml_parser-0.1.1.tar.gz
Algorithm Hash digest
SHA256 630fbff151aa0d315bfdd13dd11dd075c7a712246379d2f139c6638a5ca75e9d
MD5 5d174ab139854523d17fbbacc59c4a5a
BLAKE2b-256 84ac82e1c25067be65df8997e7b7709aa97a4b6e2208db53b051f3fddb228425

See more details on using hashes here.

File details

Details for the file emu_xml_parser-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: emu_xml_parser-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 9.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.14.3 Darwin/25.2.0

File hashes

Hashes for emu_xml_parser-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e4184f1d0d140debb986ab211a72580fde431c5bdde0d42bf93eaf7e85339992
MD5 bacd3edc9763449db83d6531c06c71cb
BLAKE2b-256 aec605bb885a7a652de89bb8dcfa4d01c1cf71d2aea167d72781363d4bcff466

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page