Convert EndNote XML to CSV/JSON/XLSX with streaming parse and TXT report.
Project description
EndNote Utils
Convert EndNote XML files into clean CSV/JSON/XLSX with automatic TXT reports.
Supports both Python API and command-line interface (CLI).
Features
- ✅ Parse one XML file (
--xml) or an entire folder of*.xml(--folder) - ✅ Streams
<record>elements usingiterparse(low memory usage) - ✅ Extracts fields:
database, ref_type, title, journal, authors, year, volume, number, abstract, doi, urls, keywords, publisher, isbn, language, extracted_date - ✅ Adds a
databasecolumn from the XML filename stem (IEEE.xml → IEEE) - ✅ Normalizes DOI (
10.xxxx→https://doi.org/...) - ✅ Supports multiple output formats: CSV, JSON, XLSX
- ✅ Always generates a TXT report (default:
<out>_report.txt) with:- per-file counts (exported/skipped)
- totals, files processed
- run timestamp & duration
- duplicate table per database (Origin / Retractions / Duplicates / Remaining)
- optional duplicate key list (top-N)
- optional summary stats (year, ref_type, journal, top authors)
- ✅ Auto-creates output folders if missing
- ✅ Deduplication:
--dedupe doi(unique by DOI)--dedupe title-year(unique by normalized title + year)--dedupe-keep first|last(keep first or last occurrence within each file)
- ✅ Summary stats (
--stats) with optional JSON export (--stats-json) - ✅ CLI options for CSV formatting, filters, verbosity
- ✅ Importable Python API for scripting & integration
Installation
From PyPI
pip install endnote-utils
Requires Python 3.8+.
Usage
Command Line
Single file
endnote-utils --xml data/IEEE.xml --out output/ieee.csv
Folder with multiple files
endnote-utils --folder data/xmls --out output/all_records.csv
Custom report path
endnote-utils \
--xml data/Scopus.xml \
--out output/scopus.csv \
--report reports/scopus_run.txt \
--stats \
--verbose
If --report is not provided, it defaults to <out>_report.txt.
Use --no-report to disable report generation.
CLI Options
| Option | Description | Default |
|---|---|---|
--xml |
Path to a single EndNote XML file | – |
--folder |
Path to a folder containing multiple *.xml files |
– |
--csv |
(Legacy) Output CSV path | – |
--out |
Generic output path (.csv, .json, .xlsx) |
– |
--format |
Explicit format (csv, json, xlsx) |
inferred |
--report |
Output TXT report path | <out>_report.txt |
--no-report |
Disable TXT report completely | – |
--delimiter |
CSV delimiter | , |
--quoting |
CSV quoting: minimal, all, nonnumeric, none |
minimal |
--no-header |
Suppress CSV header row | – |
--encoding |
Output text encoding | utf-8 |
--ref-type |
Only include records with this ref_type name |
– |
--year |
Only include records with this year | – |
--max-records |
Stop after N records per file (for testing) | – |
--dedupe |
Deduplicate mode: none, doi, title-year |
none |
--dedupe-keep |
Deduplication strategy: first, last |
first |
--stats |
Include summary stats in TXT report | – |
--stats-json |
Path to JSON file to save stats & duplicate info | – |
--verbose |
Verbose logging with debug details | – |
Example Report (snippet)
========================================
EndNote Export Report
========================================
Run started : 2025-09-11 14:30:22
Files : 4
Duration : 0.47 seconds
Per-file results
----------------------------------------
GGScholar.xml : 13 exported, 0 skipped
IEEE.xml : 2147 exported, 0 skipped
PubMed.xml : 504 exported, 0 skipped
Scopus.xml : 847 exported, 0 skipped
TOTAL exported: 3511
Duplicates table (by database)
----------------------------------------
Database Origin Retractions Duplicates Remaining
------------------------------------------------------------
GGScholar 179 0 27 152
IEEE 1900 0 589 1311
PubMed 320 0 225 95
Scopus 1999 1 511 1489
TOTAL 4410 1 1352 3047
Duplicate keys (top)
----------------------------------------
Mode : doi
Keep : first
Removed: 1352
Details (top):
10.1109/SPMB55497.2022.10014965 : 3 duplicate(s)
10.1109/TSSA63730.2024.10864368 : 2 duplicate(s)
Summary stats
----------------------------------------
By year:
2022 : 569
2023 : 684
2024 : 1148
2025 : 1108
By ref_type (top):
Journal Article: 2037
Conference Proceedings: 1470
Book Section: 4
By journal (top 20):
IEEE Access: 175
IEEE Journal of Biomedical and Health Informatics: 67
...
Top authors (top 10):
Y. Wang: 50
X. Wang: 35
...
Python API
from pathlib import Path
from endnote_utils import export, export_folder
# Single file
total, out_file, report_file = export(
Path("data/IEEE.xml"),
Path("output/ieee.csv"),
dedupe="doi", stats=True
)
# Folder
total, out_file, report_file = export_folder(
Path("data/xmls"),
Path("output/all.csv"),
ref_type="Conference Proceedings",
year="2024",
dedupe="title-year",
dedupe_keep="last",
stats=True,
stats_json=Path("output/stats.json"),
)
Development Notes
- Pure Python, uses only standard library (
argparse,csv,xml.etree.ElementTree,logging,pathlib,json). - Optional dependency:
openpyxl(for Excel.xlsxexport). - Streaming XML parsing avoids high memory usage.
- Deduplication strategies configurable (
doi/title-year). - Report includes per-database table and optional JSON snapshot.
- Follows PEP 621 packaging (
pyproject.toml).
License
MIT License © 2025 Minh Quach
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file endnote_utils-0.2.0.tar.gz.
File metadata
- Download URL: endnote_utils-0.2.0.tar.gz
- Upload date:
- Size: 14.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4ace8dc93ee621614da67c80a4ff80645f47fd8618c833b80b557b9f5fe71565
|
|
| MD5 |
870078ee6dc766429154ef9cf9dba5dc
|
|
| BLAKE2b-256 |
94a935b6be524e001bf9eb70c34d3865cfbd32721c610e9234e18924ffbe4019
|
File details
Details for the file endnote_utils-0.2.0-py3-none-any.whl.
File metadata
- Download URL: endnote_utils-0.2.0-py3-none-any.whl
- Upload date:
- Size: 13.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
440cbdcb63c81402ed13d3a5644fac810d2478e88f01bcee994672fa81a4a5ef
|
|
| MD5 |
0fa5de1ff45c8102e3e480b3b04d7c83
|
|
| BLAKE2b-256 |
b99a6e8085f14368211e2e2a2388c8160755f2a173b4774405a549d46ec10e2e
|