Convert EndNote XML to CSV/JSON/XLSX with streaming parse and TXT report.
Project description
EndNote Utils
Convert EndNote XML and RIS files into clean CSV / JSON / XLSX with automatic TXT reports.
Supports both Python API and command-line interface (CLI).
Features
- ✅ Parse one file (
--xmlor--ris) or an entire folder of mixed*.xml/*.ris - ✅ Streams records with
iterparse(XML) and line-based parsing (RIS) → low memory usage - ✅ Extracts fields:
database, ref_type, title, journal, authors, year, volume, number, abstract, doi, urls, keywords, publisher, isbn, language, extracted_date - ✅ Adds a
databasecolumn from the filename stem (IEEE.xml → IEEE,PubMed.ris → PubMed) - ✅ Normalizes DOI (
10.xxxx→https://doi.org/...) - ✅ Always generates a TXT report (default:
<out>_report.txt) with:- per-file counts (exported/skipped records)
- totals, run timestamp & duration
- duplicates table by database (Origin / Retractions / Duplicates / Remaining)
- optional summary stats (by year, ref_type, journal, authors)
- ✅ Deduplication by
doiortitle+year(--dedupe) - ✅ Export formats: CSV, JSON, XLSX
- ✅ Auto-creates output folders if missing
- ✅ Importable Python API for scripting & integration
Installation
From PyPI
pip install endnote-utils
Optional (for Excel export):
pip install "openpyxl>=3.1.0"
Requires Python 3.8+.
Usage
Command Line
Single XML file
endnote-utils --xml data/IEEE.xml --out output/ieee.csv
Single RIS file
endnote-utils --ris data/PubMed.ris --out output/pubmed.json
Folder with mixed files
endnote-utils --folder data/refs --out output/all.xlsx
→ Each produces both the chosen output (csv/json/xlsx) and a TXT report (<out>_report.txt).
CLI Options
| Option | Description | Default |
|---|---|---|
--xml |
Path to a single EndNote XML file | – |
--ris |
Path to a single RIS file | – |
--folder |
Path to a folder containing *.xml / *.ris files |
– |
--out |
Output file path; format inferred from extension | – |
--format |
Explicit format (csv, json, xlsx) |
inferred |
--report |
Output TXT report path | <out>_report.txt |
--no-report |
Disable TXT report | – |
--delimiter |
CSV delimiter | , |
--quoting |
CSV quoting: minimal, all, nonnumeric, none |
minimal |
--no-header |
Suppress CSV header row | – |
--encoding |
Output encoding | utf-8 |
--ref-type |
Filter: only include records with this ref_type | – |
--year |
Filter: only include records with this year | – |
--max-records |
Stop after N records per file (for testing) | – |
--dedupe |
Deduplicate (none, doi, title-year) |
none |
--dedupe-keep |
For duplicates, keep first or last |
first |
--stats |
Add summary stats (year, ref_type, journal, authors) | – |
--stats-json |
Save stats + duplicates as JSON file | – |
--verbose |
Verbose logging with debug details | – |
Example Report (snippet)
========================================
EndNote Export Report
========================================
Run started : 2025-09-12 12:42:20
Files : 4
Duration : 0.47 seconds
Per-file results
----------------------------------------
GGScholar.xml : 13 exported, 0 skipped
IEEE.xml : 2147 exported, 0 skipped
PubMed.ris : 504 exported, 0 skipped
Scopus.ris : 847 exported, 0 skipped
TOTAL exported : 3511
Duplicates table (by database)
----------------------------------------
Database Origin Retractions Duplicates Remaining
---------------------------------------------------------
IEEE 2200 0 53 2147
PubMed 520 2 14 504
Scopus 880 0 33 847
TOTAL 3600 2 100 3498
Duplicate keys (top)
----------------------------------------
Mode : doi
Keep : first
Removed: 100
Details (top):
https://doi.org/10.1109/abc.123 : 5 duplicate(s)
...
Summary stats
----------------------------------------
By year:
2022 : 569
2023 : 684
2024 : 1148
2025 : 1108
By ref_type (top):
Journal Article : 2037
Conference Proceedings : 1470
By journal (top 10):
IEEE Access : 175
...
Top authors (top 10):
Y. Wang : 50
X. Wang : 35
...
Python API
from pathlib import Path
from endnote_utils import export, export_folder
# Single XML
total, out_file, report_file = export(
Path("data/IEEE.xml"), Path("output/ieee.csv"),
dedupe="doi", stats=True
)
# Folder (mixed XML + RIS)
total, out_file, report_file = export_folder(
Path("data/refs"), Path("output/all.csv"),
dedupe="title-year", stats=True, stats_json=Path("output/stats.json")
)
print(f"Exported {total} → {out_file}")
print(f"Report at {report_file}")
Development Notes
- Pure Python, only stdlib (
argparse,csv,xml.etree.ElementTree,logging,pathlib,json,re). - Optional:
openpyxlfor Excel output. - Streaming parsers for XML and RIS avoid high memory usage.
- Robust error handling: skips malformed records but logs them in verbose mode.
- Follows PEP 621 packaging (
pyproject.toml).
License
MIT License © 2025 Minh Quach
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file endnote_utils-0.2.2.tar.gz.
File metadata
- Download URL: endnote_utils-0.2.2.tar.gz
- Upload date:
- Size: 15.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
95e2d771c7b48901182028aaa7ee5d700437fd8d02feaf47744411a1b9f0f549
|
|
| MD5 |
ede6e393377111c0ed80b0a8c58009fe
|
|
| BLAKE2b-256 |
91fa8175dd3412df1cf5be2dbb1d2c8b19891b6816923ce24ed44595b984484e
|
File details
Details for the file endnote_utils-0.2.2-py3-none-any.whl.
File metadata
- Download URL: endnote_utils-0.2.2-py3-none-any.whl
- Upload date:
- Size: 14.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f3f365773546ddb19836b8e3e4cbf2c3de0addf3b568355e2f313d0db60ea967
|
|
| MD5 |
5a12f80e7df23843f1ce400baafbcec7
|
|
| BLAKE2b-256 |
c1b9f860a05221b5f5d7efd978e5d365a42235e08e9e044f0c2b6d5a33d59e7c
|