Skip to main content

Convert EndNote XML to CSV/JSON/XLSX with streaming parse and TXT report.

Project description

EndNote Utils

Convert EndNote XML files into clean CSV/JSON/XLSX with automatic TXT reports.
Supports both Python API and command-line interface (CLI).


Features

  • ✅ Parse one XML file (--xml) or an entire folder of *.xml (--folder)
  • ✅ Streams <record> elements using iterparse (low memory usage)
  • ✅ Extracts fields:
    database, ref_type, title, journal, authors, year, volume, number, abstract, doi, urls, keywords, publisher, isbn, language, extracted_date
  • ✅ Adds a database column from the XML filename stem (IEEE.xml → IEEE)
  • ✅ Normalizes DOI (10.xxxxhttps://doi.org/...)
  • ✅ Supports multiple output formats: CSV, JSON, XLSX
  • ✅ Always generates a TXT report (default: <out>_report.txt) with:
    • per-file counts (exported/skipped)
    • totals, files processed
    • run timestamp & duration
    • duplicate table per database (Origin / Retractions / Duplicates / Remaining)
    • optional duplicate key list (top-N)
    • optional summary stats (year, ref_type, journal, top authors)
  • ✅ Auto-creates output folders if missing
  • ✅ Deduplication:
    • --dedupe doi (unique by DOI)
    • --dedupe title-year (unique by normalized title + year)
    • --dedupe-keep first|last (keep first or last occurrence within each file)
  • ✅ Summary stats (--stats) with optional JSON export (--stats-json)
  • ✅ CLI options for CSV formatting, filters, verbosity
  • ✅ Importable Python API for scripting & integration

Installation

From PyPI

pip install endnote-utils

Requires Python 3.8+.


Usage

Command Line

Single file

endnote-utils --xml data/IEEE.xml --out output/ieee.csv

Folder with multiple files

endnote-utils --folder data/xmls --out output/all_records.csv

Custom report path

endnote-utils \
  --xml data/Scopus.xml \
  --out output/scopus.csv \
  --report reports/scopus_run.txt \
  --stats \
  --verbose

If --report is not provided, it defaults to <out>_report.txt. Use --no-report to disable report generation.


CLI Options

Option Description Default
--xml Path to a single EndNote XML file
--folder Path to a folder containing multiple *.xml files
--csv (Legacy) Output CSV path
--out Generic output path (.csv, .json, .xlsx)
--format Explicit format (csv, json, xlsx) inferred
--report Output TXT report path <out>_report.txt
--no-report Disable TXT report completely
--delimiter CSV delimiter ,
--quoting CSV quoting: minimal, all, nonnumeric, none minimal
--no-header Suppress CSV header row
--encoding Output text encoding utf-8
--ref-type Only include records with this ref_type name
--year Only include records with this year
--max-records Stop after N records per file (for testing)
--dedupe Deduplicate mode: none, doi, title-year none
--dedupe-keep Deduplication strategy: first, last first
--stats Include summary stats in TXT report
--stats-json Path to JSON file to save stats & duplicate info
--verbose Verbose logging with debug details

Example Report (snippet)

========================================
EndNote Export Report
========================================
Run started : 2025-09-11 14:30:22
Files       : 4
Duration    : 0.47 seconds

Per-file results
----------------------------------------
GGScholar.xml    : 13 exported, 0 skipped
IEEE.xml         : 2147 exported, 0 skipped
PubMed.xml       : 504 exported, 0 skipped
Scopus.xml       : 847 exported, 0 skipped
TOTAL exported: 3511

Duplicates table (by database)
----------------------------------------
Database        Origin   Retractions  Duplicates  Remaining
------------------------------------------------------------
GGScholar           179            0         27        152
IEEE               1900            0        589       1311
PubMed              320            0        225         95
Scopus             1999            1        511       1489
TOTAL              4410            1       1352       3047

Duplicate keys (top)
----------------------------------------
Mode   : doi
Keep   : first
Removed: 1352
Details (top):
  10.1109/SPMB55497.2022.10014965 : 3 duplicate(s)
  10.1109/TSSA63730.2024.10864368 : 2 duplicate(s)

Summary stats
----------------------------------------
By year:
   2022 : 569
   2023 : 684
   2024 : 1148
   2025 : 1108

By ref_type (top):
  Journal Article: 2037
  Conference Proceedings: 1470
  Book Section: 4

By journal (top 20):
  IEEE Access: 175
  IEEE Journal of Biomedical and Health Informatics: 67
  ...

Top authors (top 10):
  Y. Wang: 50
  X. Wang: 35
  ...

Python API

from pathlib import Path
from endnote_utils import export, export_folder

# Single file
total, out_file, report_file = export(
    Path("data/IEEE.xml"),
    Path("output/ieee.csv"),
    dedupe="doi", stats=True
)

# Folder
total, out_file, report_file = export_folder(
    Path("data/xmls"),
    Path("output/all.csv"),
    ref_type="Conference Proceedings",
    year="2024",
    dedupe="title-year",
    dedupe_keep="last",
    stats=True,
    stats_json=Path("output/stats.json"),
)

Development Notes

  • Pure Python, uses only standard library (argparse, csv, xml.etree.ElementTree, logging, pathlib, json).
  • Optional dependency: openpyxl (for Excel .xlsx export).
  • Streaming XML parsing avoids high memory usage.
  • Deduplication strategies configurable (doi / title-year).
  • Report includes per-database table and optional JSON snapshot.
  • Follows PEP 621 packaging (pyproject.toml).

License

MIT License © 2025 Minh Quach

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

endnote_utils-0.2.1.tar.gz (16.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

endnote_utils-0.2.1-py3-none-any.whl (14.7 kB view details)

Uploaded Python 3

File details

Details for the file endnote_utils-0.2.1.tar.gz.

File metadata

  • Download URL: endnote_utils-0.2.1.tar.gz
  • Upload date:
  • Size: 16.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for endnote_utils-0.2.1.tar.gz
Algorithm Hash digest
SHA256 a4043d9f8d6daefe40bd5d60bd28e3587cd38c76c2bb25f0300f8f221b909d18
MD5 3a3c8f4e3c5f3c4d17ba91d9fc5c9af3
BLAKE2b-256 2286292c6f6323e1902bcfe4bdddfb9a0b05037ae266a191c4fda632d0803ef6

See more details on using hashes here.

File details

Details for the file endnote_utils-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: endnote_utils-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 14.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for endnote_utils-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ded22c845bd51e3698a7c45deab91bd22e0ced65b62f8dadf33580e7d3a535b1
MD5 61f25152bea3e1c83efb3acba1c4a5b0
BLAKE2b-256 eb5bba49122cefbb6d72039d0d85fb4f311a1914dc9b62cf7d2b88bfde802cae

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page