Skip to main content

data mining tool, to mine data from batch of xml files

Project description

XML/TRXML Selector


This package provides two scripts: mine-xml and mine-trxml.

mine-xml selects tags from xml/mxml files, and save the selected values to file.

mine-trxml selects fields from trxml/mtrxml files, and save the selected values to file.

Status Documentation Status Updates


Python 3.6+


pip install xml-selector


Use xml selector script

The xml selector supports:
  • one or more tagnames:
  • selector could be one tagname name
  • or comma separated tagnames langskill,compskill,softskills
  • multiple sources:
  • e.g. select from xml dir, xml files, mxml file, or directly from annotation server
#select from xml directory
mine-xml --source tests/xmls/ --selector name --output_file name.tsv
mine-xml --source tests/xmls/ --selector langskill,compskill,softskill --output_file skill.tsv --with_field_name

#select from xml file or mxml file
mine-xml --source tests/sample.mxml --selector experience --output_file experience.tsv

#select directly from annotation server
mine-xml --source localhost:50249 --selector name --output_file name.tsv --query "set Data2018"

Use trxml selector script

The trxml selector supports:
  • one or more selectors:
  • selector can be one field:
  • or comma separated fields:,address.0.address
  • single or multi item:
  • can select field from one item, e.g. experienceitem.3.experience
  • or select field value of all item, e.g. experienceitem.experience (or experienceitem.*.experience)
  • multiple sources:
  • e.g. select from trxml dir, trxml files, or mtrxml file
# one selector, single item
mine-trxml --source tests/trxmls/ --selector --output_file name.tsv

# one selector, multiple item
mine-trxml --source tests/sample.mxml --selector experienceitem.experience --output_file experience.tsv

# more selectors, single item
mine-trxml --source tests/trxmls/ --selector,address.0.address, --output_file personal.tsv

# more selectors, multiple item
mine-trxml --source tests/sample.mxml  --itemgroup experienceitem --fields experience,experiencedate --output_file experience.tsv
mine-trxml --source tests/sample.mxml  --selector experienceitem.*.experience,experienceitem.*.experiencedate --output_file experience.tsv
mine-trxml --source tests/sample.mxml  --selector experienceitem.experience,experienceitem.experiencedate --output_file experience.tsv


To install package and its dependencies, run the following from project root directory:

python install

To work the code and develop the package, run the following from project root directory:

python develop

To run unit tests, execute the following from the project root directory:

python test

selector and output details:

  • mine-xml:

    input: documents, selector(s), output


    • default (parameter with_field_name not set): filename, field_value

    e.g. select all names with selector name

    filename value
    xxxx Chao Li
    • parameter with_field_name set: filename, field_value, field_name

    e.g. select skills with selector compskill,langskill,otherskill

    filename value field
    xxxx java compskill
    xxxx dutch langskill
  • mine-trxml

    • input:
    • documents, selector(s), output,
    • documents, itemgroup, fields, output
    • single selector:
    • single item ( filename field
    xxxx Chao Li
    • multi items (skill.*.skill): filename item_index field
    filename item_index field
    xxxx 0 java
    xxxx 1 dutch
    • multiple selectors
    • single item: filename, field1, field2 …

    each selector points to a field of a specific item with a digital index, e.g. name.0.lastname,name.0.firstname,

    filename name.0.lastname name.0.firstname
    xxxx Li Chao China
    xxxx Lee Richard USA
    • multi items: filename, item_index, field1, field2 …

    each selector points to a field from all items in an itemgroup, e.g. skill.skill,skill.type,

    filename skill skill type date
    xxxx 0 java compskill 2001-2005
    xxxx 1 dutch langskill 2002-

0.0.5 (2019-10-14)

  • bug fix: ElementTree xpath find will return a None if value is an empty string, restore to empty string

0.0.4 (2019-09-11)

  • bug fix: reading always use utf8, and not continue reading if failed on encoding of one document

0.0.3 (2019-08-11)

  • expand module to generate matched phrases per doc

0.0.2 (2019-08-09)

  • added support for CI

0.0.1 (2019-08-09)

  • make two script: mine-xml and mine-trxml

0.0.0 (2019-08-06)

  • Add the first version of the mine_xml and mine_trxml

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xml_miner-0.0.5.tar.gz (23.7 kB view hashes)

Uploaded source

Built Distribution

xml_miner-0.0.5-py2.py3-none-any.whl (20.2 kB view hashes)

Uploaded py2 py3

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Huawei Huawei PSF Sponsor Microsoft Microsoft PSF Sponsor NVIDIA NVIDIA PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page