Skip to main content

data mining tool, to mine data from batch of xml files

Project description

XML/TRXML Selector

Description

This package provides two scripts: mine-xml and mine-trxml.

mine-xml selects tags from xml/mxml files, and save the selected values to file.

mine-trxml selects fields from trxml/mtrxml files, and save the selected values to file.

Status

https://travis-ci.org/tilaboy/xml-miner.svg?branch=master Documentation Status Updates

Requirements

Python 3.6+

Installation

pip install xml-selector

Usage

Use xml selector script

The xml selector supports:
  • one or more tagnames:
  • selector could be one tagname name
  • or comma separated tagnames langskill,compskill,softskills
  • multiple sources:
  • e.g. select from xml dir, xml files, mxml file, or directly from annotation server
examples:
#select from xml directory
mine-xml --source tests/xmls/ --selector name --output_file name.tsv
mine-xml --source tests/xmls/ --selector langskill,compskill,softskill --output_file skill.tsv --with_field_name

#select from xml file or mxml file
mine-xml --source tests/sample.mxml --selector experience --output_file experience.tsv

#select directly from annotation server
mine-xml --source localhost:50249 --selector name --output_file name.tsv --query "set Data2018"

Use trxml selector script

The trxml selector supports:
  • one or more selectors:
  • selector can be one field: name.0.name
  • or comma separated fields: name.0.name,address.0.address
  • single or multi item:
  • can select field from one item, e.g. experienceitem.3.experience
  • or select field value of all item, e.g. experienceitem.experience (or experienceitem.*.experience)
  • multiple sources:
  • e.g. select from trxml dir, trxml files, or mtrxml file
examples:
# one selector, single item
mine-trxml --source tests/trxmls/ --selector name.0.name --output_file name.tsv

# one selector, multiple item
mine-trxml --source tests/sample.mxml --selector experienceitem.experience --output_file experience.tsv

# more selectors, single item
mine-trxml --source tests/trxmls/ --selector name.0.name,address.0.address,phone.0.phone --output_file personal.tsv

# more selectors, multiple item
mine-trxml --source tests/sample.mxml  --itemgroup experienceitem --fields experience,experiencedate --output_file experience.tsv
mine-trxml --source tests/sample.mxml  --selector experienceitem.*.experience,experienceitem.*.experiencedate --output_file experience.tsv
mine-trxml --source tests/sample.mxml  --selector experienceitem.experience,experienceitem.experiencedate --output_file experience.tsv

Development

To install package and its dependencies, run the following from project root directory:

python setup.py install

To work the code and develop the package, run the following from project root directory:

python setup.py develop

To run unit tests, execute the following from the project root directory:

python setup.py test

selector and output details:

  • mine-xml:

    input: documents, selector(s), output

    output:

    • default (parameter with_field_name not set): filename, field_value

    e.g. select all names with selector name

    filename value
    xxxx Chao Li
    • parameter with_field_name set: filename, field_value, field_name

    e.g. select skills with selector compskill,langskill,otherskill

    filename value field
    xxxx java compskill
    xxxx dutch langskill
  • mine-trxml

    • input:
    • documents, selector(s), output,
    • documents, itemgroup, fields, output
    • single selector:
    • single item (name.0.name): filename field
    filename name.0.name
    xxxx Chao Li
    • multi items (skill.*.skill): filename item_index field
    filename item_index field
    xxxx 0 java
    xxxx 1 dutch
    • multiple selectors
    • single item: filename, field1, field2 …

    each selector points to a field of a specific item with a digital index, e.g. name.0.lastname,name.0.firstname,address.0.country

    filename name.0.lastname name.0.firstname address.0.country
    xxxx Li Chao China
    xxxx Lee Richard USA
    • multi items: filename, item_index, field1, field2 …

    each selector points to a field from all items in an itemgroup, e.g. skill.skill,skill.type,skill.date

    filename skill skill type date
    xxxx 0 java compskill 2001-2005
    xxxx 1 dutch langskill 2002-

0.0.5 (2019-10-14)

  • bug fix: ElementTree xpath find will return a None if value is an empty string, restore to empty string

0.0.4 (2019-09-11)

  • bug fix: reading always use utf8, and not continue reading if failed on encoding of one document

0.0.3 (2019-08-11)

  • expand miner.py module to generate matched phrases per doc

0.0.2 (2019-08-09)

  • added support for CI

0.0.1 (2019-08-09)

  • make two script: mine-xml and mine-trxml

0.0.0 (2019-08-06)

  • Add the first version of the mine_xml and mine_trxml

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for xml-miner, version 0.0.5
Filename, size File type Python version Upload date Hashes
Filename, size xml_miner-0.0.5-py2.py3-none-any.whl (20.2 kB) File type Wheel Python version py2.py3 Upload date Hashes View hashes
Filename, size xml_miner-0.0.5.tar.gz (23.7 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page