Skip to main content

data mining tool, to mine data from batch of xml files

Project description

XML/TRXML Selector

Description

This package provides two scripts: mine-xml and mine-trxml.

mine-xml selects tags from xml/mxml files, and save the selected values to file.

mine-trxml selects fields from trxml/mtrxml files, and save the selected values to file.

Status

https://travis-ci.org/tilaboy/xml-miner.svg?branch=master Documentation Status Updates

Requirements

Python 3.6+

Installation

pip install xml-selector

Usage

Use xml selector script

The xml selector supports:
  • one or more tagnames:

  • selector could be one tagname name

  • or comma separated tagnames langskill,compskill,softskills

  • multiple sources:

  • e.g. select from xml dir, xml files, mxml file, or directly from annotation server

examples:
#select from xml directory
mine-xml --source tests/xmls/ --selector name --output_file name.tsv
mine-xml --source tests/xmls/ --selector langskill,compskill,softskill --output_file skill.tsv --with_field_name

#select from xml file or mxml file
mine-xml --source tests/sample.mxml --selector experience --output_file experience.tsv

#select directly from annotation server
mine-xml --source localhost:50249 --selector name --output_file name.tsv --query "set Data2018"

Use trxml selector script

The trxml selector supports:
  • one or more selectors:

  • selector can be one field: name.0.name

  • or comma separated fields: name.0.name,address.0.address

  • single or multi item:

  • can select field from one item, e.g. experienceitem.3.experience

  • or select field value of all item, e.g. experienceitem.experience (or experienceitem.*.experience)

  • multiple sources:

  • e.g. select from trxml dir, trxml files, or mtrxml file

examples:
# one selector, single item
mine-trxml --source tests/trxmls/ --selector name.0.name --output_file name.tsv

# one selector, multiple item
mine-trxml --source tests/sample.mxml --selector experienceitem.experience --output_file experience.tsv

# more selectors, single item
mine-trxml --source tests/trxmls/ --selector name.0.name,address.0.address,phone.0.phone --output_file personal.tsv

# more selectors, multiple item
mine-trxml --source tests/sample.mxml  --itemgroup experienceitem --fields experience,experiencedate --output_file experience.tsv
mine-trxml --source tests/sample.mxml  --selector experienceitem.*.experience,experienceitem.*.experiencedate --output_file experience.tsv
mine-trxml --source tests/sample.mxml  --selector experienceitem.experience,experienceitem.experiencedate --output_file experience.tsv

Development

To install package and its dependencies, run the following from project root directory:

python setup.py install

To work the code and develop the package, run the following from project root directory:

python setup.py develop

To run unit tests, execute the following from the project root directory:

python setup.py test

selector and output details:

  • mine-xml:

    input: documents, selector(s), output

    output:

    • default (parameter with_field_name not set): filename, field_value

    e.g. select all names with selector name

    filename

    value

    xxxx

    Chao Li

    • parameter with_field_name set: filename, field_value, field_name

    e.g. select skills with selector compskill,langskill,otherskill

    filename

    value

    field

    xxxx

    java

    compskill

    xxxx

    dutch

    langskill

  • mine-trxml

    • input:

    • documents, selector(s), output,

    • documents, itemgroup, fields, output

    • single selector:

    • single item (name.0.name): filename field

    filename

    name.0.name

    xxxx

    Chao Li

    • multi items (skill.*.skill): filename item_index field

    filename

    item_index

    field

    xxxx

    0

    java

    xxxx

    1

    dutch

    • multiple selectors

    • single item: filename, field1, field2 …

    each selector points to a field of a specific item with a digital index, e.g. name.0.lastname,name.0.firstname,address.0.country

    filename

    name.0.lastname

    name.0.firstname

    address.0.country

    xxxx

    Li

    Chao

    China

    xxxx

    Lee

    Richard

    USA

    • multi items: filename, item_index, field1, field2 …

    each selector points to a field from all items in an itemgroup, e.g. skill.skill,skill.type,skill.date

    filename

    skill

    skill

    type

    date

    xxxx

    0

    java

    compskill

    2001-2005

    xxxx

    1

    dutch

    langskill

    2002-

0.0.5 (2019-10-14)

  • bug fix: ElementTree xpath find will return a None if value is an empty string, restore to empty string

0.0.4 (2019-09-11)

  • bug fix: reading always use utf8, and not continue reading if failed on encoding of one document

0.0.3 (2019-08-11)

  • expand miner.py module to generate matched phrases per doc

0.0.2 (2019-08-09)

  • added support for CI

0.0.1 (2019-08-09)

  • make two script: mine-xml and mine-trxml

0.0.0 (2019-08-06)

  • Add the first version of the mine_xml and mine_trxml

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xml_miner-0.0.5.tar.gz (23.7 kB view details)

Uploaded Source

Built Distribution

xml_miner-0.0.5-py2.py3-none-any.whl (20.2 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file xml_miner-0.0.5.tar.gz.

File metadata

  • Download URL: xml_miner-0.0.5.tar.gz
  • Upload date:
  • Size: 23.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.6.7

File hashes

Hashes for xml_miner-0.0.5.tar.gz
Algorithm Hash digest
SHA256 bdab76c295a8de22f8b6e78e4452fd70097d8e754e0c8b7252758672878636d2
MD5 bf8a290733166329b030673d5ef53469
BLAKE2b-256 83715c6ce0b5835b6bdd6f68252c2f4afbe19b786f441f8c4ca6df4d9849b04e

See more details on using hashes here.

File details

Details for the file xml_miner-0.0.5-py2.py3-none-any.whl.

File metadata

  • Download URL: xml_miner-0.0.5-py2.py3-none-any.whl
  • Upload date:
  • Size: 20.2 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.6.7

File hashes

Hashes for xml_miner-0.0.5-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 2a3ab1f6eca7df8e72a404de063ccb15d977e8e651fcf1978846049e37642d31
MD5 74e02e2fb6272f241c29d7de8e19469c
BLAKE2b-256 f6eb2af8d8f0d9a154f4345f53a8326ab7848463147a46d9aa4a072b48cfe34a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page