data mining tool, to mine data from batch of xml files
Project description
XML/TRXML Selector
Description
This package provides two scripts: mine-xml and mine-trxml.
mine-xml selects tags from xml/mxml files, and save the selected values to file.
mine-trxml selects fields from trxml/mtrxml files, and save the selected values to file.
Status
Requirements
Python 3.6+
Installation
pip install xml-selector
Usage
Use xml selector script
The xml selector supports:
one or more tagnames:
selector could be one tagname name
or comma separated tagnames langskill,compskill,softskills
multiple sources:
e.g. select from xml dir, xml files, mxml file, or directly from annotation server
examples:
#select from xml directory mine-xml --source tests/xmls/ --selector name --output_file name.tsv mine-xml --source tests/xmls/ --selector langskill,compskill,softskill --output_file skill.tsv --with_field_name #select from xml file or mxml file mine-xml --source tests/sample.mxml --selector experience --output_file experience.tsv #select directly from annotation server mine-xml --source localhost:50249 --selector name --output_file name.tsv --query "set Data2018"
Use trxml selector script
The trxml selector supports:
one or more selectors:
selector can be one field: name.0.name
or comma separated fields: name.0.name,address.0.address
single or multi item:
can select field from one item, e.g. experienceitem.3.experience
or select field value of all item, e.g. experienceitem.experience (or experienceitem.*.experience)
multiple sources:
e.g. select from trxml dir, trxml files, or mtrxml file
examples:
# one selector, single item mine-trxml --source tests/trxmls/ --selector name.0.name --output_file name.tsv # one selector, multiple item mine-trxml --source tests/sample.mxml --selector experienceitem.experience --output_file experience.tsv # more selectors, single item mine-trxml --source tests/trxmls/ --selector name.0.name,address.0.address,phone.0.phone --output_file personal.tsv # more selectors, multiple item mine-trxml --source tests/sample.mxml --itemgroup experienceitem --fields experience,experiencedate --output_file experience.tsv mine-trxml --source tests/sample.mxml --selector experienceitem.*.experience,experienceitem.*.experiencedate --output_file experience.tsv mine-trxml --source tests/sample.mxml --selector experienceitem.experience,experienceitem.experiencedate --output_file experience.tsv
Development
To install package and its dependencies, run the following from project root directory:
python setup.py install
To work the code and develop the package, run the following from project root directory:
python setup.py develop
To run unit tests, execute the following from the project root directory:
python setup.py test
selector and output details:
mine-xml:
input: documents, selector(s), output
output:
default (parameter with_field_name not set): filename, field_value
e.g. select all names with selector name
filename
value
xxxx
Chao Li
parameter with_field_name set: filename, field_value, field_name
e.g. select skills with selector compskill,langskill,otherskill
filename
value
field
xxxx
java
compskill
xxxx
dutch
langskill
mine-trxml
input:
documents, selector(s), output,
documents, itemgroup, fields, output
single selector:
single item (name.0.name): filename field
filename
name.0.name
xxxx
Chao Li
multi items (skill.*.skill): filename item_index field
filename
item_index
field
xxxx
0
java
xxxx
1
dutch
multiple selectors
single item: filename, field1, field2 …
each selector points to a field of a specific item with a digital index, e.g. name.0.lastname,name.0.firstname,address.0.country
filename
name.0.lastname
name.0.firstname
address.0.country
xxxx
Li
Chao
China
xxxx
Lee
Richard
USA
multi items: filename, item_index, field1, field2 …
each selector points to a field from all items in an itemgroup, e.g. skill.skill,skill.type,skill.date
filename
skill
skill
type
date
xxxx
0
java
compskill
2001-2005
xxxx
1
dutch
langskill
2002-
0.0.4 (2019-09-11)
bug fix: reading always use utf8, and not continue reading if failed on encoding of one document
0.0.3 (2019-08-11)
expand miner.py module to generate matched phrases per doc
0.0.2 (2019-08-09)
added support for CI
0.0.1 (2019-08-09)
make two script: mine-xml and mine-trxml
0.0.0 (2019-08-06)
Add the first version of the mine_xml and mine_trxml
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for xml_miner-0.0.4-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6b615e02e5b1178e9a0d1c69d7308c19edc211eb492d426ddc6d80a79d912d59 |
|
MD5 | 3caedc887077398e30226a84233d0873 |
|
BLAKE2b-256 | 9c43df4bd152238cd080774637f236bc0d98668ba02eb0945da069c556e04845 |