data mining tool, to mine data from batch of xml files
Project description
XML/TRXML Selector
Description
This package provides two scripts: mine-xml and mine-trxml.
mine-xml selects tags from xml/mxml files, and save the selected values to file.
mine-trxml selects fields from trxml/mtrxml files, and save the selected values to file.
Status
Requirements
Python 3.6+
Installation
pip install xml-selector
Usage
Use xml selector script
The xml selector supports:
one or more tagnames:
selector could be one tagname name
or comma separated tagnames langskill,compskill,softskills
multiple sources:
e.g. select from xml dir, xml files, mxml file, or directly from annotation server
examples:
#select from xml directory mine-xml --source tests/xmls/ --selector name --output_file name.tsv mine-xml --source tests/xmls/ --selector langskill,compskill,softskill --output_file skill.tsv --with_field_name #select from xml file or mxml file mine-xml --source tests/sample.mxml --selector experience --output_file experience.tsv #select directly from annotation server mine-xml --source localhost:50249 --selector name --output_file name.tsv --query "set Data2018"
Use trxml selector script
The trxml selector supports:
one or more selectors:
selector can be one field: name.0.name
or comma separated fields: name.0.name,address.0.address
single or multi item:
can select field from one item, e.g. experienceitem.3.experience
or select field value of all item, e.g. experienceitem.experience (or experienceitem.*.experience)
multiple sources:
e.g. select from trxml dir, trxml files, or mtrxml file
examples:
# one selector, single item mine-trxml --source tests/trxmls/ --selector name.0.name --output_file name.tsv # one selector, multiple item mine-trxml --source tests/sample.mxml --selector experienceitem.experience --output_file experience.tsv # more selectors, single item mine-trxml --source tests/trxmls/ --selector name.0.name,address.0.address,phone.0.phone --output_file personal.tsv # more selectors, multiple item mine-trxml --source tests/sample.mxml --itemgroup experienceitem --fields experience,experiencedate --output_file experience.tsv mine-trxml --source tests/sample.mxml --selector experienceitem.*.experience,experienceitem.*.experiencedate --output_file experience.tsv mine-trxml --source tests/sample.mxml --selector experienceitem.experience,experienceitem.experiencedate --output_file experience.tsv
Development
To install package and its dependencies, run the following from project root directory:
python setup.py install
To work the code and develop the package, run the following from project root directory:
python setup.py develop
To run unit tests, execute the following from the project root directory:
python setup.py test
selector and output details:
mine-xml:
input: documents, selector(s), output
output:
default (parameter with_field_name not set): filename, field_value
e.g. select all names with selector name
filename
value
xxxx
Chao Li
parameter with_field_name set: filename, field_value, field_name
e.g. select skills with selector compskill,langskill,otherskill
filename
value
field
xxxx
java
compskill
xxxx
dutch
langskill
mine-trxml
input:
documents, selector(s), output,
documents, itemgroup, fields, output
single selector:
single item (name.0.name): filename field
filename
name.0.name
xxxx
Chao Li
multi items (skill.*.skill): filename item_index field
filename
item_index
field
xxxx
0
java
xxxx
1
dutch
multiple selectors
single item: filename, field1, field2 …
each selector points to a field of a specific item with a digital index, e.g. name.0.lastname,name.0.firstname,address.0.country
filename
name.0.lastname
name.0.firstname
address.0.country
xxxx
Li
Chao
China
xxxx
Lee
Richard
USA
multi items: filename, item_index, field1, field2 …
each selector points to a field from all items in an itemgroup, e.g. skill.skill,skill.type,skill.date
filename
skill
skill
type
date
xxxx
0
java
compskill
2001-2005
xxxx
1
dutch
langskill
2002-
0.0.5 (2019-10-14)
bug fix: ElementTree xpath find will return a None if value is an empty string, restore to empty string
0.0.4 (2019-09-11)
bug fix: reading always use utf8, and not continue reading if failed on encoding of one document
0.0.3 (2019-08-11)
expand miner.py module to generate matched phrases per doc
0.0.2 (2019-08-09)
added support for CI
0.0.1 (2019-08-09)
make two script: mine-xml and mine-trxml
0.0.0 (2019-08-06)
Add the first version of the mine_xml and mine_trxml
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file xml_miner-0.0.5.tar.gz
.
File metadata
- Download URL: xml_miner-0.0.5.tar.gz
- Upload date:
- Size: 23.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.6.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bdab76c295a8de22f8b6e78e4452fd70097d8e754e0c8b7252758672878636d2 |
|
MD5 | bf8a290733166329b030673d5ef53469 |
|
BLAKE2b-256 | 83715c6ce0b5835b6bdd6f68252c2f4afbe19b786f441f8c4ca6df4d9849b04e |
File details
Details for the file xml_miner-0.0.5-py2.py3-none-any.whl
.
File metadata
- Download URL: xml_miner-0.0.5-py2.py3-none-any.whl
- Upload date:
- Size: 20.2 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.6.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2a3ab1f6eca7df8e72a404de063ccb15d977e8e651fcf1978846049e37642d31 |
|
MD5 | 74e02e2fb6272f241c29d7de8e19469c |
|
BLAKE2b-256 | f6eb2af8d8f0d9a154f4345f53a8326ab7848463147a46d9aa4a072b48cfe34a |