data mining tool for extract text for files
Project description
aTXT
A Data Mining Tool For Extract Text From Files.
Meta
Author: Jonathan S. Prieto C.
Email: prieto.jona@gmail.com
Notes: Have feedback? Please send me an email.
Free software: BSD license
Requirements
This software is available thanks to others open sources projects. The following list itemizes some of those:
PySide (GUI lib)
Tessaract OCR
Xpdf
lxml (doc files)
scandir (trasversal folders fast)
docx
Installation
pip install atxt
Check dependencies for avoiding surprises:
aTXT --check
Show help for command line: :
aTXT -h
Usage
In every case, you can use aTXT with his name package or more easy: :
2txt -h
You can use the graphical interface (if you have installed PySide):
aTXT -i
You should something like this:
Note: aTXT will always generate a FILE for each file path.
Examples: :
$ 2txt prueba.html $ 2txt prueba.html -o $ 2txt --file ~/Documents/prueba.html $ 2txt --file ~/Documents/prueba.html --to ~/htmls
Searching all textable files in a level-2 of depth over ~: :
$ 2txt ~ -d 2 $ 2txt --path ~ -d 2 --format 'txt,html'
Problems, Bugs? ————Please be free to comment whatever issue or problem with the installation. : .. _Issues: http://github.com/d555/python-atxt/issues
Changelog
1.0.5.3 (2015-07-03)
“fix bugs suggested by landscape.io”
1.0.5.2 (2015-07-02)
This version is more relate with Windows support:
support for .doc on windows with antiword
rewrite method to walk on directory based on scandir and modifacation of it.
fix a bug on windows when it tried to perform some search on depth
fix a bug with workers/run_path.py that caused two calls for these method
fix bugs with the fucking encoding and the arguments to subprocess
fix bug of logging and colors on terminal of windows.
1.0.5.1 (2015-06-30)
fix some bugs with gui
1.0.5 (2015-03-01)
First release on PyPI.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.