Skip to main content

PySin is a toolbox for text retrieval in unstructured documents datasets. It contains both a multi-type text extractor and a search engine. To test them, you can use the medical prescriptions generator that is also provided.

Project description

PySin

PySin is a toolbox for text retrieval in unstructured documents datasets. It contains both a multi-type text extractor and a search engine. To test them, you can use the medical documents generator that is also provided.

OS Dependencies

You will need geckodriver to run the generator. Download it and copy it to your PATH (eg: /usr/local/bin)

Debian, Ubuntu, and friends

sudo apt-get install build-essential libpoppler-cpp-dev pkg-config python-dev

Fedora, Red Hat, and friends

sudo yum install gcc-c++ pkgconfig poppler-cpp-devel python-devel redhat-rpm-config

macOS

brew install pkg-config poppler

Conda users may also need libgcc:

conda install -c anaconda libgcc

Windows

Currently tested only when using conda:

  • Install the Microsoft Visual C++ Build Tools
  • Install poppler through conda:
conda install -c conda-forge poppler

Install

pip install pysin

Search engine

Arguments

The function search takes 5 arguments.

Positionnal arguments :

  • query : your query
  • input_path : the path to the directory to search in
  • output_path : the path to the directory to put the results in

Keyword arguments:

  • scale : can take the values row or doc depending on if the query should be satisfied by a single row or by a whole document. The row scale is more precise whereas the doc scale is faster. The scale defaults to row.
  • update_cache : True to update the cached files (for example if some files have been added to the folder since the last search), else False. Defaults to True. If you're working with a huge amount of data that doesn't change, you should set update_cache to False.

To search the word 'word' within the files of the folder 'path/to/data/' by writing the results in the folder '/path/to/results/', just run the following command :

from pysin import search
search('word', 'path/to/data/', 'path/to/results/')

Queries

To search one word beyond multiple ones, just write them side to side in the query.

search('word1 word2 word3', 'path/to/data/', 'path/to/results/')

To search the files where 'mandatory' is and where 'foo' or 'bar' is also (but not necessarily both at the same time), just type the following command :

search('+mandatory foo bar', 'path/to/data/', 'path/to/results/', scale='doc')

The same query holds for the row scale. The previous command might return a document that contains 'mandatory' at the first row and 'foo' at the last one whereas in the row scale, only the occurrences where 'mandatory' AND 'foo' (and/or 'bar') are in the same row are returned.

To search the rows where 'mandatory' is but 'forbidden' isn't, type the following command :

search('mandatory -forbidden', 'path/to/data/', 'path/to/results/')

To search an expression with several words, use quotes :

search('"complex expression"', 'path/to/data/', 'path/to/results/')

You can obviously combine everything into a single query :

search('+mandatory choice1 choic2 "choice3" -"not this one" +"another mandatory"', 'path/to/data/', 'path/to/results/')

Results

When a research is launched, a folder is made at output_path in which are two files :

  • results.csv : in row scale, one row correspond to one occurrence and contains the path to the file, the occurrence row number and the context of the occurrence. In doc scale, there are only the paths to the corresponding files.
  • folders.json : returns the number of occurrences in each folder using a tree structure

Extractor

The extractor preprocesses all the files to enable the research by converting the handled files into txt cached files. The handled types are csv, doc, docx, html, md, pdf, rtf, txt, xml.

To extract all the files within a folder at path 'path', just run :

extract('path/to/data')

To erase all the cached files, just run :

reset_cache('path/to/data')

Medical prescriptions generator

The generator is based on the data of the faker module. It can generate both medical prescriptions and medical report. To generate 19 fake medical documents in the folder 'data', just run the following command :

generate(19, 'path/to/data')

Soft mode

The search engine and the extractor can also by used as softs. For the search engine, just run the following command :

$ python src/search.py +mandatory choice1 choic2 "choice3" -"not this one" +"another mandatory" --input_path path/to/data/ --output_path path/to/results/

To search at the doc scale, just add the argument --d.

The extractor can be used like this :

$ python src/extractor.py path/to/data/

To clear the cached files, just add the argument --reset :

$ python src/extractor.py --reset path/to/data/

Trick

If you have to do lots of researchs in one folder, let's say absolute/path/to/data/, by putting the results always in the same folder, let's say absolute/path/to/results/, and always at the same scale, let's say the row one, then you can create a shortcut to search more easily by running the following commands :

$ echo alias search=\'python /absolute/path/to/search.py --input_path /absolute/path/to/data/ --output_path /absolute/path/to/results\' >> ~/.bashrc
$ source ~/.bashrc

Then, you're able to do a research from any location by typing :

$ search +mandatory choice1 choic2 "choice3" -"not this one" +"another mandatory"

WARNING : before doing this, make sure that the search alias doesn't exist yet, for example by running the command search and checking that shell returns the following message :

ModuleNotFoundError: No module named 'apt_pkg'

Example

You can test this module using the example.py script.

TODO

  • multithreaded research
  • improve medication notation
  • new document types
  • adapt .doc extraction to windows environment

Publish

First, you need to have twine installedd

pip install --user --upgrade twine

Make sure you have bumped the version number in setup.py, then run the following:

python setup.py sdist bdist_wheel
python -m twine upload dist/*

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pysin-1.5.0.tar.gz (191.3 kB view details)

Uploaded Source

Built Distributions

pysin-1.5.0-py3.8.egg (217.2 kB view details)

Uploaded Source

pysin-1.5.0-py3-none-any.whl (191.6 kB view details)

Uploaded Python 3

File details

Details for the file pysin-1.5.0.tar.gz.

File metadata

  • Download URL: pysin-1.5.0.tar.gz
  • Upload date:
  • Size: 191.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.4.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.8.2

File hashes

Hashes for pysin-1.5.0.tar.gz
Algorithm Hash digest
SHA256 0d1d48ef7c86b8e3d70a11953b9ea3d33346d55fc8840436bc4bffea8dc0710d
MD5 e7bbd81a3036072821b8a8b1ee1848c1
BLAKE2b-256 58cef4f93c86b428456ae5ffb612d186fc387f7b7198e42bffb54cdd41057d47

See more details on using hashes here.

File details

Details for the file pysin-1.5.0-py3.8.egg.

File metadata

  • Download URL: pysin-1.5.0-py3.8.egg
  • Upload date:
  • Size: 217.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.4.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.8.2

File hashes

Hashes for pysin-1.5.0-py3.8.egg
Algorithm Hash digest
SHA256 0ff6542d15f33e644bec89d7548e3c55edd8ea29e91f05c43d4c6e0391b8f8dc
MD5 105315e05ccd9113311db44fcb4b6949
BLAKE2b-256 b6a04f51627dcd32a062be93ad4d57180540bc0365b6ca695352752fc0301364

See more details on using hashes here.

File details

Details for the file pysin-1.5.0-py3-none-any.whl.

File metadata

  • Download URL: pysin-1.5.0-py3-none-any.whl
  • Upload date:
  • Size: 191.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.4.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.8.2

File hashes

Hashes for pysin-1.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 279788b71a706ace35b2b9812d5062c749d0a1382e2bf4f99b53872d05514955
MD5 a31f170ca2e892ec090450c3ae36d768
BLAKE2b-256 1e2d2e89a1e3ee46dbf2598404d0e616a353e68454c0a0a444bbd45a3e876762

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page