Skip to main content

A set of tools to retrieve information from filepaths.

Project description

Path2Insight

travis readthedocs

Path2Insight (p2i) is a modular and scalable python module which aims at offering a unified and comprehensive set of processing tools for analyzing file paths. P2i supports static file systems analysis without requiring access to the original physical storage. Basically, a scan of the storage’s content exported as a text file suffices to explore the saved resources. There is also no need to access the content of the files as the p2i module import file paths as strings.

Once loaded, the file paths are stored in-memory as a python object enabling: preprocessing, text processing and descriptive analysis of folders and files.

Preprocessing: Sample, sort and select files based on multiple criteria (e.g. parent folder, depth).

Text processing: Chunk file paths into tokens (full path, stem and name), n-grams or complete paths with the help of several extensible tokenizers. Also, taggers offer the option to aggregate files based on their structure and content (which prepare paths for further analysis such as entity recognition or classification tasks).

Descriptive analysis: P2i implements counters for tokens, stems and extensions. It supports also statistical features such as X2 tests on the distribution of extensions, stems and names. Further, a representation of the complexity of the folder structure is facilitated by folder-depth analysis functionalities.

The table below shows how Path2Insight differ and complement functionalities offered by lower-level python modules (pathlib and os.path).

Functionality P2i Pathlib os.path
Preprocessing Pathlib + Sampling, sorting, selection match, joinpath Normcase, norm path
Descriptive statistics Counters: stem, extension, name. Taggers. Tokenizers os.stat os.stat
Text processing Pathlib + Tokens, n-grams, taggers, lower, upper,… Stem, name, parent, extension drive, … Split
Access or modify information on the system No, can be linked to additional metadata (datetimes, users) by joining on the full path Yes, chmod, current folder. … Yes, user, size, datetimes, descriptors, …

P2i is dependency free (only pathlib2 is required for Python 2.7 users), fast and scalable path processing toolkit. It is compliant with the major data analysis python modules such as pandas, scikit-learn, nltk and matplotlib to extent the analytical possibilities of path2insight.

Example

Import the module and load a demo dataset with static file paths (or use path2insight.walk to collect from you file system).

>>> import path2insight
>>> from path2insight.datasets import load_ensembl

>>> filepaths = load_ensembl()
>>> path2insight.depth_counts(filepaths)
Counter({3: 1, 4: 11, 5: 39424, 6: 5543, 7: 2733, 8: 3388})
>>> path2insight.token_counts(filepaths).most_common(10)
[('txt', 31977),
 ('gene', 13798),
 ('ensembl', 12727),
 ('dm', 12500),
 ('homolog', 7380),
 ('fa', 5890),
 ('chromosome', 5011),
 ('feature', 4878),
 ('dna', 4608),
 ('90', 3404)]
>>> path2insight.extension_counts(filepaths).most_common(10)
[('.gz', 44427),
 ('', 3094),
 ('.bb', 847),
 ('.nsq', 349),
 ('.nin', 349),
 ('.nhr', 349),
 ('.tsv', 336),
 ('.psq', 250),
 ('.pin', 250),
 ('.phr', 250)]
>>> path2insight.select_re(filepaths, level5='micro.*')
[PosixFilePath('/Volumes/release-90/variation/VEP/microtus_ochrogaster_vep_90_MicOch1.0.tar.gz'),
 PosixFilePath('/Volumes/release-90/variation/VEP/microtus_ochrogaster_refseq_vep_90_MicOch1.0.tar.gz'),
 PosixFilePath('/Volumes/release-90/variation/VEP/microtus_ochrogaster_merged_vep_90_MicOch1.0.tar.gz'),
 PosixFilePath('/Volumes/release-90/variation/VEP/microcebus_murinus_vep_90_Mmur_2.0.tar.gz'),
 PosixFilePath('/Volumes/release-90/rdf/microtus_ochrogaster/microtus_ochrogaster_xrefs.ttl.gz.graph'),
>>> path2insight.distance_on_token(filepaths[0:10])
array([[ 0.        ,  2.        ,  1.41421356,  3.        ,  3.        ],
       [ 2.        ,  0.        ,  2.44948974,  3.31662479,  3.31662479],
       [ 1.41421356,  2.44948974,  0.        ,  3.        ,  3.        ],
       [ 3.        ,  3.31662479,  3.        ,  0.        ,  1.41421356],
       [ 3.        ,  3.31662479,  3.        ,  1.41421356,  0.        ]])

Installation and dependencies

Path2Insight is available on Pypi. This make it possible to install it with through:

pip install path2insight

To upgrade path2insight use

pip install --upgrade path2insight

Path2Insight is available for Python 2.7 and Python 3.4+. Path2Insight depends heavily on the pathlib module. This module is part of Python 3.4 or higher. For Python 2, the backport pathlib2 is used. Therefore, it is advised to use Path2Insight with Python 3.4 or higher.

Some of the submodules of Path2Insight depend on other Python packages (numpy, pandas, sklearn, scipy, jellyfish). One can get a full installation by installing the packages in the requirements-full.txt file.

pip install -r requirements-full.txt

Cite

Follows.

Authors

  • Armel Lefebvre
  • Jonathan de Bruin

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for path2insight, version 1.0b2
Filename, size File type Python version Upload date Hashes
Filename, size path2insight-1.0b2-py2.py3-none-any.whl (1.6 MB) File type Wheel Python version py2.py3 Upload date Hashes View
Filename, size path2insight-1.0b2.tar.gz (1.6 MB) File type Source Python version None Upload date Hashes View

Supported by

Pingdom Pingdom Monitoring Google Google Object Storage and Download Analytics Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page