Skip to main content

Open source software for E-Discovery and Information Retrieval

Project description

https://img.shields.io/pypi/v/freediscovery.svg https://travis-ci.org/FreeDiscovery/FreeDiscovery.svg?branch=master https://ci.appveyor.com/api/projects/status/w5kjscmqlrlehp5t/branch/master?svg=true https://codecov.io/gh/FreeDiscovery/FreeDiscovery/branch/master/graph/badge.svg

Open Source e-Discovery and Information Retrieval Engine

FreeDiscovery is built on top of existing machine learning libraries (scikit-learn) and provides a REST API for information retrieval applications. It aims to benefit existing e-Discovery and information retrieval platforms with a focus on text categorization, semantic search, document clustering, duplicates detection and e-mail threading.

In addition, FreeDiscovery can be used as Python package and exposes several estimators with a scikit-learn compatible API.

Installation

FreeDiscovery requires Python 3.5+ and can be installed with conda: conda install -c conda-forge freediscovery

Alternatively, to install with pip,

  1. Install scipy and numpy

  2. Run pip install freediscovery[all]

Running the server

  • freediscovery run

  • to check that the server started successfully, curl -X GET http://localhost:5001/

Quick start

  1. Install FreeDiscovery and start the server (see above)

  2. Download the 20_newsgroup dataset: freediscovery download 20_newsgroups

1. Data ingestion

  1. Create a new vectorized dataset with curl -X POST http://localhost:5001/api/v0/feature-extraction and save the returned hexadecimal id for later use with export FD_DATASET_ID=<returned-id>.

  2. Ingest the dataset,

    curl -X POST -H 'Content-Type: application/json' -d '{
       "data_dir": "./20_newsgroups/"
    }'  http://localhost:5001/api/v0/feature-extraction/${FD_DATASET_ID}
  3. Get the mapping between file_path of individial files and their document_id:

    curl -X POST http://localhost:5001/api/v0/feature-extraction/${FD_DATASET_ID}/id-mapping > ./fd_id_mapping.txt

    and save the results.

2. Latent Semantic Indexing (LSI)

The creation of an LSI index is necessary for clustering, nearest neighbor classification, semantic search and near-duplicates detection,

curl -X POST -H 'Content-Type: application/json' -d "{
   \"parent_id\": \"${FD_DATASET_ID}\"
}"  http://localhost:5001/api/v0/lsi/

Save the returned id for later use with export FD_LSI_ID=<returned-id>.

4. Categorization

Create a categorization model,

curl -X POST -H 'Content-Type: application/json' -d "{
   \"parent_id\": \"${FD_DATASET_ID}\",
   \"method\": \"LogisticRegression\",
   \"data\": [{\"document_id\": 14000, \"category\": \"sci.space\"},
              {\"document_id\": 14003, \"category\": \"sci.space\"},
              {\"document_id\": 18780, \"category\": \"talk.politics.misc\"},
              {\"document_id\": 18784, \"category\": \"talk.politics.misc\"}
              ],
   \"training_scores\": true
 }"  http://localhost:5001/api/v0/categorization/

Save the returned id for later use with export FD_CAT_ID=<returned-id>.

Predictions for the other documents in the dataset can then be retrieved with,

curl -X GET -H 'Content-Type: application/json' -d "{
   \"max_results\": 10, \"max_result_categories\": 2, \"sort_by\": \"sci.space\"
 }"  http://localhost:5001/api/v0/categorization/${FD_CAT_ID}/predict

The correspondence of these results with ground truth categories can be checked in fd_id_mapping.txt.

5. Hierarchical clustering

Create a Birch hierarchical clustering model,

curl -X POST -H 'Content-Type: application/json' -d "{
   \"parent_id\": \"${FD_LSI_ID}\",
   \"min_similarity\": 0.7, \"max_tree_depth\": 2
 }"  http://localhost:5001/api/v0/clustering/birch/

Save the returned id for later use with export FD_BIRCH_ID=<returned-id>.

Finally retrieve the computed hierarchical clusters,

curl -X GET http://localhost:5001/api/v0/clustering/birch/${FD_BIRCH_ID}

See http://freediscovery.io/doc/stable/examples/ for more complete examples.

We would very much appreciate feedback on the existing functionality. Feel free to open new issues on Github or send any comments to the mailing list https://groups.google.com/forum/#!forum/freediscovery-ml.

Documentation

For more information see the documentation and API Reference,

Licence

FreeDiscovery is released under the 3-clause BSD licence.

https://freediscovery.github.io/static/grossmanlabs-old-logo-small.gif

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

freediscovery-1.1.2.tar.gz (1.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

freediscovery-1.1.2-py3-none-any.whl (2.7 MB view details)

Uploaded Python 3

File details

Details for the file freediscovery-1.1.2.tar.gz.

File metadata

  • Download URL: freediscovery-1.1.2.tar.gz
  • Upload date:
  • Size: 1.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for freediscovery-1.1.2.tar.gz
Algorithm Hash digest
SHA256 3c185fe6091cc8beedfa7a933a7085672a6b65482b4a933714fd8c2312feb25a
MD5 048721186e826f3b753da2af6e2f5c99
BLAKE2b-256 b69f56fc7f4ff0e5fba2552be77ef363ea0236e151028112e58e913254b08a7d

See more details on using hashes here.

File details

Details for the file freediscovery-1.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for freediscovery-1.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 a2810cefd6f864c0d6da1dc1b6f1c0cefa79b81c5a551a98b36ac0134a2234a2
MD5 e9b0483777645e0108ad95bdb4840e13
BLAKE2b-256 2e4efe4ad260e958d8de2faa9dde82501ebdefd07e92ba798d1ca9921cf859c9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page