Open source software for E-Discovery and Information Retrieval
Project description
Open Source e-Discovery and Information Retrieval Engine
FreeDiscovery is built on top of existing machine learning libraries (scikit-learn) and provides a REST API for information retrieval applications. It aims to benefit existing e-Discovery and information retrieval platforms with a focus on text categorization, semantic search, document clustering, duplicates detection and e-mail threading.
In addition, FreeDiscovery can be used as Python package and exposes several estimators with a scikit-learn compatible API.
Installation
FreeDiscovery requires Python 3.5+ and can be installed with conda: conda install -c conda-forge freediscovery
Alternatively, to install with pip,
Install scipy and numpy
Run pip install freediscovery[all]
Running the server
freediscovery run
to check that the server started successfully, curl -X GET http://localhost:5001/
Quick start
Install FreeDiscovery and start the server (see above)
Download the 20_newsgroup dataset: freediscovery download 20_newsgroups
1. Data ingestion
Create a new vectorized dataset with curl -X POST http://localhost:5001/api/v0/feature-extraction and save the returned hexadecimal id for later use with export FD_DATASET_ID=<returned-id>.
Ingest the dataset,
curl -X POST -H 'Content-Type: application/json' -d '{ "data_dir": "./20_newsgroups/" }' http://localhost:5001/api/v0/feature-extraction/${FD_DATASET_ID}
Get the mapping between file_path of individial files and their document_id:
curl -X POST http://localhost:5001/api/v0/feature-extraction/${FD_DATASET_ID}/id-mapping > ./fd_id_mapping.txt
and save the results.
2. Latent Semantic Indexing (LSI)
The creation of an LSI index is necessary for clustering, nearest neighbor classification, semantic search and near-duplicates detection,
curl -X POST -H 'Content-Type: application/json' -d "{
\"parent_id\": \"${FD_DATASET_ID}\"
}" http://localhost:5001/api/v0/lsi/
Save the returned id for later use with export FD_LSI_ID=<returned-id>.
3. Semantic search
Search in the semantic space can be performed with,
curl -X POST -H 'Content-Type: application/json' -d "{
\"parent_id\": \"${FD_LSI_ID}\",
\"query\": \"Jupyter moon\", \"max_results\": 10
}" http://localhost:5001/api/v0/search/
4. Categorization
Create a categorization model,
curl -X POST -H 'Content-Type: application/json' -d "{
\"parent_id\": \"${FD_DATASET_ID}\",
\"method\": \"LogisticRegression\",
\"data\": [{\"document_id\": 14000, \"category\": \"sci.space\"},
{\"document_id\": 14003, \"category\": \"sci.space\"},
{\"document_id\": 18780, \"category\": \"talk.politics.misc\"},
{\"document_id\": 18784, \"category\": \"talk.politics.misc\"}
],
\"training_scores\": true
}" http://localhost:5001/api/v0/categorization/
Save the returned id for later use with export FD_CAT_ID=<returned-id>.
Predictions for the other documents in the dataset can then be retrieved with,
curl -X GET -H 'Content-Type: application/json' -d "{
\"max_results\": 10, \"max_result_categories\": 2, \"sort_by\": \"sci.space\"
}" http://localhost:5001/api/v0/categorization/${FD_CAT_ID}/predict
The correspondence of these results with ground truth categories can be checked in fd_id_mapping.txt.
5. Hierarchical clustering
Create a Birch hierarchical clustering model,
curl -X POST -H 'Content-Type: application/json' -d "{
\"parent_id\": \"${FD_LSI_ID}\",
\"min_similarity\": 0.7, \"max_tree_depth\": 2
}" http://localhost:5001/api/v0/clustering/birch/
Save the returned id for later use with export FD_BIRCH_ID=<returned-id>.
Finally retrieve the computed hierarchical clusters,
curl -X GET http://localhost:5001/api/v0/clustering/birch/${FD_BIRCH_ID}
See http://freediscovery.io/doc/stable/examples/ for more complete examples.
We would very much appreciate feedback on the existing functionality. Feel free to open new issues on Github or send any comments to the mailing list https://groups.google.com/forum/#!forum/freediscovery-ml.
Documentation
For more information see the documentation and API Reference,
development version [master branch | documentation http://freediscovery.io/doc/dev/ ].
stable version 1.2 [1.2.X branch | documentation http://freediscovery.io/doc/stable/ ].
Licence
FreeDiscovery is released under the 3-clause BSD licence.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for freediscovery-1.2.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 54a5c59bf8a184101cc2c95d631953d9d38001f526d96cf9d44cffee43232575 |
|
MD5 | 416ed2d0172022eb9b09ac0558ccf575 |
|
BLAKE2b-256 | 870b8a38c60a0dc928b56e3c451d412dce93498f6d00d0ed2f70ce764851c533 |