Skip to main content

A package that recommends similar data collections based on unstructured and explicitly provided meta information of files.

Project description

DA4RDM_RecSyS_ContentBased

Description

The DA4RDM_RecSyS_ContentBased is a python based package that recommends similar data collections based on unstructured and explicitly provided meta information of files.

Installation

The package is built using Python as a programming language and utilizes python packages such as tensorflow, keras, nlpaug and few others. The complete list of dependencies could be referred to in the requirements.txt file. The package can be installed using the pip command provided below:

pip install DA4RDM-RecSys-ContentBased

Importing the Modules

The package has two important modules preprocessor and distance_similarity_calculator. The preprocessor module has methods that perform the task of data cleaning, outlier detection, PCA analysis and text preprocessing and outputs a processed dataframe that could be used for similarity evaluation. The distance_similarity_calculator has methods that compute the distance using KMeans with choice of distance measures and finally outputs recommendation based on the distance values. Once imported the component methods can be invoked and used. The modules can be imported using the below commands:

from DA4RDM_RecSys_ContentBased import preprocessor
from DA4RDM_RecSys_ContentBased import distance_similarity_calculator

Main Methods

  1. loadAndPreprocess_function
    To perform the task of preprocessing the method loadAndPreprocess_function within the module preprocessor can be used. This method invokes other necessary methods and finally outputs a processed dataframe. The function body is as shown below:
def loadAndPreprocess_function(filepath: str, features=[], seperator='|', n_componentsAfterPCA=1, encoder='mobilebert_multi_cased', minmaxScaleFlage=True, removeColumnsWithOneValueFlag=False, debug=False):
    """Loads and preprocesses a csv-file

    :param filepath: filepath to the csv file with '|' as the seperator
    :param features: array of features to consider
    :param seperator: the seperator used when loading the csv-file
    :param n_componentsAfterPCA: sets the number of component for PCA
    :param encoder: set the language model: 'mobilebert_multi_cased' or 'bert_multi_cased'
    :param minmaxScaleFlage: minmax scaling resource vector
    :param removeColumnsWithOneValueFlag: remove columns with only one value between all resources and files
    :param debug: debug mode
    :return: a preprocessed pandas.Dataframe
    """
  1. result_function
    To get the final recommendation the method result_function within the module distance_similarity_calculator can be used.This function accepts the preprocessed dataframe along with other important parameters (Please refer to function body below for all parameters) and outputs recommendation based on the distance values:
def result_function(df, key:str, distanceMethod='euclidean', sortAscending=True, nearestNeighbourFlag=True, outputFormatJson=False, DEBUG_MODE=False):
    """Calculating a distance between the key-resource and the resources in the dataframe

    :param df: preprocessed dataframe
    :param key: compare resources to this key
    :param distanceMethod: 'euclidean' or 'cosine' distance
    :param sortAscending: True = sort output ascending; False = descending
    :param nearestNeighbourFlag: sets the flag for the nearest neighbour
    :param DEBUG_MODE: debug mode
    :param outputFormatJson: Trigger Json format
    :return: relative distance between key and furthest resource
    """

Usage and Examples

Below is an example execution of the loadAndPreprocess_function with features selection and debug mode set to False. The output dataframe df is the preprocessed dataframe.

df = preprocessor.loadAndPreprocess_function(filepath="tomography.csv", features=['http://purl.org/coscine/terms/sfb1394#acquiredIons', 'http://purl.org/coscine/terms/sfb1394#annularMillingParameters', 'http://purl.org/coscine/terms/sfb1394#baseTemperature', 'http://purl.org/coscine/terms/sfb1394#laserPulseEnergy', 'http://purl.org/coscine/terms/sfb1394#lowVoltageCleaning', 'http://purl.org/coscine/terms/sfb1394#pulseFrequency','http://purl.org/coscine/terms/sfb1394#runTime','http://purl.org/coscine/terms/sfb1394#specimenApexRadius'],debug=False)

Below is an example execution of the result_function with output format set to json:

jsonOutPut = distance_similarity_calculator.result_function(df, '1EC47F72-DF63-4D95-94E7-EB70C6BA09DB', distanceMethod='euclidean', outputFormatJson=True, DEBUG_MODE=False)

Output

All the above executions computes the relative distance between the neighbours and the reference resourceid and outputs an ordered recommendation based on the distance. Finally, based on the parameter outputFormatJson, the results are generated as a json file.

If json is the selected format the function outputs a json for the distance values as shown below:

{"distance":{"1EC47F72-DF63-4D95-94E7-EB70C6BA09DB":0.0,"302231B4-C161-4392-8895-8111FB7ED1F2":0.1323549579,"322EA9BA-AF4E-4C3A-BE02-0FC76C6673FE":0.3456503446,"6FC1403F-5957-4C45-8048-87D19C7C5832":0.3462583399,"4EFD8371-FD03-477F-BF39-861381FF080C":0.3463898247,"9C30C57E-7308-4DE9-BC38-49796C58929E":0.3472023012,"F8BE75F7-356E-4EB1-83AF-E6C174971D78":0.3489339426,"FAF13DF1-1747-4237-90F3-9451F4F8FEF7":0.3643016356,"24CE68AD-38BA-46DC-ACDB-9D1B93063490":0.4380531763,"632AD746-6A29-471F-861E-00663EA4B5CF":0.4494196308,"1FAA54D3-122B-41FD-ACE3-2B698FC1326F":0.9921902678,"9AA7E05B-A018-4B53-8A63-993C912DA553":0.995833426,"E6822DB5-116C-4875-8D2E-E84B4A2A9794":0.996137678,"65B41144-C3B9-4E96-9FA2-49B2071AF086":0.9977728607,"F9477D28-6D4E-4799-8D34-14383899E157":1.0}}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

DA4RDM-RecSys-ContentBased-1.0.10.tar.gz (15.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

DA4RDM_RecSys_ContentBased-1.0.10-py3-none-any.whl (14.6 kB view details)

Uploaded Python 3

File details

Details for the file DA4RDM-RecSys-ContentBased-1.0.10.tar.gz.

File metadata

File hashes

Hashes for DA4RDM-RecSys-ContentBased-1.0.10.tar.gz
Algorithm Hash digest
SHA256 c37d4537f3f264bef990bfcbb4a77d680555a1f3b95824e270d32537aa882744
MD5 88398a662d46cc3da3ff57ae9c911b11
BLAKE2b-256 fa8019633b54f15957bc0bf6422fc244618fa97f01045ffce2175a18d4ecdeb1

See more details on using hashes here.

File details

Details for the file DA4RDM_RecSys_ContentBased-1.0.10-py3-none-any.whl.

File metadata

File hashes

Hashes for DA4RDM_RecSys_ContentBased-1.0.10-py3-none-any.whl
Algorithm Hash digest
SHA256 73b8f8432e49f72e4b7848f0ae142ac8e9774455f824f5047e7f864c707e0047
MD5 bcc2313fbb379048a21deddf19461479
BLAKE2b-256 43895251583384d2ccd842f2300864d8dfa6c9de9c0de983f0fd8aae65843517

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page