Component of Papers recommender system in cross-lingual and multidisciplinary scope
Project description
What is it?
Component of a papers recommender system in a cross-lingual and multidisciplinary scope.
Result of the Coursework of MBA in Data Science and Analytics - USP / ESALQ - 2020-2022.
Designed to be customizable in many ways:
- sentence-transformer model
- the maximum number of candidate articles for the evaluation of semantic similarity
- accepts any type of document that has bibliographic references
Dependences
- sentence-transformer
- celery
- mongoengine
Model
The algorithm adopted is a combination of recommender systems graph based and content based filtering with semantic similarity
The identification of the relationship between scientific articles is made during the document's entry into the system through the common bibliographic references. Subsequently, the documents are ranked by semantic similarity and recorded in a database.
The recommendation system works in two steps: creating links between articles via common citations and assigning a similarity coefficient for a selection of these linked articles.
The system itself does not establish which articles should be recommended.
The recommendation system client defines which articles to present as a recommendation depending on the criticality of the use case.
Installation
pip install -U xlingual_papers_recommender
Configurations
export DATABASE_CONNECT_URL=mongodb://my_user:my_password@127.0.0.1:27017/my_db
export CELERY_BROKER_URL="amqp://guest@0.0.0.0:5672//"
export CELERY_RESULT_BACKEND_URL="rpc://"
export MODELS_PATH=sentence_transformers_models
export DEFAULT_MODEL=paraphrase-xlm-r-multilingual-v1
Celery
Start service
celery -A xlingual_papers_recommender.core.tasks worker -l info -Q default,low_priority,high_priority --pool=solo --autoscale 8,4 --loglevel=DEBUG
Clean queue
celery worker -Q low_priority,default,high_priority --purge
Usage
Register new paper
xlingual_papers_recommender receive_paper [--skip_update SKIP_UPDATE] source_file_path log_file_path
positional arguments: source_file_path /path/document.json log_file_path /path/registered.jsonl
optional arguments: -h, --help show this help message and exit --skip_update SKIP_UPDATE if it is already registered, skip_update do not update
Examples of source_file_path:
docs
└── examples
├── document1.json
├── document2.json
├── document3.json
├── document4.json
├── document5.json
├── document51.json
├── document6.json
├── document6_2.json
├── document7.json
└── document7_2.json
References attributes:
- pub_year
- vol
- num
- suppl
- page
- surname
- organization_author
- doi
- journal
- paper_title
- source
- issn
- thesis_date
- thesis_loc
- thesis_country
- thesis_degree
- thesis_org
- conf_date
- conf_loc
- conf_country
- conf_name
- conf_org
- publisher_loc
- publisher_country
- publisher_name
- edition
- source_person_author_surname
- source_organization_author
Get paper recommendations
usage: xlingual_papers_recommender get_connected_papers [-h] [--min_score MIN_SCORE] pid
positional arguments:
pid pid
optional arguments:
-h, --help show this help message and exit
--min_score MIN_SCORE
min_score
Load papers data from datasets
Register parts
usage: xlingual_papers_recommender_ds_loader register_paper_part [-h] [--skip_update SKIP_UPDATE] [--pids_selection_file_path PIDS_SELECTION_FILE_PATH]
{abstracts,references,keywords,paper_titles,articles} input_csv_file_path output_file_path
positional arguments:
{abstracts,references,keywords,paper_titles,articles}
part_name
input_csv_file_path CSV file with papers part data
output_file_path jsonl output file path
optional arguments:
-h, --help show this help message and exit
--skip_update SKIP_UPDATE
True to skip if paper is already registered
--pids_selection_file_path PIDS_SELECTION_FILE_PATH
Selected papers ID file path (CSV file path which has "pid" column)
Register articles
Example:
xlingual_papers_recommender_ds_loader register_paper_part articles articles.csv articles.jsonl
Required columns
- pid
- main_lang
- uri
- subject_areas
- pub_year
- doi (optional)
- network_collection (optional)
Register abstracts
Example:
xlingual_papers_recommender_ds_loader register_paper_part abstracts /inputs/abstracts.csv /outputs/abstracts.jsonl
Columns
- pid
- lang
- original
- text (padronizado)
Same for paper_titles
and keywords
datasets.
Register references
Example:
xlingual_papers_recommender_ds_loader register_paper_part references /inputs/references.csv /outputs/references.jsonl
Columns
- pub_year
- vol
- num
- suppl
- page
- surname
- organization_author
- doi
- journal
- paper_title
- source
- issn
- thesis_date
- thesis_loc
- thesis_country
- thesis_degree
- thesis_org
- conf_date
- conf_loc
- conf_country
- conf_name
- conf_org
- publisher_loc
- publisher_country
- publisher_name
- edition
- source_person_author_surname
- source_organization_author
Merge papers parts
usage: xlingual_papers_recommender_ds_loader merge_parts [-h] [--split_into_n_papers SPLIT_INTO_N_PAPERS] [--create_paper CREATE_PAPER]
input_csv_file_path output_file_path
positional arguments:
input_csv_file_path Selected papers ID file path (CSV file path which has "pid" column)
output_file_path jsonl output file path
optional arguments:
-h, --help show this help message and exit
--split_into_n_papers SPLIT_INTO_N_PAPERS
True to create one register for each abstract
--create_paper CREATE_PAPER
True to register papers
Example:
xlingual_papers_recommender_ds_loader merge_parts pids.csv output.jsonl
Register papers from loaded datasets
usage: xlingual_papers_recommender_ds_loader register_paper [-h] [--skip_update SKIP_UPDATE] input_csv_file_path output_file_path
positional arguments:
input_csv_file_path Selected papers ID file path (CSV file path which has "pid" column)
output_file_path jsonl output file path
optional arguments:
-h, --help show this help message and exit
--skip_update SKIP_UPDATE
True to skip if paper is already registered
Example:
xlingual_papers_recommender_ds_loader register_paper pids.csv output.jsonl
Generate reports from papers, sources and connections
usage: xlingual_papers_recommender_reports all [-h] reports_path
positional arguments:
reports_path /path
optional arguments:
-h, --help show this help message and exit
Example:
xlingual_papers_recommender_reports all /reports
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file xlingual_papers_recommender-1.0.tar.gz
.
File metadata
- Download URL: xlingual_papers_recommender-1.0.tar.gz
- Upload date:
- Size: 15.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.9.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fdcf096ac37a97a91d836ed7496a0e3718f2a51dffd8836e9988090cf315fa75 |
|
MD5 | a78c6eab1b9add517ccaa31dd7109a36 |
|
BLAKE2b-256 | 34cda2fbe3ebc6a36fefac60b68bdbdf1fe6b5ae7cd2c2315365d9fa135be27a |
File details
Details for the file xlingual_papers_recommender-1.0-py3-none-any.whl
.
File metadata
- Download URL: xlingual_papers_recommender-1.0-py3-none-any.whl
- Upload date:
- Size: 15.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.9.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 71b17d40a05a6ab8b7b4194ceee32b1ce974b8c78fa09a09379a4db65c113127 |
|
MD5 | e8f8ae163ba6ba635ea8bdac19212a97 |
|
BLAKE2b-256 | b04dad8349f84e42dd04c05d62251f23549748da16585bdc38098bba437fcf03 |