Geberate scientific survey with just a query
Project description
Auto-Research
A no-code utility to generate a detailed well-cited survey with topic clustered sections (draft paper format) and other interesting artifacts from a single research query.
Requires:
- python 3.7 or above
- poppler-utils
- list of requirements in requirements.txt
- 8GB disk space
- 13GB CUDA(GPU) memory - for a survey of 100 searched papers(max_search) and 25 selected papers(num_papers)
Steps to run (pip coming soon):
apt install -y poppler-utils libpoppler-cpp-dev
git clone https://github.com/sidphbot/Auto-Research.git
cd Auto-Research/
pip install -r requirements.txt
python Surveyor.py [options] <your_research_query>
Artifacts generated (zipped):
- Detailed survey draft paper as txt file
- A curated list of top 25+ papers as pdfs and txts
- Images extracted from above papers as jpegs, bmps etc
- Heading/Section wise highlights extracted from above papers as a re-usable pure python joblib dump
- Tables extracted from papers(optional)
- Corpus of metadata highlights/text of top 100 papers as a re-usable pure python joblib dump
Example run #1 - python utility
python src/Surveyor.py 'multi-task representation learning'
Example run #2 - python class
from Surveyor import Surveyor
mysurveyor = Surveyor()
mysurveyor.survey('quantum entanglement')
Access/Modify defaults:
- inside code
from Surveyor import DEFAULTS
from pprint import pprint
pprint(DEFAULTS)
or,
- Modify static config file -
defaults.py
or,
- At runtime (utility)
python src/Surveyor.py --help
usage: Surveyor.py [-h] [--max_search max_metadata_papers]
[--num_papers max_num_papers] [--pdf_dir pdf_dir]
[--txt_dir txt_dir] [--img_dir img_dir] [--tab_dir tab_dir]
[--dump_dir dump_dir] [--models_dir save_models_dir]
[--title_model_name title_model_name]
[--ex_summ_model_name extractive_summ_model_name]
[--ledmodel_name ledmodel_name]
[--embedder_name sentence_embedder_name]
[--nlp_name spacy_model_name]
[--similarity_nlp_name similarity_nlp_name]
[--kw_model_name kw_model_name]
[--refresh_models refresh_models] [--high_gpu high_gpu]
query_string
Generate a survey just from a query !!
positional arguments:
query_string your research query/keywords
optional arguments:
-h, --help show this help message and exit
--max_search max_metadata_papers
maximium number of papers to gaze at - defaults to 100
--num_papers max_num_papers
maximium number of papers to download and analyse -
defaults to 25
--pdf_dir pdf_dir pdf paper storage directory - defaults to
arxiv_data/tarpdfs/
--txt_dir txt_dir text-converted paper storage directory - defaults to
arxiv_data/fulltext/
--img_dir img_dir image storage directory - defaults to
arxiv_data/images/
--tab_dir tab_dir tables storage directory - defaults to
arxiv_data/tables/
--dump_dir dump_dir all_output_dir - defaults to arxiv_dumps/
--models_dir save_models_dir
directory to save models (> 5GB) - defaults to
saved_models/
--title_model_name title_model_name
title model name/tag in hugging-face, defaults to
'Callidior/bert2bert-base-arxiv-titlegen'
--ex_summ_model_name extractive_summ_model_name
extractive summary model name/tag in hugging-face,
defaults to 'allenai/scibert_scivocab_uncased'
--ledmodel_name ledmodel_name
led model(for abstractive summary) name/tag in
hugging-face, defaults to 'allenai/led-
large-16384-arxiv'
--embedder_name sentence_embedder_name
sentence embedder name/tag in hugging-face, defaults
to 'paraphrase-MiniLM-L6-v2'
--nlp_name spacy_model_name
spacy model name/tag in hugging-face (if changed -
needs to be spacy-installed prior), defaults to
'en_core_sci_scibert'
--similarity_nlp_name similarity_nlp_name
spacy downstream model(for similarity) name/tag in
hugging-face (if changed - needs to be spacy-installed
prior), defaults to 'en_core_sci_lg'
--kw_model_name kw_model_name
keyword extraction model name/tag in hugging-face,
defaults to 'distilbert-base-nli-mean-tokens'
--refresh_models refresh_models
Refresh model downloads with given names (needs
atleast one model name param above), defaults to False
--high_gpu high_gpu High GPU usage permitted, defaults to False
-
At runtime (code)
during surveyor object initialization with
surveyor_obj = Surveyor()
pdf_dir
: String, pdf paper storage directory - defaults toarxiv_data/tarpdfs/
txt_dir
: String, text-converted paper storage directory - defaults toarxiv_data/fulltext/
img_dir
: String, image image storage directory - defaults toarxiv_data/images/
tab_dir
: String, tables storage directory - defaults toarxiv_data/tables/
dump_dir
: String, all_output_dir - defaults toarxiv_dumps/
models_dir
: String, directory to save to huge models, defaults tosaved_models/
title_model_name
: String, title model name/tag in hugging-face, defaults toCallidior/bert2bert-base-arxiv-titlegen
ex_summ_model_name
: String, extractive summary model name/tag in hugging-face, defaults toallenai/scibert_scivocab_uncased
ledmodel_name
: String, led model(for abstractive summary) name/tag in hugging-face, defaults toallenai/led-large-16384-arxiv
embedder_name
: String, sentence embedder name/tag in hugging-face, defaults toparaphrase-MiniLM-L6-v2
nlp_name
: String, spacy model name/tag in hugging-face (if changed - needs to be spacy-installed prior), defaults toen_core_sci_scibert
similarity_nlp_name
: String, spacy downstream trained model(for similarity) name/tag in hugging-face (if changed - needs to be spacy-installed prior), defaults toen_core_sci_lg
kw_model_name
: String, keyword extraction model name/tag in hugging-face, defaults todistilbert-base-nli-mean-tokens
high_gpu
: Bool, High GPU usage permitted, defaults toFalse
refresh_models
: Bool, Refresh model downloads with given names (needs atleast one model name param above), defaults to False
during survey generation with
surveyor_obj.survey(query="my_research_query")
max_search
: int maximium number of papers to gaze at - defaults to100
num_papers
: int maximium number of papers to download and analyse - defaults to25
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Auto-Research-1.0.tar.gz
(46.0 kB
view details)
Built Distribution
File details
Details for the file Auto-Research-1.0.tar.gz
.
File metadata
- Download URL: Auto-Research-1.0.tar.gz
- Upload date:
- Size: 46.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.9.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e72ac3167a8b1c38bad7b3389204da396efb13ee7ed15927aef0941fddfbd72d |
|
MD5 | 4370e9e71bad2d11e79817dd617a707f |
|
BLAKE2b-256 | d318f8af98eca66236b03896c4cb55ca8a2eebb4bc81cc26c024ef4f0188b4a2 |
File details
Details for the file Auto_Research-1.0-py3-none-any.whl
.
File metadata
- Download URL: Auto_Research-1.0-py3-none-any.whl
- Upload date:
- Size: 50.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.9.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f52ffe851cfbd5e37fc58d57457064786853868600ff32128462f98e5ca7927e |
|
MD5 | fd13c276d9a7f8dd11de048e471f8d65 |
|
BLAKE2b-256 | b90c9e0832051981e0735d84874bc0cfd7f7df9ae75cf45a41b9a70418d5b4e2 |