Geberate scientific survey with just a query

Project description

Auto-Research

A no-code utility to generate a detailed well-cited survey with topic clustered sections (draft paper format) and other interesting artifacts from a single research query.

Requires:

python 3.7 or above
poppler-utils
list of requirements in requirements.txt
8GB disk space
13GB CUDA(GPU) memory - for a survey of 100 searched papers(max_search) and 25 selected papers(num_papers)

Steps to run (pip coming soon):

apt install -y poppler-utils libpoppler-cpp-dev
git clone https://github.com/sidphbot/Auto-Research.git

cd Auto-Research/
pip install -r requirements.txt
python Surveyor.py [options] <your_research_query>

Artifacts generated (zipped):

Detailed survey draft paper as txt file
A curated list of top 25+ papers as pdfs and txts
Images extracted from above papers as jpegs, bmps etc
Heading/Section wise highlights extracted from above papers as a re-usable pure python joblib dump
Tables extracted from papers(optional)
Corpus of metadata highlights/text of top 100 papers as a re-usable pure python joblib dump

Example run #1 - python utility

python src/Surveyor.py 'multi-task representation learning'

Example run #2 - python class

from Surveyor import Surveyor
mysurveyor = Surveyor()
mysurveyor.survey('quantum entanglement')

Access/Modify defaults:

inside code

from Surveyor import DEFAULTS
from pprint import pprint

pprint(DEFAULTS)

or,

Modify static config file - defaults.py

or,

At runtime (utility)

python src/Surveyor.py --help

usage: Surveyor.py [-h] [--max_search max_metadata_papers]
                   [--num_papers max_num_papers] [--pdf_dir pdf_dir]
                   [--txt_dir txt_dir] [--img_dir img_dir] [--tab_dir tab_dir]
                   [--dump_dir dump_dir] [--models_dir save_models_dir]
                   [--title_model_name title_model_name]
                   [--ex_summ_model_name extractive_summ_model_name]
                   [--ledmodel_name ledmodel_name]
                   [--embedder_name sentence_embedder_name]
                   [--nlp_name spacy_model_name]
                   [--similarity_nlp_name similarity_nlp_name]
                   [--kw_model_name kw_model_name]
                   [--refresh_models refresh_models] [--high_gpu high_gpu]
                   query_string

Generate a survey just from a query !!

positional arguments:
  query_string          your research query/keywords

optional arguments:
  -h, --help            show this help message and exit
  --max_search max_metadata_papers
                        maximium number of papers to gaze at - defaults to 100
  --num_papers max_num_papers
                        maximium number of papers to download and analyse -
                        defaults to 25
  --pdf_dir pdf_dir     pdf paper storage directory - defaults to
                        arxiv_data/tarpdfs/
  --txt_dir txt_dir     text-converted paper storage directory - defaults to
                        arxiv_data/fulltext/
  --img_dir img_dir     image storage directory - defaults to
                        arxiv_data/images/
  --tab_dir tab_dir     tables storage directory - defaults to
                        arxiv_data/tables/
  --dump_dir dump_dir   all_output_dir - defaults to arxiv_dumps/
  --models_dir save_models_dir
                        directory to save models (> 5GB) - defaults to
                        saved_models/
  --title_model_name title_model_name
                        title model name/tag in hugging-face, defaults to
                        'Callidior/bert2bert-base-arxiv-titlegen'
  --ex_summ_model_name extractive_summ_model_name
                        extractive summary model name/tag in hugging-face,
                        defaults to 'allenai/scibert_scivocab_uncased'
  --ledmodel_name ledmodel_name
                        led model(for abstractive summary) name/tag in
                        hugging-face, defaults to 'allenai/led-
                        large-16384-arxiv'
  --embedder_name sentence_embedder_name
                        sentence embedder name/tag in hugging-face, defaults
                        to 'paraphrase-MiniLM-L6-v2'
  --nlp_name spacy_model_name
                        spacy model name/tag in hugging-face (if changed -
                        needs to be spacy-installed prior), defaults to
                        'en_core_sci_scibert'
  --similarity_nlp_name similarity_nlp_name
                        spacy downstream model(for similarity) name/tag in
                        hugging-face (if changed - needs to be spacy-installed
                        prior), defaults to 'en_core_sci_lg'
  --kw_model_name kw_model_name
                        keyword extraction model name/tag in hugging-face,
                        defaults to 'distilbert-base-nli-mean-tokens'
  --refresh_models refresh_models
                        Refresh model downloads with given names (needs
                        atleast one model name param above), defaults to False
  --high_gpu high_gpu   High GPU usage permitted, defaults to False

At runtime (code)

during surveyor object initialization with surveyor_obj = Surveyor()
- pdf_dir: String, pdf paper storage directory - defaults to arxiv_data/tarpdfs/
- txt_dir: String, text-converted paper storage directory - defaults to arxiv_data/fulltext/
- img_dir: String, image image storage directory - defaults to arxiv_data/images/
- tab_dir: String, tables storage directory - defaults to arxiv_data/tables/
- dump_dir: String, all_output_dir - defaults to arxiv_dumps/
- models_dir: String, directory to save to huge models, defaults to saved_models/
- title_model_name: String, title model name/tag in hugging-face, defaults to Callidior/bert2bert-base-arxiv-titlegen
- ex_summ_model_name: String, extractive summary model name/tag in hugging-face, defaults to allenai/scibert_scivocab_uncased
- ledmodel_name: String, led model(for abstractive summary) name/tag in hugging-face, defaults to allenai/led-large-16384-arxiv
- embedder_name: String, sentence embedder name/tag in hugging-face, defaults to paraphrase-MiniLM-L6-v2
- nlp_name: String, spacy model name/tag in hugging-face (if changed - needs to be spacy-installed prior), defaults to en_core_sci_scibert
- similarity_nlp_name: String, spacy downstream trained model(for similarity) name/tag in hugging-face (if changed - needs to be spacy-installed prior), defaults to en_core_sci_lg
- kw_model_name: String, keyword extraction model name/tag in hugging-face, defaults to distilbert-base-nli-mean-tokens
- high_gpu: Bool, High GPU usage permitted, defaults to False
- refresh_models: Bool, Refresh model downloads with given names (needs atleast one model name param above), defaults to False
during survey generation with surveyor_obj.survey(query="my_research_query")
- max_search: int maximium number of papers to gaze at - defaults to 100
- num_papers: int maximium number of papers to download and analyse - defaults to 25

Project details

Release history Release notifications | RSS feed

This version

1.0

Jul 11, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Auto-Research-1.0.tar.gz (46.0 kB view details)

Uploaded Jul 11, 2021 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

Auto_Research-1.0-py3-none-any.whl (50.7 kB view details)

Uploaded Jul 11, 2021 Python 3

File details

Details for the file Auto-Research-1.0.tar.gz.

File metadata

Download URL: Auto-Research-1.0.tar.gz
Upload date: Jul 11, 2021
Size: 46.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.1 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.9.6

File hashes

Hashes for Auto-Research-1.0.tar.gz
Algorithm	Hash digest
SHA256	`e72ac3167a8b1c38bad7b3389204da396efb13ee7ed15927aef0941fddfbd72d`
MD5	`4370e9e71bad2d11e79817dd617a707f`
BLAKE2b-256	`d318f8af98eca66236b03896c4cb55ca8a2eebb4bc81cc26c024ef4f0188b4a2`

See more details on using hashes here.

File details

Details for the file Auto_Research-1.0-py3-none-any.whl.

File metadata

Download URL: Auto_Research-1.0-py3-none-any.whl
Upload date: Jul 11, 2021
Size: 50.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.1 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.9.6

File hashes

Hashes for Auto_Research-1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f52ffe851cfbd5e37fc58d57457064786853868600ff32128462f98e5ca7927e`
MD5	`fd13c276d9a7f8dd11de048e471f8d65`
BLAKE2b-256	`b90c9e0832051981e0735d84874bc0cfd7f7df9ae75cf45a41b9a70418d5b4e2`

See more details on using hashes here.

Auto-Research 1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Auto-Research

A no-code utility to generate a detailed well-cited survey with topic clustered sections (draft paper format) and other interesting artifacts from a single research query.

Steps to run (pip coming soon):

Artifacts generated (zipped):

Example run #1 - python utility

Example run #2 - python class

Access/Modify defaults:

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes