Skip to main content

data-to-paper: Backward-traceable AI-driven scientific research

Project description

Backward-traceable AI-driven Research

License: MIT

data-to-paper is an automation framework that systematically navigates interacting AI agents through a complete end-to-end scientific research, starting from raw data alone and concluding with transparent, backward-traceable, human-verifiable scientific papers (Example AI-created paper, Copilot App DEMO).

Try it out

pip install data-to-paper

then run: data-to-paper

See INSTALL for dependencies.

Key features

  • End-to-end field-agnostic research. The process navigates through the entire scientific path, from data exploration, literature search and ideation, through data analysis and interpretation, to the step-by-step writing of a complete research papers.

  • Traceable "data-chained" manuscripts. Tracing informtion flow, data-to-paper creates backward-traceable and verifiable manuscripts, where any numeric values can be click-traced all the way up to the specific code lines that created them (data-chaining DEMO).

  • Autopilot or Copilot. The platform can run fully autonomously, or can be human-guided through the Copilot App, allowing users to:

    • Oversee, Inspect and Guide the research

    • Set research goals, or let the AI autonomously raise and test hypotheses

    • Provide review, or invoke on-demand AI-reviews

    • Rewind the process to prior steps

    • Record and replay runs

    • Track API costs

  • Coding guardrails. Standard statistical packages are overridden with multiple guardrails to minimize common LLM coding errors.



https://github.com/Technion-Kishony-lab/data-to-paper/assets/31969897/0f3acf7a-a775-43bd-a79c-6877f780f2d4

Motivation: Building a new standard for Transparent, Traceable, and Verifiable AI-driven Research

The data-to-paper framework is created as a research project to understand the capacities and limitations of LLM-driven scientific research, and to develop ways of harnessing LLM to accelerate research while maintaining, and even enhancing, the key scientific values, such as transparency, traceability and verifiability, and while allowing scientist to oversee and direct the process (see also: living guidelines).

Implementation

Towards this goal, data-to-paper systematically guides interacting LLM and rule-based agents through the conventional scientific path, from annotated data, through creating research hypotheses, conducting literature search, writing and debugging data analysis code, interpreting the results, and ultimately the step-by-step writing of a complete research paper.

Reference

The data-to-paper framework is described in the following pre-print:

  • Tal Ifargan, Lukas Hafner, Maor Kern, Ori Alcalay and Roy Kishony, "Autonomous LLM-driven research from data to human-verifiable research papers", arXiv:2404.17605

Examples

We ran data-to-paper on the following test cases:

  • Health Indicators (open goal). A clean unweighted subset of CDC’s Behavioral Risk Factor Surveillance System (BRFSS) 2015 annual dataset (Kaggle). Here is an example Paper created by data-to paper.

Try out:

data-to-paper diabetes
  • Social Network (open goal). A directed graph of Twitter interactions among the 117th Congress members (Fink et al). Here is an example Paper created by data-to paper.

Try out:

data-to-paper social_network
  • Treatment Policy (fixed-goal). A dataset on treatment and outcomes of non-vigorous infants admitted to the Neonatal Intensive Care Unit (NICU), before and after a change to treatment guidelines was implemented (Saint-Fleur et al). Here is an example Paper created by data-to paper.

Try out:

data-to-paper npr_nicu
  • Treatment Optimization (fixed-goal). A dataset of pediatric patients, which received mechanical ventilation after undergoing surgery, including an x-ray-based determination of the optimal tracheal tube intubation depth and a set of personalized patient attributes to be used in machine learning and formula-based models to predict this optimal depth (Shim et al). Here is an example Paper created by data-to paper.

We defined three levels of difficulty for the research question for this paper.

  1. easy: Compare two ML methods for predicting optimal intubation depth
    Try out:
data-to-paper ML_easy
  1. medium: Compare one ML method and one formula-based method for predicting optimal intubation depth
    Try out:
data-to-paper ML_medium
  1. hard: Compare 4 ML methods with 3 formula-based methods for predicting optimal intubation depth
    Try out:
data-to-paper ML_hard

Contributing

We invite people to try out data-to-paper with their own data and are eager for feedback and suggestions. It is currently designed for relatively simple research goals and simple datasets, where we want to raise and test a statistical hypothesis.

We also invite people to help develop and extend the data-to-paper framework in science or other fields.

Important notes

Disclaimer. By using this software, you agree to assume all risks associated with its use, including but not limited to data loss, system failure, or any other issues that may arise, especially, but not limited to, the consequences of running of LLM created code on your local machine. The developers of this project do not accept any responsibility or liability for any losses, damages, or other consequences that may occur as a result of using this software.

Accountability. You are solely responsible for the entire content of created manuscripts including their rigour, quality, ethics and any other aspect. The process should be overseen and directed by a human-in-the-loop and created manuscripts should be carefully vetted by a domain expert. The process is NOT error-proof and human intervention is necessary to ensure accuracy and the quality of the results.

Compliance. It is your responsibility to ensure that any actions or decisions made based on the output of this software comply with all applicable laws, regulations, and ethical standards. The developers and contributors of this project shall not be held responsible for any consequences arising from using this software. Further, data-to-paper manuscripts are watermarked for transparency as AI-created. Users should not remove this watermark.

Token Usage. Please note that the use of most language models through external APIs, especially GPT4, can be expensive due to its token usage. By utilizing this project, you acknowledge that you are responsible for monitoring and managing your own token usage and the associated costs. It is highly recommended to check your API usage regularly and set up any necessary limits or alerts to prevent unexpected charges.

Related projects

Here are some other cool multi-agent related projects:

And also this curated list of awesome-agents.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data_to_paper-1.1.1.tar.gz (2.6 MB view details)

Uploaded Source

Built Distribution

data_to_paper-1.1.1-py3-none-any.whl (2.6 MB view details)

Uploaded Python 3

File details

Details for the file data_to_paper-1.1.1.tar.gz.

File metadata

  • Download URL: data_to_paper-1.1.1.tar.gz
  • Upload date:
  • Size: 2.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.3

File hashes

Hashes for data_to_paper-1.1.1.tar.gz
Algorithm Hash digest
SHA256 b7153452c6ce527b87eb91854f35453af59055952913388ef1f222c3dafc0490
MD5 5da72844c2a2d14a4967db3fc8201e7e
BLAKE2b-256 c2291abfaa9f9c92a0e334e34c7aa571068c6fa7ee57f2f6ac7f12099f09d8ed

See more details on using hashes here.

File details

Details for the file data_to_paper-1.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for data_to_paper-1.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 cb55aa7cea30e8917f5bf835c7a3745969542ddb14ac583f371fc276333ec809
MD5 28a1ca11b737cc5ee305a38b4150e51a
BLAKE2b-256 b88437863cd90bf5c82c48bfcda398a435b97ea4008965471d75a0392d3ac508

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page