A library for processing textual datasets with large language models.

Project description

gpt_scientist

gpt_scientist is a lightweight Python library for processing tabular data stored in Google Sheets (or CSV files) using OpenAI models (like GPT-4o, GPT-4o-mini, etc).

The library is designed primarily for social science researchers and other users without extensive programming experience who wants to run AI-based textual analysis over tabular data with just a few lines of Python code.

The library is best used in Google Colab for processing Google Sheets. However, it can also be used locally with CSV files.

Feedback and Collaboration

If you use gpt_scientist for your project, we would love to hear about it! Your feedback helps us improve the library and better support real-world research and activist work.

Feel free to open an issue on GitHub, or reach out by email.

Installation

pip install gpt-scientist

Quick Example

from gpt_scientist import Scientist

# Create a Scientist
sc = Scientist(api_key='YOUR_OPENAI_API_KEY')
# (or set it via the OPENAI_API_KEY environment variable)

# Set the system prompt that describes the general capabilities you need:
sc.set_system_prompt("You are an assistant helping to analyze customer reviews.")
# Or, if the system prompt is long (e.g. contains the theoretical frame of your research study), you can load it from a google doc:
# sc.load_system_prompt_from_google_doc('your-google-doc-id')


# Define the task prompt
prompt = "Analyze the review and provide the overall sentiment from 1 (very negative) to 5 (very positive), together with a short explanation."

# Analyze a Google Sheet
sc.analyze_google_sheet(
    sheet_key='your-google-sheet-key', # you can use the full URL or just the part between /d/ and the next /
    prompt=prompt,
    input_fields=['review_text'],
    output_fields=['sentiment', 'explanation'],
    rows='2:12',  # optional: analyze only rows 2 to 12 in the sheet
)

This will:

Read the first worksheet from your Google Sheet
Create the sentiment and explanation columns in that sheet if they don't exist
For each row in the specified range (2 to 12):
- Read the content of the review_text column
- Call the OpenAI model with the prompt and the review text
- Write the results (sentiment and explanation) back into the sheet

Important: Google Sheets can only be accessed from Google Colab, so you need to run this code in a Colab notebook. To use the library locally with CSV files, call sc.analyze_csv(...) instead of sc.analyze_google_sheet(..) (see example).

Notes

The library will write to the sheet as it goes, so even if you stop the execution, you will have the results for the rows that were already processed.
The library processes multiple rows in parallel (by default, 100 at a time). This makes the processing much faster, but also don't be surprised if the output cells are filled in out of order.
The library will also show you the cost of the API calls so far, so you can keep track of your spending (only for those models whose price it knows).
If the output columns already exist, the library will skip those rows where the outputs are already filled in (unless you specify overwrite=True).

Advanced Features

Document Processing

Often, you may want to analyze not just short text fields, but longer documents — for example, interview transcripts. If your input cell in the Google Sheet contains a link to a Google Doc, the library will automatically open the document and feed its full content to the language model.

Important: If Google Sheets automatically converted your link into a "smart chip" (those clickable document previews), the library will not recognize it. You must ensure the spreadsheet cell contains a plain hyperlink, not a chip.

Quote Verification

One of the useful applications of GPT-based analysis is extracting quotes on specific topics from documents. The library includes a helper function to verify extracted quotes against the original source text.

You can call this after your call to analyze_google_sheet:

sc.check_quotes_google_sheet(
    sheet_key='your-google-sheet-key',
    input_fields=['transcript'],
    output_field='gpt_extracted_quote',
    rows='4:5'  # optional: specify which rows to process
)

This function will:

Create a new column 'gpt_extracted_quote_verified' (if it doesn't exist yet).
For each row, search for the extracted quote in any of the input fields (in this case, in the transcript).
If it finds an exact or approximate match, it will write the exact version of the quote into 'gpt_extracted_quote_verified'.
Otherwise, it will insert 'QUOTE NOT FOUND'.

This helps verify that the quotes generated by the model actually correspond to the original document, improving the reliability of automated extraction.

Other Settings

Select a different worksheet

If your input data is not on the first sheet, add worksheet_index=n (e.g. worksheet_index=1) to the parameters of your analyze_google_sheet. Indexing starts from 0, so 1 is the second sheet.

Change the model

The default model is gpt-4o-mini: it is cheap and good enough for most tasks. But you can use any model that is enabled for your OpenAI API key. Just make this call before your call to analyze_google_sheet:

sc.set_model('gpt-4o')

Write results to a new sheet

If you don't want to modify the input sheet, add in_place=False to the parameters of your analyze_google_sheet. This will create a new worksheet for the output.

Load a system prompt from a file

sc.load_system_prompt_from_file('content/system.txt')

Control parallel processing

Depending on your OpenAI account tier and the model you are using, you might be limited in how many requests you can make per minute. You can control the number of parallel requests with:

sc.set_parallel_rows(10)

The default is 100.

Set model parameters

sc.set_model_params({
    'top_p': 0.5,
    'max_completion_tokens': 100
})

Pass any parameters supported by the OpenAI API. Common options:

top_p: Controls diversity via nucleus sampling (0.0-1.0). Lower values make responses more deterministic.
max_completion_tokens: Limits response length. Protects from excessive costs if the model generates very long outputs.
temperature: Controls randomness (0.0-2.0).
reasoning_effort: For reasoning models ('none', 'low', 'medium', 'high').

Adjust retries and batch sizes

sc.set_num_results(5)
sc.set_num_retries(20)

set_num_retries controls how many times the library retries after a bad response (default: 10).
set_num_results controls how many completions are requested at once — useful if input size is much bigger than output size, and the reponses are often bad.

Customize token pricing

sc.set_pricing({'gpt-3.5-turbo': {'input': 1.5, 'output': 2}})

If you are using a model not included in the built-in pricing table, or if token prices have changed, you can define your own (in dollars per million tokens)

Acknowledgements

This library has been created as a result of my collaboration with the Hannah Arendt Research Center, and the idea is due to the Center's founder, Mariia Vasilevskaia.

Project details

Release history Release notifications | RSS feed

This version

0.1.35

Jan 14, 2026

0.1.34

Jan 12, 2026

0.1.33

Dec 22, 2025

0.1.32

Nov 23, 2025

0.1.31

Nov 8, 2025

0.1.30

Nov 3, 2025

0.1.29

Nov 3, 2025

0.1.28

Oct 31, 2025

0.1.27

Oct 31, 2025

0.1.26

Oct 31, 2025

0.1.25

Oct 30, 2025

0.1.24

Oct 29, 2025

0.1.23

Sep 21, 2025

0.1.22

Sep 3, 2025

0.1.21

Sep 3, 2025

0.1.20

Aug 12, 2025

0.1.19

Jul 14, 2025

0.1.18

Jun 4, 2025

0.1.17

Apr 26, 2025

0.1.16

Feb 17, 2025

0.1.15

Oct 21, 2024

0.1.14

Oct 4, 2024

0.1.13

Oct 4, 2024

0.1.12

Oct 4, 2024

0.1.11

Oct 4, 2024

0.1.10

Oct 2, 2024

0.1.9

Sep 23, 2024

0.1.8

Sep 23, 2024

0.1.7

Sep 23, 2024

0.1.6

Sep 20, 2024

0.1.5

Sep 15, 2024

0.1.4

Aug 27, 2024

0.1.3

Aug 27, 2024

0.1.2

Aug 27, 2024

0.1.1

Aug 25, 2024

0.1.0

Aug 22, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gpt_scientist-0.1.35.tar.gz (24.9 kB view details)

Uploaded Jan 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

gpt_scientist-0.1.35-py3-none-any.whl (26.7 kB view details)

Uploaded Jan 14, 2026 Python 3

File details

Details for the file gpt_scientist-0.1.35.tar.gz.

File metadata

Download URL: gpt_scientist-0.1.35.tar.gz
Upload date: Jan 14, 2026
Size: 24.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for gpt_scientist-0.1.35.tar.gz
Algorithm	Hash digest
SHA256	`5a8383780307ea6d5b0240d6cbbc9cc63adf2bb4e78ed8877ed283724a073a0b`
MD5	`aeb84a2a009d3431be4ad5ad9714df0a`
BLAKE2b-256	`ddcffd9453bd91a1c80d8fdf9c5d263ae2cd8e7b64fe536c78641089d779aff0`

See more details on using hashes here.

File details

Details for the file gpt_scientist-0.1.35-py3-none-any.whl.

File metadata

Download URL: gpt_scientist-0.1.35-py3-none-any.whl
Upload date: Jan 14, 2026
Size: 26.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for gpt_scientist-0.1.35-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a17c5ce08e9b11144fe3c0249b4cd32b3cebdf16a86623326b8d5aa4cad151f9`
MD5	`03fc7da64e88327559d75cf8c5cc9a05`
BLAKE2b-256	`ef07f3a5503fe700482f2fba9f9277b2eac98c57aa0c9e3ce929f08245433538`

See more details on using hashes here.

gpt-scientist 0.1.35

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

gpt_scientist

Installation

Quick Example

Advanced Features

Other Settings

Acknowledgements

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes