Write a survey report using a supported news site and GPT-4 LLM
Project description
newssurvey
newssurvey is a proof-of-concept Python 3.12 application to write a survey report about a question or concern using a single supported news site. The news site is used to conduct searches and read articles. Currently only two sites are supported. Numerous calls are made to OpenAI LLMs, namely gpt-4o-mini and gpt-4o, to formulate the response. A funded OpenAI API key is required.
Sources
The supported sources are:
Name | Type | Observed LLM cost range per report in USD |
---|---|---|
medicalxpress | medical | 1 to 6 |
physorg | science | 1 to 19 |
The LLM cost per report varies by the number of source articles and output sections for the submitted user query. The cost is approximately 1 USD per 100 source articles per 10 output sections. Strictly speaking, the cost is unbounded and must be monitored and restricted via the OpenAI usage dashboard. The generation time per report is expected to be under an hour, also depending on the number of source articles.
Links
Caption | Link |
---|---|
Repo | https://github.com/impredicative/newssurvey |
Changelog | https://github.com/impredicative/newssurvey/releases |
Package | https://pypi.org/project/newssurvey |
Approach
Each step in this workflow corresponds to an action taken by the LLM.
- Get search terms: Search terms for the given user query and site are listed by the LLM. The user query is a question or concern applicable to the user chosen news site. Additional search terms are also obtained until convergence.
- Get filtered search results: For each search term, a single page of search results is retrieved. More than one search type may be supported by the site, in which case all supported search types are used. Each result is composed of a title and possibly a blurb. The search results are filtered, one page at a time, for relevance by the LLM. This step is repeated for additional pages of search results until there are no relevant results for the page. After this, the full texts of all filtered search results are read.
- List section names: The list of article titles is presented to the LLM, ordered by distance to the user query. The LLM provides a coherent single-level list of sections names. The list is then refined until convergence.
- Rate articles for sections: For each article, the LLM numerically rates on a scale of 0 to 100 how well the article can contribute to each section.
- Condense article by section: For each article and section pairing, limited to ones with nonzero ratings, the LLM condenses the article text.
- Get text by section: For each section, its condensed articles are concatenated together, ordered by their corresponding ratings, up to the maximum input context length of the LLM. The LLM formulates the text for each section. The section-specific citation numbers are replaced by globally consistent numbers.
- Get response title: The LLM provides the response title using the list of section names.
The workflow is intended to be as simple as necessary, and without cycles between steps.
Limitations
Due to the LLM's context window limitation of 128K, only up to about 400 condensed articles can be used for writing a section. Efforts are made, of course, to use the most highly rated section-specific relevant articles that fit in this window.
Samples
These generated sample are available in HTML format.
Source | User query (simplified) | Output title |
---|---|---|
medicalxpress | nutrition for anxiety | Nutritional Interventions and Supplements for Managing Anxiety in Adults |
medicalxpress | daytime drowsiness | Understanding and Addressing Daytime Drowsiness: Causes, Strategies, and Solutions |
medicalxpress | acid reflux treatments | Comprehensive Management and Treatment Approaches for Gastroesophageal Reflux Disease (GERD): Integrating Pharmacological, Non-Pharmacological, and Lifestyle Interventions |
physorg | dark matter theories | Comprehensive Exploration of Dark Matter Theories and Alternatives |
physorg | multiverse theories | Exploring Multiverse Theories: Concepts, Evidence, and Implications |
physorg | synthetic biology software | Software Tools and Technological Advances in Synthetic Biology |
As additional news sources are supported, samples based on them are intended to be added.
Setup
Common setup
- In the working directory, create a file named
.env
, with the intended environment variableOPENAI_API_KEY=<your OpenAI API key>
, or set it in a different way. - Continue the setup via GitHub or PyPI as below.
Setup via GitHub using devcontainer
- Continue from the common setup steps.
- Clone or download this repo.
- Build and provision the defined devcontainer.
Setup via GitHub manually
- Continue from the common setup steps.
- Clone or download this repo.
- Ensure that
rye
is installed and available. - In the repo directory, run
rye sync --no-lock
.
Setup via PyPI
- Continue from the common setup steps.
- Create and activate a Python 3.12 devcontainer or virtual environment.
- Install via PyPI:
pip install -U newssurvey
.
Usage
Usage can be as a command-line application or as a Python library.
Usage considerations
- Refining the query text over a few iterations is often essential for receiving a sufficiently tailored response.
- Only a single instance of the application must be run at a time, failing which throttles can aggressively be imposed by the source website and by OpenAI. This is also enforced at the application-level by the use of a lock file.
- Do not browse the source website from the same IP address when a search is running, as this will result in throttling errors.
Usage as application
In the simplest case, run python -m newssurvey
to interactively start the application. You will be prompted for the necessary information.
For non-interactive use, the usage help is copied below:
$ python -m newssurvey -h
Usage: python -m newssurvey [OPTIONS]
Generate and write a response to a question or concern using a supported news source.
A single instance of this method is enforced.
Options:
-s, --source TEXT Name of supported news source. If not given, the user is prompted for it.
-q, --query TEXT Question or concern answerable by the news source. If a path to a file, the file
text is read as text. If not given, the user is prompted for it.
-m, --max-sections INTEGER RANGE
Maximum number of sections to include in the response, between 5 and 100. Its
recommended value, also the default, is 100. [5<=x<=100]
-f, --output-format TEXT Output format of the response. It can be txt (for text), md (for markdown), gfm.md
(for GitHub Flavored markdown), html, pdf, or json. If not specified, but if an
output filename is specified via '--output-path', it is determined automatically
from the file extension. If not specified, and if an output filename is not
specified either, its default is txt.
-o, --output-path PATH Output directory path or file path. If intended as a directory path, it must exist,
and the file name is auto-determined. If intended as a file path, its extension can
be txt (for text), md (for markdown), gfm.md (for GitHub Flavored markdown), html,
pdf, or json. If not specified, the output file is written to the current working
directory with an auto-determined file name. The response is written to the file
except if there is an error.
-c, --confirm / -nc, --no-confirm
Confirm as the workflow progresses. If `--confirm`, a confirmation is interactively
sought as each step of the workflow progresses, and this is the default. If `--no-
confirm`, the workflow progresses without any confirmation.
-h, --help Show this message and exit.
Usage examples:
$ python -m newssurvey -s medicalxpress -q "safe strategies for weight loss" -f txt -o ~ -nc
$ python -m newssurvey -s medicalxpress -q ./my_detailed_medical_concern.txt -f html -o ~/output.html -c
$ python -m newssurvey -s physorg -q ./my_science_query.txt -f pdf -o ./work/ -m 10
Usage as library
>>> from newssurvey import generate_response
>>> import inspect
>>> print(inspect.signature(generate_response))
(source: str, query: str, max_sections: int = 100, output_format: Optional[str] = 'txt', confirm: bool = False) -> newssurvey.types.Response
>>> print(inspect.getdoc(generate_response))
Return a response for the given source and query.
The returned response contains the attributes: format, title, response.
The progress is printed to stdout.
Params:
* `source`: Name of supported news source.
* `query`: Question or concern answerable by the news source.
* `max_sections`: Maximum number of sections to include in the response, between 5 and 100. Its recommended value, also the default, is 100.
* `output_format`: Output format. It can be txt (for text), md (for markdown), gfm.md (for GitHub Flavored markdown), html, pdf, or json. Its default is txt.
* `confirm`: Confirm as the workflow progresses. If true, a confirmation is interactively sought as each step of the workflow progresses. Its default is false.
If failed, a subclass of the `newssurvey.exceptions.Error` exception is raised.
Cache
An extensive disk cache is stored locally to cache website and LLM outputs with a fixed expiration period. This is in the [src]/newssurvey/.diskcache
directory. The expiration period is 1 week for website searches and 52 weeks for everything else, also subject to separate disk usage limits. To reuse the cache, rerun the same user query within this period. To bypass the cache, alter the user query, otherwise delete the appropriate cache subdirectory. Updates to the LLM prompts will also bypass the cache.
The LLM is prompted to always output in a basic text format. Following this, the text is structured into the user-requested output format without using the LLM. Rewriting the output into a new format is therefore possible offline until the earliest cache expiration, typically for 1 week.
Disclaimer
This software is provided as a proof-of-concept application and is distributed under the LGPL license. It is offered without any guarantees or warranties, either expressed or implied, including but not limited to the implied warranties of merchantability, fitness for a particular purpose, or non-infringement.
Users are responsible for ensuring that they have the necessary API keys, permissions, and access to third-party services such as the OpenAI API, which are required for full functionality. The costs associated with using the OpenAI API, including those outlined in this documentation, are subject to change and must be monitored independently by the user.
The software relies on third-party services and content from news sites. The availability, accuracy, or relevance of content from these external sources cannot be guaranteed, nor can the continued accessibility of these services be ensured in the future. The accuracy and reliability of reports generated by the software depend on the quality of input queries, availability of articles, and the performance of language models, all of which are subject to change and influenced by external factors beyond the control of the software.
While efforts have been made to optimize the performance and output of this software, users should independently verify any information generated, particularly if it is intended for use in professional, medical, scientific, technical, legal, or other high-stakes contexts. Use of this software is at your own risk. This software should not be used as the sole basis for any serious, life-impacting decisions. Always consult relevant professionals or authoritative sources directly for such purposes.
By using this software, you agree that its developers and contributors shall not be held liable for any damages, costs, or losses arising from its use, including but not limited to direct, indirect, incidental, consequential, or punitive damages. Users are encouraged to thoroughly review its source code to understand the workings of the application and assess its suitability for their intended use.
The authors do not claim ownership of any content generated using this software. Responsibility for the use of any and all generated content rests with the user. Users should exercise caution and due diligence to ensure that generated content does not infringe on the rights of third parties.
This disclaimer is subject to change without notice. It is your responsibility to review it periodically for updates.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file newssurvey-0.7.2.tar.gz
.
File metadata
- Download URL: newssurvey-0.7.2.tar.gz
- Upload date:
- Size: 50.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7627579878c7806a9cb9dd83b003d4bcf9fecefc07b35cfef4301eedf2a3da52 |
|
MD5 | 89d60278b2ec41f00d42824a6bcc93e1 |
|
BLAKE2b-256 | b20b93f006dae3f9354babdc7ea0c0185f70e722cda8707ed7c9b2e69ece6e21 |
File details
Details for the file newssurvey-0.7.2-py3-none-any.whl
.
File metadata
- Download URL: newssurvey-0.7.2-py3-none-any.whl
- Upload date:
- Size: 76.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f1f05ee03b3a4c61b03363bd35d5a1d328cd07b157396b2c684446f322a6021f |
|
MD5 | e032b6466d1c22bfd11173f4d90a00a2 |
|
BLAKE2b-256 | 5433a67c3ba31c4fb9e785cbf37166d1e736b53f87014f1bee8c7bd773f5871f |