File search tool using OpenAI assistant.
Project description
File search tools using OpenAI Assistant
Work in progress, still trying polish a few features and getting some initial feedback.
Installation (pip)
pip install lumei
Usage
Example
The following is an example of processing a list of pdf files and extracting the vendor and price data from the files. The command requires an OpenAI API key which can be obtained from here https://platform.openai.com/account/api-keys.
lumei \
--input-files ~/folder_1/*.pdf,~/folder_2/*.pdf \
--output-file ~/output.json \
--openai-api-key=<OPENAI_API_KEY> \
--query="[
{'name': 'vendor', 'search': 'Name of the vendor who issued the invoice.'},
{'name': 'price', 'search': 'Total bill from the invoice.'},
{'name': 'file path', 'attribute': 'FILE_PATH'},
{
'names': {'variable 1': 'var1', 'variable 2': 'var2' },
'command': 'stat -f %SB %input_file_path% && var1=1 && var2=2'
}
]"
Input Parameters
--input-files
Source files to process on. Multiple files can be provided, and they are seperated by a comma "," character. File inputs can be expressed as a path to a single file or a regex.
--output-file
Path of the file that the results will be written to. Input must be a file path to a single file. Supported file formate are ".csv", ".xlsx", and ".json". Output file will only be written to when all results have been obtained.
--openai-api-key [Optional]
API key for OpenAI, necessary for file search functionalities. Key can be obtained from here https://platform.openai.com/account/api-keys.
Alternative way to provide the API key is to set it as the "OPENAI_API_KEY" environment variable.
--query
Name and the description of data to search for.
Input should be an array of JSON objects.
name
is the name of the data to search for. Name of the data will be the column name for the result dataset.
search
is the description of the data to search for.
attribute
is a piece metadata related to the query, list of possible attributes can be found below.
command
is the output of a bash command.
The command can reference the file path using the %input_file_path%
variable.
names
is map of column names to environment variables names.
This is only supported for commands.
Example:
[
{
'name': 'vendor',
'search': 'Name of the vendor who issued the invoice.'
},
{
'name': 'price',
'search': 'Total bill from the invoice.'
},
{
'name': 'file path',
'attribute': 'FILE_PATH'
},
{
'names': {'variable 1': 'var1', 'variable 2': 'var2' },
'command': 'stat -f %SB %input_file_path% && var1=1 && var2=2'
}
]
Possible Attributes
FILE_PATH
, START_TIMESTAMP
, END_TIMESTAMP
, START_DATETIME
, END_DATETIME
Standalone Methods
openai_file_search
Example of using the file search method directly without CLI.
from lumei import openai_file_search
from typing import Optional
results: Optional[dict[str, str]] = openai_file_search(
openai_api_key="<OPENAI_API_KEY>",
input_file_path="~/example_invoice_file.pdf",
file_search_query={
"vendor": "Name of the vendor who issued the invoice.",
"price": "Total bill from the invoice.",
}
)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file lumei-0.6.0.tar.gz
.
File metadata
- Download URL: lumei-0.6.0.tar.gz
- Upload date:
- Size: 46.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 559831e4e4b1e883dd14c059c81746353577be6b5b51c90495c22ec2c63a35de |
|
MD5 | 19197b34609471def463770be31ee03d |
|
BLAKE2b-256 | a2e36b575ca122b4dc80940f7a6cb8cd4c2b5a99a153b4bd48143c353804bcd0 |
File details
Details for the file lumei-0.6.0-py3-none-any.whl
.
File metadata
- Download URL: lumei-0.6.0-py3-none-any.whl
- Upload date:
- Size: 36.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6e7a5b6dbb7bb3c104d937d740d54d68472e9f9aeb7cb9083619ddc7d0d0444d |
|
MD5 | 695919f785b5dbbe8a12eaacaa6d66b8 |
|
BLAKE2b-256 | 7d6da7e7b0971c5e5dbcdbfe17322d758c31f176197876265f8dcaab8921e13a |