No project description provided
Project description
Auctus Search
Discover and Load Datasets in your Notebook
with ease-of-use
Search for datasets using Auctus and integrate them seamlessly into your Notebook exploration!
[!IMPORTANT]
- We highly recommend to explore the
/examplefolder for Jupyter Notebook-based tutorials 🎉- The following library is under active-development and is not yet stable. Expect bugs & frequent changes!
- Marimo is not yet supported/nor-tested but is in discussion for future releases.
🌆 Auctus Search –– In a Nutshell
Auctus Search is a lightweight library that connects to the
Auctus API,
allowing easy search, filtering, and loading of datasets.
It offers an easy way to find datasets .search_datasets(search_query="Taxis"),
preview them interactively .display(), optionally filter them .with_types(["spatial"]) or .with_score_greater_than(20) to name a few,
and integrate them into your notebook workflow
as pandas.DataFrame or geopandas.GeoDataFrame objects, .load_selected_dataset().
For a more advanced usage, you can even .profile_selected_dataset() which uses
Data Profile Vis under the hood. See further in the API's section.
For developers, it also allows you to integrate it all into your project, have a look at the Auctus Search Mixin
in the OSMNxMapping – It is fully integrated for the user to benefits from
the Auctus Search capabilities and most importantly the great Auctus API as a whole.
See further notebook-based examples in the examples/ directory. 📓
🥐 Installation
We highly recommend using uv for installation from source to avoid the hassle of Conda or other package managers.
It is also the fastest known to date on the OSS market and manages dependencies seamlessly without manual environment
activation (Biggest flex!). If you do not want to use uv, there are no issues, but we will cover it in the upcoming
section; but in the incoming documentation.
First, ensure uv is installed on your machine by
following these instructions.
Prerequisites
- Install
uvas described above. - Clone
Auctus Search(required for alpha development) into your desired directory. Use:git clone git@github.com:VIDA-NYU/auctus_search.git
This step ensurespyproject.tomlbuildsauctus_searchfrom source during installation, though we plan forauctus_searchto become a PyPi package (uv add auctus_searchorpip install auctus_search) in future releases.
Steps
- Jump into the
Auctus Searchrepository:cd auctus_search
- Lock and sync dependencies with
uv:uv lock uv sync
- (Recommended) Install Jupyter extensions for interactive features requiring Jupyter widgets:
uv run jupyter labextension install @jupyter-widgets/jupyterlab-manager
- Launch Jupyter Lab to explore
Auctus Search(Way faster than running Jupyter withoutuv):uv run --with jupyter jupyter lab
[!NOTE]
Future versions will simplify this process:auctus_searchwill move to PyPi, removing the need for manual cloning, and Jupyter extensions will auto-install viapyproject.tomlconfiguration.
Voila 🥐! You’re all set to explore Auctus Search in Jupyter Lab.
Getting Started!
Below is a concise, step-by-step example of how to use the Auctus Search library in a Jupyter notebook.
Cell 1: Import the Library
from auctus_search import AuctusSearch
# This imports the main `AuctusSearch` class, which provides all the functionality we'll use.
Cell 2: Initialise An AuctusSearch Instance
search = AuctusSearch() # Create an instance of `AuctusSearch` to start searching for datasets. This object will handle all interactions with the Auctus API and dataset management.
Cell 3: Search for Datasets
collection = search.search_datasets(search_query="Taxis", display_initial_results=True)
# Search for datasets related to "Taxis" (very broad right!). The `search_datasets` method queries the Auctus API and returns a
# `DatasetCollection`. Setting `display_initial_results=True` shows the initial results interactively in the notebook,
# allowing you to see available datasets right away.
# More parameters such as page and size for pagination are available, but we'll stick to the defaults for now. Readers are instructed to check the API below for more details.
Cell 4: Filter the Dataset Collection
filtered_collection = (
collection
.with_types(["spatial"])
# Refine the search results to only include datasets that at least have a spatial component.
.with_number_of_rows_greater_than(100000)
# Refine further to – after the with_types– only include datasets with more than 100,000 rows.
)
Cell 5: Display Filtered Datasets Interactively
filtered_collection.display()
# Display the filtered datasets in an interactive grid. Each dataset is shown as a card with details like name, source,
# and size. You can click "Select This Dataset" on any card to choose one for further use.
Cell 6: Load the Selected Dataset
dataset = search.load_selected_dataset()
# After selecting a dataset in the previous step, this loads it into memory as a `pandas.DataFrame` (or
# `geopandas.GeoDataFrame` if spatial). By default, it also displays an interactive table preview of the dataset.
Are you coping with the idea of Auctus Search a lightweight jupyter-focussed wrapper around the Auctus API?
Want more filtering actions? Have more advanced usage? Check the API below for more details on how to filter datasets.
Enjoy! 🥐
🗺️ Roadmap / Future Work
[!NOTE]
For more about future works, explore theissuestab above!
-
From labs to more general communities, we want to advance
Auctus Searchby attaining large unit-test coverage, integrating routines viaG.Actions, and producing thorough documentation for users all around. -
It would be very interesting to explore interfacing the whole management of the
Auctus APIso that we could add any alternative to Auctus to have a pretty large library being able to target multiple dataset collection APIs. Such as: https://lil.law.harvard.edu/blog/2025/02/06/announcing-data-gov-archive/
We are also looking forward to seeing more examples in the examples/ directory; Yet in the meantime,
we are happy to welcome you to contribute to the library 🎄
🌁 API
[!IMPORTANT] The following project is fully python-typed safe and uses the great @beartype! It should reduce side effects and better library usability on the user end side.
The Auctus Search API is split into two main parts: the AuctusSearch class for searching, profiling, and loading datasets, and the AuctusDatasetCollection class for filtering and displaying results. Here's the rundown:
AuctusSearch
Your main entry point for searching, profiling, and loading datasets.
search_datasets(search_query, page=1, size=10, display_initial_results=False)
- Purpose: Searches the Auctus API for datasets matching your query.
- Parameters:
search_query(str or list): Search term(s) (e.g.,"Taxis"or["Taxis", "NYC"]– could also be"Taxis NYC").page(int, default=1): Page number of results for pagination. Works withsize; a highersizemeans fewer pages, while a lowersizeincreases the number of pages.size(int, default=10): Number of results per page.display_initial_results(bool, default=False): IfTrue, displays initial results in a Jupyter notebook cell.
- Returns: An
AuctusDatasetCollectionobject containing the search results. - Example:
from auctus_search import AuctusSearch search = AuctusSearch() collection = search.search_datasets(search_query="Taxis", page=1, size=100) # Fetches all "Taxis" data without pagination (may take longer and require scrolling). Adjust `size` and `page` as needed.
profile_selected_dataset()
- Purpose: Displays an interactive data profile summary of the selected dataset using the Data Profile Viz library. Requires a dataset to be selected (via
search_datasets(.)) and its metadata to be available. - Parameters: None
- Returns: None (displays the profile interactively in the notebook)
- Raises:
ValueErrorif no dataset is selected or if metadata is missing.
- Example:
from auctus_search import AuctusSearch search = AuctusSearch() collection = search.search_datasets(search_query="Taxis") collection.display() # Displays dataset cards; select one by clicking "Select This Dataset" search.profile_selected_dataset() # Shows the interactive profile
Note that most probably, an profile_edit_selected_dataset(.) could soon see the light of day. See further in https://github.com/soniacq/DataProfileVis.
load_selected_dataset(display_table=True)
- Purpose: Downloads and loads the dataset you selected from the collection (after clicking
Select This Dataset). - Parameters:
display_table(bool, default=True): IfTrue, shows a preview table usingSkrub.
- Returns: A
pandas.DataFrameorgeopandas.GeoDataFrame(currently supports CSV; more formats coming soon!). - Raises:
ValueErrorif no dataset is selected. - Example:
dataset = search.load_selected_dataset() # Ensure a dataset is selected first, or it raises a ValueError.
interactive_table_display(dataframe, n_rows=10, order_by=None, title="Table Report", column_filters=None, verbose=1)
- Purpose: Displays an interactive table of your loaded dataset in Jupyter.
- Parameters:
dataframe(pandas.DataFrame or geopandas.GeoDataFrame): The dataset to display.n_rows(int, default=10): Number of rows to show.order_by(str or list, optional): Column(s) to sort by.title(str, optional): Table title.column_filters(dict, optional): Filters for columns (e.g.,{"city": {"eq": "NYC"}}).verbose(int, default=1): Verbosity level.
- Returns: None (displays the table in the notebook).
- Example:
search.interactive_table_display(dataset, n_rows=5, title="Taxis Data")
AuctusDatasetCollection
A helper class to filter and explore datasets returned from a search. It supports chaining filter methods, making it ideal for interactive use in Jupyter notebooks compared to parameter-heavy alternatives.
Filtering Methods
-
with_types(types)- Purpose: Filters datasets by dataset types (e.g.,
"spatial","temporal","numerical","categorical"). - Parameters:
types(list): List of desired types, e.g.,["spatial", "temporal"].
- Returns: A new
AuctusDatasetCollection. - Example:
filtered = collection.with_types(["spatial"])
- Purpose: Filters datasets by dataset types (e.g.,
-
with_number_of_rows_greater_than(min_rows)- Purpose: Keeps datasets with more than
min_rowsrows. - Parameters:
min_rows(int): Minimum number of rows.
- Returns: A new
AuctusDatasetCollection. - Example:
filtered = collection.with_number_of_rows_greater_than(500)
- Purpose: Keeps datasets with more than
-
with_number_of_rows_less_than(max_rows)- Purpose: Keeps datasets with fewer than
max_rowsrows. - Parameters:
max_rows(int): Maximum number of rows.
- Returns: A new
AuctusDatasetCollection.
- Purpose: Keeps datasets with fewer than
-
with_number_of_rows_between(min_rows, max_rows)- Purpose: Filters datasets with rows between
min_rowsandmax_rows. - Parameters:
min_rows(int): Minimum number of rows.max_rows(int): Maximum number of rows.
- Returns: A new
AuctusDatasetCollection.
- Purpose: Filters datasets with rows between
-
with_number_of_columns_greater_than(min_columns)- Purpose: Keeps datasets with more than
min_columnscolumns. - Parameters:
min_columns(int): Minimum number of columns.
- Returns: A new
AuctusDatasetCollection.
- Purpose: Keeps datasets with more than
-
with_number_of_columns_less_than(max_columns)- Purpose: Keeps datasets with fewer than
max_columnscolumns. - Parameters:
max_columns(int): Maximum number of columns.
- Returns: A new
AuctusDatasetCollection.
- Purpose: Keeps datasets with fewer than
-
with_number_of_columns_between(min_columns, max_columns)- Purpose: Filters datasets with columns between
min_columnsandmax_columns. - Parameters:
min_columns(int): Minimum number of columns.max_columns(int): Maximum number of columns.
- Returns: A new
AuctusDatasetCollection.
- Purpose: Filters datasets with columns between
-
with_score_greater_than(min_score)- Purpose: Keeps datasets with a relevancy score above
min_score. - Parameters:
min_score(int or float): Minimum score.
- Returns: A new
AuctusDatasetCollection. - Example:
filtered = collection.with_score_greater_than(20)
- Purpose: Keeps datasets with a relevancy score above
-
with_score_less_than(max_score)- Purpose: Keeps datasets with a score below
max_score. (Less useful since higher scores indicate better relevancy, but included for flexibility.) - Parameters:
max_score(int or float): Maximum score.
- Returns: A new
AuctusDatasetCollection.
- Purpose: Keeps datasets with a score below
-
with_score_between(min_score, max_score)- Purpose: Filters datasets with scores between
min_scoreandmax_score. - Parameters:
min_score(int or float): Minimum score.max_score(int or float): Maximum score.
- Returns: A new
AuctusDatasetCollection.
- Purpose: Filters datasets with scores between
preview()
- Purpose: Prints a summary of the dataset collection (search query, filters, and count).
- Returns: None (prints to console).
- Example:
filtered.preview()
display()
- Purpose: Shows an interactive grid of dataset cards in Jupyter for you to select one.
- Returns: None (displays in notebook).
- Example:
filtered.display()
📓 Examples
Check out the examples/ directory in the Auctus Search repo for more
detailed Jupyter notebook examples.
Licence
Auctus Search is released under the MIT Licence.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file auctus_search-0.1.0.tar.gz.
File metadata
- Download URL: auctus_search-0.1.0.tar.gz
- Upload date:
- Size: 27.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
976047a12e0ca53e2f673b3918ed47acff5c7de798665c5f52872e43a80309a0
|
|
| MD5 |
ffd90e61434cf9a372487ab2ca4e9527
|
|
| BLAKE2b-256 |
c26129e79c310cc47eda4a0fdf31a99d9c75e9f3da24cd479d080ca29e1fca00
|
File details
Details for the file auctus_search-0.1.0-py3-none-any.whl.
File metadata
- Download URL: auctus_search-0.1.0-py3-none-any.whl
- Upload date:
- Size: 20.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e0f81d44faee23f749d90e89cd5918acd526b8774df0340033371c262be609d0
|
|
| MD5 |
e8903f8d40d3067cbb2e3c8e7ff52ec9
|
|
| BLAKE2b-256 |
c5b1f037b09a504447ebd24972041a489faf140c72094e116fb481e06ce75439
|