Skip to main content

No project description provided

Project description

Auctus Search

Discover and Load Datasets in your Notebook

with ease-of-use

Python 3.9+ Skrub Jupyter Version Unstable

Search for datasets using Auctus and integrate them seamlessly into your Notebook exploration!


Auctus Search Cover

[!IMPORTANT]

  1. We highly recommend to explore the /example folder for Jupyter Notebook-based tutorials 🎉
  2. The following library is under active-development and is not yet stable. Expect bugs & frequent changes!
  3. Marimo is not yet supported/nor-tested but is in discussion for future releases.

🌆 Auctus Search –– In a Nutshell

Auctus Search is a lightweight library that connects to the Auctus API, allowing easy search, filtering, and loading of datasets.

It offers an easy way to find datasets .search_datasets(search_query="Taxis"), preview them interactively .display(), optionally filter them .with_types(["spatial"]) or .with_score_greater_than(20) to name a few, and integrate them into your notebook workflow as pandas.DataFrame or geopandas.GeoDataFrame objects, .load_selected_dataset().

For a more advanced usage, you can even .profile_selected_dataset() which uses Data Profile Vis under the hood. See further in the API's section.

For developers, it also allows you to integrate it all into your project, have a look at the Auctus Search Mixin in the OSMNxMapping – It is fully integrated for the user to benefits from the Auctus Search capabilities and most importantly the great Auctus API as a whole.

See further notebook-based examples in the examples/ directory. 📓


🥐 Installation

We highly recommend using uv for installation from source to avoid the hassle of Conda or other package managers. It is also the fastest known to date on the OSS market and manages dependencies seamlessly without manual environment activation (Biggest flex!). If you do not want to use uv, there are no issues, but we will cover it in the upcoming section; but in the incoming documentation.

First, ensure uv is installed on your machine by following these instructions.

Prerequisites

  • Install uv as described above.
  • Clone Auctus Search (required for alpha development) into your desired directory. Use:
    git clone git@github.com:VIDA-NYU/auctus_search.git
    
    This step ensures pyproject.toml builds auctus_search from source during installation, though we plan for auctus_search to become a PyPi package (uv add auctus_search or pip install auctus_search) in future releases.

Steps

  1. Jump into the Auctus Search repository:
    cd auctus_search
    
  2. Lock and sync dependencies with uv:
    uv lock
    uv sync
    
  3. (Recommended) Install Jupyter extensions for interactive features requiring Jupyter widgets:
    uv run jupyter labextension install @jupyter-widgets/jupyterlab-manager
    
  4. Launch Jupyter Lab to explore Auctus Search (Way faster than running Jupyter without uv):
    uv run --with jupyter jupyter lab
    

[!NOTE]
Future versions will simplify this process: auctus_search will move to PyPi, removing the need for manual cloning, and Jupyter extensions will auto-install via pyproject.toml configuration.

Voila 🥐! You’re all set to explore Auctus Search in Jupyter Lab.


Getting Started!

Below is a concise, step-by-step example of how to use the Auctus Search library in a Jupyter notebook.

Cell 1: Import the Library

from auctus_search import AuctusSearch
# This imports the main `AuctusSearch` class, which provides all the functionality we'll use.

Cell 2: Initialise An AuctusSearch Instance

search = AuctusSearch()  # Create an instance of `AuctusSearch` to start searching for datasets. This object will handle all interactions with the Auctus API and dataset management.

Cell 3: Search for Datasets

collection = search.search_datasets(search_query="Taxis", display_initial_results=True)

# Search for datasets related to "Taxis" (very broad right!). The `search_datasets` method queries the Auctus API and returns a
# `DatasetCollection`. Setting `display_initial_results=True` shows the initial results interactively in the notebook,
# allowing you to see available datasets right away.

# More parameters such as page and size for pagination are available, but we'll stick to the defaults for now. Readers are instructed to check the API below for more details.

Cell 4: Filter the Dataset Collection

filtered_collection = (
    collection
    .with_types(["spatial"])  
    # Refine the search results to only include datasets that at least have a spatial component.
    .with_number_of_rows_greater_than(100000)
    # Refine further to – after the with_types– only include datasets with more than 100,000 rows.
)

Cell 5: Display Filtered Datasets Interactively

filtered_collection.display()

# Display the filtered datasets in an interactive grid. Each dataset is shown as a card with details like name, source,
# and size. You can click "Select This Dataset" on any card to choose one for further use.

Cell 6: Load the Selected Dataset

dataset = search.load_selected_dataset()

# After selecting a dataset in the previous step, this loads it into memory as a `pandas.DataFrame` (or
# `geopandas.GeoDataFrame` if spatial). By default, it also displays an interactive table preview of the dataset.

Are you coping with the idea of Auctus Search a lightweight jupyter-focussed wrapper around the Auctus API?

Want more filtering actions? Have more advanced usage? Check the API below for more details on how to filter datasets.

Enjoy! 🥐


🗺️ Roadmap / Future Work

[!NOTE]
For more about future works, explore the issues tab above!

  1. From labs to more general communities, we want to advance Auctus Search by attaining large unit-test coverage, integrating routines via G.Actions, and producing thorough documentation for users all around.

  2. It would be very interesting to explore interfacing the whole management of the Auctus API so that we could add any alternative to Auctus to have a pretty large library being able to target multiple dataset collection APIs. Such as: https://lil.law.harvard.edu/blog/2025/02/06/announcing-data-gov-archive/

We are also looking forward to seeing more examples in the examples/ directory; Yet in the meantime, we are happy to welcome you to contribute to the library 🎄


🌁 API

[!IMPORTANT] The following project is fully python-typed safe and uses the great @beartype! It should reduce side effects and better library usability on the user end side.

The Auctus Search API is split into two main parts: the AuctusSearch class for searching, profiling, and loading datasets, and the AuctusDatasetCollection class for filtering and displaying results. Here's the rundown:

AuctusSearch

Your main entry point for searching, profiling, and loading datasets.

search_datasets(search_query, page=1, size=10, display_initial_results=False)
  • Purpose: Searches the Auctus API for datasets matching your query.
  • Parameters:
    • search_query (str or list): Search term(s) (e.g., "Taxis" or ["Taxis", "NYC"] – could also be "Taxis NYC").
    • page (int, default=1): Page number of results for pagination. Works with size; a higher size means fewer pages, while a lower size increases the number of pages.
    • size (int, default=10): Number of results per page.
    • display_initial_results (bool, default=False): If True, displays initial results in a Jupyter notebook cell.
  • Returns: An AuctusDatasetCollection object containing the search results.
  • Example:
    from auctus_search import AuctusSearch
    search = AuctusSearch()
    collection = search.search_datasets(search_query="Taxis", page=1, size=100)  # Fetches all "Taxis" data without pagination (may take longer and require scrolling). Adjust `size` and `page` as needed.
    
profile_selected_dataset()
  • Purpose: Displays an interactive data profile summary of the selected dataset using the Data Profile Viz library. Requires a dataset to be selected (via search_datasets(.)) and its metadata to be available.
  • Parameters: None
  • Returns: None (displays the profile interactively in the notebook)
  • Raises:
    • ValueError if no dataset is selected or if metadata is missing.
  • Example:
    from auctus_search import AuctusSearch
    search = AuctusSearch()
    collection = search.search_datasets(search_query="Taxis")
    collection.display()  # Displays dataset cards; select one by clicking "Select This Dataset"
    search.profile_selected_dataset()  # Shows the interactive profile
    

Note that most probably, an profile_edit_selected_dataset(.) could soon see the light of day. See further in https://github.com/soniacq/DataProfileVis.

load_selected_dataset(display_table=True)
  • Purpose: Downloads and loads the dataset you selected from the collection (after clicking Select This Dataset).
  • Parameters:
    • display_table (bool, default=True): If True, shows a preview table using Skrub.
  • Returns: A pandas.DataFrame or geopandas.GeoDataFrame (currently supports CSV; more formats coming soon!).
  • Raises: ValueError if no dataset is selected.
  • Example:
    dataset = search.load_selected_dataset()  # Ensure a dataset is selected first, or it raises a ValueError.
    
interactive_table_display(dataframe, n_rows=10, order_by=None, title="Table Report", column_filters=None, verbose=1)
  • Purpose: Displays an interactive table of your loaded dataset in Jupyter.
  • Parameters:
    • dataframe (pandas.DataFrame or geopandas.GeoDataFrame): The dataset to display.
    • n_rows (int, default=10): Number of rows to show.
    • order_by (str or list, optional): Column(s) to sort by.
    • title (str, optional): Table title.
    • column_filters (dict, optional): Filters for columns (e.g., {"city": {"eq": "NYC"}}).
    • verbose (int, default=1): Verbosity level.
  • Returns: None (displays the table in the notebook).
  • Example:
    search.interactive_table_display(dataset, n_rows=5, title="Taxis Data")
    

AuctusDatasetCollection

A helper class to filter and explore datasets returned from a search. It supports chaining filter methods, making it ideal for interactive use in Jupyter notebooks compared to parameter-heavy alternatives.

Filtering Methods
  • with_types(types)

    • Purpose: Filters datasets by dataset types (e.g., "spatial", "temporal", "numerical", "categorical").
    • Parameters:
      • types (list): List of desired types, e.g., ["spatial", "temporal"].
    • Returns: A new AuctusDatasetCollection.
    • Example:
      filtered = collection.with_types(["spatial"])
      
  • with_number_of_rows_greater_than(min_rows)

    • Purpose: Keeps datasets with more than min_rows rows.
    • Parameters:
      • min_rows (int): Minimum number of rows.
    • Returns: A new AuctusDatasetCollection.
    • Example:
      filtered = collection.with_number_of_rows_greater_than(500)
      
  • with_number_of_rows_less_than(max_rows)

    • Purpose: Keeps datasets with fewer than max_rows rows.
    • Parameters:
      • max_rows (int): Maximum number of rows.
    • Returns: A new AuctusDatasetCollection.
  • with_number_of_rows_between(min_rows, max_rows)

    • Purpose: Filters datasets with rows between min_rows and max_rows.
    • Parameters:
      • min_rows (int): Minimum number of rows.
      • max_rows (int): Maximum number of rows.
    • Returns: A new AuctusDatasetCollection.
  • with_number_of_columns_greater_than(min_columns)

    • Purpose: Keeps datasets with more than min_columns columns.
    • Parameters:
      • min_columns (int): Minimum number of columns.
    • Returns: A new AuctusDatasetCollection.
  • with_number_of_columns_less_than(max_columns)

    • Purpose: Keeps datasets with fewer than max_columns columns.
    • Parameters:
      • max_columns (int): Maximum number of columns.
    • Returns: A new AuctusDatasetCollection.
  • with_number_of_columns_between(min_columns, max_columns)

    • Purpose: Filters datasets with columns between min_columns and max_columns.
    • Parameters:
      • min_columns (int): Minimum number of columns.
      • max_columns (int): Maximum number of columns.
    • Returns: A new AuctusDatasetCollection.
  • with_score_greater_than(min_score)

    • Purpose: Keeps datasets with a relevancy score above min_score.
    • Parameters:
      • min_score (int or float): Minimum score.
    • Returns: A new AuctusDatasetCollection.
    • Example:
      filtered = collection.with_score_greater_than(20)
      
  • with_score_less_than(max_score)

    • Purpose: Keeps datasets with a score below max_score. (Less useful since higher scores indicate better relevancy, but included for flexibility.)
    • Parameters:
      • max_score (int or float): Maximum score.
    • Returns: A new AuctusDatasetCollection.
  • with_score_between(min_score, max_score)

    • Purpose: Filters datasets with scores between min_score and max_score.
    • Parameters:
      • min_score (int or float): Minimum score.
      • max_score (int or float): Maximum score.
    • Returns: A new AuctusDatasetCollection.
preview()
  • Purpose: Prints a summary of the dataset collection (search query, filters, and count).
  • Returns: None (prints to console).
  • Example:
    filtered.preview()
    
display()
  • Purpose: Shows an interactive grid of dataset cards in Jupyter for you to select one.
  • Returns: None (displays in notebook).
  • Example:
    filtered.display()
    

📓 Examples

Check out the examples/ directory in the Auctus Search repo for more detailed Jupyter notebook examples.


Licence

Auctus Search is released under the MIT Licence.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

auctus_search-0.1.0.tar.gz (27.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

auctus_search-0.1.0-py3-none-any.whl (20.8 kB view details)

Uploaded Python 3

File details

Details for the file auctus_search-0.1.0.tar.gz.

File metadata

  • Download URL: auctus_search-0.1.0.tar.gz
  • Upload date:
  • Size: 27.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.11

File hashes

Hashes for auctus_search-0.1.0.tar.gz
Algorithm Hash digest
SHA256 976047a12e0ca53e2f673b3918ed47acff5c7de798665c5f52872e43a80309a0
MD5 ffd90e61434cf9a372487ab2ca4e9527
BLAKE2b-256 c26129e79c310cc47eda4a0fdf31a99d9c75e9f3da24cd479d080ca29e1fca00

See more details on using hashes here.

File details

Details for the file auctus_search-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for auctus_search-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e0f81d44faee23f749d90e89cd5918acd526b8774df0340033371c262be609d0
MD5 e8903f8d40d3067cbb2e3c8e7ff52ec9
BLAKE2b-256 c5b1f037b09a504447ebd24972041a489faf140c72094e116fb481e06ce75439

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page