A library for providing a unified asyncio API for various data sources
Project description
asyncrepo
asyncrepo
provides a unified async interface for retrieving data from a variety of sources.
Installation
pip install asyncrepo
Usage
For now, just check out the live tests for some examples:
- AWS S3 Buckets Test
- AWS S3 Objects Test
- Confluence Pages Test
- File CSV Rows Test
- GitHub Repos Test
- Greenhouse Jobs Test
- Jira Issues Test
Motivation
To provide tooling for developers of unified and federated search platforms.
Currently supported repositories
aws.s3_buckets.S3Buckets
- AWS S3 buckets belonging to the current user.aws.s3_objects.S3Objects
- AWS S3 objects belonging to a bucket.confluence.pages.Pages
- Confluence pages belonging to a given organizationfile.csv_rows.CSVRows
- CSV rows within a given file specified by filepath or URLgithub.repos.Repos
- GitHub repositories belonging to a given user or organization.greenhouse.jobs.Jobs
- Greenhouse jobs belonging to a given board.jira.issues.Issues
- JIRA issues belonging to a given organization.
Supported repository operations
.get(id: str)
: Get an item from the repository by its ID..list()
: Get an iterator for all items in the repository..list_pages()
: Get a paginated iterator for all items in the repository..search(query: str)
: Get an iterator for all items in the repository that match the query..search_pages(query: str)
: Get a paginated iterator for all items in the repository that match the query.
Exceptions
asyncrepo.exceptions.ItemNotFound
: Raised by .get(id: str) if the item does not exist in the repository.
Support by repository
Repository | .get | .list | .search | Non-blocking IO | Authentication |
---|---|---|---|---|---|
aws.s3_buckets.S3Buckets | Yes | Yes | Naive | Yes | AWS |
aws.s3_objects.S3Objects | Yes | Yes | Yes | Yes | AWS |
confluence.pages.Pages | Yes | Yes | Yes | Yes | Basic |
file.csv_rows.CSVRows | Naive | Yes | Naive | Yes | None |
github.repos.Repos | Yes | Yes | Yes | Yes | Token |
greenhouse.jobs.Jobs | Yes | Yes | Naive | Yes | None |
jira.issues.Issues | Yes | Yes | Yes | Yes | Basic |
Caveats by repository
†: On the roadmap of things to be addressed.
aws.s3_buckets.S3Buckets
- † Only basic metadata is available about buckets.
- † Currently implemented as a single page repository.
aws.s3_objects.S3Objects
- † No options to get the contents of an object.
- † Only basic metadata is available about objects.
- Search is implemented using the prefix search API.
confluence.pages.Pages
- † No options to limit the repository scope to a specific space.
- The search API seems to occasionally return an empty result set for a query that should have results. This results in fragile live tests for the repository. This may only happen under high concurrency.
- Similar to the above, the API will occasionally return a 500 error when querying under high concurrency.
- † There is a simple retry system in place to address the aforementioned 500 error But it should be abstracted out into a more general retry system that can be applied to other repositories.
file.csv_rows.CSVRows
- † There is no options for caching the file. If a URL is used, that means every time the file is queried, it will be downloaded (e.g. every get, search, or list operation). In the future, it would be nice to be able to cache the file either in memory or on disk with some sort of TTL.
- Because CSVs have no natural pages, there is a page_size option that can be used to limit the number of rows returned per page. The default is 20. This allows you to load in some data without loading the entire file into memory.
- Because CSV rows have no natural primary key, the id defaults to the row index. You can change this by passing an id to the repository, which expects either the name of a column or a tuple of column names.
github.repos.Repos
- † May need additional work to mitigate rate limiting issues.
† Uses PyGithub, which is not async.- † Patches PyGithub to support async (should consider using a different library like Gidgethub).
- † The
get
operation can retrieve repositories which are out of scope for the user/organization.
greenhouse.jobs.Jobs
- It is a single page repository.
jira.issues.Issues
- † No options to limit the repository scope to a specific project.
- The .get method accepts either keys or IDs, but the .id for items is always the ID. This is because the ID doesn't change, whereas the key could change by moving the issue to a different project.
Repository quirks
Because this library provides a unified interface for very different sources, all repositories will have some quirks. Here are some of them.
Naive search
Search is not natively supported by all sources. As a workaround, some sources fall back to an implementation that performs a text search on the raw data for each item in the repository.
Naive get
Get is not natively supported by all sources. As a workaround, some sources fall back to an implementation that performs a scan of the entire repository to find the item with the given ID.
Single page repositories
Some repositories are based on a flat list of items, rather than being paginated. All items from such repositories are returned as the first and only page.
Items
All items are represented by an Item
object. This object has the following attributes:
id
: A string that uniquely identifies item, which can be passed toRepository.get
to retrieve the item.document
: A dictionary containing the data for the item.repository
: The repository that contains the item.
Wish list
The following is a list of things that might be worked on next.
General improvements
- Addressing caveats as indicated (†).
- Making the live tests runnable in GitHub Actions.
- Write mock tests for the various supported repositories.
- Add a meta repository that can combine multiple repositories.
- Support for non-default orderings.
- Stop subclassing ClientSession from aiohttp because it makes the developer sad.
- More enterprise-friendly implementations. Testing is done on cloud-hosted services and those APIs are often different from the on-premise ones. Submit a ticket or a pull request if you want to help.
- Split dependencies into separate packages depending on which repositories are desired.
- Normalize and extensively document repository constructors. For now, look at the tests or the code.
- Naive get as described isn't actually a default fallback for when get isn't implemented. I should add a default implementation that falls back to looking for the item by using an implemented list method.
- Stable API. Right now, the API is unstable and can change at any time.
Potential Operations
-
Repository.create(item)
-
Repository.update(item)
-
Repository.delete(item)
-
Item.save(upsert: bool=True)
-
Item.delete()
Potential exceptions
-
asyncrepo.exceptions.PermissionDenied
- When the user is not authorized to perform the operation. -
asyncrepo.exceptions.OperationNotSupported
- When the operation is not supported by the repository. -
asyncrepo.exceptions.ItemAlreadyExists
- When the item already exists and upsert is False.
Potential properties
These properties could be added to the Item
class to provide useful output for search results
and other use cases.
-
Item.title: Optional[str]
- Page titles, ticket summaries, filenames, etc. -
Item.text: Optional[str]
- Page content, ticket descriptions, file contents, etc. -
Item.url: Optional[str]
- URL to the page, ticket, file, etc. -
Item.image_url: Optional[str]
- Ticket status icon, file thumbnail, etc. -
Item.facets: Dict[str, str]
- Potentially an extensive list of facets for items. File type, ticket status, etc. -
Item.created_at: Optional[datetime]
- Date and time the item was created. -
Item.updated_at: Optional[datetime]
- Date and time the item was last updated. -
Item.created_by: Optional[str]
- User that created the item. -
Item.updated_by: Optional[str]
- User that last updated the item. -
Item.created_by_url: Optional[str]
- URL to the user that created the item. -
Item.updated_by_url: Optional[str]
- URL to the user that last updated the item. -
Item.created_by_avatar_url: Optional[str]
- URL to the user's avatar that created the item. -
Item.updated_by_avatar_url: Optional[str]
- URL to the user's avatar that last updated the item.
Potential Repositories
There's so many things that there could be repositories for -- this is just a very short list I keep track of for inspiration.
-
jira.projects.Projects
-
confluence.spaces.Spaces
-
confluence.blogs.Blogs
-
jenkins.jobs.Jobs
-
jenkins.builds.Builds
-
elastic.indexes.Indexes
-
elastic.documents.Documents
-
slack.channels.Channels
-
slack.users.Users
-
slack.messages.Messages
-
pypi.packages.Packages
-
google.drive.Files
-
google.mail.Mail
-
google.calendar.Events
-
github.code.Code
Contribution Guidelines
Please submit a ticket if you have an idea for a new feature or you've found a bug. You can also submit a pull request if you have a solution to the problem!
And hey, don't feel anxious about contributing. If you're interested in helping improve this library by submitting a pull request, I'd be extremely happy to hear from you.
Bug fixes
- Create a test which fails because of the identified bug
- Fix the bug
- Ensure the test passes
- Submit a pull request
New features
- Create a test which fails because the new feature is not implemented
- Implement the new feature
- Ensure the test passes
- Submit a pull request
New repository checklist
- Add your new repository to the
asyncrepo.repositories
module. - At minimum, your repository should support getting and listing. Fallback to naive search is OK if there isn't a better way.
- Add live tests for your repository. If credentials are required and you have to set up the tests against a private server, outline what credentials are required in the env.dist file and make it clear what data is expected to exist in the test environment.
- Ensure the tests pass
- Document your repository in this file.
- Submit a pull request.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file asyncrepo-0.0.9.tar.gz
.
File metadata
- Download URL: asyncrepo-0.0.9.tar.gz
- Upload date:
- Size: 8.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 81d0fe0037229a571c721f923c10db8852531e7c7ccc43460022f122474da086 |
|
MD5 | e8f48b2011090fed0cc3528590d926cc |
|
BLAKE2b-256 | 417753ec71b1526ed0cecdc8dce484b3bfa8dcd1b6d1455d2a5baea0dd4e959f |
File details
Details for the file asyncrepo-0.0.9-py3-none-any.whl
.
File metadata
- Download URL: asyncrepo-0.0.9-py3-none-any.whl
- Upload date:
- Size: 8.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 247064e2715d1913a2628ffdbc1e6a07b91c4faa6a71241dc3b630485d2b7822 |
|
MD5 | aca2f29717be4debd668f584b922533f |
|
BLAKE2b-256 | aa5d051b0334f97f9e67196c20a829063c349f3ca7efec0e69ed69aa0393a5cd |