Extensible on-demand file loading with cache management.
Project description
title: README
Synopsis
Extensible on-demand data loader with caching. The library currently provides loaders for the following data sources:
- text (file or URL)
- JSON (file or URL)
- YAML (file or URL)
- Pandas DataFrame (file or database connection)
The user can easily add custom loaders for other data sources by subclassing the LoaderBase class or one of its subclasses (e.g. FileLoaderBase, URLLoaderBase. Submissions of new loaders for inclusion in the library are welcome.
Data is managed by the Manager class. The user registers patterns to map data sources to corresponding loaders and then requests the data via the manager. These patterns are compiled to Python regular expressions which are then matched against data requests sent to the manager to determine which loader to use to handle the request. The argument is then passed through to the loader, which will load the data upon the first request and keep it in memory for subsequent requests until the cache is cleared.
The manager provides several methods for managing the cached data, such as clearing everything, clearing by argument pattern, forcing a reload of all data, reloading only data from sources that report a modification, etc.
See the Usage section below for details.
Links
GitLab
Other Repositories
Usage
Basic
In the following example, we instantiate a manager an configure it to load JSON and YAML filepaths via the JSONFileLoader and YAMLFileLoader, respectively.
# Import the manager and loaders.
from dalood.manager import Manager
from dalood.loader.json import JSONFileLoader
from dalood.loader.yaml import YAMLFileLoader
# Instantiate the manager.
man = Manager()
# Register the JSON file loader for all arguments ending with the JSON extension
# (`.json`). The first argument is a Python regular expression, followed by a
# loader instance.
man.register_loader("^.*\.json$", JSONFileLoader())
# The regular expression syntax is a bit complex for new users. Dalood therefore
# supports other pattern types: glob patterns and literal strings.
# Register the YAML file loader for all arguments ending with the YAML extension
# (`.yaml`). Here we use a simpler glob pattern instead of a regular expression
# pattern
man.register_loader("*.yaml", YAMLFileLoader(), pattern_type="glob")
# For comparison, we would have registered the JSON loader with the following statement.
# man.register_loader("*.json", JSONFileLoader(), pattern_type="glob")
# Now that the loaders are registered, we can load JSON and YAML files by simply
# passing their paths to the manager via the `get()` method:
json_data = man.get("/tmp/examples/foo.json")
yaml_data = man.get("/tmp/examples/bar.yaml")
# The data remains in memory within the manager so subsequent requests for the
# same argument via `get()` will not reload the file from the disk. You can
# check which arguments are in memory by iterating over the manager.
for arg in man:
print("Cached argument:", arg)
# Output:
# /tmp/examples/foo.json
# /tmp/examples/bar.yaml.json
# To force a refresh when requesting the data, pass the `reload` argument
# to`get()`:
json_data = man.get("/tmp/examples/foo.json", reload=True).
# You can also request a reload only if the source file reports a modification
# since the data was loaded by the manager:
json_data = man.get("/tmp/examples/foo.json", refresh=True).
# For application that load large amounts of data it may be desirable to
# periodically clear the cache according to different conditions. The
# `clear_cache()` method is provided for this purpose. Without any arguments,
# all cached data is cleared.
man.clear_cache()
for arg in man:
print("Cached argument:", arg)
# Output: empty
# Re-requeesting data after clearing the cache will simply reload the data from
# the source and cache it again.
# Clearing everything is not always desirable so `clear_cache()` provides
# options to clear the cache by a pattern (e.g. all loaded YAML files), by age
# (e.g everything loaded more than an hour ago), or by last access time (e.g.
# everything that was last accessed more than 20 minutes ago). The following
# would clear all JSON files accessed more than 2 minutes ago:
man.clear_cache("*.json", pattern_type="glob", age={"minutes":2}, by_access_time=True)
Literal Patterns And Customized Loaders
In additional to the regular expression and glob patterns, there are also "literal" patterns that will only match the exact string of the pattern. These can be used to load specific arguments with specific loaders. The following example shows how to associate different CSV file loaders with different files that use different separator characters (comma or space).
from dalood.manager import Manager
from dalood.loader.pandas import DataFrameCSVLoader
man = Manager()
comma_csv_loader = DataFrameCSVLoader(sep=",")
space_csv_loader = DataFrameCSVLoader(sep=" ")
man.register_loader("/tmp/example/file1.csv", comma_csv_loader, pattern_type="literal")
man.register_loader("/tmp/example/file2.csv", comma_csv_loader, pattern_type="literal")
man.register_loader("/tmp/example/file1.csv", space_csv_loader, pattern_type="literal")
# This would be tedious for many different files that cannot be summarized via a
# pattern. In that case, a custom function could make this easier:
def register_comma_csv_loader_for_path(path):
man.register_loader(path, comma_csv_loader, pattern_type="literal")
Pattern Classes
All of the methods that accept a pattern string and optional pattern_type
parameter also accept instances of RegexPattern
, GlobPattern
and LiteralPattern
from dalood.regex.
from dalood.manager import Manager
from dalood.loader.text import TextFileLoader
from dalood.regex import GlobPattern
man = Manager()
pattern = GlobPattern("*.txt")
man.register_loader(pattern, TextFileLoader())
User-Loaded Data
Dalood also provides loaders that simply hold references to user-provided data in order to make it accessible via a common API. For example, the user may wich to build a custom object in memory and then access it via the manager using a simple name:
from dalood.manager import Manager
from dalood.loader.memory import MemoryLoader
man = Manager()
# Assume that the user has defined 2 custom objects: "custom_obj1" and
# "custom_obj2". We can map them to arbitrary names either by passing a dict as
# the "mapping" parameter when instantiating MemoryLoader
mem_loader = MemoryLoader(mapping={"obj1": custom_obj1})
# or afterward using the "map" method.
mem_loader.map("obj2", custom_obj2).
Once mapped in the memory loader, we can register them via the manager:
mem_loader.register_patterns(man)
Now we can access the objects via the manager's "get()" method:
new_var_for_custom_obj1 = man.get("obj1")
User-Defined Loaders
The user can define custom loaders and then register them with a manager using custom patterns:
from dalood.manager import Manager
from dalood.loader.file import FileLoaderBase
# Create a custom loader for "foo" files. We'll make it load the first 10 bytes
# from the file.
import pathlib
class FooFileLoader(FileLoaderBase):
def load(self, src):
path = pathlib.Path(src)
with path.open('rb') as handle:
return handle.read(10)
# Register this loader to handle all arguments ending in ".foo".
man.register_loader("*.foo", FooFileLoader(), pattern_type="glob")
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file dalood-2024.7.tar.gz
.
File metadata
- Download URL: dalood-2024.7.tar.gz
- Upload date:
- Size: 18.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e2ab31a33ee49e759ce1497d308e05477061b756b579a153394d27c860f0c1be |
|
MD5 | ef9855e9492d95029671e55af8773f62 |
|
BLAKE2b-256 | ee5a95e685166307b0a37e49e97f68b6375b840244354944524fdd19fc4ae938 |
File details
Details for the file dalood-2024.7-py3-none-any.whl
.
File metadata
- Download URL: dalood-2024.7-py3-none-any.whl
- Upload date:
- Size: 18.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f27923155510c1e52f75752691596360bbe1fc8d0787f13ff67926f2ab7dc99d |
|
MD5 | 896da1b14cb38f171f9f2f5dd82ac493 |
|
BLAKE2b-256 | de4372f8c8ae2e48506682f6cc5ad0c561178231eb835bfeef7779c9d9708fbb |