Docstring parsing for generating argument and config parsers and expediting the common coding processes.
Project description
Docstr : Python Docstring Parsing for Writing Once
Given properly written docstrings that would be used to create auto-docs through Sphinx, docstr
can use those docstrings at runtime to expedite certain programming practices, such as the creation of argument and config parsers, the modular configuration of piplines of both OOP and funcitonal python components, and reduce code duplication.
:warning: WARNING! |
---|
Any project that is dependent upon the resulting parsed tokens from docstr's parsing of some python object's __doc__ s will be dependent on the content within the comments of that parsed code! The same applies to any python program that is parsed using the docstr CLI. This is by design, as the point is to use the doc strings to write once and reduce redundancy. Future versions may allow work arounds in certain conditions. Future versions will implement more concise and complete error messages similar to that of a compiler for debugging purposes. Support for a 3rd party linter (pylint) will be explored. |
Design Principles
- Write once
- reduce redundant code and make it so common and similar code does not need to be rewritten for the same functionality.
- Keep it simple
- and functional.
- Modularity
- Keep as modular as possible such that each unit may be removed, replaced, or taken and plugged into a different system.
- Efficiency
- efficient code creation and execution.
Docstr Pipeline Use
- Load Input Give objects whose docstrings are to be parsed
- pass those objects as args to a function
- Decorate those objects with a docstr decorator
- point to those objects via namespace in a configuration file
- Parse the object's docstrings, possibly in a hierarchical fashion if multiple.
- dependent on docutils and sphinx for conversion of styles to rst
- uses regex for parsing and tokenizing the docstrings
- Tokenize: The resulting tokens of parsing
- [-TODO-] Syntax check the docstrings to ensure properly written
- may be done live during the parsing process, or afterwards.
- [-TODO-] Optionally syntax check the docstrings to the objects to ensure they match expectations, e.g., the args in doc are as they are expected by a function.
- tokenized docstrings enable colorized docstrings in editors.
- The tokenized objects need to be in a useful enough and general enough format that enables ease of integration into down-stream "compile" software, such as ([-TODO-]) pydantic for type checking, or ConfigArgParse for CLI and config parsers (done).
- [-TODO-] Allow for "pre-parsed" docstrings in the case where a docstring is from a 3rd party and is not supported by existing docstr parsing standards. In config files, this can be specified in its own section.
- [-TODO-] Syntax check the docstrings to ensure properly written
- "Compile and Run": Using the tokens, setup and perform operations
- Command Line Interface: argparse auto creation
- Configuration file parser: ConfigArgParse auto creation
- Enables Pipeline running of code if docstr is used on a class that has a run() or main().
- [-TODO-] may hook into or rely upon: Dagster, ray, asyncio
- [-TODO-] Decorators may be applied to functions/classes in the config file
- this may allow for applying Ray to functions/classes.
- [-TODO-] Allow for configuration files to link to other config files, if so desired.
- Enables Pipeline running of code if docstr is used on a class that has a run() or main().
- [-TODO-] Meta-programming (Reflection)
- enabling type checking versions of the parsed code. (pydantic?)
- splat extension for "write once". If docstr parsed, may automatically generate the boiler plate code based on the parsed docstring, thus reducing writing redundancy. - May require linters (pylint, flake8) to acknowledge this.
- docstr.meta.dataclass : make any class a dataclass given correct docstring
Prototype
The prototype provides the following features, albeit within constrained use cases as it is a prototype. First is the pipeline followed by the prototype:
- Load Input:
- When
docstr
is installed, a console script is included. Exectutingdocstr path/to/a_python_program_config.yaml [arguments for the python program]
will run the python program based on docstr's parsing of the docstrings and configuration file. The prototype will read in the config, adjust docstr's parser settings to that under thedocstr:
options sections. Parse the docstrings starting with entry object (the first python object in configuration). The parsed docstring tokens will be used to make the ConfigArgParse.ArgumentParser for the python program, and then update the values using the given args and config file. The program will then run with those values. Look atdocstr/cli/cli.py:docstr_cap()
for the function that thedocstr
command runs.
- When
- Parse and Tokenize
- Uses sphinx auto doc parsing (relying on docutils too) with napoleaon extention to support Numpy and Google docstring styles.
- Creates a tree of configurable objects that consists of python classes and functions. Only includes whitelisted objects to be parsed, which is currently inferred from the python namespace imports under the
docstr
section of the yaml config. Note that the namespace imports expect these objects to be accessible within the current python environment. - supports subclasses of dataclass and NestedTuple, albeit without unit test coverage.
- Compile: ConfigArgParse Generation
- Generates the ConfigArgParse (CAP) for CLI and configuration creation based on the given python program's yaml file.
- This is then usable to configure and run the python program through the
docstr
CLI. -. Initialize and Run- initialize the objects from the generated CAP starting from leaves going up to the root of the python program based on the given configuraiton yaml file.
- Once initialized, the python program will run using the entry object and the given
main
string indicator of what function/method is the main method.
Included Specific Features
These are the specific features already included by the prototype, albeit in a limited form. For all of these specific features, basic unit tests exist.
- Auto-generation of a ConfigArgParser supporting CLI, config files, and env variabels (untested in prototype). The CAP provides a basic nested namespace and carries the configurable object to be instantiated.
- functions and class methods
- classes
- Basic linking of docstrings:
- recursive linking of the object as an argument of the current one being parsed will be parsed if it is included in the namespace/whitelist of the
docstr.parsing.DocstringParser
. - This enables hierarchical pipelines as specified by the given config yaml file.
- recursive linking of the object as an argument of the current one being parsed will be parsed if it is included in the namespace/whitelist of the
- supports reStructuredText, Google and Numpy styled docstrings
- Support is through Sphinx auto docs w/ napoleaon extenstion.
- (Scuffed but it works for a single program) Hierarchical pipeline support as specified by a yaml config file
- This allows programmers to focus on OOP design and create their pipelines as modules to be easily sequentially chained or parallelized, and allows the programmer to avoid having to write the boiler plate commonly used in pipelines.
- This is nearing extra, and programmers would still be able to write the main scripts or call these functions in their own existing programs.
- End result would be: given a yaml config of a pipeline consisting of python objects in the active namespace who all have parsable docstrings, generate the pipeline running main script with its CLI/configuraiton parser.
Desired Features Under Development for Version 1.0.0
The following are more desired features for a complete docstr version 1.0.0. This adds more detail to what is specified in the docstr pipeline section, all of which is expected in version 1.0.0:
- General: [-TODO-]
- Comprehensive unit tests to ensure everything functions as expected.
- Along with basic CI/CD on github to check build status.
- Parallelization and Async support:
- Many processes in
docstr
could be parallelized and possibly benefit from async support. This is a speed-up option.
- Many processes in
- Comprehensive unit tests to ensure everything functions as expected.
- Load Input:
- Parse and Tokenize [-TODO-]
- further support of parsing the docstrings of dataclasses post_init.
This is crucial.
- unit tests are required
- Need to further test and support the use of the namespace of the module an object is within, including the aliases, so
import pandas as pd
to enablepd.DataFrame
in that module's docs and same withimport torch; nn = torch.nn
. - Some cases where a docstring is either unnecessary or only partially required.
- by default, complete docs are desired, but in the case where the docs are out of the user's control, support for handling partial docs to run the python program should be supported and be an option able to be specified by the user.
- docstr configs specific to a set of modules/packages may be beneficial to support.
- NestedTuple inherting classes don't need to have docstrings beyond adding descriptions to the arguments/attributes, as the rest of the format is exactly the same as if it were to be parsed.
- (done) NestTuple is explicitly supported through duck typing.
- Are there other such python classes/factories that have typing w/o abc?
- by default, complete docs are desired, but in the case where the docs are out of the user's control, support for handling partial docs to run the python program should be supported and be an option able to be specified by the user.
- far more informative exception/error messages from docstr during the parsing process to inform the user exactly what file, line of code, object being parsed, and why it is an error, such as expectations of given context in parsing, and the unexpected thing found.
- we want the error messages to be informative enough in themselves such that the user does not have to enter a debugger to observe such things a normal compiler, even JIT python, would inform the user of.
- further support of parsing the docstrings of dataclasses post_init.
This is crucial.
- Compile and Run [-TODO-]
- Auto-generation of CAP
- Support Iterable(type|MultiType), so that a type can be
written as
list(str|int)
for a list of strs or ints.- This absolutely needs unit tested once implemented
- Traverse and build CAPs for the configurable objects w/in MultiTypes.
- Allow CAP to be generated and informed more by the config
when a docstring specifies a parent class, and the config gives a
child class that inherits from that parent and is still configurable.
This is a common and important use case to support.
- This absolutely needs unit tested once implemented
- Option to ignore the docs (especially when there are none in
a 3rd party) and to instead be informed by the type hinting and
defaults used in the python declaration of the function/method.
- the point is to write once with docstrings, but if 3rd party's do not support complete docs because they rely on the python type hinting, then we should support it to for completion. All that really matters is it is written once somewhere, and when we don't have control over the code, we either support common trends in others or allow the user wrap such objects w/ their own written docstring, where the latter is a currently supported workaround at the expense of the user's time.
- This absolutely needs unit tested once implemented
- doc linking, e.g.,
see module.sub_module.class.method
- further doc linking support is necessary with more tests.
- support of custom Sphinx napoleon
- support of custom docstring parsers through Sphinx extentions.
- Given an object whose doc string is fully parsed by docstr, output the template of the YAML config file to be editted.
- The initialized program needs to have the option (possibly defaulting to) lazy initializing, such that when the python interpretter gets to an object it then may be instantiated from the docstr CAP and values. Especially good for code that is massive in memory post init or takes a long time and you want to only execute it once necessary.
- Support Iterable(type|MultiType), so that a type can be
written as
- Code redundancy reduction using the docstrings:
- Decorator for specifying the args and kwargs of a function with format as:
def func(*args, **kwargs):
to avoid redundant writing of args, defaults, etc. - Notably, "protected" attributes may be marked as such in the docs and so the user doesn't have to write @property variable name reutrn protected _variable.
- Decorator for specifying the args and kwargs of a function with format as:
- Type Checking based on Docstrings:
- Decorator for optional runtime type-checking of variables values at runtime to the docstring's preset values.
- [-TODO-] Extra features: features that are unnecessary, but beneficial
- Parameter iteration and searching (possibly using
pyrameter
orSHADHO
) to run multiple versions of the same pipeline using different parameters, either- This is specifically useful to Machine Learning (ML) researchers, but may be nice for those who either want to exhaustively run variations of their pipeline or have some objective function they want to optimize over different variations of their pipeline.
- in ML, the pipeline and its parameter search would be entirely contained all in one yaml file, or multiple through optoinal config linking. This would result in a straightforward overview of the ML experiment pipeline, easily swapped with existing modules written in code as classes.
- Possibly rely upon Apache AirFlow DAG for the pipelining.
dag-factory
: AirFlow extension to make dynamic- alternative to AirFlow is
dagster
- Parameter iteration and searching (possibly using
- Auto-generation of CAP
Verisoning
Docstr uses Semantic Versioning 2.0.0. Docstr's version will remain < 1.0.0 until adequate unit test coverage exists.
License
The docstr project is licensed under the MIT license, which is provided in LICENSE.txt
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file docstr-0.0.2.tar.gz
.
File metadata
- Download URL: docstr-0.0.2.tar.gz
- Upload date:
- Size: 54.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.9.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 130b768c95579e8fe5f6e7f85146ab30265f0474d0a56f329edacc33078c9804 |
|
MD5 | 11363b22e5c2ecf933fcd82d6daf864a |
|
BLAKE2b-256 | 650ebadf146dc49b9ffedb2b0da946b7d85f4f0379163ef57a7f14e913ed8874 |
File details
Details for the file docstr-0.0.2-py3-none-any.whl
.
File metadata
- Download URL: docstr-0.0.2-py3-none-any.whl
- Upload date:
- Size: 40.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.9.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7471d39ceffc74ce323fa2effae3d99a614212cb2ddf1764c33215c07c20e16b |
|
MD5 | 426b764fe941c9029d090743fe41cc69 |
|
BLAKE2b-256 | 90937dfb8194ba020e5f8967d12192abe641ca398b50132afcaef7c6c1ec2e83 |