Caching of Task Executions (Client-side)
The HQS Tasks client is equipped with a (client-side) caching mechanism. As the name might incorrectly suggest, the purpose of this cache is not only to speed things up, but plays a primary role in the user's workflow and experience when using tasks.
The key point to understand is that whenever a task client function is being invoked (which is usually the statement with the await keyword), it is not necessarily true that a new task execution is submitted. We rather check if an execution has already been submitted before (which might be finished or still pending),
and only if that is not true, a new execution is submitted. Then, the client (while running) tracks the state of the execution in the backend and stores the (last reported) state in a local memory: the cache.
This makes it possible to "detach" your client from the actual execution backend at any time by simply cancelling your (locally running) script. When re-running the exact same script at any point later, it automatically finds the previously submitted tasks and "re-attaches" to them. With re-attaching, we mean the status is being fetched, the local cache is being updated accordingly, and your script can continue since only now the "awaited" task client function is returning the task's output.
Cache Location (and Types)
There are different types of caches (or to be more precise: storage implementations, i.e. where and how the data is being stored). This is independent of the below mentioned logic and the different cache modes.
Shelve
This is currently the default cache type. The data is stored in (a triplet of) files, normally located in the same folder as the script you are running, but this location can be adjusted.
Note that we do not give any guarantees for cache files being compatible between different versions of the HQS Tasks implementation.
To enable (and configure) this cache type, add the following configuration code to your client script:
from hqs_tasks_execution.config import global_config, CacheConfigurationShelve
global_config.cache = CacheConfigurationShelve(
file="my_custom_cache_file", # Note: This is optional.
)
SQLite
This uses a (file-based) SQLite database. The database file is normally located in the same folder as the script you are running but this can be adjusted.
To enable this cache type, add the following configuration code to your client script:
from hqs_tasks_execution.config import global_config, CacheConfigurationSQLite
global_config.cache = CacheConfigurationSQLite(
file="my_custom_cache_file.db", # Note: This is optional.
)
Chained
This enables you to use several different caches (of the same or different type). The idea is that if a first cache does not hold the execution which we are about to execute, the second cache is investigated. If it contains it, we are lucky - the information is then also being stored (wrote back) in the first cache.
Use cases are when you want a fast local cache, but also a network-shared cache which you share with your colleagues.
Or if you change your cache type but want to "fall-back" to your existing (i.e., migrate the cache when being accessed on demand).
To enable this cache type, add the following configuration code to your client script (this demonstrates the second use case described above):
from hqs_tasks_execution.config import global_config, CacheConfigurationShelve, CacheConfigurationSQLite
global_config.cache = CacheConfigurationChained(
caches=[
# New cache: Primarily used.
CacheConfigurationSQLite(),
# Old cache: Used if not found in primary; will then write back to primary.
CacheConfigurationShelve(),
]
)
Cache Reliability (Disclaimer)
Be aware that due to the cache being "just local files," which you could easily delete by accident (for example when you want to clean up your script folder or any other reason), you should never fully rely on the cache for storing valuable results. Also, we do not guarantee that there will never be an implementation error leading to accidental deletion of cache entries or the cache ending up corrupt due to whatever reason.
We recommend to store valuable results additionally in a safe location, a structured database, some archive, or have at least a backup of the cache files, depending on your needs.
However, depending on the concrete backend you use to run tasks, that might come with a fully-featured database which serves your purpose. For example, when using the REST backend, you can access the REST-API separately to browse and retrieve all of your task executions and their results. However, this is beyond the scope of this documentation.
Identifying Task Executions
Above we claimed that when re-running the script, it needs to be the exact same script for the caching mechanism to work. This is only half the story: for every task execution, if you already invoked the same task client function with the same input as before (and in the same version as before), the execution is being identified by the cache and the previous execution can be utilized.
You might observe in the gray caching-related log lines a somewhat cryptic string: the so-called "cache key". This is the identifier for the cache, composed of the task name, the task version (which usually equals the Python package version) and a hash-value of the input. Note that the actual task input is a JSON-serialization of the given Python object(s) which have been passed in the function, but usually the same (Python) values result in the same JSON value and therefore in the same hash value.
For tasks that have a stochastic behavior (implement randomized algorithms), it cannot be guaranteed that passing the same input will lead to the same result. When retrieving a previous execution of such a task from the cache, the corresponding result will remain the same even though a new execution of the task would lead to a different result. Hence, it is recommended to provide tasks exhibiting random behavior with a "random seed" parameter as an additional input argument. By passing different values for that parameter, multiple task executions are being submitted, because the input is technically different and therefore also the hash value and consequently the cache key by which they are identified. Other than that, the seed value is usually "just any value". Also, taking the caching mechanism into account, we achieve a deterministic behavior. (Without the cache, or when clearing it, it would only be deterministic if the task guarantees it.)
Cache Modes
The cache can operate in different modes. This can be configured using global_scope.cache.mode or scoped_config(cache_mode=...). The modes are found in the enum CacheMode from hqs_tasks_execution.config and are explained briefly in the following:
FULLY_CACHED(default): Cache everything. Utilize the cache for everything (submitted or finished). Update the cache entry with every received update.RESTART_FAILED: Ignore failed executions. Utilize the cache for submitted and successful, but not for failed executions. Update the cache entry with every received update.REATTACH_ONLY: Ignore finished executions. Utilize the cache only for submitted, but not for finished (successful or failed) executions. Update the cache entry with every received update.WRITE_ONLY: Ignore cache, but write results to it. Never utilize the cache, but still update the cache entry with every received update.DISABLED: Ignore cache, also never write to it. Cache is disabled completely (not utilized and also not written / updated).
In most cases, you want FULLY_CACHED, which is the default mode. Even when a task execution failed, in most cases (exceptions to this are discussed below) this is solved by changing the input; or (if it is a bug or a missing feature in the task implementation) the task client package receives an update in which case the version of the task changes. In both cases the cache key will be different, so a new execution will be submitted.
In the following cases you do not want to utilize the cache for a failed execution:
- When there was a fault in the task execution system. There are several technical reasons, which we can never completely list here. Note that in some cases some error might already have been raised on the client side before actually submitting anything in the backend, in which case no cache entry is being written.
- When the resources were insufficient (usually memory or time) to successfully execute the task for that particular input. The hardware provisioning options are not part of the cache entry identifier and hence increasing them would still utilize the cache entry in the default cache mode.
- For similar reasons, when you want to run performance benchmarks or just want to "play around" with different hardware parameters. This list is not complete, of course.
What we recommend in these cases is, when you are not sure which of the other cache modes is the "right" choice, to go through them in the above mentioned order. But keep in mind that, the less the cache is utilized, the more resources are consumed and hence the cost is rising, therefore we recommend to carefully play with this option and not just disable it completely when you experience something is being utilized from the cache although you didn't want that to happen.
If you have a long-running task and experience internet connection issues (e.g. ConnectionError exception), you can re-attach to the submitted task by executing the same task again. Note, that this only works for the cache modes FULLY_CACHED, RESTART_FAILED, and REATTACH_ONLY.
"Cache-Only" Mode
While it is technically not a cache mode, you can run your script in a mode in which we never submit new task executions. However, existing (pending) task executions are tried to be re-attached to, in order to fetch their results.
This mode can be enabled by setting global_scope.throttling.disable_submissions = True for the whole script, or scoped_config(disable_submissions=True) for a scope.