Jobs

To get an overview on how jobs are used, check out this introduction to bulk actions.

Implementation overview

Each job represents an action that should be performed on several entities at once.

All the jobs are saved to the API's database, and each of them should contain enough information on their own to run properly.

Once a job is saved, it is meant to be as much read-only as possible, so it is clearer to debug. Only the job's status code and output should be updated to reflect the current state of the job.

The overall idea is that there are many types of jobs, defined by job codes. For each job code, there is a job handler which contains the corresponding method used to run the job. There is a jobs executor service to orchestrate all the jobs' executions.

Job breakdown

To keep track more precisely of what happens during bulk actions, each job can create sub jobs for additional actions it should carry out. For example with a device configuration patch, there are separate sub jobs to:

save the device configurations changes to the database
send the change request to IoT Hub device twins jobs
check on the IoT Hub job if the twins have been updated as expected

Each job and sub job has its own job code. The corresponding handler has a method to define what the job code should do.

Jobs can be separated into sub jobs for various reasons:

separation of concerns: saving to database (job 1) is different from saving to IoT Hub Twins (job 2), so it is clearer to do these things in separate handlers.
time constraints: if the jobs 2 and 3 from the examples were put together in one job, this job would spend a lot of time waiting. When these two tasks are put into separate jobs, the API can go on and run other jobs while waiting for the IoT Hub job to finish.
job type reusability: each job can create other jobs of any code. If another bulk action needs to update IoT Hub Twins, it does not need to implement it again, only create a job with the same code ad for job 2.
retrying jobs: when jobs are cut into smaller tasks, it is easier to retry only the exact task that needs to be retried if it failed.

Jobs executor service

The JobsExecutorService is the service that orchestrates running the jobs. It selects the jobs to run, calls their job handler to start them, and saves any future jobs that the job handler returned.

It also manages the retry context for jobs if needed (more detail on that below).

This service starts processing jobs thanks to a method call in the API's main function, as a never-ending background task. We need to make sure all the errors are caught in a try ... catch, otherwise an uncaught error could make the service method crash and the whole API would need to be restarted in order to start processing jobs again.

Locking jobs

Multiple jobs executor methods could be running at the same time to process more jobs at the same time. This could happen by calling its method multiple times in the API's main function, or if multiple APIs were running alongside each other.

In that case, we need to make sure that the same job cannot be started twice. For this, the jobs executor "locks" the job while it is being processed:

the job is set as in progress with a lastLock timestamp (current time)
the lastLock timestamp is refreshed regularly

If a job is in progress while their lastLock was too long ago (more than the refresh interval), it means probably something went wrong and the jobs executor can safely select this job to run it again.

Endpoints

Reading jobs

There are two ways to read jobs:

listing all jobs in a tenant

This provides an overview on all the jobs in the tenant, of any type. It does not include sub jobs.

reading one job's details

This provides more detailed information for one specific job. It also shows the sub jobs.

Creating jobs

There is not a unique endpoint for creating all kinds of jobs. Instead, we provide a specific endpoint for each type of bulk action.

This makes the APIs endpoints clearer to use, and makes it easier to validate the input.

Authorization

Authorization for reading and creating jobs can be given on a tenant level.

However before creating the first job of a bulk action request, usually there is more authorization to be checked besides the ability to create a job. We need to make sure that the user is allowed to perform all the operations that the bulk action might perform, on each on the entities of the request.

Make sure not to forget these additional authorization checks when adding a new bulk action endpoint!

Job codes and parameters

Jobs can find the details of what they should do in their job parameters. Each job code expects a certain format of parameters. This format is set by an interface, and the parameters are stored as a JSON string. Each job handler simply parses this string with an interface it chooses.

Jobs relations

Global jobs

The first job created by the bulk request action is called a global job. It does not have a parent job, and does not belong to any other global job.

All of the sub jobs created from this global job (including sub jobs created by sub jobs) will reference this same job as their global job.

A global job is in status waiting as long as it still has unfinished sub jobs.

Once all its sub jobs are finished, the jobs executor service updates its status accordingly.

Parent jobs

All sub jobs have a parent job to keep track of which job was created by which other job.

It also helps determine the status of the global job: if all the leaf jobs (jobs with no children) succeed, then the global job also succeeds. It does not matter if another job failed along the way, as long as it has been retried and the retries succeeded.

Priority groups

Jobs can belong to a priority group. Other jobs can then depend on priority groups to express that they depend on other jobs to be finished before they can start.

Priority groups are not really used yet since they were not useful in the jobs until now. This logic has yet to be implemented.

Job handlers

A job handler contains a method to execute for one specific job code.

Once completed, the handler returns an output which the jobs executor service will use to further orchestrate the jobs.

All the handlers should provide the same output information:

new job status
new job output (optional)
new sub jobs (optional)
new retry job (optional)

The handlers do not save anything to the database about it.

It is up to the jobs executor service to handle and save the handler output.

When explicitly returning a failure status, make sure to provide a message to explain it in the output as well.

Adding a new handler

To add a new job code and handler to the API:

add a new value to the JobCode enum
create a new handler class in src/jobs/job-handlers which implements the JobHandler interface. Reference the expected job parameters format in your new handler, with an existing parameters interface or a new one
update the JobHandlersPicker to make it pick your new handler for the new job code

Then you can use your new job code as a sub job in your other jobs, or as a global job in a new bulk action request.

Retrying jobs

Failed jobs can be retried in case the error could be solved on its own by giving it another try. There is a retry limit set by the jobs executor service.

Jobs have retry count and retried job properties. In normal cases, neither contain any value.

When a job fails and we want to retry it, the next jobs are in a "retry context" where the two retry properties contain values to explain what is being retried.

A job can be retried for two reasons:

the job handler threw an exception that was not expected. The job will be marked as failed, and copied to be created and tried again with a retry context
the job handler finished, but provided a retryJob in the output. This retryJob is created with a retry context

For the first job that is being retried, the retriedJob is the parent job (first job that needed a retry). For all the next jobs that already carry a retry context, only the retry count is increased by 1 each time there is an additional failure. The retriedJob will still always reference the first job that needed the retry.

If the retry limit is reached, the next retry job will be discarded and the last failed job will contain a message in its output to display that it reached the retry limit.

A job is said to succeed the retry if all these conditions meet:

the job has status success
the job has a retry context
the retriedJob of the succeeded job has the same job code

In this case, the retry context is cleared for the next jobs.