Design of the Job Executor improvement

javahippie · December 8, 2024, 9:15am

I wanted to start a new thread to discuss the way the new locking mechanism for the job executor will be implemented (discussed here: Job-execution-rejected - #63 by jradesenv)

Context
The job executor queries the ACT_RU_JOB table periodically to retrieve the newest jobs which are not yet worked on. After retrieving the list, it adds a logical lock via update to the fetched row, so future job executor queries to the same table won’t retrieve them again.

If there are multiple concurrent job executors querying the same table, e.g. multiple nodes, it can happen that a row is retrieved, but only if the job executor tries to lock it, realizes that it was already locked by a competing job executor. This leads to the job executor not scaling well, as competing executors might produce so much traffic on the ACT_RU_JOB table, that they are mainly occupied with retrieving already locked rows.

Solution
All of the supported databases except DB2 and MSSQL support the FOR UPDATE SKIP LOCKED clause in SQL. This is not a part of the SQL standard, but widely accepted. If we add this clause to the job executor query, it will create a row level lock in the database for every retrieved row as long as the transaction is active. The transaction is finished as soon as all logical locks in the table are created via UPDATE and the jobs are started in the engine. It does not wait for the jobs to finish.

The SKIP LOCKED part of the clause prevents the query from including locked rows into the result set, avoiding the collisions previously described. In the thread linked above @jradesenv was able to verify this in several experiments.

Testing
We need to add comprehensive tests for this feature, in all supported databases and isolation levels. We already know that DB2 and MSSQL will not support this feature and users who are on these RDBMSs will not be able to benefit from this feature (as long as we cannot find a different way to implement this for these DB).

Proposals
All of these proposals are open for discussion, I’d be happy about feedback

I would like to introduce this behavior as an opt-in configuration first. I would not like to enable this behavior by default for supported databases, as it fundamentally changes the way the job executor works which might not be a desired behavior, even if can be seen as an improvement.
If we enable the configuration for a non-supported RDBMS I would like to log a warning stating this, but not let the engine fail on startup or runtime.
I would like to introduce a new property for the engine configuration to enable this feature. I would not like to add experimental or beta to the property path, because this would force early adopters to change their configuration once the feature is stable. The property path should not be broken in the future. Instead I would like to propose a new preview-features property, which can be enabled to activate all experimental features in the engine at once. If an experimental feature was activated with preview-features being false, we should print a warning into the logs. This prevents users from accidentally enabling features which are not stable, yet, and at the same time not break the property paths of experimental features.
I would like to name the property job-executor-acquire-with-skip-locked to mimic the already existing job-executor-acquire-by-priority on the root level. Logically it could also be a sub-property of the job executor configuration.

Julian · December 8, 2024, 9:33am

First off, let me say: great work on this feature! Thank you!

I wanted to share my perspective on one particular aspect:

When it comes to database features that may not be supported, I believe the engine should adopt a proactive approach.

Specifically:

If the database doesn’t support a required feature, the engine should refuse to start outright or, perhaps even better, start with that feature disabled and provide a clear warning.

Implementing this could be straightforward. For example, a compliance test during startup could run a specific query to verify support for the feature. If an exception occurs, the engine knows the feature isn’t available.

Alternatively, we could hardcode known incompatibilities for specific databases, ensuring predictable behavior.

What’s crucial, in my opinion, is that the engine must not appear “ready” (e.g., monitoring systems showing it’s up and running) and then fail later when it tries to acquire a job due to an unsupported feature.

This proactive handling would not only improve reliability but also give users immediate feedback about their setup, preventing potential issues from arising during runtime.

jradesenv · December 8, 2024, 5:46pm

For me it would be better to fail to boot and not start if the configuration is not correct. This way you will avoid cases where they go to production and only discover that the configuration does not work when a huge load arrives. If you check the configuration and it fails to start, you will fail the dev/test environment early and this is the best warning to have without relying on someone having seen a log.

javahippie · December 9, 2024, 8:38pm

Thanks for your opinions! I agree that refusing to start up might be the better choice, will implement it that way

javahippie · December 15, 2024, 7:06pm

To keep track: I think that we might be able to bring this feature to all databases compatible with Operaton.

MariaDB, MySQL, Postgres, Oracle DB: FOR UPDATE ... SKIP LOCKED
IBM DB2: SKIP LOCKED DATA
MS SQL: READPAST

Features are to be compared and tested, this is just after reading the documentation superficially