I wanted to start a new thread to discuss the way the new locking mechanism for the job executor will be implemented (discussed here: Job-execution-rejected - #63 by jradesenv)
Context
The job executor queries the ACT_RU_JOB
table periodically to retrieve the newest jobs which are not yet worked on. After retrieving the list, it adds a logical lock via update to the fetched row, so future job executor queries to the same table won’t retrieve them again.
If there are multiple concurrent job executors querying the same table, e.g. multiple nodes, it can happen that a row is retrieved, but only if the job executor tries to lock it, realizes that it was already locked by a competing job executor. This leads to the job executor not scaling well, as competing executors might produce so much traffic on the ACT_RU_JOB
table, that they are mainly occupied with retrieving already locked rows.
Solution
All of the supported databases except DB2 and MSSQL support the FOR UPDATE SKIP LOCKED
clause in SQL. This is not a part of the SQL standard, but widely accepted. If we add this clause to the job executor query, it will create a row level lock in the database for every retrieved row as long as the transaction is active. The transaction is finished as soon as all logical locks in the table are created via UPDATE
and the jobs are started in the engine. It does not wait for the jobs to finish.
The SKIP LOCKED
part of the clause prevents the query from including locked rows into the result set, avoiding the collisions previously described. In the thread linked above @jradesenv was able to verify this in several experiments.
Testing
We need to add comprehensive tests for this feature, in all supported databases and isolation levels. We already know that DB2 and MSSQL will not support this feature and users who are on these RDBMSs will not be able to benefit from this feature (as long as we cannot find a different way to implement this for these DB).
Proposals
All of these proposals are open for discussion, I’d be happy about feedback
-
I would like to introduce this behavior as an opt-in configuration first. I would not like to enable this behavior by default for supported databases, as it fundamentally changes the way the job executor works which might not be a desired behavior, even if can be seen as an improvement.
-
If we enable the configuration for a non-supported RDBMS I would like to log a warning stating this, but not let the engine fail on startup or runtime.
-
I would like to introduce a new property for the engine configuration to enable this feature. I would not like to add
experimental
orbeta
to the property path, because this would force early adopters to change their configuration once the feature is stable. The property path should not be broken in the future. Instead I would like to propose a newpreview-features
property, which can be enabled to activate all experimental features in the engine at once. If an experimental feature was activated withpreview-features
being false, we should print a warning into the logs. This prevents users from accidentally enabling features which are not stable, yet, and at the same time not break the property paths of experimental features. -
I would like to name the property
job-executor-acquire-with-skip-locked
to mimic the already existingjob-executor-acquire-by-priority
on the root level. Logically it could also be a sub-property of the job executor configuration.