Job-execution-rejected

just saw that this query is defined here in this resource file: operaton/engine/src/main/resources/org/operaton/bpm/engine/impl/mapping/entity/Job.xml at b12e80c7ded6db496c36759c423c2938d86a689b · operaton/operaton · GitHub

Wouldn’t it be possible to just override the resource file? Will test it out.

You are right, patching the engine is not possible for most people :wink: Overriding it on the classpath could work, but it might be pretty instable and break with every patch version update.

If you want to try it, I added FOR UPDATE SKIP LOCKED as a new line after the ${limitAfter} in line 230.

Edit:
Can confirm it works:

grafik

2 Likes

@jradesenv can you try this and see if and how this affects your test scenario? Would be really interesting to see if you can reproduce the performance improvements.

I read the documentation about the SKIP LOCKED SQL command and according to the docs it‘s exactly made for this use case.

It‘s of course a hack, you need to copy and patch the file from the exact matching C7 version you‘re using

1 Like

Yeah, tried overriding the file on my resources and it worked! will executed the tests again with the FOR UPDATE SKIP LOCKED and come back here to show results

2 Likes

10k process instances 30k jobs with 3 containers

  1. job executor default + replacing query with FOR UPDATE SKIP LOCKED
    it works and theres not a single job acquisition faillure anymore.

  1. job executor with pagination + replacing query with FOR UPDATE SKIP LOCKED
    I can still see some improvement on jobs per minute with these 2 changes together

just to compare…

  1. job executor default and WITHOUT the addition of FOR UPDATE SKIP LOCKED on the query:
    the job acquisiition failure happens as expected but i wasnt expecting it to have better performance than with FOR UPDATE…

  2. job executor with pagination and WITHOUT the addition of FOR UPDATE SKIP LOCKED on the query:
    By applying paging/offset I also managed not to have job acquisition failure, and it still appears to be the best performance, but perhaps in a larger test with 100k instances the use of skip locked would avoid overhead in the database by avoiding unnecessary updates that would generate failure.

1 Like

Would be interesting to see how it behaves with 5 or 10 containers now

1 Like

AWS RDS Performance Insights of all four tests (it was running on a Postgresql 15.8 t4g.medium):

2 Likes

Thanks for doing those comparisons! I’m not sure if I read your summary-images right, where can I see the execution time? I understood that adding FOR UPDATE SKIP LOCKED reduces the acquisition failures, but decreases performance in general?

Saturday my test environment was a mess and today i executed it in a better way.
We can really see the improvement of this setting here…

100k process / 300k jobs / 10 pods

Default implementation:
40minutes total and 760k job acquisition failed

Query updated with FOR UPDATE SKIP LOCK:
16minutes total and ZERO job acquisition faiiled

Finally achived to use 95-100% of my database cpu in a good way. In the default implementation i was capped at 30% cpu, stuck in job acquiisition fails.

Just to show some data
RDS waits:

here you can see that there was almost no job execution on the first test, and much more job select/acquisition, while on the second test it was almost all about job inserts.

3 Likes

Thanks for sharing these numbers again! So the “16minutes total and ZERO job acquisition faiiled” is just the FOR UPDATE SKIP LOCK without your custom paging logic, right? That behaves even better than I would have hoped for :+1:

I think we could add this as an option to the engine:

  • Create a new property in the engine configuration (+ Spring Boot + Quarkus), opt in
  • Allow users to enable this locking mechanism on the job executor
  • Get many people to try it :sweat_smile:
3 Likes

Yep! I’ll will try it plus my custom paging to see if it is still relevant.
But the idea of creating a property to enable this behavior seems a great advance as we dont expect users to override those mybatis xml too often :+1:

Im still curious to test it against Mysql too, as i have many projects on this too. Will post the results here when it gets done.

3 Likes

For reference:

3 Likes

@jradesenv thanks for all your tests! Looking forward to see the mysql results.

Finally tried it against a t4g.medium mysql 8 database and it worked great also (postgresql still performs better as expected):

We made like 15k jobs/minute.

The same testes took 20minutes to complete

2 Likes

I decided to use only the updated query instead of my paging solution, but i still used the logic that set a single index for each one of my containers to configure a different priority range for the job executor.
This way i have 15 pods total with no concurrency problems (with the FOR UPDATE SKIP LOCKED) and put two of them to work on low priority jobs only (jobPriorityRangeMax(90)). This way all my priority tasks are set from 91 to 100 and the others let as zero.

Yesterday i tested for the first time my 500k+ scenario.
It took 35min to start all the instances (full logic taking objects from my mongodb to process engine and updating its status).
Then restarted the containers with job executor enabled and it took 1:30 to conclude all the 500k+ process instances (which is a complex process with 3 subprocess, parallel gateways for send and receive messages with async before/after)

So, im happy with this throughput and it will work :slight_smile:

Just passed by to say thanks for all the sharing we had here and to say that i will keep passing by to discuss new challenges with you all :fist_right::fist_left:

2 Likes

@jradesenv I’m really happy to hear this! The approach with different priority ranges sounds sensible. Also thanks for testing this with MySQL, also!

I’m currently preparing a pull request which adds an engine config property which allows users to enable FOR UPDATE SKIP LOCKED in the engine itself as an experimental feature for Operaton. If this works and if we get some positive feedback, I think I’ll offer this as a contribution to Camunda, too, in the spirit of OSS :wink:

Also, I think I will write a blog post about this on our companies blog, would it be fine for you if I used the numbers you provided here and mention you? If not, that’s totally fine, I think I can recreate those experiments

Yeah, of course you can share everything we discussed here :love_you_gesture:

1 Like

Just a small update: Implementing this in the engine is not that easy.

After digging into the job executor deeper with @kthoms today, I am a little surprised that it worked that well for you. Currently Camunda is not setting any row level locks in the database, but sets a logical lock. The FOR UPDATE SKIP LOCKED clause obviously does not know about those locks, but obtains its own physical row lock. As long as this is active, other job executor instances will not retrieve this row.

By default, the Job Executor queries all rows, and then sets the logical log in the same transaction. My assumption is, that in the delta between “selecting” and “setting the lock”, other job executors can retrieve this row also, and then try to set the lock, too, which can fail. Due to the additional physical lock, we have prevented this from happening.

So the new row-level physical lock does not lock the job in the database for the whole processing time, but locks it so the logical lock can be safely written.

Everything written here is a hypothesis yet to be proven, I just wanted to give you an update

Hmm, I don’t know if I managed to understand the “problem” with this, because it’s exactly what I expected to happen. By default, camunda has this concurrency problem where several containers are making the same query, getting the same jobs, and the first one manages to set the lock owner and increase the rev_ while the others will get the job acquisition failed because it no longer finds this row with the same rev_ (updated rows zero).
FOR UPDATE then warns the database that this line must remain locked at the db level until the same transaction makes the update, while SKIP LOCKED is precisely to skip these already locked lines. Perhaps the transaction level READ COMMITED will then help eliminate the rest of the competition that may still occur.
Even if there were some job acquisition failing, I understand that it would be minimal, a matter of miliseconds of concurrency, and then it is much better than the standard (where 100k jobs generated 740k failures with 10 containers), and then the logical lock that is done via update still maintains security using rev_ to avoid the minimum competition that may occur.

1 Like

You’re right with that, I rewrote my post several times and in the end it was confusing.

The tricky thing is testing. We are now adding row level locking for the first time, until now the whole “locking” concept in the engine is entirely logical locks, so we need to make sure that it plays well with all the other configurations. That definitely got lost in my edits, sorry for the confusion :wink: