Job-execution-rejected

jradesenv · November 14, 2024, 5:09pm

Im doing some performance tests about the way the job executor behaves and i just cant find any documentation on what this “job-execution-rejected” metrics tell me.

My process have a message-receiver with async after and more two delegates with async after too, so each time i correlate a message it will execute 3 more jobs.

After a test doing correlates using 2 parallel threads, i can see the job-execution-reject number greater than the job-successful metric.

Ive put every log to debug level and just cant find any query or commit that have failed.

Whats is this job-execution-rejected and what does a higher number of it tells me?

at moment, i’m testing with springboo embedded, 2 pods/containers and thsi config:

    job-execution:
      queue-capacity: 5
      core-pool-size: 3
      max-pool-size: 5
      lock-time-in-millis: 300000
      wait-time-in-millis: 500
      max-wait: 2000
      max-jobs-per-acquisition: 5

javahippie · November 14, 2024, 6:21pm

Hi @jradesenv

What work is happening in your delegates, how long does it take? I think your wait-time-in-millis and max-wait are rather short (at least shorter as the default), what happens if you increase them?

I have seen this behavior if the worker threads were all occupied and there were no more left to accept new jobs. If you could share your repo, I’d be able to take a look at it.

jradesenv · November 14, 2024, 6:36pm

Thanks for the answer @javahippie . These delegates are only for test purposes and has only a single logger.info line.

if my queue capacity is equal my max pool size, how can it happens? should i test with other suggested values?

javahippie · November 14, 2024, 6:39pm

I’m trying some things with an old project I had for similar reasons, may need some time.

One additional question: Are your Asynchronous continuations marked as “Exclusive” in the modeler?

jradesenv · November 14, 2024, 7:14pm

Yep, they are all exclusive. This makes my job execution success be higher than job acquisiton success, because the next exclusive jobs tend to run on the sam worker i think

javahippie · November 14, 2024, 7:23pm

I agree, I’d expect the behavior you describe from the exclusive property, too.

jradesenv · November 14, 2024, 7:27pm

tested here with minimum numbers as:
queue-capacity: 1
core-pool-size: 1
max-pool-size: 1
lock-time-in-millis: 300000
wait-time-in-millis: 500
max-wait: 2000
max-jobs-per-acquisition: 1

and with only 10 message correlation i still get a large number of job execution reject

javahippie · November 14, 2024, 7:34pm

Grasping at straws here, because remote-debugging this is kind of tricky: Have you checked the database for locks on the job execution table?

jradesenv · November 14, 2024, 7:44pm

yees, this is a clean environment i created only for this performance test. My database is a t3.medium postgresql instanc on aws created just for this POC project, and nothing more running on it.

Tried again now with only one pod, to see if this was concurrency between containers, but even with a single container, core-pool-size 1, max-jobs-per-acquisition 1 i still see a large number of job reject.

I still dont understand what is ths metric…is this a issue when the job executed executes the delegate and then fail to commit the completed job? or this is an error before job executor get it?

Julian · November 14, 2024, 8:04pm

This research topic really caught my interest! Recently, someone mentioned to me that the ‘job executor’ might be considered ‘outdated’ and could benefit from a refresh. However, he didn’t provide any specific details, so it’s up to us to dive in and build our knowledge around it. Excited to explore this area and see what improvements we can bring to the table

javahippie · November 14, 2024, 8:12pm

The JavaDoc for the metric claims this:

/**
   * Number of jobs that were submitted for execution but were rejected due to
   * resource shortage. In the default job executor, this is the case when
   * the execution queue is full.
   */

So the scheduler which cyclically queries the job execution table is not able to put the retrieved jobs into the execution queue.

As your queue is exactly the same size as your max-jobs-per-acquisition, this kind of makes sense to me.

jradesenv · November 14, 2024, 8:15pm

im trying to complete my message correlations this way using a semaphore and a new thread for each, so maybe it was due to a lack of free thread?

but it occurs even with a parallel 2
And if my queue is full, why shouldnt the job acquisition stoppd acquiring more jobs?

public void continueByMessage(String message, int quantity, int parallel) throws InterruptedException {
        List<EventSubscription> events = runtimeService.createEventSubscriptionQuery()
                .eventName(message)
                .listPage(0, quantity);

        Semaphore semaphore = new Semaphore(parallel);

        for (EventSubscription event : events) {
            semaphore.acquire();

            new Thread(() -> {
                try {
                    runtimeService.messageEventReceived(message, event.getExecutionId());
                } finally {
                    semaphore.release();
                }
            }).start();
        }
    }

jradesenv · November 14, 2024, 8:17pm

tried now with these values, with a larger queue than my max jobs per acquisition, and parallel 2 on my message correlation code i posted before:

  queue-capacity: 10
  core-pool-size: 3
  max-pool-size: 5
  lock-time-in-millis: 300000
  wait-time-in-millis: 5000
  max-wait: 10000
  max-jobs-per-acquisition: 3

javahippie · November 14, 2024, 8:20pm

Thanks for trying that. Currently I cannot estimate the impact of starting the messages spawning new Threads with Semaphores. At this point, I would start debugging the threads with Java FlightRecorder. Would it be possible to share a FR recording here?

jradesenv · November 14, 2024, 9:33pm

I think it can be a lack of threads caused by the semaphore + new thread. Moved it to a container with job-executor disabled, executed it again with parallel 2 and the other container executed every job without job rejects.

But then i tried again with 500 process instances instead of 100, and completing the messages in parallel 10 (still in other project with job executor disabled) and got too many job rejects again.

So even with no other resource wasting threads on this project, just job acquisition and job executor, it still get job rejections when we have many jobs waiting on database.

Im very curious on how it works now… if theres no thread free to execute, i was expecting the job to be waiting on the internal job queue until lock expration time, and when its full i expected job acquisition to stop acquiring more jobs.

Maybe it would be nice to dig on the source code to uderstand more of what it is and when this job execution reject behavior happen,

jradesenv · November 14, 2024, 9:55pm

tested now completing all 500 message correlations with the job-executor disabled, letting the new async jobs waiting on database

and only then started my other container with job-executor enabled.
also configured my single job executor to a single core, like this:

    job-execution:
      queue-capacity: 10
      core-pool-size: 1
      max-pool-size: 1
      lock-time-in-millis: 300000
      wait-time-in-millis: 5000
      max-wait: 10000
      max-jobs-per-acquisition: 1

and as its starts i alredy see a lot of job reject

Im doing all these tests because im on a project that will have to execute 500k process instance like this on a single day (but with heavier java delegates), so i’m trying to understad how it works under the hood and whats the best configuration.

I was thinking about problems with multiple containers concurrency in the job acquisition, many job acquisition failed and so, but im surprised with these problems in a single container scenario

javahippie · November 15, 2024, 9:30am

After a quick call with @amzill discussing some of the aspects here, there is one question left: Are all jobs you expect to be processed finished?

In the end, job-execution-rejected is just info from the Engine, that no thread was able to process an acquired job. This is not a bad thing per se, but a feature of the engine. The engine should react by increasing the wait time for the next job acquisition, to try and make better use of the resources.

I did not have time yet to reconstruct your example, maybe on the weekend I can get a little more hands on

jradesenv · November 15, 2024, 2:21pm

This is the info iwas looking. Thanks.
I can see that every job had been executed perfectly, but i was afraid that if it was rejected, there should be some wrong configuration wasting resources, locking excessive jobs, and impact my throughput on the large data i will have in the real scenario of 500k process.

It occurs because i got more jobs on my last job acquisition polling than i have free space on the local queue?
When this job-execution-reject occurs, the job stays locked by this owner and in the local queue anyway or it clears the owner on database for other container try and lock it?

jradesenv · November 15, 2024, 3:13pm

Reading through the source code, it seems the engine tries to execute the job as soon as it is acquired, and if it gets rejected only then it gets added to the current job queue.

acquire jobs and call the executor here

try to execute here and reject it

after reject, the rejectedJobsHandler add this to the current job queue here

If my reading was correct, then its very normal to occurs and isnt a problem at all as youve said @javahippie , it only says that the job queue that we configured will be used as expected (hold some jobs until any job executor thread gets free for it).

javahippie · November 16, 2024, 8:50pm

Happy to hear that!

If you could share your findings, that would be amazing. Some time ago I wrote a script which progressively changed the configs of the Job Executor to see how it differed in performance.

I also switched out the job worker threads with Javas Virtual Threads as an experiment once, to remove the thread limit entirely - and it became pretty obvious that at some point the locking on the Job Executor table became the bottleneck. On the other hand I have worked with clients processing 2 billion activities a year in the process engine, those limits are not reached easily