Issue with maximum concurrent runs and job status

1

I have a very simple Glue ETL job configured that has a maximum of 1 concurrent runs allowed. This job works fine when run manually from the AWS console and CLI.

I have some Python code that is designed to run this job periodically against a queue of work that results in different arguments being passed to the job. The Python code starts the job and waits for it to enter the SUCCEEDED state but will abort if it stops, fails, times out, etc.

The relevant snippet is:

        start_response = self.client.start_job_run(JobName=self.jobname, Arguments=formatted_arguments)

        if not wait:
            return

        jobid = start_response['JobRunId']
        log.info("Waiting for glue job %s (%s)", self.jobname, jobid)
        while True:
            state = self.job_status(jobid)
            if state == 'SUCCEEDED':
                log.info("Glue job %s (%s) completed", self.jobname, jobid)
                return
            if state in ['STOPPED', 'FAILED', 'TIMEOUT', 'STOPPING']:
                raise StandardError("Glue job %s (%s) %s" % (self.jobname, jobid, state))
            if not state in ['STARTING', 'RUNNING']:
                raise StandardError("Glue job %s (%s) is in unknown state %s" % (self.jobname, jobid, state))

            log.debug("Waiting for %s (%s), which is %s", self.jobname, jobid, state)
            time.sleep(GLUE_STATUS_INTERVAL)

Unfortunately, seemingly without fail, when the job enters the SUCCEEDED state, if I run that same job again upon entering this state, Glue claims I've hit the maximum concurrent runs (1) for the job in question:

ConcurrentRunsExceededException: An error occurred (ConcurrentRunsExceededException) when calling the StartJobRun operation: Concurrent runs exceeded for <job>

When I look at the console, the job is SUCCEEDED and there are no others running, for this job or otherwise.

I can work around this by sleeping in the right place, but this seems like a workaround for what smells like a bug somewhere. I noticed that the last entry in the log for an example run was 9:54:15 after it claimed the job was SUCCEEDED, but the console says the end time was 9:55 (with no seconds).

Any ideas why I can't start another job immediately after the other completes? Is there some sort of cool down period?

Edited by: jhart-r7 on Jun 20, 2018 10:18 AM

asked 6 years ago3833 views
2 Answers
0

I've been able to reliably work around this by sleeping a minute between consecutive runs of this job. Really feels like there is another state after SUCCEEDED that isn't described in https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-jobs-runs.html or there is a cool down period between runs that isn't documented.

Has anyone else experienced this? Ideas to solve this?

answered 6 years ago
0

I faced similar issue recently where in my case I have to run the same job 100 times with different parameter values. Is there any solution for such case. I am using lambda function for this and it has a maximum time out of 15 mins. I could not even finish 3 runs. Please suggest if any alternate is there for this.

answered 3 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions