Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[Scheduler] Intermittent Runs #51646

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rodnaph opened this issue Sep 13, 2023 · 9 comments · Fixed by #51651
Closed

[Scheduler] Intermittent Runs #51646

rodnaph opened this issue Sep 13, 2023 · 9 comments · Fixed by #51651

Comments

@rodnaph
Copy link
Contributor

rodnaph commented Sep 13, 2023

Symfony version(s) affected

6.3.4

Description

When using the scheduler with a single daily recurring message we are often seeing days where the scheduler appears to have not run.

Here is the schedule config:

return (new Schedule())
    ->add(RecurringMessage::every('1 day', new OurDailyMessage(), from: '10:30'))
    ->stateful($this->cache)
;

We have configured caching for the scheduler provider using the database, to allow it to catchup if the task was to crash. The servers run in UTC.

We have added some middleware to attempt to give some visibility to scheduler activity, which just logs when each message is dispatched on the schedule.

class SchedulerLoggingMiddleware implements MiddlewareInterface
{
    public function __construct(
        private readonly Connection $connection,
    ) {
    }

    public function handle(Envelope $envelope, StackInterface $stack): Envelope
    {
        if (null !== $envelope->last(ScheduledStamp::class)) {
            $this->connection->insert('scheduler_log', [
                'date_created' => new \DateTimeImmutable(),
                'message' => $envelope->getMessage()::class,
            ], [
                Types::DATETIME_IMMUTABLE,
                \PDO::PARAM_STR,
            ]);
        }

        return $stack->next()->handle($envelope, $stack);
    }
}

This is then run in production alongside our other Messenger transports (eg. async another_transport) with an ECS task.

bin/console messenger:consume scheduler_default async another_transport etc --limit=1000 --time-limit=3600 

As you can see from the logs it appears some days are missing (ie. 10th and 7th in this example).

image

I know this is a tricky one, as I don't have an exact reproduction, or really know where to start debugging this. I'm hoping in raising an issue perhaps there's something obvious which is incorrectly configured, or someone can indicate an area to look at to attempt to debug or reproduce this.

How to reproduce

As indicated above I do not have a reproduction, running the scheduler locally on shorter timeframes seems to behave as intended, and debugging a daily job is hard so I'm hoping for some pointers before beginning on that.

Possible Solution

Unsure if this is related: #51384

Additional Context

  1. Is running the scheduler in the same command as other messenger transports a valid use? I thought the cache should allow the scheduler to catch up if it happens to be busy at that moment... but should the scheduler always be run as the sole transport?
@kbond
Copy link
Member

kbond commented Sep 13, 2023

Difficult to debug indeed...

  1. Is it at all possible the cache could have been cleared around the time it was supposed to be run?

  2. I know this would take some days to confirm but could you run the schedule without the cache (to see if caching is the problem). I also know the job might be dropped if the worker is restarting during the trigger time - hopefully logs could rule this out?

  3. Is running the scheduler in the same command as other messenger transports a valid use?

    This is totally valid as far as I know but again, maybe test running it's own worker with and without cache?

Hopefully we can confirm it is the caching issue. And if so, #51384 seems like it would be related.

@rodnaph
Copy link
Contributor Author

rodnaph commented Sep 13, 2023

Thanks for taking the time to respond!

Is it at all possible the cache could have been cleared around the time it was supposed to be run?

It's unlikely, we don't have anything currently set up to clear the cache really, not on deploy or anything... it would take manual action.

I know this would take some days to confirm but could you run the schedule without the cache

That's a good suggestion, I think it's fairly unlikely anyway that the ECS task won't be running, so it should not affect testing this. We'll give it a try and see how it goes. As you said, this might take a week or two before we've useful results, but I'll come back to this ticket when I have enough data to share.

Thanks again.

@valtzu
Copy link
Contributor

valtzu commented Sep 13, 2023

@rodnaph If it's possible, could you try if #51651 fixes your issue?

to allow it to catchup

At least this seems to work with the fix.

@rodnaph
Copy link
Contributor Author

rodnaph commented Sep 14, 2023

@rodnaph If it's possible, could you try if #51651 fixes your issue?

to allow it to catchup

At least this seems to work with the fix.

Hi - thanks for the suggestion. Unfortunately I'm not sure it'll be possible to include your PR in our build process to test it, I'll let you know if I am able to though.

@kbond
Copy link
Member

kbond commented Sep 14, 2023

If you confirm the problem is the cache as suggested above, hopefully that will heavily imply that #51651 is the fix.

@fabpot fabpot closed this as completed in a90eca6 Sep 16, 2023
@fabpot
Copy link
Member

fabpot commented Sep 16, 2023

This bug has been fixed in 6.4, not 6.3, as we had to introduce a BC break to fix it properly.

@rodnaph
Copy link
Contributor Author

rodnaph commented Sep 19, 2023

I see this issue has been closed, but we've continued to log our scheduler activity with the cache disabled, and noticed an apparently missing run from yesterday, which if not caused by some other part of our setup could indicate the linked cache change is not the cause here.

image

Log of instances where scheduler has run for above config

I'm going to add some more logging to our handlers to check if perhaps our middleware is buggy. I'll also upgrade to 6.4 when the other fix becomes available in a release.

@valtzu
Copy link
Contributor

valtzu commented Sep 19, 2023

Without cache, there are several possibilities where a run may be skipped, for example:

  1. A worker is brought down during deployment which may cause a gap when there's no consumer online, and if it's just the time of the run, it would be skipped
  2. You may have --memory-limit or --time-limit which causes worker to restart, and it just happens to be the time of the trigger – which means the run is skipped
  3. A worker may exit because of an exception/error thrown, and then be brought back up by restart policy, but if it happens during the time of trigger – again the run is skipped

If your case falls under any of those scenarios, I don't think it's a bug in the scheduler component.

If you're sure that there was a scheduler worker alive at the time when the run was meant to happen, then there could be an issue indeed 🤔

@rodnaph
Copy link
Contributor Author

rodnaph commented Sep 19, 2023

You're absolutely correct @valtzu, I did note in my previous comment that while there is a small chance the worker might not be running I thought it unlikely enough to not affect these results.

But it seems from our logs that in this case the worker happened to stop 19 seconds before the scheduled run, and then not restart until 20 seconds after it. 🤦

I'll continue monitoring and return here if an issue does seem to exist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants