-
-
Notifications
You must be signed in to change notification settings - Fork 9.6k
[Console] Added Parallelization trait #27518
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Done something similar a bit more generic https://github.com/saxulum/saxulum-processes-executor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All this logic requires tests of course (and as this involves subprocesses, I think this might even require some functional tests)
$exception->getTraceAsString() | ||
)); | ||
|
||
$this->getContainer()->reset(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will cause issues for commands using DI, as the dependencies injected in the command will still be the old instances (and so may be different from the one injected into services instantiated lazily after the reset).
And using DI in commands in the recommended way in 3.4+
|
||
foreach ($elements as $element) { | ||
// Close the input stream if the segment is full | ||
if (null !== $currentInputStream && $numberOfStreamedElements >= $this->segmentSize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will send all the input to one of the processes before starting the second process, right ?
wouldn't it be better to actually loop over running process (start the N processes for the first N elements, and then start again at the beginning to send next elements to each processes and so on) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that makes any difference as writing in the input stream should not be blocking, i.e. I expect the OS to buffer data in the input as long as the process is busy. I didn't test it in that detail though.
->addOption( | ||
'processes', | ||
'p', | ||
InputOption::VALUE_OPTIONAL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be VALUE_REQUIRED
. Passing only -p
does not make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
} | ||
} else { | ||
// Distribute if we have multiple segments | ||
$commandTemplate = sprintf('%s bin/console %s %s --child --env=%s --verbose --no-debug', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is missing escaping, which will cause issues in case arguments have spaces or other special chars in them. Use the 3.4+ array syntax of the Process constructor instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
} | ||
} else { | ||
// Distribute if we have multiple segments | ||
$commandTemplate = sprintf('%s bin/console %s %s --child --env=%s --verbose --no-debug', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is also broken if the bin file is not bin/console
. Use $_SERVER['PHP_SELF']
to detect it instead, as done in the help
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
} | ||
} else { | ||
// Distribute if we have multiple segments | ||
$commandTemplate = sprintf('%s bin/console %s %s --child --env=%s --verbose --no-debug', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--no-debug
should be added only if the current command has it too, to make this trait compatible with the debug mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is here all the time since not having --no-debug
has severe performance implications. I don't mind too much though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then it should at least check $input->hasOption()
to handle applications not having that option (not FrameworkBundle one)
*/ | ||
private static function getWorkingDirectory(ContainerInterface $container): string | ||
{ | ||
return dirname($container->getParameter('kernel.root_dir')); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should run in the current working directory instead. Otherwise, this might break commands taking an argument which is a relative filesystem path (as it would change the folder to which it is relative).
This is also necessary in case you switch to $_SERVER['PHP_SELF']
as it might be a relative path too
'PATH' => getenv('PATH'), | ||
'HOME' => getenv('HOME'), | ||
'SYMFONY_DEBUG' => $container->getParameter('kernel.debug'), | ||
'SYMFONY_ENV' => $container->getParameter('kernel.environment'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't the env inherited already anyway ?
This would remove the need to couple it to the container here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not AFAIR. I wouldn't have added this if it hadn't been necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the environment is inherited by default since a 3.x release (I think it was 3.2 but not sure). Older releases were requiring to opt-in. On which Symfony version was this code written initially ?
self::getEnvironmentVariables($this->getContainer()), | ||
$numberOfProcesses, | ||
$segmentSize, | ||
$this->getContainer()->get('logger'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should make the retrieval of the logger configurable instead, as not all commands are container aware
* | ||
* @return string A single character | ||
*/ | ||
private static function getProgressSymbol() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should not be static if you call it non-statically
$commandTemplate = sprintf('%s bin/console %s %s --child --env=%s --verbose --no-debug', | ||
self::detectPhpExecutable(), | ||
$this->getName(), | ||
implode(' ', array_slice($input->getArguments(), 1)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is forwarding arguments, but it is not forwarding options. It should forward all options except the ones belonging to the trait IMO.
and why slicing ? If this is to remove the element
argument, you cannot assume it is the first argument.
Lovely, thanks for all the feedback @stof. I'll wait for a general decision before working any more on this. |
This will not solve #8454. The comments in this issue were already saying that Spork was not fitting their need due to being able to handle only PHP processes, and this implementation is even more strict, as it runs the same Symfony command in all sub-processes based on a batch of elements (so only one very specific use case of the one supported by a process manager). I haven't had the need for this trait yet (cases where I have heavy processing to perform on a bunch on elements are implemented by dispatching a bunch of messages in our RabbitMQ and letting our workers do the work). But I'm not opposed to having this into the core if it works well. |
implode(' ', array_slice($input->getArguments(), 1)), | ||
$input->getOption('env') | ||
); | ||
$terminalWidth = current($this->getApplication()->getTerminalDimensions()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Application::getTerminalDimensions() has been deprecated since version 3.2 and removed since 4.0
*/ | ||
private function startProcess(InputStream $inputStream): void | ||
{ | ||
$process = new Process( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That means each element to process will be executed in separated process, each of them will need to bootstrap application again and again.
Can we use threads instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, each process executes a segment of elements (50, 100, 1000... you decide), for which it bootstraps the application.
Threads would be an option, but more complicated and AFAIK not as easy to implement in an interoperable fashion. I didn't try, but I'm happy if you want to give it a shot.
Happy to see you again @webmozart. |
@nicolas-grekas IMO the use case is not that rare, proof is a few people liking the feature and it's also not the first time this feature has been requested. I also know a couple of people that quite like it but never submitted anything because the feature is far from being trivial and they didn't have the confidence to propose a PR. The issues with xargs and parallel:
The major benefit of it is that the best alternative right now would be to leverage queues, but in a lot of cases it introduces a lot of complexity and you might not have queues at all in your app in the first place. But that's only my opinion about the feature, not so much about the implementation :) |
I'm also 👎 to merge this one in core. |
Thanks for proposing this patch. I hope you or someone else will maintain it as a standalone package. |
@webmozart Are there any plans to publish this is a bundle? I already had plans to create the exact same trait on my own, before I found your solution. Your parallelization trait is solving a very common issue. |
@andreas-gruenwald Yes we can do so. We're continually using and working on this trait in our company, so having it in a package would make sense. /cc @theofidry |
For whoever cares, here's the code as standalone package: https://github.com/webmozarts/console-parallelization |
I propose to add a
Parallelization
trait that we use internally for a while now to the core Console component.What's it about
The
Parallelization
trait adds parallelization capabilities to Console commands in a very simple but efficient manner. These commands can then be launched with a--processes
argument to determine in how many processes the command should be executed.How to use
Parallelization
trait to your commandconfigure()
and callconfigureParallelization()
fetchElements()
to return an iterable of strings (e.g. database IDs)runSingleCommand()
to process the workload for a single elementgetElementName()
to return the human readable name of one elementExample (simple)
By default, this command is executed in a single processes like a regular Symfony command. However, it can be changed to use multiple processes if that speeds up performance:
Now the same workload is distributed among 4 processes.
Segments and batches
When distributing a workload among processes, each process receives a segment of the processed data. The size of that segment can be configured by overriding
getSegmentSize()
. Depending on the task, different segment sizes may be optimal.Within a process, the work is distributed in batches. You can override
runBeforeBatch()
orrunAfterBatch()
in order to prepare or finish the batch, for example in order to flush the database only after a batch has been completed.By default, the batch size is the segment size, hence each process also processes a single batch. For very large segment sizes (e.g. 1000), it might make sense to reduce the batch size (e.g. 100) to improve memory usage and IO throughput. You can do that by overriding
getBatchSize()
.Example (advanced)
Hooks
runBeforeFirstCommand()
- executed in the master process at the very beginningrunAfterLastCommand()
- executed in the master process at the very beginningrunBeforeBatch()
- executed in the child process before a batchrunAfterBatch()
- executed in the child process after a batchPending decision
Before we proceed with CS details, implementation feedback etc.: Is there any interest to add this to core?