[Console] Added Parallelization trait #27518

webmozart · 2018-06-06T10:06:14Z

Q	A
Branch?	master
Bug fix?	no
New feature?	yes
BC breaks?	no
Deprecations?	no
Tests pass?	yes
Fixed tickets	#8454 ?
License	MIT
Doc PR	TODO

I propose to add a Parallelization trait that we use internally for a while now to the core Console component.

What's it about

The Parallelization trait adds parallelization capabilities to Console commands in a very simple but efficient manner. These commands can then be launched with a --processes argument to determine in how many processes the command should be executed.

How to use

Add the Parallelization trait to your command
Implement configure() and call configureParallelization()
Implement fetchElements() to return an iterable of strings (e.g. database IDs)
Implement runSingleCommand() to process the workload for a single element
Implement getElementName() to return the human readable name of one element

Example (simple)

class ImportContactsCommand
{
    use Parallelization;

    protected function configure()
    {
        $this->setName('import:contacts');

        $this->configureParallelization();
    }

    protected function fetchElements(InputInterface $input): iterable
    {
        // I am executed in the master process
        $sheet = $this->openExcelSheet();
        
        foreach ($sheet as $row) {
            yield serialize($row);
        }
    }

    protected function runSingleCommand(string $element, InputInterface $input, OutputInterface $output)
    {
        // I am executed in the child process when using parallelization
        // I am executed in the master process when not using parallelization
        $row = unserialize($row);
        
        $this->importContact($row);
    }

    protected function getElementName(int $count)
    {
        return 1 === $count ? 'contact' : 'contacts';
    }

    // ...
}

By default, this command is executed in a single processes like a regular Symfony command. However, it can be changed to use multiple processes if that speeds up performance:

bin/console import:contacts --processes 4

Now the same workload is distributed among 4 processes.

Segments and batches

When distributing a workload among processes, each process receives a segment of the processed data. The size of that segment can be configured by overriding getSegmentSize(). Depending on the task, different segment sizes may be optimal.

Within a process, the work is distributed in batches. You can override runBeforeBatch() or runAfterBatch() in order to prepare or finish the batch, for example in order to flush the database only after a batch has been completed.

By default, the batch size is the segment size, hence each process also processes a single batch. For very large segment sizes (e.g. 1000), it might make sense to reduce the batch size (e.g. 100) to improve memory usage and IO throughput. You can do that by overriding getBatchSize().

Example (advanced)

class ImportContactsCommandWithBatchFlush
{
    use Parallelization;

    protected function configure()
    {
        $this->setName('import:contacts');

        $this->configureParallelization();
    }

    protected function fetchElements(InputInterface $input): iterable
    {
        $sheet = $this->openExcelSheet();

        foreach ($sheet as $row) {
            yield serialize($row);
        }
    }

    protected function runSingleCommand(string $element, InputInterface $input, OutputInterface $output)
    {
        $em = $this->getContainer()->get('doctrine')->getManagerForClass(Contact::class);
        
        $row = unserialize($row);

        $contact = $this->createContact($row);
        
        $em->persist($contact);
    }
    
    protected function runAfterBatch(InputInterface $input, OutputInterface $output)
    {
        $em = $this->getContainer()->get('doctrine')->getManagerForClass(Contact::class);

        // Persist in a batch
        $em->flush();
        $em->clear();
    }

    protected function getElementName(int $count)
    {
        return 1 === $count ? 'contact' : 'contacts';
    }
}

Hooks

runBeforeFirstCommand() - executed in the master process at the very beginning
runAfterLastCommand() - executed in the master process at the very beginning
runBeforeBatch() - executed in the child process before a batch
runAfterBatch() - executed in the child process after a batch

Pending decision

Before we proceed with CS details, implementation feedback etc.: Is there any interest to add this to core?

dominikzogg · 2018-06-06T12:20:59Z

Done something similar a bit more generic https://github.com/saxulum/saxulum-processes-executor

stof

All this logic requires tests of course (and as this involves subprocesses, I think this might even require some functional tests)

stof · 2018-06-06T11:41:15Z

src/Symfony/Component/Console/Parallelization/Parallelization.php

+                $exception->getTraceAsString()
+            ));
+
+            $this->getContainer()->reset();


this will cause issues for commands using DI, as the dependencies injected in the command will still be the old instances (and so may be different from the one injected into services instantiated lazily after the reset).
And using DI in commands in the recommended way in 3.4+

stof · 2018-06-06T11:45:21Z

src/Symfony/Component/Console/Parallelization/ProcessLauncher.php

+
+        foreach ($elements as $element) {
+            // Close the input stream if the segment is full
+            if (null !== $currentInputStream && $numberOfStreamedElements >= $this->segmentSize) {


this will send all the input to one of the processes before starting the second process, right ?

wouldn't it be better to actually loop over running process (start the N processes for the first N elements, and then start again at the beginning to send next elements to each processes and so on) ?

I don't think that makes any difference as writing in the input stream should not be blocking, i.e. I expect the OS to buffer data in the input as long as the process is busy. I didn't test it in that detail though.

stof · 2018-06-06T11:47:11Z

src/Symfony/Component/Console/Parallelization/Parallelization.php

+            ->addOption(
+                'processes',
+                'p',
+                InputOption::VALUE_OPTIONAL,


should be VALUE_REQUIRED. Passing only -p does not make sense.

stof · 2018-06-06T11:48:05Z

src/Symfony/Component/Console/Parallelization/Parallelization.php

+            }
+        } else {
+            // Distribute if we have multiple segments
+            $commandTemplate = sprintf('%s bin/console %s %s --child --env=%s --verbose --no-debug',


this is missing escaping, which will cause issues in case arguments have spaces or other special chars in them. Use the 3.4+ array syntax of the Process constructor instead.

stof · 2018-06-06T11:50:15Z

src/Symfony/Component/Console/Parallelization/Parallelization.php

+            }
+        } else {
+            // Distribute if we have multiple segments
+            $commandTemplate = sprintf('%s bin/console %s %s --child --env=%s --verbose --no-debug',


this is also broken if the bin file is not bin/console. Use $_SERVER['PHP_SELF'] to detect it instead, as done in the help

stof · 2018-06-06T11:51:25Z

src/Symfony/Component/Console/Parallelization/Parallelization.php

+            }
+        } else {
+            // Distribute if we have multiple segments
+            $commandTemplate = sprintf('%s bin/console %s %s --child --env=%s --verbose --no-debug',


--no-debug should be added only if the current command has it too, to make this trait compatible with the debug mode.

It is here all the time since not having --no-debug has severe performance implications. I don't mind too much though.

then it should at least check $input->hasOption() to handle applications not having that option (not FrameworkBundle one)

stof · 2018-06-06T11:54:10Z

src/Symfony/Component/Console/Parallelization/Parallelization.php

+     */
+    private static function getWorkingDirectory(ContainerInterface $container): string
+    {
+        return dirname($container->getParameter('kernel.root_dir'));


this should run in the current working directory instead. Otherwise, this might break commands taking an argument which is a relative filesystem path (as it would change the folder to which it is relative).
This is also necessary in case you switch to $_SERVER['PHP_SELF'] as it might be a relative path too

stof · 2018-06-06T11:54:48Z

src/Symfony/Component/Console/Parallelization/Parallelization.php

+            'PATH' => getenv('PATH'),
+            'HOME' => getenv('HOME'),
+            'SYMFONY_DEBUG' => $container->getParameter('kernel.debug'),
+            'SYMFONY_ENV' => $container->getParameter('kernel.environment'),


isn't the env inherited already anyway ?

This would remove the need to couple it to the container here

Not AFAIR. I wouldn't have added this if it hadn't been necessary.

Well, the environment is inherited by default since a 3.x release (I think it was 3.2 but not sure). Older releases were requiring to opt-in. On which Symfony version was this code written initially ?

stof · 2018-06-06T11:56:12Z

src/Symfony/Component/Console/Parallelization/Parallelization.php

+                self::getEnvironmentVariables($this->getContainer()),
+                $numberOfProcesses,
+                $segmentSize,
+                $this->getContainer()->get('logger'),


you should make the retrieval of the logger configurable instead, as not all commands are container aware

stof · 2018-06-06T12:04:58Z

src/Symfony/Component/Console/Parallelization/Parallelization.php

+     *
+     * @return string A single character
+     */
+    private static function getProgressSymbol()


should not be static if you call it non-statically

stof · 2018-06-06T12:29:15Z

src/Symfony/Component/Console/Parallelization/Parallelization.php

+            $commandTemplate = sprintf('%s bin/console %s %s --child --env=%s --verbose --no-debug',
+                self::detectPhpExecutable(),
+                $this->getName(),
+                implode(' ', array_slice($input->getArguments(), 1)),


this is forwarding arguments, but it is not forwarding options. It should forward all options except the ones belonging to the trait IMO.

and why slicing ? If this is to remove the element argument, you cannot assume it is the first argument.

webmozart · 2018-06-06T12:57:16Z

Lovely, thanks for all the feedback @stof. I'll wait for a general decision before working any more on this.

stof · 2018-06-06T14:39:55Z

This will not solve #8454. The comments in this issue were already saying that Spork was not fitting their need due to being able to handle only PHP processes, and this implementation is even more strict, as it runs the same Symfony command in all sub-processes based on a batch of elements (so only one very specific use case of the one supported by a process manager).
https://github.com/saxulum/saxulum-processes-executor is more in line with #8454, but is not equivalent to this trait (building the logic of this trait with this processes executor would still require to reimplement most of the trait).

I haven't had the need for this trait yet (cases where I have heavy processing to perform on a bunch on elements are implemented by dispatching a bunch of messages in our RabbitMQ and letting our workers do the work). But I'm not opposed to having this into the core if it works well.

pborreli · 2018-06-06T15:27:30Z

src/Symfony/Component/Console/Parallelization/Parallelization.php

+                implode(' ', array_slice($input->getArguments(), 1)),
+                $input->getOption('env')
+            );
+            $terminalWidth = current($this->getApplication()->getTerminalDimensions());


Application::getTerminalDimensions() has been deprecated since version 3.2 and removed since 4.0

keradus · 2018-06-08T07:41:27Z

src/Symfony/Component/Console/Parallelization/ProcessLauncher.php

+     */
+    private function startProcess(InputStream $inputStream): void
+    {
+        $process = new Process(


That means each element to process will be executed in separated process, each of them will need to bootstrap application again and again.
Can we use threads instead?

No, each process executes a segment of elements (50, 100, 1000... you decide), for which it bootstraps the application.

Threads would be an option, but more complicated and AFAIK not as easy to implement in an interoperable fashion. I didn't try, but I'm happy if you want to give it a shot.

nicolas-grekas · 2018-06-28T06:47:40Z

Happy to see you again @webmozart.
I'm afraid that honestly, I'm not sure this belongs to core...
This is a non-trivial amount of code to maintain, for a situation that doesn't feel common enough to me, and that can be solved with xargs or parallel most if the time. Not in the same exact behavior of course, but that's still narrowing down the use case even further...
I might be wrong of course.

theofidry · 2018-06-28T07:54:27Z

@nicolas-grekas IMO the use case is not that rare, proof is a few people liking the feature and it's also not the first time this feature has been requested. I also know a couple of people that quite like it but never submitted anything because the feature is far from being trivial and they didn't have the confidence to propose a PR.

The issues with xargs and parallel:

It doesn't work on Windows
It requires you to have two commands: one to provide the arguments and a second to process it
You get the bootstrapping process as an overhead for each element since it's no longer a long-living process unless you add more stuff to have the processing command acting as a server

The major benefit of it is that the best alternative right now would be to leverage queues, but in a lot of cases it introduces a lot of complexity and you might not have queues at all in your app in the first place.

But that's only my opinion about the feature, not so much about the implementation :)

fabpot · 2018-08-02T10:06:20Z

I'm also 👎 to merge this one in core.

nicolas-grekas · 2018-08-15T20:32:51Z

Thanks for proposing this patch. I hope you or someone else will maintain it as a standalone package.

andreas-gruenwald · 2019-10-22T15:33:19Z

@webmozart Are there any plans to publish this is a bundle? I already had plans to create the exact same trait on my own, before I found your solution. Your parallelization trait is solving a very common issue.

webmozart · 2019-10-22T15:58:57Z

@andreas-gruenwald Yes we can do so. We're continually using and working on this trait in our company, so having it in a package would make sense. /cc @theofidry

webmozart · 2019-10-31T09:48:50Z

For whoever cares, here's the code as standalone package: https://github.com/webmozarts/console-parallelization

Added Parallelization trait

091c9c0

carsonbot added Status: Needs Review Feature labels Jun 6, 2018

stof requested changes Jun 6, 2018

View reviewed changes

carsonbot added Status: Needs Work and removed Status: Needs Review labels Jun 6, 2018

stof reviewed Jun 6, 2018

View reviewed changes

pborreli reviewed Jun 6, 2018

View reviewed changes

nicolas-grekas added this to the next milestone Jun 7, 2018

keradus reviewed Jun 8, 2018

View reviewed changes

nicolas-grekas changed the title ~~Added Parallelization trait~~ [Console] Added Parallelization trait Jun 19, 2018

chalasr added the Console label Jun 20, 2018

nicolas-grekas closed this Aug 15, 2018

nicolas-grekas modified the milestones: next, 4.2 Nov 1, 2018

Uh oh!

[Console] Added Parallelization trait #27518

[Console] Added Parallelization trait #27518

Uh oh!

Conversation

webmozart commented Jun 6, 2018

What's it about

How to use

Example (simple)

Segments and batches

Example (advanced)

Hooks

Pending decision

Uh oh!

dominikzogg commented Jun 6, 2018

Uh oh!

stof left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

webmozart commented Jun 6, 2018

Uh oh!

stof commented Jun 6, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nicolas-grekas commented Jun 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

theofidry commented Jun 28, 2018

Uh oh!

fabpot commented Aug 2, 2018

Uh oh!

nicolas-grekas commented Aug 15, 2018

Uh oh!

andreas-gruenwald commented Oct 22, 2019

Uh oh!

webmozart commented Oct 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

nicolas-grekas commented Jun 28, 2018 •

edited

Loading

webmozart commented Oct 22, 2019 •

edited

Loading