Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[Console] Added Parallelization trait #27518

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

webmozart
Copy link
Contributor

Q A
Branch? master
Bug fix? no
New feature? yes
BC breaks? no
Deprecations? no
Tests pass? yes
Fixed tickets #8454 ?
License MIT
Doc PR TODO

I propose to add a Parallelization trait that we use internally for a while now to the core Console component.

What's it about

The Parallelization trait adds parallelization capabilities to Console commands in a very simple but efficient manner. These commands can then be launched with a --processes argument to determine in how many processes the command should be executed.

How to use

  1. Add the Parallelization trait to your command
  2. Implement configure() and call configureParallelization()
  3. Implement fetchElements() to return an iterable of strings (e.g. database IDs)
  4. Implement runSingleCommand() to process the workload for a single element
  5. Implement getElementName() to return the human readable name of one element

Example (simple)

class ImportContactsCommand
{
    use Parallelization;

    protected function configure()
    {
        $this->setName('import:contacts');

        $this->configureParallelization();
    }

    protected function fetchElements(InputInterface $input): iterable
    {
        // I am executed in the master process
        $sheet = $this->openExcelSheet();
        
        foreach ($sheet as $row) {
            yield serialize($row);
        }
    }

    protected function runSingleCommand(string $element, InputInterface $input, OutputInterface $output)
    {
        // I am executed in the child process when using parallelization
        // I am executed in the master process when not using parallelization
        $row = unserialize($row);
        
        $this->importContact($row);
    }

    protected function getElementName(int $count)
    {
        return 1 === $count ? 'contact' : 'contacts';
    }

    // ...
}

By default, this command is executed in a single processes like a regular Symfony command. However, it can be changed to use multiple processes if that speeds up performance:

bin/console import:contacts --processes 4

Now the same workload is distributed among 4 processes.

Segments and batches

When distributing a workload among processes, each process receives a segment of the processed data. The size of that segment can be configured by overriding getSegmentSize(). Depending on the task, different segment sizes may be optimal.

Within a process, the work is distributed in batches. You can override runBeforeBatch() or runAfterBatch() in order to prepare or finish the batch, for example in order to flush the database only after a batch has been completed.

By default, the batch size is the segment size, hence each process also processes a single batch. For very large segment sizes (e.g. 1000), it might make sense to reduce the batch size (e.g. 100) to improve memory usage and IO throughput. You can do that by overriding getBatchSize().

Example (advanced)

class ImportContactsCommandWithBatchFlush
{
    use Parallelization;

    protected function configure()
    {
        $this->setName('import:contacts');

        $this->configureParallelization();
    }

    protected function fetchElements(InputInterface $input): iterable
    {
        $sheet = $this->openExcelSheet();

        foreach ($sheet as $row) {
            yield serialize($row);
        }
    }

    protected function runSingleCommand(string $element, InputInterface $input, OutputInterface $output)
    {
        $em = $this->getContainer()->get('doctrine')->getManagerForClass(Contact::class);
        
        $row = unserialize($row);

        $contact = $this->createContact($row);
        
        $em->persist($contact);
    }
    
    protected function runAfterBatch(InputInterface $input, OutputInterface $output)
    {
        $em = $this->getContainer()->get('doctrine')->getManagerForClass(Contact::class);

        // Persist in a batch
        $em->flush();
        $em->clear();
    }

    protected function getElementName(int $count)
    {
        return 1 === $count ? 'contact' : 'contacts';
    }
}

Hooks

  • runBeforeFirstCommand() - executed in the master process at the very beginning
  • runAfterLastCommand() - executed in the master process at the very beginning
  • runBeforeBatch() - executed in the child process before a batch
  • runAfterBatch() - executed in the child process after a batch

Pending decision

Before we proceed with CS details, implementation feedback etc.: Is there any interest to add this to core?

@dominikzogg
Copy link

Done something similar a bit more generic https://github.com/saxulum/saxulum-processes-executor

Copy link
Member

@stof stof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this logic requires tests of course (and as this involves subprocesses, I think this might even require some functional tests)

$exception->getTraceAsString()
));

$this->getContainer()->reset();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will cause issues for commands using DI, as the dependencies injected in the command will still be the old instances (and so may be different from the one injected into services instantiated lazily after the reset).
And using DI in commands in the recommended way in 3.4+


foreach ($elements as $element) {
// Close the input stream if the segment is full
if (null !== $currentInputStream && $numberOfStreamedElements >= $this->segmentSize) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will send all the input to one of the processes before starting the second process, right ?

wouldn't it be better to actually loop over running process (start the N processes for the first N elements, and then start again at the beginning to send next elements to each processes and so on) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that makes any difference as writing in the input stream should not be blocking, i.e. I expect the OS to buffer data in the input as long as the process is busy. I didn't test it in that detail though.

->addOption(
'processes',
'p',
InputOption::VALUE_OPTIONAL,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be VALUE_REQUIRED. Passing only -p does not make sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

}
} else {
// Distribute if we have multiple segments
$commandTemplate = sprintf('%s bin/console %s %s --child --env=%s --verbose --no-debug',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is missing escaping, which will cause issues in case arguments have spaces or other special chars in them. Use the 3.4+ array syntax of the Process constructor instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

}
} else {
// Distribute if we have multiple segments
$commandTemplate = sprintf('%s bin/console %s %s --child --env=%s --verbose --no-debug',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is also broken if the bin file is not bin/console. Use $_SERVER['PHP_SELF'] to detect it instead, as done in the help

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

}
} else {
// Distribute if we have multiple segments
$commandTemplate = sprintf('%s bin/console %s %s --child --env=%s --verbose --no-debug',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--no-debug should be added only if the current command has it too, to make this trait compatible with the debug mode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is here all the time since not having --no-debug has severe performance implications. I don't mind too much though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then it should at least check $input->hasOption() to handle applications not having that option (not FrameworkBundle one)

*/
private static function getWorkingDirectory(ContainerInterface $container): string
{
return dirname($container->getParameter('kernel.root_dir'));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should run in the current working directory instead. Otherwise, this might break commands taking an argument which is a relative filesystem path (as it would change the folder to which it is relative).
This is also necessary in case you switch to $_SERVER['PHP_SELF'] as it might be a relative path too

'PATH' => getenv('PATH'),
'HOME' => getenv('HOME'),
'SYMFONY_DEBUG' => $container->getParameter('kernel.debug'),
'SYMFONY_ENV' => $container->getParameter('kernel.environment'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't the env inherited already anyway ?

This would remove the need to couple it to the container here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not AFAIR. I wouldn't have added this if it hadn't been necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the environment is inherited by default since a 3.x release (I think it was 3.2 but not sure). Older releases were requiring to opt-in. On which Symfony version was this code written initially ?

self::getEnvironmentVariables($this->getContainer()),
$numberOfProcesses,
$segmentSize,
$this->getContainer()->get('logger'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should make the retrieval of the logger configurable instead, as not all commands are container aware

*
* @return string A single character
*/
private static function getProgressSymbol()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should not be static if you call it non-statically

$commandTemplate = sprintf('%s bin/console %s %s --child --env=%s --verbose --no-debug',
self::detectPhpExecutable(),
$this->getName(),
implode(' ', array_slice($input->getArguments(), 1)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is forwarding arguments, but it is not forwarding options. It should forward all options except the ones belonging to the trait IMO.

and why slicing ? If this is to remove the element argument, you cannot assume it is the first argument.

@webmozart
Copy link
Contributor Author

Lovely, thanks for all the feedback @stof. I'll wait for a general decision before working any more on this.

@stof
Copy link
Member

stof commented Jun 6, 2018

This will not solve #8454. The comments in this issue were already saying that Spork was not fitting their need due to being able to handle only PHP processes, and this implementation is even more strict, as it runs the same Symfony command in all sub-processes based on a batch of elements (so only one very specific use case of the one supported by a process manager).
https://github.com/saxulum/saxulum-processes-executor is more in line with #8454, but is not equivalent to this trait (building the logic of this trait with this processes executor would still require to reimplement most of the trait).

I haven't had the need for this trait yet (cases where I have heavy processing to perform on a bunch on elements are implemented by dispatching a bunch of messages in our RabbitMQ and letting our workers do the work). But I'm not opposed to having this into the core if it works well.

implode(' ', array_slice($input->getArguments(), 1)),
$input->getOption('env')
);
$terminalWidth = current($this->getApplication()->getTerminalDimensions());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Application::getTerminalDimensions() has been deprecated since version 3.2 and removed since 4.0

@nicolas-grekas nicolas-grekas added this to the next milestone Jun 7, 2018
*/
private function startProcess(InputStream $inputStream): void
{
$process = new Process(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That means each element to process will be executed in separated process, each of them will need to bootstrap application again and again.
Can we use threads instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, each process executes a segment of elements (50, 100, 1000... you decide), for which it bootstraps the application.

Threads would be an option, but more complicated and AFAIK not as easy to implement in an interoperable fashion. I didn't try, but I'm happy if you want to give it a shot.

@nicolas-grekas nicolas-grekas changed the title Added Parallelization trait [Console] Added Parallelization trait Jun 19, 2018
@nicolas-grekas
Copy link
Member

nicolas-grekas commented Jun 28, 2018

Happy to see you again @webmozart.
I'm afraid that honestly, I'm not sure this belongs to core...
This is a non-trivial amount of code to maintain, for a situation that doesn't feel common enough to me, and that can be solved with xargs or parallel most if the time. Not in the same exact behavior of course, but that's still narrowing down the use case even further...
I might be wrong of course.

@theofidry
Copy link
Contributor

@nicolas-grekas IMO the use case is not that rare, proof is a few people liking the feature and it's also not the first time this feature has been requested. I also know a couple of people that quite like it but never submitted anything because the feature is far from being trivial and they didn't have the confidence to propose a PR.

The issues with xargs and parallel:

  • It doesn't work on Windows
  • It requires you to have two commands: one to provide the arguments and a second to process it
  • You get the bootstrapping process as an overhead for each element since it's no longer a long-living process unless you add more stuff to have the processing command acting as a server

The major benefit of it is that the best alternative right now would be to leverage queues, but in a lot of cases it introduces a lot of complexity and you might not have queues at all in your app in the first place.

But that's only my opinion about the feature, not so much about the implementation :)

@fabpot
Copy link
Member

fabpot commented Aug 2, 2018

I'm also 👎 to merge this one in core.

@nicolas-grekas
Copy link
Member

Thanks for proposing this patch. I hope you or someone else will maintain it as a standalone package.

@nicolas-grekas nicolas-grekas modified the milestones: next, 4.2 Nov 1, 2018
@andreas-gruenwald
Copy link

@webmozart Are there any plans to publish this is a bundle? I already had plans to create the exact same trait on my own, before I found your solution. Your parallelization trait is solving a very common issue.

@webmozart
Copy link
Contributor Author

webmozart commented Oct 22, 2019

@andreas-gruenwald Yes we can do so. We're continually using and working on this trait in our company, so having it in a package would make sense. /cc @theofidry

@webmozart
Copy link
Contributor Author

For whoever cares, here's the code as standalone package: https://github.com/webmozarts/console-parallelization

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.