Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Crash when overriding MRJob.jobconf() and not overriding MRJob.steps() #656

@irskep

Description

@irskep

Long explanation of the problem

You can specify jobconfs for both the entire job or for each individual step...sort of.

Command line options, config file options, and the MRJob.jobconf() method all refer to the same job-level set of jobconf values.

The step object also accepts a jobconf kwarg which it expects to be a dict.

But there's a lurking issue: if you try to override MRJob.jobconf(), without also overriding MRJob.steps(), everything explodes. (Except not in local/inline mode, due to #655.) The explosion symptoms are documented in #585.

Example demonstrating exact way to reproduce:

import os
from mrjob.job import MRJob

class Job(MRJob):

    # comment this method out to get a crash
    def steps(self):
        return [self.mr(mapper=self.mapper)]

    def jobconf(self):
        return {}

    def mapper(self, _, v):
        for k, v in os.environ.iteritems():
            yield k, v

if __name__ == '__main__':
    Job().run()

By default, steps() returns [MRJobStep(mapper=self.mapper, ..., jobconf=self.jobconf)], where each key is only included if it has been overridden in the subclass.

But including jobconf in that list has never been valid, and apparently no test cases exist for it. Oops! One could make an argument that jobconf() should be treated the same as mapper(), i.e. only applying to the first step and only if steps() isn't implemented, but its behavior for the entire lifespan of mrjob has been to return job-level jobconf values, not step-level.

Suggested course of action

  1. Remove jobconf from the list of keys taken from the MRJob class if steps() is not specified.
  2. Write test cases for overriding jobconf() and any other methods that are missing test coverage in this context.
  3. Thoroughly document when a user-specified jobconf value is job-level vs step-level.

It seems reasonable that a step-level jobconf argument would have to be a dictionary, not a function. It's necessarily evaluated at job start time, not at task time, even if it's specific to a step.

Ping @tarnfeld and @DavidMarin to confirm everything I just said.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions