Long explanation of the problem
You can specify jobconfs for both the entire job or for each individual step...sort of.
Command line options, config file options, and the MRJob.jobconf() method all refer to the same job-level set of jobconf values.
The step object also accepts a jobconf kwarg which it expects to be a dict.
But there's a lurking issue: if you try to override MRJob.jobconf(), without also overriding MRJob.steps(), everything explodes. (Except not in local/inline mode, due to #655.) The explosion symptoms are documented in #585.
Example demonstrating exact way to reproduce:
import os
from mrjob.job import MRJob
class Job(MRJob):
# comment this method out to get a crash
def steps(self):
return [self.mr(mapper=self.mapper)]
def jobconf(self):
return {}
def mapper(self, _, v):
for k, v in os.environ.iteritems():
yield k, v
if __name__ == '__main__':
Job().run()
By default, steps() returns [MRJobStep(mapper=self.mapper, ..., jobconf=self.jobconf)], where each key is only included if it has been overridden in the subclass.
But including jobconf in that list has never been valid, and apparently no test cases exist for it. Oops! One could make an argument that jobconf() should be treated the same as mapper(), i.e. only applying to the first step and only if steps() isn't implemented, but its behavior for the entire lifespan of mrjob has been to return job-level jobconf values, not step-level.
Suggested course of action
- Remove
jobconf from the list of keys taken from the MRJob class if steps() is not specified.
- Write test cases for overriding
jobconf() and any other methods that are missing test coverage in this context.
- Thoroughly document when a user-specified jobconf value is job-level vs step-level.
It seems reasonable that a step-level jobconf argument would have to be a dictionary, not a function. It's necessarily evaluated at job start time, not at task time, even if it's specific to a step.
Ping @tarnfeld and @DavidMarin to confirm everything I just said.
Long explanation of the problem
You can specify jobconfs for both the entire job or for each individual step...sort of.
Command line options, config file options, and the
MRJob.jobconf()method all refer to the same job-level set of jobconf values.The step object also accepts a
jobconfkwarg which it expects to be a dict.But there's a lurking issue: if you try to override
MRJob.jobconf(), without also overridingMRJob.steps(), everything explodes. (Except not in local/inline mode, due to #655.) The explosion symptoms are documented in #585.Example demonstrating exact way to reproduce:
By default,
steps()returns[MRJobStep(mapper=self.mapper, ..., jobconf=self.jobconf)], where each key is only included if it has been overridden in the subclass.But including
jobconfin that list has never been valid, and apparently no test cases exist for it. Oops! One could make an argument thatjobconf()should be treated the same asmapper(), i.e. only applying to the first step and only ifsteps()isn't implemented, but its behavior for the entire lifespan of mrjob has been to return job-level jobconf values, not step-level.Suggested course of action
jobconffrom the list of keys taken from theMRJobclass ifsteps()is not specified.jobconf()and any other methods that are missing test coverage in this context.It seems reasonable that a step-level jobconf argument would have to be a dictionary, not a function. It's necessarily evaluated at job start time, not at task time, even if it's specific to a step.
Ping @tarnfeld and @DavidMarin to confirm everything I just said.