Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Support for s3n:// in S3Filesystem.ls()#672

Merged
coyotemarin merged 1 commit into
Yelp:masterfrom
duedil-ltd:feature/s3n-s3-fs
Jul 21, 2013
Merged

Support for s3n:// in S3Filesystem.ls()#672
coyotemarin merged 1 commit into
Yelp:masterfrom
duedil-ltd:feature/s3n-s3-fs

Conversation

@tarnfeld

Copy link
Copy Markdown
Contributor

I've added support for the s3n:// schema when calling the ls method on S3Filesystem. This allows us to use a composite filesystem with S3Filesystem in front of HadoopFilesystem to gain extra performance (boto is much faster than hadoop for s3).

This fix transparently passes though s3n:// to s3:// for boto and then back to s3n:// for the list of uris returned.

Comment thread mrjob/fs/s3.py

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use mrjob.parse.urlparse() since urlparse() is buggy in some old versions of Python 2.6. Otherwise looks good!

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took care of it myself. :)

@coyotemarin coyotemarin merged commit 2136d2e into Yelp:master Jul 21, 2013
scottknight added a commit to timtadh/mrjob that referenced this pull request Oct 10, 2013
secondary sort and self-terminating job flows
 * jobs:
   * SORT_VALUES: Secondary sort by value (Yelp#240)
     * see mrjob/examples/
   * can now override jobconf() again (Yelp#656)
   * renamed mrjob.compat.get_jobconf_value() to jobconf_from_env()
   * examples:
     * bash_wrap/ (mapper/reducer_cmd() example)
     * mr_most_used_word.py (two step job)
     * mr_next_word_stats.py (SORT_VALUES example)
 * runners:
   * All runners:
     * single --setup option works but is not yet documented (Yelp#206)
     * setup now uses sh rather than python internally
   * EMR runner:
     * max_hours_idle: self-terminating idle job flows (Yelp#628)
       * mins_to_end_of_hour option gives finer control over self-termination.
     * Can reuse pooled job flows where previous job failed (Yelp#633)
     * Throws IOError if output path already exists (Yelp#634)
     * Gracefully handles SSL cert issues (Yelp#621, Yelp#706)
     * Automatically infers EMR/S3 endpoints from region (Yelp#658)
     * ls() supports s3n:// schema (Yelp#672)
     * Fixed log parsing crash on JarSteps (Yelp#645)
     * visible_to_all_users works with boto <2.8.0 (Yelp#701)
     * must use --interpreter with non-Python scripts (Yelp#683)
     * cat() can decompress gzipped data (Yelp#601)
   * Hadoop runner:
     * check_input_paths: can disable input path checking (Yelp#583)
     * cat() can decompress gzipped data (Yelp#601)
   * Inline/Local runners:
     * Fixed counter parsing for multi-step jobs in inline mode
     * Supports per-step jobconf (Yelp#616)
 * Documentation revamp
 * mrjob.parse.urlparse() works consistently across Python versions (Yelp#686)
 * deprecated:
   * many constants in mrjob.emr replaced with functions in mrjob.aws
 * removed deprecated features:
   * old conf locations (~/.mrjob and in PYTHONPATH) (Yelp#747)
   * built-in protocols must be instances (Yelp#488)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants