Tags: twitter/scalding
Tags
Add support for cogroups in beam-backend (#1945) In this change we are adding support for `HashCoGroup` and `CoGroupedPipe`. For evaluating HashCoGroup we are creating a ParDo transformation on the larger pipe with smaller pipe as side input. For evaluating CoGroupedPipe we are using the `MultiJoinFunction` to evaluate the final output iterator. Also as part of this change we are doing a minor refactor for code > 100 lines. TESTS: Added unit tests for both HashCoGroup and CoGroupedPipe.
Fix race between jvm shutdown and `writer.finished` (#1938) Currently `writer.finished` happens in `onComplete` callback on `Future` result in `Exucution`. However since `onComplete` isn't being called before future being resolved and called asynchroniously after future being resolved, it leads to a race and runtime error: - User's code as last operation in `main` executes `Execution` - `onComplete` with `writer.finished` is being scheduled - result `Future` gets resolved and jvm starts to shutdown - `writer.finished` starts to execute and in case of cascading backend adds shutdown hook - which is not permitted during jvm shutdown and breaks To fix this behaviour I made `onComplete` logic to happen before result future get resolved by changing `onComplete` to `andThen`
Alternative implementation to DeprecatedParquetInputFormat with fix (#… …1937) When combining N Parquet files, the first record of files 2 to N gets skipped while the last record from the previous file is returned instead. This means losing some records while others get duplicated, quite bad. This was fixed a month ago in apache/parquet-java#844 but we would need to update the dependencies. Should we do this approach or work towards updating deps?
Remove mapreduce.input.fileinputformat.inputdir setting in memory sou… …rce (#1936) Memory source sets the mapreduce.input.fileinputformat.inputdir property to a random UUID value. Often in clusters with HDFS federation, paths like that are not valid namespaces. While this path is not usually checked since this is a memory source, in clusters where Kerberos is enabled, Hadoop lists the input dirs to a job to get delegation tokens. Since this path is not valid, this results in a FileNotFoundException on a Kerberized cluster. This patch removes this setting in Scalding memory sources since they are not valid anyway. Co-authored-by: Navin Viswanath <[email protected]>
[temporary] bring twitter internal changes to release
Add type ascriptions to serialization code (#1926) Scalding macros expand into a large amount of code, most of which contained no or very little type ascriptions, leaving a lot of unnecessary work to the compiler. By explicitly adding these type ascriptions in the generated code, we can reduce compilation times.
Add explicit type annotation to make code compatible with scala 2.12
PreviousNext