Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Tags: twitter/scalding

Tags

tlazaro/twitter/20210917

Toggle tlazaro/twitter/20210917's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
Add support for cogroups in beam-backend (#1945)

In this change we are adding support for `HashCoGroup` and `CoGroupedPipe`.
For evaluating HashCoGroup we are creating a ParDo transformation on the larger pipe with smaller pipe as side input.
For evaluating CoGroupedPipe we are using the `MultiJoinFunction` to evaluate the final output iterator.

Also as part of this change we are doing a minor refactor for code > 100 lines.

TESTS: Added unit tests for both HashCoGroup and CoGroupedPipe.

twitter/20210415

Toggle twitter/20210415's commit message
Fix race between jvm shutdown and `writer.finished` (#1938)

Currently `writer.finished` happens in `onComplete` callback on `Future` result in `Exucution`. However since `onComplete` isn't being called before future being resolved and called asynchroniously after future being resolved, it leads to a race and runtime error:
- User's code as last operation in `main` executes `Execution`
- `onComplete` with `writer.finished` is being scheduled
- result `Future` gets resolved and jvm starts to shutdown
- `writer.finished` starts to execute and in case of cascading backend adds shutdown hook
- which is not permitted during jvm shutdown and breaks

To fix this behaviour I made `onComplete` logic to happen before result future get resolved by changing `onComplete` to `andThen`

twitter/20210128

Toggle twitter/20210128's commit message
Alternative implementation to DeprecatedParquetInputFormat with fix (#…

…1937)

When combining N Parquet files, the first record of files 2 to N gets skipped while the last record from the previous file is returned instead. This means losing some records while others get duplicated, quite bad.

This was fixed a month ago in apache/parquet-java#844 but we would need to update the dependencies.

Should we do this approach or work towards updating deps?

twitter/20210121

Toggle twitter/20210121's commit message
Remove mapreduce.input.fileinputformat.inputdir setting in memory sou…

…rce (#1936)

Memory source sets the mapreduce.input.fileinputformat.inputdir property to
a random UUID value. Often in clusters with HDFS federation, paths like that are
not valid namespaces.
While this path is not usually checked since this is a memory source, in clusters
where Kerberos is enabled, Hadoop lists the input dirs to a job to get delegation
tokens. Since this path is not valid, this results in a FileNotFoundException on
a Kerberized cluster.

This patch removes this setting in Scalding memory sources since they are not valid anyway.

Co-authored-by: Navin Viswanath <[email protected]>

twitter/20200929

Toggle twitter/20200929's commit message
[temporary] bring twitter internal changes to release

twitter/20200601

Toggle twitter/20200601's commit message
Add more guards on ReferencedClassFinder (#1931)

twitter/20200508

Toggle twitter/20200508's commit message
Add type ascriptions to serialization code (#1926)

Scalding  macros expand into a large amount of code, most of which
contained no or very little type ascriptions, leaving a lot of
unnecessary work to the compiler. By explicitly adding these type
ascriptions in the generated code, we can reduce compilation times.

twitter/20190422

Toggle twitter/20190422's commit message
Add explicit type annotation to make code compatible with scala 2.12