Thanks to visit codestin.com
Credit goes to github.com

Skip to content
This repository was archived by the owner on Jan 20, 2022. It is now read-only.

Conversation

@ianoc
Copy link
Collaborator

@ianoc ianoc commented Feb 21, 2014

...oding the underlying transport

@johnynek , @jcoveney Somewhat here as an RFC, going to do more tests with this new branch/build.

Part of this is just a clean up of Timestamp bleeding further and into places not needed, so I could output from a final flatmap without a timestamp. We basically partition the keyspace into a set of buckets before passing off to storm.

@johnynek
Copy link
Collaborator

I think we should tackle these two items separately:

  1. Can we remove the time and just batch right when we come in? Maybe, but at the same time, let's keep in mind why we would do that: I guess performance. Otherwise, keeping the timestamps could be useful for debugging (what is the most recent timestamp seen?)

  2. What about doing some batching so that 1 value does not correspond to one storm tuple. This seems like a chance for a huge win, especially at the output of large fan outs from flatMaps or from cache dumps.

Breaking this into two issues (the second being the highest priority) seems like a good way to go.

@ianoc
Copy link
Collaborator Author

ianoc commented Feb 21, 2014

--> When you come in you mean to the FFM bolt?

  1. seems like a potential large win, a more generalization of this code.
    -> The issue/trick here might be how do we build the original batch
    -> how does left join work? (single future/multiGet , slow tuple would effect a whole batch-- might be totally fine)
    This effectively means we are batched when we are key'd to storm, but there is a potential win of course being batched during all operations. It might require customizing a spout to test properly... but thats easy too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should no longer be needed. This was added to Algebird as the default for tuples.

…l-partition

Conflicts:
	summingbird-storm/src/main/scala/com/twitter/summingbird/storm/StormPlatform.scala
…l-partition

Conflicts:
	summingbird-storm/src/main/scala/com/twitter/summingbird/storm/StormPlatform.scala
@johnynek
Copy link
Collaborator

going to wait to review this until #455 is in so we can look at a clean diff.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this better than the default Semigroup?

…eature/manual-partition

Conflicts:
	summingbird-online/src/main/scala/com/twitter/summingbird/online/executor/FinalFlatMap.scala
	summingbird-online/src/main/scala/com/twitter/summingbird/online/executor/Summer.scala
	summingbird-storm/src/main/scala/com/twitter/summingbird/storm/StormPlatform.scala
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is still unclear to me.

So, from the above, it seems that we are using this number to be the max number of keys in a cache on the flatmappers, and then using the fact that Maps can be summed, and then emitted as a single batch.

Can you explain more the whole algorithm here?

I would prefer if we could just get the effect of batching here. For instance, the cache emits blocks of results, those could be written as a List, and sent over (perhaps with a max size and chunked).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we can't just emit the cache as a list of keys, we need an outer key grouping. Since we are on a key grouped level, we need to have a stable mapping of K -> OuterK. Here we are configuring a setting of how big the space we should map the original K into, a multiple of the available consumers. Generic batching still wouldn't really solve this I don't think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this go in AllOpts and get a comment? I really don't know what this does.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its not there because its not a user set variable, not really sure where it should go. I didn't want to pass around an integer. It contains the size space which to mod the hash of the key into. (calculated in storm platform)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be private in that case?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, will change

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why wrapping S here rather than leave it opaque? Seems like the previous PR did the opposite: move from (Timestamp, T) => T, here we are taking (what I think) is an opaque parameter S and putting structure we don't use. Why not just instantiante with S=InputState[S1] for some S1?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use the InputState portion further down in the Summer class now for fan out. I could put S in the type parameter of Summer to be S <: InputState[_] ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh... here is where you use the concrete nature of InputState. This is new right?.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep exactly. S <: InputState[_] compiles nicely though, so we could use it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW: This seems super easy to get wrong. fanOut(size) seems more intuitive.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, state.fanOut can throw, but that will cause an exception, not return a Future.exception. What about:

try {
  // code you have above here
}
catch {
  case t: Throwable => Future.exception(t)
}

@ianoc
Copy link
Collaborator Author

ianoc commented Feb 27, 2014

I've done all the comments except the S <: InputState[_] so far, which I have locally. If you would rather it i'll push it in too. I kind of like it as it'll make it obvious we don't care what the inner type is

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is 1 not enough here? Can you add a comment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This really came from Aaron, i think some sort of notion that incase the space gets lob sided in some manner. I can't really see why it shouldn't be one if everything is well behaved.

…the output of increasing its size.

Wrap the fan out code in a try catch for a future exception
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

size > 0 or we would not have recieved anything, right? Can we put:

assert(size > 0, "Input maps must be non-empty")

else, we are not going to do the acking correctly, right?

johnynek added a commit that referenced this pull request Feb 27, 2014
Manually block up sections of output from caches into lists to avoid flo...
@johnynek johnynek merged commit 1e9ceb9 into develop Feb 27, 2014
@johnynek johnynek deleted the feature/manual-partition branch February 27, 2014 04:05
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.nonEmpty

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants