problems with _foreach #1384

cholin · 2013-03-02T15:42:38Z

Hey,

like a year ago or more there were some design changes in the way iterations are handled in libgit2. Now there exist two ways of getting all the entries of a container for example like reflogs, trees or references. For reflog you do it the old for-loop style with finding out the size of the container and then accessing the entry itself with an index. For notes in contrast you have to provide callback function and libgit2 iterates through the container and calls for each iteration step the provided callback. In the following some examples:

*_entrycount/ *_entry_by_index

git_reflog_entry_byindex / git_reflog_entrycount
git_tree_entry_byindex / git_tree_entrycount

*_foreach

git_reference_foreach
git_note_foreach
git_diff_foreach
...

In pygit2 the latter iteration handling is gettig us into trouble. As some may know in python, in comparison for example to ruby, you do not have blocks. For iterating you use for-loops or list comprehensions. So generally you do not use anonymous functions (callbacks) for each iteration step. Therefor it is really hard to provide a binding for these new _foreach-functions efficiently. We have to iterate through the whole list, make a PyList-object out of it and give it back to python. So if a user only wants the first two note entries this is a huge disadvantage (it is not lazy). However for _entrycount/_entry_by_index we can build a generator object because here we only have to provide a next() method. This is not possible for these foreach-iterators because we can not jump out of our called callback to our caller function. For an implemenation example you can have a look at RefLogIter_iternext() in src/reflog.c and Diff_changes__get__ in src/diff.c.

So the resulting question is: Is there any possibility to use these foreach-iterators in an old fashioned way? (Even with setjmp.h it seems not possible - not that I would want to abuse them for this...) Or is it possible to provide as well _entrycount-/_entry_by_index- besides _foreach-functions?

I think several bindings could run into this issue (except they are callback based or have yield in the binding level layer)

Greeting
Nico

The text was updated successfully, but these errors were encountered:

arrbee · 2013-03-02T16:40:14Z

Check out the git_diff_patch APIs for an indexed version of diff iteration that will hopefully be more easily bound. It was co-designed with @brianmario for binding in Rugged, so hopefully it will be easier to use for you.

Regarding the other cases, usually we fall back to foreach style iterators if externalizing state as an index value is too expensive or complicated. In the case of refs for example, you may be iterating through compressed refs or directory entries on disk where the state cannot easily be presented as just a number.

That's not to say it can't be fixed. For example, you could create a ref iterator object that encapsulated the in-progress state and allowed a non-callback based interface (although the recent work to allow plug in ref stores will probably make that much harder /cc @ethomson ). PRs are always welcome!

cholin · 2013-03-02T16:53:40Z

Well the iterator does not have to be indexed-based, a next method would fit as well. The problem is that you can't "pause" the foreach-iterator (no context switch possible in the other direction). I'll look into git_diff_patch but I think thain main issue with foreach is another. Maybe I can find some time to dive a little bit deeper into the libgit2 source code, but it will occurr again for other topics.

carlosmn · 2013-03-02T17:06:58Z

Thinking about the refs... both the hash implementation and the filesystem support iterating through them, so even though we don't quite know the amount we have on the filesystem, we could create an iterator, as both khash and the syscalls let us implement _next().

Presumably any sensible database implementation you'd use as a backend would let you have a similar iterator.

carlosmn · 2013-03-02T17:32:36Z

These iterations would have to be supported by each backend, but that shouldn't be that much of an issue. I keep being confused about yield in C#, @nulltoken would C# also benefit from such an iteration?

nulltoken · 2013-03-02T19:33:29Z

@nulltoken would C# also benefit from such an iteration?

Sir. Yes, Sir!

With _foreach, libgit2 pushes the data toward the caller
With next(), the caller pulls the data.

Implementing yield with next() would indeed be pretty trivial in LibGit2Sharp.

I can think of only one potential issue related to the next() based implementation. Because of a command line operation (or because of another action being executed on a different thread), the content of the workdir and/or the index may change between two calls to next().

Current foreach based implementation of course suffers from the same risk, but with quite a narrower window of exposure as each entry is being stuffed into a private List<> upon each foreach callback. In this case, the consumer only enumerates a cached version of what was in the workdir/index.

arrbee · 2013-03-02T20:45:07Z

I'll look into git_diff_patch but I think thain main issue with foreach is another.

I may not have communicated clearly that I understand your complaint about the _foreach pattern, but I think if you look at it, the git_diff_patch related APIs will allow you to access diff contents by index. For reference:

Call the git_diff_index_to_workdir() or whatever API you want to make a git_diff_list.
You can now access individual deltas by index from 0 to git_diff_num_deltas()
Call git_diff_get_patch() with the diff list and the index value to get the git_diff_delta structure and optionally a git_diff_patch object that represents the text diff
You also access the hunks and lines in the patch by index using git_diff_patch_get_hunk() and git_diff_patch_get_line_in_hunk()

Iterating over the diff can be index based because we have definitive knowledge of the number of changed files (and the number of hunks and lines in the text diff).

Regarding the many other uses of _foreach, as you can see it appears that @carlosmn has an interest in building an iterator object for you for refs. Many of the other cases are of varying level of difficulty. Let's take a quick look...

git_attr_foreach() is relatively easy to encapsulate as an iterator because the internal design already has a function (collect_attr_files()) that assembles most of the iteration data into a structure. I'm guessing that one could make a git_attr_iterator along with related functions in fewer than 100 lines of code. That might be a good project for someone who wanted to get started contributing to libgit2. This is a case where the number of items is not known in advance, so an iterator object with an advance() function is needed.

Making a git_status_iterator is somewhat more complicated to implement, just because of the increased complexity of the inner loop of the iterator and the indirection between the top level function and the invocation of the callback, but it is definitely a manageable task with few unknowns. Although it is based on diff data internally, I would probably recommend an iterator object and advance() because direct indexing is complicated here.

Looking over the notes code, I think it could be improved by creating an iterator object. Currently it is doing allocated for each invocation of the callback function and the natural iterator implementation would reuse the data buffer for each successive output, I think. From what I can see, it would not be too hard to implement.

Anyhow... That's just a few items. Sounds like some great small-ish projects in there.

BTW, I think most of the usage of the _foreach style comes from new contributors copying the most recently written examples and then that propagates through the library. Probably half of the active contributors have started participating in just the last 12 months. When I initially wrote the attributes code, for example, there were no examples in the public API of an iterator object, and since I couldn't do indexed access, I just copied the foreach pattern.

cholin · 2013-03-07T13:26:59Z

As the note iterator is already merged and carlos is working on a reference iterator, I think we can close this issue.

carlosmn mentioned this issue Mar 2, 2013

Introduce a refs iterator #1385

Merged

arrbee mentioned this issue Mar 2, 2013

Update contributing and conventions #1386

Merged

This was referenced Mar 3, 2013

generator implementation issue with libgit2 libgit2/pygit2#183

Closed

[RFC] basic note iterator implementation #1396

Merged

cholin closed this as completed Mar 7, 2013

cholin mentioned this issue Mar 12, 2013

[RFC] new git_config enumeration API #1409

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

problems with _foreach #1384

problems with _foreach #1384

cholin commented Mar 2, 2013

arrbee commented Mar 2, 2013

Uh oh!

cholin commented Mar 2, 2013

Uh oh!

carlosmn commented Mar 2, 2013

Uh oh!

carlosmn commented Mar 2, 2013

Uh oh!

nulltoken commented Mar 2, 2013

Uh oh!

arrbee commented Mar 2, 2013

Uh oh!

cholin commented Mar 7, 2013

Uh oh!

problems with _foreach #1384

problems with _foreach #1384

Comments

cholin commented Mar 2, 2013

arrbee commented Mar 2, 2013

Uh oh!

cholin commented Mar 2, 2013

Uh oh!

carlosmn commented Mar 2, 2013

Uh oh!

carlosmn commented Mar 2, 2013

Uh oh!

nulltoken commented Mar 2, 2013

Uh oh!

arrbee commented Mar 2, 2013

Uh oh!

cholin commented Mar 7, 2013

Uh oh!