Thanks to visit codestin.com
Credit goes to github.com

Skip to content

problems with _foreach #1384

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cholin opened this issue Mar 2, 2013 · 7 comments
Closed

problems with _foreach #1384

cholin opened this issue Mar 2, 2013 · 7 comments

Comments

@cholin
Copy link

cholin commented Mar 2, 2013

Hey,

like a year ago or more there were some design changes in the way iterations are handled in libgit2. Now there exist two ways of getting all the entries of a container for example like reflogs, trees or references. For reflog you do it the old for-loop style with finding out the size of the container and then accessing the entry itself with an index. For notes in contrast you have to provide callback function and libgit2 iterates through the container and calls for each iteration step the provided callback. In the following some examples:

*_entrycount/ *_entry_by_index

  • git_reflog_entry_byindex / git_reflog_entrycount
  • git_tree_entry_byindex / git_tree_entrycount

*_foreach

  • git_reference_foreach
  • git_note_foreach
  • git_diff_foreach
  • ...

In pygit2 the latter iteration handling is gettig us into trouble. As some may know in python, in comparison for example to ruby, you do not have blocks. For iterating you use for-loops or list comprehensions. So generally you do not use anonymous functions (callbacks) for each iteration step. Therefor it is really hard to provide a binding for these new _foreach-functions efficiently. We have to iterate through the whole list, make a PyList-object out of it and give it back to python. So if a user only wants the first two note entries this is a huge disadvantage (it is not lazy). However for _entrycount/_entry_by_index we can build a generator object because here we only have to provide a next() method. This is not possible for these foreach-iterators because we can not jump out of our called callback to our caller function. For an implemenation example you can have a look at RefLogIter_iternext() in src/reflog.c and Diff_changes__get__ in src/diff.c.

So the resulting question is: Is there any possibility to use these foreach-iterators in an old fashioned way? (Even with setjmp.h it seems not possible - not that I would want to abuse them for this...) Or is it possible to provide as well _entrycount-/_entry_by_index- besides _foreach-functions?

I think several bindings could run into this issue (except they are callback based or have yield in the binding level layer)

Greeting
Nico

@arrbee
Copy link
Member

arrbee commented Mar 2, 2013

Check out the git_diff_patch APIs for an indexed version of diff iteration that will hopefully be more easily bound. It was co-designed with @brianmario for binding in Rugged, so hopefully it will be easier to use for you.

Regarding the other cases, usually we fall back to foreach style iterators if externalizing state as an index value is too expensive or complicated. In the case of refs for example, you may be iterating through compressed refs or directory entries on disk where the state cannot easily be presented as just a number.

That's not to say it can't be fixed. For example, you could create a ref iterator object that encapsulated the in-progress state and allowed a non-callback based interface (although the recent work to allow plug in ref stores will probably make that much harder /cc @ethomson ). PRs are always welcome!

@cholin
Copy link
Author

cholin commented Mar 2, 2013

Well the iterator does not have to be indexed-based, a next method would fit as well. The problem is that you can't "pause" the foreach-iterator (no context switch possible in the other direction). I'll look into git_diff_patch but I think thain main issue with foreach is another. Maybe I can find some time to dive a little bit deeper into the libgit2 source code, but it will occurr again for other topics.

@carlosmn
Copy link
Member

carlosmn commented Mar 2, 2013

Thinking about the refs... both the hash implementation and the filesystem support iterating through them, so even though we don't quite know the amount we have on the filesystem, we could create an iterator, as both khash and the syscalls let us implement _next().

Presumably any sensible database implementation you'd use as a backend would let you have a similar iterator.

@carlosmn
Copy link
Member

carlosmn commented Mar 2, 2013

These iterations would have to be supported by each backend, but that shouldn't be that much of an issue. I keep being confused about yield in C#, @nulltoken would C# also benefit from such an iteration?

@nulltoken
Copy link
Member

@nulltoken would C# also benefit from such an iteration?

Sir. Yes, Sir!

  • With _foreach, libgit2 pushes the data toward the caller
  • With next(), the caller pulls the data.

Implementing yield with next() would indeed be pretty trivial in LibGit2Sharp.

I can think of only one potential issue related to the next() based implementation. Because of a command line operation (or because of another action being executed on a different thread), the content of the workdir and/or the index may change between two calls to next().

Current foreach based implementation of course suffers from the same risk, but with quite a narrower window of exposure as each entry is being stuffed into a private List<> upon each foreach callback. In this case, the consumer only enumerates a cached version of what was in the workdir/index.

@arrbee
Copy link
Member

arrbee commented Mar 2, 2013

I'll look into git_diff_patch but I think thain main issue with foreach is another.

I may not have communicated clearly that I understand your complaint about the _foreach pattern, but I think if you look at it, the git_diff_patch related APIs will allow you to access diff contents by index. For reference:

  1. Call the git_diff_index_to_workdir() or whatever API you want to make a git_diff_list.
  2. You can now access individual deltas by index from 0 to git_diff_num_deltas()
  3. Call git_diff_get_patch() with the diff list and the index value to get the git_diff_delta structure and optionally a git_diff_patch object that represents the text diff
  4. You also access the hunks and lines in the patch by index using git_diff_patch_get_hunk() and git_diff_patch_get_line_in_hunk()

Iterating over the diff can be index based because we have definitive knowledge of the number of changed files (and the number of hunks and lines in the text diff).

Regarding the many other uses of _foreach, as you can see it appears that @carlosmn has an interest in building an iterator object for you for refs. Many of the other cases are of varying level of difficulty. Let's take a quick look...

git_attr_foreach() is relatively easy to encapsulate as an iterator because the internal design already has a function (collect_attr_files()) that assembles most of the iteration data into a structure. I'm guessing that one could make a git_attr_iterator along with related functions in fewer than 100 lines of code. That might be a good project for someone who wanted to get started contributing to libgit2. This is a case where the number of items is not known in advance, so an iterator object with an advance() function is needed.

Making a git_status_iterator is somewhat more complicated to implement, just because of the increased complexity of the inner loop of the iterator and the indirection between the top level function and the invocation of the callback, but it is definitely a manageable task with few unknowns. Although it is based on diff data internally, I would probably recommend an iterator object and advance() because direct indexing is complicated here.

Looking over the notes code, I think it could be improved by creating an iterator object. Currently it is doing allocated for each invocation of the callback function and the natural iterator implementation would reuse the data buffer for each successive output, I think. From what I can see, it would not be too hard to implement.

Anyhow... That's just a few items. Sounds like some great small-ish projects in there.

BTW, I think most of the usage of the _foreach style comes from new contributors copying the most recently written examples and then that propagates through the library. Probably half of the active contributors have started participating in just the last 12 months. When I initially wrote the attributes code, for example, there were no examples in the public API of an iterator object, and since I couldn't do indexed access, I just copied the foreach pattern.

@cholin
Copy link
Author

cholin commented Mar 7, 2013

As the note iterator is already merged and carlos is working on a reference iterator, I think we can close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants