Remove `git_buf` as a public-facing structure #5534

ethomson · 2020-05-29T09:59:37Z

We often want to provide callers with a buffer that they can control the lifetime of (for example, several configuration functions will take a git_buf that they write the value into). This is a very useful pattern but git_buf is also a utility type that we use internally.

It's useful to separate these concerns, especially if we were to make a more general-purpose set of utility classes/functions. By intermingling git_buf as a public type and as a general utility, it becomes difficult to manage the memory. If git_buf_dispose is an exported type then it must live in the libgit2 library. Meaning that other assemblies can't allocate memory in a git_buf and then use git_buf_dispose, since on some systems (eg, Windows) allocators are per-assembly.

Splitting git_buf into a true utility class with no exported symbols allows us to keep it internal.

This adds a git_userbuf that is used for returning data to users. For API compatibility, we provide a typedef and macros that are enabled unless GIT_DEPRECATE_HARD is set. git_userbuf is the same definition as git_buf so that we can simply cast it and work with it using the git_buf functions internally.

This also adds a GIT_DEPRECATE_BUF option that will remove the deprecated git_buf definitions. This is useful for the library itself, which would not want the buffer deprecation layer, since it uses actual git_buf functions.

There are a few issues here that I did not address:

There are a few functions that don't give users data in the form of a git_buf, but instead take data from them. We should evaluate these carefully, but it's very unlikely to me that this is "correct". We should take data from users in a NUL-terminated const char * or a buffer and length pair and then make a copy.
The filter functionality is wacky. I'd forgotten about this "we'll give you a git_buf and you give us a git_buf and it could be the same for efficiency's sake". That should have been deprecated the moment that we had filter streams which are themselves complex, but not this cannon aimed at your foot.

If we can safely deprecate these (and I think that we can) then we can get git_userbuf to a place where it's strictly written by libgit2 and consumed and then freed by users, which would reduce some of the needless complexity. But that's a different pull request.

`git__getenv` belongs in a class instead of as a top-level function, move it into the `git_buf` class as `git_buf_getenv`.

Introduce a new user-facing buffer struct that is compatible with `git_buf`. This will allow us to keep our `git_buf` implementation private, to disentangle the notion of public and private types. But since it's compatible, it's trivially castable.

The `git_buf` type is now no longer a publicly available structure, and the `git_buf` family of functions are no longer exported. The deprecation layer adds a typedef for `git_buf` (as `git_userbuf`) and macros that define `git_buf` functions as `git_userbuf` functions. This provides API (but not ABI) compatibility with libgit2 1.0's buffer functionality. Within libgit2 itself, we take care to avoid including those deprecated typedefs and macros, since we want to continue using the `git_buf` type and functions unmodified. Therefore, a `GIT_DEPRECATE_BUF` guard now wraps the buffer deprecation layer. libgit2 will define that.

pks-t · 2020-06-01T10:45:53Z

If git_buf_dispose is an exported type then it must live in the libgit2 library. Meaning that other assemblies can't allocate memory in a git_buf and then use git_buf_dispose, since on some systems (eg, Windows) allocators are per-assembly.

@ethomson I honestly don't quite get the concern. Isn't the same true for the new git_userbuf type now? If there's multiple versions of libgit2 linked into a program I'd expect bad things to happen anyway, so I'm not sure I'm a big fan of the added complexity of adding git_userbuf to fix a scenario that is broken and not comletely fixed by this change. Probably I'm misunderstanding the underlying issue, though.

ethomson · 2020-06-01T12:34:06Z

If there's multiple versions of libgit2 linked into a program I'd expect bad things to happen anyway

That's not the thing that I'm trying to solve for -- instead, every C program that's slightly more than trivial needs some of the same pieces of functionality: string handling, array handling, etc. For us, that's git_buf and git_vector. And if I wanted to build some other tool in C, I would reach for git_buf - and that's especially true if I were going to build something that's sort of part of libgit2, like a CLI.

But you can't simply copy buffer.c into your source directory. Assume for a minute that buffer.c had no other dependencies on any libgit2 internals (it does, in git_malloc, but that could be abstracted away.) But it won't work because you can't call git_buf_dispose - at least, not if you link to libgit2.

That's because git_buf_dispose is a symbol that is published by libgit2. So... which one do you call? Yours, or libgit2's? For at least some systems, the answer is libgit2's, and for any system where you're using different allocators, your app is going to go boom.

You could, I suppose, s/git_buf/cli_buf/g and call it a day, but now you've got two implementations to maintain. Could you automate this? Mayyyybe? This is annoying since some of the declarations happen in include/git2/buf.h and are GIT_EXTERN and some are not.

As I mentioned, this is very much top of mind to the CLI. Because no matter how trivial this CLI is, it probably needs the moral equivalent of git_buf. But putting all that aside, I've always loathed that we make asize a public member of the struct. We shouldn't. What we have allocated and how is a leaky abstraction that we should unburden the user with. We only did that as a (bad) mechanism to signal whether we had allocated the memory or whether we were taking data from a user. In retrospect, this was sort of gross, and we should have had two different types.

This was the "needless complexity" that I mentioned in the PR description. I don't address this here but I think that this is a good first step towards that.

ethomson · 2020-06-02T08:43:23Z

To give you an idea of what I would like to do in the long-term, I would like to change any code that takes a git_buf from the user so that it doesn't anymore. These should all be const char * pointers. I don't know why they weren't, and in at least one of the two cases, it was my suggestion, and one that I regret in hindsight.

Then I would like to change git_buf to be:

typedef struct {
	char   *ptr;
	size_t size;
	size_t asize;
} git_buf;

(moving the asize to the end) and then git_userbuf to be:

typedef struct {
	char   *ptr;
	size_t size;
} git_userbuf;

Now we can cast our git_bufs to git_userbufs that we give to the user, and they are blissfully unaware of the asize.

pks-t · 2020-06-15T13:00:54Z

As I mentioned, this is very much top of mind to the CLI. Because no matter how trivial this CLI is, it probably needs the moral equivalent of git_buf.

Fair enough. I'm not sure this really is worth adding this complexity, though. I'm on your page that having the CLI is a cool thing, but changing our own API to make implementing it easier seems kind of backward to me. While true that we should learn from our mistakes that we unearth by dogfeeding our own interfaces, but I'd still like to remain super cautious when deprecating interfaces.

To give you an idea of what I would like to do in the long-term, I would like to change any code that takes a git_buf from the user so that it doesn't anymore. These should all be const char * pointers. I don't know why they weren't, and in at least one of the two cases, it was my suggestion, and one that I regret in hindsight.

Yup, that's definitely true and one thing I've thought about, too. The only concern I have is backwards-compatibility.

One thing we should keep in mind is that the CLI will most likely be used by others as a reference to implement their own applications that make use of similar structures. And if we starting using internals of libgit2 (even if exposed via a separate, internal-only library), people might want to use the same non-external helpers, too. Which is why I really think there should be a hard separation between CLI and libgit2: the CLI will only ever use what's exposed by libgit2's official API and nothing more. It helps others in that they can re-use code, it helps us as we start dogfeeding our own code and it doesn't introduce new internal libraries.

ethomson · 2020-06-15T13:29:04Z

people might want to use the same non-external helpers, too.

Yes! I think this is a good thing. :)

I'm not at all trolling. If I'm coding a C application, I want to reach for git_bufs and git_vectors because they're familiar to me and well tested. Teasing the utility code out of libgit2 means that I can do exactly that.

This is akin to GNOME/glib coming from gimp. I suspect that any sufficiently large C application will create its own utility class and eventually export it. I'm not really proposing that we create a general purpose utility library and encourage a bunch of other people to use it. But I am proposing that we shard out our general purpose utility library so that they could.

it doesn't introduce new internal libraries.

OK - serious question then - how do we do string manipulation in the CLI? I think that we can either:

Use an off-the-shelf utility library, like glib. I think that there's probably nothing wrong with glib, but it's a new dependency, and it recreates code that we already have in an unfamiliar way.
Roll a new set of array, string, etc, functions. This is a lot of new code, in C, for what seems like very little payback.
Use the utility library that we do have.

pks-t · 2020-06-15T15:06:43Z

Yes! I think this is a good thing. :)

Haha, well, that comes unexpected.

OK - serious question then - how do we do string manipulation in the CLI? I think that we can either:

It's definitely a good question and I think there's no one right answer.

Use an off-the-shelf utility library, like glib. I think that there's probably nothing wrong with glib, but it's a new dependency, and it recreates code that we already have in an unfamiliar way.

Agreed.

Roll a new set of array, string, etc, functions. This is a lot of new code, in C, for what seems like very little payback.

So ideally, the planned higher-level interfaces should already allow us to not do much string manipulation anyway. I guess I'm being naive though and that you're right, but I would've thought that most string handling would be to just print it to stdout/stderr. Using the printf family would be perefctly fine for this purpose.

Use the utility library that we do have.

So with your framing of "Maybe we do want to expose these helpers" this doesn't sound too bad to me. As you say, we have them anyway and they've proven to be quite stable, so I could also see us directly exposing them via the normal libgit2 library. So instead of going the way of splitting up git_buf into our internal and the git_userbuf part, we could just give users a way to properly handle git_bufs by exposing its interface. We might consider putting these interfaces into a new separate namespace "git2/util" or something like that that doesn't get included by default, but if you ask me it should stay part of the main library in that case.

To me, this does feel like a small break with our existing "mission". I mostly took libgit2 as the core library that most nobody uses directly anyway because everybody uses bindings instead, and adding such low-level helpers to our interface doesn't help bindings at all. I'd be careful with what we expose, but doing this for git_buf and git_vector doesn't sound too bad to me. And implementing a CLI as part of the libgit2 project expands our "mission" anyway.

ethomson · 2020-06-23T09:02:45Z

To me, this does feel like a small break with our existing "mission". I mostly took libgit2 as the core library that most nobody uses directly anyway because everybody uses bindings instead, and adding such low-level helpers to our interface doesn't help bindings at all. I'd be careful with what we expose, but doing this for git_buf and git_vector doesn't sound too bad to me. And implementing a CLI as part of the libgit2 project expands our "mission" anyway.

This is definitely not what I was proposing. 😄

I don't think that we should be exposing these publicly as part of libgit2. I don't want git_buf or git_vector to be a contract that we have to abide by in the library. More importantly, this plan doesn't actually help with the goal of using this utility code as utility code.

If I'm building a new tool in C and want to use these handy utility classes, I want to just pull them in to my tool and use them directly. I do not want to have to link to libgit2 to use them. If I'm building something that uses libgit2 that might be useful, but if I'm building a totally random tool, then linking to libgit2 to get string manipulation is a non-starter for me.

I have in the past taken parts of git_buf and git_vector (for example) in other tools, and it was painful and frustrating because things are so interdependent. #5507 attacked some of these problems, and gets things basically to a point where if I were building a totally random C application that is unrelated to libgit2, I could pull the util directory out and s/git_/my_prefix_/ on all the filenames and symbols, and go.

(In a perfect world, we might have a separate prefix for the utility code, gu_buf or something, so that nobody needs to sed through the codebase.)

Is this really in our mission? No, I suppose not, but

encapsulation is good!, even if we weren't trying to produce a utility library, untangling separation of concerns is good. 😁
it's not really much work to enable it, and we can do it in a backwards compatible way, and
it would actually help people - even if, selfishly, that person is me. 😁

There's obviously a lot of precedent here, thinking of how glib came from GIMP and libchrome came from chrome.

I wouldn't want to take this out of our tree or commit to an API but this is where I'm coming from, to give you some more insight into my thought process.

Now, having said all that, what about this issue? Even beyond all that ☝️, I think that it's useful to disentangle our git_bufs that we use internally from the things that we give to users of the library. This isn't a breaking change, and although it doesn't "finish the job" so to speak of removing things that the user doesn't want or need (and maybe is even actively harmful to expose) - like the allocation size - but is a step in that direction.

I guess I don't understand yet what you don't like about this change? Is it the overall thinking (not giving users internal types) or is it the implementation? There are obviously many ways to go about the implementation, but I think we need to be aligned on the goals first.

pks-t · 2020-06-26T07:16:08Z

include/git2/userbuf.h

+ * @param target_size The desired available size
+ * @return 0 on success, -1 on allocation failure
+ */
+GIT_EXTERN(int) git_userbuf_grow(git_userbuf *buffer, size_t target_size);


So if we're going to introduce this type, to me it should really be limited to providing output to the user. As you said, separation of concerns makes sense, and we're forced to handle character arrays to the user in wrapper as opposed to handing memory ownership over to the caller directly. Otherwise there might be mismatches in malloc/free implementations etc.

So the scope of git_userbuf should be getting information to the user so that he may access it and then properly dispose of it by calling git_userbuf_dispose. To me, git_userbuf_grow and git_userbuf_set don't fit into this scope and should not be provided.

What's missing though are git_userbuf_len and git_userbuf_ptr functions to access the structure, mostly because I'd prefer the structure to be opaque to the user. I don't think we'll ever be able to actually make it opaque without a major backwards compatibility breakage (which I definitely don't want to pursue), but at least allowing users to treat the structure as opaque would make sense to me. We should probably also provide git_userbuf_init, as the macro won't work in all situations.

pks-t · 2020-06-26T07:16:48Z

I guess I don't understand yet what you don't like about this change? Is it the overall thinking (not giving users internal types) or is it the implementation? There are obviously many ways to go about the implementation, but I think we need to be aligned on the goals first.

It's really hard to nail down. I think what I don't like about it is that to me, it feels like code duplication with the intent to help others stop duplicating code if they're copying the code from us. I'm sure we're kind of talking past each other and that I'm just misunderstanding, but this motivation feels... weird to me. And it's definitely not all of the motivation you spell out anyway.

Anyway, I won't block this change. It's not like I'm a 100% against it, I just still don't quite get the point. How about we just get a third opinion on this? Maybe that'd help us understand each others motivations better, either if that third party has the same concerns as I do or if that person is able to bring across the point in a way that even I am able to understand it :P

I've also commented on the interface to make it easier to find some common ground and agree on the scope of git_userbuf.

carlosmn · 2020-12-02T11:02:45Z

I get the appeal of using the utility functions as we have done this at work. I don't know how much of an effort the library should make to make that possible. I do agree with the concern of git_buf exposing too many details and functions. I was surprised by some of the functions we export for it. An external user of the library should only be able to read the pointer+size or set it to pass in data.

I don't find it that bad that we accept a git_buf in order to accept a pointer+size for a buffer from the user. It does conflate things a bit, but if we do get rid of things like asize and git_userbuf_grow then it just ends up being a pointer+length tuple anyway, which seems more convenient for us anyway, and should be ~free for the caller to allocate on their stack. This is not quite what this PR addresses but it's in the OP.

We should be careful to balance the ease of extraction with the inconvenience and extra work for everyone else utterly uninterested in any of this, particularly now that we're post-1.0 and we've promised stability. Deprecating some o the other function is still fine by me unless we can figure out why on earth we expose so much random stuff for the buffer.

As far as creating this new type... I wonder if it's possible to achieve this extractability i.e. not mixing libgit2's and the extracted functions, differently. IIUC this is just an issue when we want to copy-paste git_buf and then also use libgit2 as then, most importantly I suppose, git_buf_dispose is going to cause all sorts of havoc when the C compiler randomly chooses which one to use.

It feels to me like the onus can be more on the extractor's side. After all they're already doing some work to copy and adjust their build system in order to save themselves the work of building an equivalent buffer/string handling library/utility functions. In some cases I suppose it might involve pointing at a libgit2 source dir they're using already. Maybe we can take a page out of klib's solution to potentially being included multiple times. As terrible as it is on many levels, not least reading and writing the code... maybe we should make the prefix configurable, so you define LIBGIT2_BUF_PREFIX=contoso_buf in your build system and poof your own library as though you had done the copying and sed s/git_buf/contoso_buf/. This should also be something we would find useful for our extraction.

So maybe this is a sensible compromise to the work involved from the different parties (while leaving it open for us to restrict how much git_buf utility we expose to the user), or maybe it's the terrible option that makes it so the others seem better.

ethomson · 2020-12-02T11:12:46Z

Yeah, I think that you've identified the two problems I have with git_buf being exposed, and you've also correctly identified that they are orthogonal.

We expose too much (asize, grow, etc)
Can't re-use it easily

At this point I'm more concerned about the first point than the second, to be honest. This has rotted quite a bit, but I'll turn the crank here on this PR and iterate to separate the concerns. I'll make sure that we're focusing on that first point instead of the second.

ethomson added 25 commits May 28, 2020 12:04

getenv: move into the buffer class

d3ed31f

`git__getenv` belongs in a class instead of as a top-level function, move it into the `git_buf` class as `git_buf_getenv`.

diff: user-facing functions write to a userbuf

ce7e71b

config: user-facing functions write to a userbuf

a364d86

blob: user-facing functions write to a userbuf

6e74a4e

branch: remove unused git_branch_upstream__name decl

abd4e9d

branch: user-facing functions write to a userbuf

95634d1

commit: user-facing functions write to a userbuf

79dea82

describe: user-facing functions write to a userbuf

ca92646

settings: user-facing functions write to a userbuf

375c1ce

repository: user-facing functions write to a userbuf

f68ec19

worktree: user-facing functions write to a userbuf

ed529ca

tree: user-facing functions write to a userbuf

fd40feb

remote: user-facing functions write to a userbuf

4031103

message: user-facing functions write to a userbuf

56b37fb

notes: user-facing functions write to a userbuf

143f545

diff: user-facing functions write to a userbuf

8d0d318

object: user-facing functions write to a userbuf

52c57d3

refspec: user-facing functions write to a userbuf

067bb2a

packbuilder: user-facing functions write to a userbuf

e77bca2

submodule: user-facing functions write to a userbuf

93698e7

merge: user-facing functions write to a userbuf

c6da855

filter: user-facing functions write to a userbuf

5d01201

examples: use git_userbuf

05c7796

ethomson force-pushed the ethomson/userbuf branch from 9e200a3 to bab51e2 Compare May 29, 2020 10:34

pks-t reviewed Jun 26, 2020

View reviewed changes

Base automatically changed from master to main January 7, 2021 10:09

ethomson mentioned this pull request Sep 28, 2021

git_buf: now a public-only API (git_str is our internal API) #6078

Merged

ethomson closed this Sep 28, 2021

ethomson deleted the ethomson/userbuf branch February 13, 2022 16:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove `git_buf` as a public-facing structure #5534

Remove `git_buf` as a public-facing structure #5534

Uh oh!

ethomson commented May 29, 2020

Uh oh!

pks-t commented Jun 1, 2020

Uh oh!

ethomson commented Jun 1, 2020

Uh oh!

ethomson commented Jun 2, 2020

Uh oh!

pks-t commented Jun 15, 2020 •

edited

Loading

Uh oh!

ethomson commented Jun 15, 2020

Uh oh!

pks-t commented Jun 15, 2020

Uh oh!

ethomson commented Jun 23, 2020

Uh oh!

pks-t Jun 26, 2020

Uh oh!

pks-t commented Jun 26, 2020

Uh oh!

carlosmn commented Dec 2, 2020

Uh oh!

ethomson commented Dec 2, 2020

Uh oh!

Uh oh!

Remove git_buf as a public-facing structure #5534

Remove git_buf as a public-facing structure #5534

Uh oh!

Conversation

ethomson commented May 29, 2020

Uh oh!

pks-t commented Jun 1, 2020

Uh oh!

ethomson commented Jun 1, 2020

Uh oh!

ethomson commented Jun 2, 2020

Uh oh!

pks-t commented Jun 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ethomson commented Jun 15, 2020

Uh oh!

pks-t commented Jun 15, 2020

Uh oh!

ethomson commented Jun 23, 2020

Uh oh!

pks-t Jun 26, 2020

Choose a reason for hiding this comment

Uh oh!

pks-t commented Jun 26, 2020

Uh oh!

carlosmn commented Dec 2, 2020

Uh oh!

ethomson commented Dec 2, 2020

Uh oh!

Uh oh!

Remove `git_buf` as a public-facing structure #5534

Remove `git_buf` as a public-facing structure #5534

pks-t commented Jun 15, 2020 •

edited

Loading