libctf deduplicator design

CTF spec

http://www.esperi.org.uk/~oranix/ctf/ctf-spec.pdf

IDs and parents

Types in CTF are stored in dictionaries. CTF types are identified by IDs (ctf_id_t), which are unique within a dictionary but not outside it. Dictionaries can be in a one-level parent/child relationship, where the parent contributes types that can be used by types in the child (and indeed in many children).

The parent is associated with the child after ctf_open(), via ctf_import(): while in theory this relationship is dynamic because nothing stops you picking a different dictionary as parent every time you ctf_import, in practice it is fixed because types in the child will refer to types in the parent, and nothing will work well if those types change! The cth_parname in the CTF dictionary header identifies the parent associated with a given dictionary (if any), but it's up to the client calling ctf_open() to decide how to interpret this, where to get the parent from etc. Usually a single shared parent contributes types to many children.

Output from GNU ld (and the ctf_link API) typically arranges to have parents and children in a single ctf_archive_t in the .ctf section (if there are no children, a raw ctf_file_t is used instead): the shared parent is in the archive member named ".ctf", while the children are associated with single input translation units and are stored in members named after the translation units from which they are derived.

The ctf_id_t which identifies types appears in CTF dictionaries' types sections when types refer to other types: e.g. when a pointer wants to declare the type it points to. Usually this is done via the ctf_type_t.ctt_type member. Type IDs are converted into indexes by calling CTF_*_TYPE_TO_INDEX (CTF-version-specific: libctf abstracts over this). In all existing file formats, you turn a type ID found in a child dictionary into an index by masking off the high bit: i.e. parent IDs start at 0, child IDs start at 0x80000000, partitioning the space into two equal halves.

This means that parent dictionaries need not know they are parents when they are constructed: you can associate a child with a parent long after the parent is built, if you want to. (But all known CTF deduplicators build both the parent and at least some children in one go). The indexes are array indexes into the type section: each entry is variable-length and a mapping from index to offset is constructed at ctf_open() time (via ctf_file_t.ctf_txlate).

What are child dictionaries used for? Most links (including all done by GNU ld) are in link mode CTF_LINK_SHARE_UNCONFLICTED. This mode shares all types it possibly can: even if a type appears in only one translatoin unit, it is shared if nothing stops it from being shared, on the grounds that this is most useful to debugger users. Types are shared by being moved into the parent dictionary: any lookup from the scope of the parent or any child will find it there.

But not all types are shareable like that. Imagine this case:

a.c:

  int wombat;
  struct foo
  {
    int bar;
  }
  struct quux
  {
    struct foo *bar;
  };
  struct bar;

b.c:

  long wombat;
  struct foo
  {
    int baz;
  };
  struct quux
  {
    struct foo *bar;
  };
  struct bar
  {
    int baz;
  };

In the case above, 'struct bar' is shareable, because there is only one unambiguous definition of 'struct bar' (if there were other identical definitions, they would be considered "the same" as well): a definition is unambiguous if it is a named type that is identical across TUs and all types it directly or indirectly relates to are also identical (right now we determine this via SHA-1 hashing).

But 'struct foo' is not shareable, because the two structs foo have members with different names: and neither is the variable 'wombat', because it's int in one container and long in the other. You cannot add both of these to one dictionary, so wombat and foo are considered conflicting, and will not usually be placed in a shared parent dict.

Further, because parents are associated with many children, types in the parent dictionary cannot cite types in any child: from our perspective, this means that unconflicting types cannot cite conflicting types. However, we can avoid this for things that cite structures that are marked conflicting: in the example above, we can consider 'struct quux' unconflicting by pointing it at a shared opaque forward to 'struct foo'.

The deduplicator makes extensive use of hash values stored in an atoms table (derived from the hash of a type and all types it recursively references) to tell whether types are identical, and global type IDs (a squashing of an offset in an array of inputs and a ctf_id_t) to compactly look up types in CTF input dicts. Global type IDs are 64 bits (32 bits for a ctf_id_t and a bunch more bits for an offset into an array of input dictionaries). So we can encode them in a pointer on 64-bit systems: on 32-bit, we stuff them in a hash table and pass around a pointer to the hash value. As elsewhere in CTF, we have an extended form of name called a "decorated name" which puts "s ", "u " or "e " on the front of names, to distinguish names in different C namespaces without having to carry around more than a simple string.

Algorithm

Type deduplication is a three-phase process:

come up with unambiguous hash values for all types: no two types may have the same hash value, and any given type should have only one hash value (for optimal deduplication).
mark those distinct types with names that collide (and thus cannot be declared simultaneously in the same translation unit) as conflicting, and recursively mark all types that cite one of those types as conflicting as well. Possibly mark all types cited in only one TU as conflicting, if the CTF_LINK_SHARE_DUPLICATED link mode is active.
emit all the types, one hash value at a time. Types not marked conflicting are emitted once, into the shared dictionary: types marked conflicting are emitted once per TU into a dictionary corresponding to each TU in which they appear. Structs marked conflicting get at the very least a forward emitted into the shared dict so that other dicts can cite it if needed.

This all works over an array of inputs, and works fine if one of the inputs is a parent of the others: we don't use the ctf_link_inputs hash directly because it is convenient to be able to address specific input types as a global type ID or 'GID', a pair of an array offset and a ctf_id_t. Since both are already 32 bits or less or can easily be constrained to that range, we can pack them both into a single 64-bit hash word for easy lookups, which would be much more annoying to do with a ctf_file_t * and a ctf_id_t. (On 32-bit platforms, we must do that anyway, since pointers, and thus hash keys and values, are only 32 bits wide.)

There are a few subtleties here that make this more complex than it seems.

Hashing

Hashing proceeds recursively, mixing in the properties of each input type (including its name, if any), and then adding the hash values of every type cited by that type. The result is stashed in the cd_type_hashes so other phases can find the hash values of input types given their IDs, and so that if we encounter this type again while hashing we can just return its hash value: it is also stashed in the output mapping, a mapping from hash value to the set of GIDs corresponding to that type in all inputs. We also keep track of the GID of the first appearance of the type in any input (in cd_output_first_tu), and the GID of structs, unions, and forwards that only appear in one TU (in cd_struct_origin). See below for where these things are used.

We have to do something about potential cycles in the type graph. We'd like to avoid emitting forwards in the final output if possible, because forwards aren't much use: they have no members. We are mostly saved from needing to worry about this at emission time by ctf_add_struct*() automatically replacing newly-created forwards when the real struct/union comes along. So we only have to avoid getting stuck in cycles during the hashing phase, while also not confusing types that cite members that are structs with each other. It is easiest to solve this problem by noting two things:

all cycles in C depend on the presence of tagged structs/unions
all tagged structs/unions have a unique name they can be disambiguated by

This means that we can break all cycles by ceasing to hash in cited types at every tagged struct/union and instead hashing in a stub consisting of the struct/union's decorated name, which is the name preceded by "s " or "u " depending on the namespace. Forwards are decorated identically (so a forward to "struct foo" would be represented as "s foo"): this means that a citation of a forward to a type and a citation of a concrete definition of a type with the same name ends up getting the same hash value.

Of course, it is quite possible to have two TUs with structs with the same name and different definitions, but that's OK because when we scan for types with ambiguous names we will identify these and mark them conflicting.

We populate one thing to help conflictedness marking. No unconflicted type may cite a conflicted one, but this means that conflictedness marking must walk from types to the types that cite them, which is the opposite of the usual order. We can make this easier to do by constructing a citers graph in cd_citers, which points from types to the types that cite them: because we emit forwards corresponding to every conflicted struct/union, we don't need to do this for citations of structs/unions by other types. This is very convenient for us, because that's the only type we don't traverse recursively: so we can construct the citers graph at the same time as we hash, rather than needing to add an extra pass. (This graph is a dynhash of type hash values, so it's small: in effect it is automatically deduplicated.)

Collisional marking

We identify types whose names collide during the hashing process, and count the rough number of uses of each name (caching may throw it off a bit: this doesn't need to be accurate). We then mark the less-frequently-cited types with each names conflicting: the most-frequently-cited one goes into the shared type dictionary, while all others are duplicated into per-TU dictionaries, named after the input TU, that have the shared dictionary as a parent. For structures and unions this is not quite good enough: we'd like to have citations of forwards to ambiguously named structures and unions stay as citations of forwards, so that the user can tell that the caller didn't actually know which structure definition was meant: but if we put one of those structures into the shared dictionary, it would supplant and replace the forward, leaving no sign. So structures and unions do not take part in this popularity contest: if their names are ambiguous, they are just duplicated, and only a forward appears in the shared dict.

The process of marking types conflicted is itself recursive: we recursively traverse the cd_citers graph populated in the hashing pass above and mark everything that we encounter conflicted (without wasting time re-marking anything that is already marked). This naturally terminates just where we want it to (at types that are cited by no other types, and at structures and unions) and suffices to ensure that types that cite conflicted types are always marked conflicted.

When linking in CTF_LINK_SHARE_DUPLICATED mode, we would like all types that are used in only one TU to end up in a per-CU dict. The easiest way to do that is to mark them conflicted. ctf_dedup_conflictify_unshared does this, traversing the output mapping and using ctf_dedup_multiple_input_dicts to check the number of input dicts each distinct type hash value came from: types that only came from one get marked conflicted. One caveat here is that we need to consider both structs and forwards to them: a struct that appears in one TU and has a dozen citations to an opaque forward in other TUs should not be considered to be used in only one TU, because users would find it useful to be able to traverse into opaque structures of that sort: so we use cd_struct_origin to check both structs/unions and the forwards corresponding to them.

Emission

Emission involves another walk of the entire output mapping, this time traversing everything other than struct members, in recursive order. Types are emitted from leaves to trunk, emitting all types a type cites before emitting the type itself. We sort the output mapping before traversing it, for reproducibility and also correctness: the input dicts may have parent/child relationships, so we simply sort all types that first appear in parents before all children, then sort types that first appear in dicts appearing earlier on the linker command line before those that appear later. (This is where we use cd_output_first_tu, collected above.)

The walking is done using a recursive traverser which arranges to not revisit any type already visited and to call its callback once per input GID for input GIDs corresponding to conflicted output types. The traverser only finds input types and calls a callback for them: it doesn't try to figure out anything about where the output might go. That's done by the callbac.

This is the (sole) callback for ctf_dedup_walk_output_mapping. Conflicted types have all necessary dictionaries created, and then we emit the type into each dictionary in turn, working over each input CTF type corresponding to each hash value and using ctf_dedup_id_to_target to map each input ctf_id_t into the corresponding type in the output (dealing with input ctf_id_t's with parents in the process by simply chasing to the parent dict if the type we're looking up is in there). Emitting structures involves simply noting that the members of this structure need emission later on: because you cannot cite a single structure member from another type, we avoid emitting the members at this stage to keep recursion depths down a bit.

At this point, if we have by some mischance decided that two different types with child types that hash to different values have in fact got the same hash value themselves and not marked it conflicting, the type walk will walk only one of them and in all likelihood we'll find that we are trying to emit a type into some child dictionary that references a type that was never emitted into that dictionary and assertion-fail. This always indicates a bug in the conflictedness marking machinery or the hashing code, or both.

ctf_dedup_id_to_target does one extra thing, alluded to above: if this is a conflicted tagged structure or union, and the target is the shared dict (i.e., the type we're being asked to emit is not itself conflicted so can't just point straight at the conflicted type), we instead synthesise a forward with the same name, emit it into the shared dict, record it in cd_output_emission_conflicted_forwards so that we don't re-emit it, and return it. This means that cycles that contain conflicts do not cause the entire cycle to be replicated in every child: only that piece of the cycle which takes you back as far as the closest tagged struct/union needs to be replicated. This trick means that no part of the deduplicator needs a cycle detector.

The final stage of emission is to walk over all structures with members that need emission and emit all of them. Every type has been emitted at this stage, so emission cannot fail.

Finally, we update the input -> output type ID mappings used by the ctf-link machinery to update all the other sections. This is surprisingly expensive and may be replaced with a scheme which lets the ctf-link machinery extract the needed info directly from the deduplicator.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

libctf deduplicator design

CTF spec

IDs and parents

Algorithm

Hashing

Collisional marking

Emission

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally