Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@Peter0x44
Copy link
Contributor

Adds support for displaying the target machine type of PE files with a new -m command line option. I tested this using llvm-mingw for its various target triples:

armv7-w64-mingw32-clang:
MACHINE TYPE
ARMNT (0x01c4)

aarch64-w64-mingw32-clang:
MACHINE TYPE
ARM64 (0xaa64)

i686-w64-mingw32-clang:
MACHINE TYPE
i386 (0x014c)

x86_64-w64-mingw32-clang:
MACHINE TYPE
x64 (0x8664)

@Peter0x44
Copy link
Contributor Author

Turns out the mingw-w64 headers are actually missing some entries. I'll use the list from microsoft's documentation:
https://learn.microsoft.com/en-us/windows/win32/debug/pe-format#machine-types

@Peter0x44 Peter0x44 force-pushed the peports_arch_detection branch from d37afc9 to 9b69799 Compare August 14, 2025 17:57
@skeeto
Copy link
Owner

skeeto commented Aug 14, 2025 via email

@avih
Copy link

avih commented Aug 14, 2025

  1. Should -m behave like -i and -e? That is, disable other output not explicitly requested?

I think yes. I.e. -m would print the machine type and exit.

"\machine AMD64 0x8664"?

I think this is overengineering. The parser of -m will not be the same parser of other outputs from peports. If you want unified content format, then just use JSON, but I think this would make the parsing more involved than necessary, and less useful with traditional posix utilities pipelines (mainly referring to the impots/exports lists).

We have one raw value with -m, so this should be the sole output IMO. Users of this feature can interpret these values according to current or future type maps.

If we want it more human readable, then maybe -M could provide that, with some title, and mapping of raw value to readable name where we can.

If we want both under one option, then I think a form of 0xHHHH (MTYPE) parses more easily by scripts, as the raw value - the only value which is guaranteed to identify the type (unlike UNKNOWN which can print for different values) is first, and can be extracted easily by removing everything starting at the first space.

In general, I don't think there's much value in printing both imports and exports together, and it makes the output harder to make unambiguous. It can be convenient, but personally I like explicit and simple.

I think it's best to default to -i (imports). It's also possible to default to exports for dll and imports for exe, but i don't like the guesswork involved, and it can have tricky edge cases.

@Peter0x44
Copy link
Contributor Author

Peter0x44 commented Aug 14, 2025

  1. Should -m behave like -i and -e? That is, disable other output not
    explicitly requested?

Yeah, agreed, I think it should do this.

I am leaning towards >machine. No major preference either way.

  1. There's no "fat PE" (right?) and so an image can only have a single
    machine type. For consistency it's formatted like a list though it can
    only have on item. Perhaps it should have a different syntax entirely?
    "\machine AMD64 0x8664"?

Yes. an image can only have a single machine type.

The Mac thing of having multiple executables in the file and selecting the one by architecture isn't done by windows. The closest is the aforementioned ARM64EC

@Peter0x44 Peter0x44 force-pushed the peports_arch_detection branch from 9b69799 to f92d448 Compare August 14, 2025 22:23
@Peter0x44
Copy link
Contributor Author

One more clarification: Do you have a preference if -m works by default (with no arguments), or would you prefer if it's only enabled if requested?

@Peter0x44
Copy link
Contributor Author

Peter0x44 commented Aug 14, 2025

Hmm. Is there any reason the number of exports is counted, but the number of imports isn't?
I've given it some deeper thought, and here is what feels "good" to me:

How about we allow any combination of -mei and separate them like this:
The default output would be of -mei, but passing any of those individually disables any that aren't explicitly passed

>machine
    0xAA64 (ARM64)
>exports
    whatever.dll
        1 whatever_function
>imports
    msvcrt.dll
        1 printf
        2 malloc
    somelib.dll
        1 somelib_func

exports would be blank if there aren't any (probably the case for a majority of exes).

Or is this becoming too complicated or similar to gendef?

@avih
Copy link

avih commented Aug 15, 2025

An idea about options usage:

  1. Each of -i/-e/-m do just one thing and output a simple form which is easy to process [in a pipeline] with traditional tools (the current -i/-e forms are good). The output formats can be documented at the -h page (e.g. using printf format string) so that parsing it doesn't require guesswork.
  2. If -i/-e/-m are mutually exclussive, then peports should probably error if more than one is provided. If they're not mutually exclussive, then the outputs should be at the order of the given options, and can be separated by an empty line, and maybe error out if the same option is used more than once.
  3. Without options, it prints some human-readable summary, which is not intended for machine-parsing, and is relatively small so that it's easy to grasp in a glance. For instance the arch, the import modules (without symbol names, or just with the number of symbols per module), and the number of exported symbols.

@skeeto
Copy link
Owner

skeeto commented Aug 16, 2025

would you prefer if it's only enabled if requested?

I prefer only enabled on requested, and per (1) requesting it puts peports in requested-outputs-only mode. When I casually point peports at an EXE or DLL, the current behavior is pretty much exactly what I want. With a (typical) EXE I get a listing of imports and nothing else, and the machine type is probably already known or unimportant. With a DLL, the exports are most important, and they come first.

Thinking about it more, maybe even better default behavior would be: show only exports, but if there are none then show imports. So then you get only the most important information for either "kind" of PE. Don't worry about that in this MR, though.

Is there any reason the number of exports is counted, but the number of imports isn't?

I don't understand. What do you mean?

Or is this becoming too complicated or similar to gendef?

While I was initially writing peports, I started with hierarchical printing and I didn't like it.

I am leaning towards >machine.

Alright, let's go with that, and so >exports, too. Other than the extra >imports nesting, and the parentheses, I like your example output. That resolves (2). So that means:

$ peports -m example.dll
>machine
    0xAA64 ARM64

$ peports example.dll
>exports
    1 whatever_function
msvcrt.dll
    1 printf
    2 malloc
somelib.dll
    1 somelib_func

$ peports -mei example.dll
>machine
    0xAA64 ARM64
>exports
    1 whatever_function
msvcrt.dll
    1 printf
    2 malloc
somelib.dll
    1 somelib_func

And with that I think it's ready to merge.

outputs should be at the order of the given options

I agree, that's a good point. (It doesn't necessarily need to happen in this MR.)

The parser of -m will not be the same parser of other outputs from peports.

If the machine type is checked in a separate call from extracting exports/imports, then the PE will be handled twice (i.e. opened and read twice). Better to request everything at once, and if the machine type is wrong, discard the rest. So that means parsing all the output together, even if the machine type is known to list first.

0xHHHH (MTYPE)

The order isn't important to me (hex or name first) at least so long as the name is only a single word, but I really do not like the parentheses. They're extraneous and do not disambiguate output. Parsers that want the name have to do work remove the parentheses.

In listings, <…> is used for <NONAME> because a name with these bytes would instead print as \x3cNONAME\x3e. The angle brackets have meaning. Same for fowarders naming their target modules. The angle brackets separate the name from the module. Space isn't enough because both names and modules can contain spaces.

That's a good point about UNKNOWN, that machine processing of the machine type (e.g. to match up modules) should use the hex, not the name.

then just use JSON

I know you're not actually proposing JSON, but as is more often the case than people appreciate, JSON is limited to unicode text and cannot itself represent byte strings. That's a substantial handicap in systems programming. For example, JSON cannot represent arbitrary file names, either on unix or Windows. In PE images, names and modules are supposed to be ASCII-only, though in practice they're really arbitrary byte strings. So they may not be representable in JSON. Storing PE listings as JSON would require an extra layer of encoding within JSON strings, which is mostly back to square one.

@avih
Copy link

avih commented Aug 16, 2025

If the machine type is checked in a separate call from extracting exports/imports, then the PE will be handled twice (i.e. opened and read twice). Better to request everything at once, and if the machine type is wrong, discard the rest. So that means parsing all the output together, even if the machine type is known to list first.

You don't know the contexts where it's going to be used, or whether the actual arch is important, regardless of the need for deps to have the same arch.

There's a reason why "do one thing well" exists. Because it's easier to handle one kind of thing at a time. It might be theoretically more efficient to process all the outputs together, but this necessarily complicates the output handling.

It can be preferable for the caller to call it twice, once for each form of output. They may or may not care about performance, or they may, but not enough in this case (the delta of one extra invocation is likely negligible in the grand scheme of things, while more complex code to handle it is not negligible).

At least make this exception:

If only one output type is requested, then don't decorate it with title and other unrelated outputs. The invoker knows what they requested, and extra decoration only makes it more cumbersome to handle.

@avih
Copy link

avih commented Aug 16, 2025

The order isn't important to me (hex or name first) at least so long as the name is only a single word

You know it's a single word, but does the user know that too? If it's not documented, then they can't assume what chars it contains, including spaces and parenthesis and whatever. Would you make such an assumption about some 3rd party tool which prints the architecture of a binary?

But users can assume safely and logically that a hex value is a single word.

I know you're not actually proposing JSON

Correct, and I also don't know enough about these names, their constraints, and their potential encoding. Keep in mind that if they're arbitrary bytes then a \n would likely also throw off any parser, so regardless of whether it's handled or not (it might be escaped?) it should be documented. There's no manpage, and it's not documented at the -h output how non-ALNUM bytes are handled. Source comments don't really count even if they cover everything. Users don't read source code typically to understand what a program does, except if they intend to modify it.

As for JSON, if it was on the table, then it could be encoded, like base64.

@avih
Copy link

avih commented Aug 16, 2025

By the way, how are file names (modules) encoded? (at the binary itself which needs to load the modules, and at the printout)

@skeeto
Copy link
Owner

skeeto commented Aug 16, 2025

If only one output type is requested, then don't decorate it with title and other unrelated outputs.

Hmm, that sounds reasonable. So then maybe:

$ peports -m example.dll
0xAA64 ARM64

$ peports -e example.dll
    1 whatever_function

$ peports -mei example.dll
>machine
    0xAA64 ARM64
>exports
    1 whatever_function
msvcrt.dll
    1 printf
    2 malloc
somelib.dll
    1 somelib_func

(Again, this doesn't have to happen in this MR.)

There's no manpage, and it's not documented

True, but that's just a couple sentences of documentation from being settled. A motivation for peports was that none of the existing tools of which I'm aware handle weird inputs well. They crash (see #135), or byte strings are decoded in some ambiguous way, or output is simply missing due to imprecise decoding/encoding. (It's also, in general, dangerous to use link.exe or Binutils on untrusted PE images, but safe with peports.) So faced with an edge case, I might wonder, "What exactly did the linker produce for this unicode function name?" The only way to inspect weird export/import tables was tediously via a hexdump.

For instance, the situation in Windows dynamic linking depends on the active code page is essentially only debuggable by peports or a manual hexdump inspection. The unambiguous output was first and foremost for human inspection, without any code page contamination. It just so happens that unambiguous output is also useful for robust machine consumption.

By the way, how are file names (modules) encoded?

Officially, it's ASCII-only:
https://learn.microsoft.com/en-us/windows/win32/debug/pe-format#the-edata-section-image-only

This has the funny situation that non-ASCII DLL names are disallowed, or at least impractical, e.g. cálculo.dll. Like many important parts of Windows' behavior, it's undocumented what happens for non-ASCII bytes. Per my article, in practice Windows translates the byte string to UTF-16 (for file system lookup) using the current code page, and so how it decodes depends on who's looking at it. Any dependency walker must go through a similar procedure to turn these byte strings into concrete paths.

Hence the importance of utilities like peports not decoding it at all! If a program did link a cálculo.dll then it's important to distinguish if the linker (or whatever) encoded using its own code page (ex. CP1252: c\xe1lculo.dll in peports) or, say, UTF-8 (c\xc3\xa1lculo.dll in peports). Again, I was not anticipating programs actually decoding this stuff to recover the byte strings, but intending to communicate clearly for human inspection.

There's of course the general problem with representing control characters (like the aforementioned \n), which peports always renders escaped. This prevents malicious PE images from "log injecting" forged outputs.

@avih
Copy link

avih commented Aug 16, 2025

(Again, this doesn't have to happen in this MR.)

Yeah. No hurry.

There's no manpage, and it's not documented

True, but that's just a couple sentences of documentation from being settled.

Yeah. It's seemingly a tiny thing which is trivial to add, but it makes it much more reassuring to use. As suggested previously, printf format string is used commonly for such things where possible, so this would help for the "single output" mode with all 3 outputs, as well as the non-printable (and non-space?) escapes where applicable (I presume that's "\\x%02x"?).

This has the funny situation that non-ASCII DLL names are disallowed, or at least impractical, e.g. cálculo.dll. Like many important parts of Windows' behavior, it's undocumented what happens for non-ASCII bytes...

Interesting. Go Windows!

$ peports -m example.dll
0xAA64 ARM64

Yes.

$ peports -e example.dll
1 whatever_function

Maybe. Does it need the indent? isn't that inconsistent with the -m output above - which is unindented?

Granted, in my use case I don't care for the numbered parts (I only need import modules list, and the arch), but what are the numbers good for? I presume not for machine processing, because then it can simply count if it cares about the numeric index of elements... unless the numbers can be non-sequential?

Similarly, both question for the imports on its own.

I think that except for imports which is inherently indented internally (symbols per module), the output itself doesn't need the "title indentation" with one output, i.e. like you did with the -m output is OK, but I'd guess it was accidental in your example rather than intentional?

And also it doesn't need the numbering in such case IMO, because it doesn't help automated processing as far as I can tell.

$ peports -mei example.dll
>machine
0xAA64 ARM64
>exports
1 whatever_function
msvcrt.dll
1 printf
2 malloc
somelib.dll
1 somelib_func

Maybe, but to be honest I don't think anyone would use this for automatic processing, because it's just more combersome and error-prone to process. So if that's for humans, then if can be anything. I still don't find the numbering very useful, but I don't mind them either.

As suggested elsewhere, and assuming the numbers are always sequential, an alternative might be to print only the total per numbered list (at the same line as the list title), though this might make the code more complex if the number of entries is unknown when printing the title (exports) or module (imports).

@avih
Copy link

avih commented Aug 17, 2025

Re the indentation, I would think, for consistency, that each items should indented from its parent.

So with multiple outputs, I would think each should have a title (MACHINE/EXPORTS/IMPORTS or however you want to name them), then one tab inside is the content - which may itself be indented further (only imports).

This also removes the ambiguity, because first column is always parent or indent, and the the line data itself never contains indent (tab - or space?) because those are escaped.

And each single output is without the title, and one level of indentation removed, and possibly without the numbering too.

@skeeto
Copy link
Owner

skeeto commented Aug 17, 2025

but what are the numbers good for?

On exports they're export ordinals, and on imports they're hints (with a name) or ordinal imports (without a name, <NONAME>). It's mostly a holdover from the 16-bit era to enable faster dynamic linking, so that not even a binary search is necessary:

https://learn.microsoft.com/en-us/cpp/build/exporting-functions-from-a-dll-by-ordinal-rather-than-by-name

If a module imports by ordinal, there's no string comparison nor search search. It just links the Nth export from the DLL. This also enables obfuscation within a piece of software: A module can link its private DLLs by ordinal without revealing function names. A few older Win32 functions have stable ordinals, and so permit ordinal imports of those functions.

GNU ld has --out-implib to produce an import library when building a module (even EXE's can export!). Every export has an ordinal whether or not it's actually stable across builds. Unless you specify ordinals (via a DEF) to the linker, it just numbers them 1-indexed, monotonically. The ordinals in this import library are the true ordinals for the corresponding module.

Traditionally MSVC lib.exe when producing an import library from a DEF, given no ordinal preference, simply assigns ordinals monotonically as a blind guess. Binutils dlltool does the same. These ordinals are virtually never correct. I've patched dlltool to produce zero ordinals ("null") instead of guessing, including when building w64dk itself, which is visible in all the system import libraries. This removes pointless noise from images and improves build reproduceability.

When importing by name, the ordinals listed in an import library are used as hints. The dynamic linker first tries the hint (in practice usually wrong), then reverts to a search. Because these are all zero in w64dk, you'll see zeroes for all these hints in w64dk-linked modules.

I like seeing what's going on with hints and ordinals, so they're in the output. In particular, the hints fingerprint a build revealing what toolchain linked it. If you see all zeroes you know it's w64dk, or at least something unusual (Go). Otherwise you can distinguish what version of Mingw-w64 or Visual Studio are behind a particular image. It's pretty nifty for detective work.

@avih
Copy link

avih commented Aug 17, 2025

Thanks. I was not aware of any of these. So I guess the numbering can be useful.

@Peter0x44
Copy link
Contributor Author

Peter0x44 commented Aug 17, 2025

Is there any reason the number of exports is counted, but the number of imports isn't?

I don't understand. What do you mean?

Sorry for the confusion, I didn't actually know what an ordinal was and that w64devkit has a patch to zero them.
https://github.com/skeeto/w64devkit/blob/master/src/binutils-dlltool-zero-ordinals.patch

I figured there was code in peports that was counting the number of exports/imports so you can know if you are close to exceeding the 65535 limit or so. But that's not the case.

Adds support for displaying the target machine type of PE files with
a new -m command line option.
@skeeto skeeto force-pushed the peports_arch_detection branch from 6c15412 to b88e8b3 Compare August 17, 2025 21:28
@skeeto skeeto merged commit b88e8b3 into skeeto:master Aug 17, 2025
@skeeto
Copy link
Owner

skeeto commented Aug 17, 2025

Thanks! I made a couple of small tweaks (I want the option listing in alphabetical order) and squashed.

@Peter0x44 Peter0x44 deleted the peports_arch_detection branch August 17, 2025 21:43
@Peter0x44
Copy link
Contributor Author

Thanks

(I want the option listing in alphabetical order)

Any reason for this btw?

@skeeto
Copy link
Owner

skeeto commented Aug 17, 2025 via email

@wesinator
Copy link
Contributor

default: return s8("UNKNOWN");

unknown should correspond only to 0x0

if the machine type is unrecognized (not in the known list, i.e. the default case), then you could return the string of the machine u16 hex value.

wesinator added a commit to wesinator/w64devkit that referenced this pull request Aug 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants