Codestin Search App

LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 24226 - Constant not propagated into inline assembly, results in "constraint 'I' expects an integer constant expression"

Summary: Constant not propagated into inline assembly, results in "constraint 'I' expe...

Status:	NEW

Alias:	None

Product:	clang
Classification:	Unclassified
Component:	Driver (show other bugs)
Version:	3.6
Hardware:	All All

Importance:	P normal
Assignee:	Unassigned Clang Bugs

URL:
Keywords:

Depends on:
Blocks:

Reported:	2015-07-22 23:28 PDT by Jeffrey Walton
Modified:	2016-02-22 17:19 PST (History)
CC List:	6 users (show)

See Also:
Fixed By Commit(s):

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jeffrey Walton 2015-07-22 23:28:29 PDT

Attempting to compile the following program using the integrated assembler results in:

$ clang++ -g2 -O3 clang-test.cpp -o clang-test.exe
clang-test.cpp:16:11: error: invalid operand for inline asm constraint 'I'
        __asm__ ("rorl %1, %0" : "+mq" (value) : "I" ((unsigned char)(rot...
                 ^
1 error generated.

It appears the integrated assembler does not receive the const value "2". Even 2%32 is constant because the preprocessor can perform the math.

The program is OK on other Linux OS's using GCC/GAS.

**********

// clang++ -g2 -O3 clang-test.cpp -o clang-test.exe
unsigned int RightRotate(unsigned int value, unsigned int rotate);

int main(int argc, char* argv[])
{
	return RightRotate(argc, 2);
}

unsigned int RightRotate(unsigned int value, unsigned int rotate)
{
    // x = value; y = rotate
    // The I constraint ensures we use the immediate-8 variant of the
    // rotate amount y. However, y must be in [0, 31] inclusive. We
    // rely on the preprocessor to propagate the constant and perform
    // the modular reduction so the assembler generates the instruction.
    __asm__ ("rorl %1, %0" : "+mq" (value) : "I" ((unsigned char)(rotate%32)));
    return value;
}

**********

Applies to both:

$ /usr/local/bin/clang++ -v
clang version 3.6.0 (tags/RELEASE_360/final)
Target: x86_64-apple-darwin12.6.0


And

$ clang++ -v
Apple LLVM version 5.1 (clang-503.0.40) (based on LLVM 3.4svn)
Target: x86_64-apple-darwin12.6.0

**********

We are jumping through these hoops because the cryptographers often call out specs that are not sympathetic to hardware and standards.


**********

The real code is hairier, and it involves a well defined template implementations that avoids branching (i.e., all instructions always execute and it C/C++ avoids undefined behavior):

// Well defined for all y, near constant time
template <class T> inline T rotrImmediateMod(T x, unsigned int y)
{
	static const unsigned int THIS_SIZE = sizeof(T)*8;
	y %= THIS_SIZE;
	return T((x>>y) | (x<<((THIS_SIZE-y) % THIS_SIZE)));
}

Combined with specializations:

template<> inline word32 rotrImmediateMod<word32>(word32 x, unsigned int y)
{
	__asm__ ("rorl %1, %0" : "+g" (x) : "I" ((unsigned char)(y%32)));
	return x;
}

**********

I'll send a case of Heinekens anywhere in the world to the first person who provides a near constant time intrinsic for left- and right-rotate. I'm amazed Clang does not provide one (http://llvm.org/docs/LangRef.html#intrinsic-functions).

With the intrinsic, we avoid all the C/C++ undefined behavior, we avoid the branching, we get the 1 ASM instruction speedup, and we avoid all the hassles of inline assembly.

Comment 1 Sean Silva 2015-07-22 23:57:10 PDT

So there's two points here:
1. Difficulties in getting the compiler to emit a rotate instruction
2. Constant folding of inline asm operands

For 1., the following compiles to a single ror instruction when optimization is turned on:

unsigned ror(unsigned x, int amt) {
  amt &= (32 - 1);
  return (x >> amt) | (x << (32 - amt));
}

Does that address your concern?

For 2., I've been bitten by this in the past. IIRC, in the past when I ran into this, even a constexpr expression or a template argument didn't help (this is anecdotal).

I'm not sure if the cases of a function argument to a non-constexpr function, as in the OP, are palatable; I suspect this is classic "GCC's definition of 'constant' is 'whatever it can fold'" and clang deliberately does not follow this.

Comment 2 Jeffrey Walton 2015-07-23 00:25:21 PDT

(In reply to comment #1)
> So there's two points here:
> 1. Difficulties in getting the compiler to emit a rotate instruction
> 2. Constant folding of inline asm operands
> 
> For 1., the following compiles to a single ror instruction when optimization
> is turned on:
> 
> unsigned ror(unsigned x, int amt) {
>   amt &= (32 - 1);
>   return (x >> amt) | (x << (32 - amt));
> }

I *think* that has undefined behavior when `amt = 0`. If we check for `amt == 0`, then we introduce a branch.

Intel's ICC is particularly ruthless about dropping the statement in its entirety. (Been there, done that, got the t-shirt, and used Clang/UBsan to isolate the problem).

That intrinsic keeps looking better and better. Are you thirsty?

Comment 3 Jeffrey Walton 2015-07-23 00:47:54 PDT

(In reply to comment #1)
> ...

I forgot to say thanks, so this is a less than useful reply.

> I'm not sure if the cases of a function argument to a non-constexpr
> function, as in the OP, are palatable; I suspect this is classic "GCC's
> definition of 'constant' is 'whatever it can fold'" and clang deliberately
> does not follow this.

And let me again: are you thirty yet ;)

Comment 4 Sean Silva 2015-07-23 02:01:39 PDT

(In reply to comment #2)
> (In reply to comment #1)
> > So there's two points here:
> > 1. Difficulties in getting the compiler to emit a rotate instruction
> > 2. Constant folding of inline asm operands
> > 
> > For 1., the following compiles to a single ror instruction when optimization
> > is turned on:
> > 
> > unsigned ror(unsigned x, int amt) {
> >   amt &= (32 - 1);
> >   return (x >> amt) | (x << (32 - amt));
> > }
> 
> I *think* that has undefined behavior when `amt = 0`. If we check for `amt
> == 0`, then we introduce a branch.

Ok, yeah I can see why this is tricky. At least for the rotate by immediate case you could do something like:

template <int N>
unsigned rorImpl(unsigned x) {
  return (x >> N) | (x << (32 - N));
}

template <>
unsigned rorImpl<0>(unsigned x) {
  return x;
}

template <int N>
unsigned ror(unsigned x) {
  return rorImpl<N & (32 - 1)>(x);
}


Ugly, but it should work (provided you have enough faith in the optimization of `(x >> N) | (x << (32 - N))`).

Comment 5 Reid Kleckner 2015-07-23 11:53:24 PDT

(In reply to comment #2)
> (In reply to comment #1)
> > So there's two points here:
> > 1. Difficulties in getting the compiler to emit a rotate instruction
> > 2. Constant folding of inline asm operands
> > 
> > For 1., the following compiles to a single ror instruction when optimization
> > is turned on:
> > 
> > unsigned ror(unsigned x, int amt) {
> >   amt &= (32 - 1);
> >   return (x >> amt) | (x << (32 - amt));
> > }
> 
> I *think* that has undefined behavior when `amt = 0`. If we check for `amt
> == 0`, then we introduce a branch.
> 
> Intel's ICC is particularly ruthless about dropping the statement in its
> entirety. (Been there, done that, got the t-shirt, and used Clang/UBsan to
> isolate the problem).
> 
> That intrinsic keeps looking better and better. Are you thirsty?

Check out clang's Intrin.h, we actually provide this intrinsic for compatibility with MSVC:

static __inline__ unsigned int __DEFAULT_FN_ATTRS
_rotr(unsigned int _Value, int _Shift) {
  _Shift &= 0x1f;
  return _Shift ? (_Value >> _Shift) | (_Value << (32 - _Shift)) : _Value;
}

At -O2, if you give it a constant it will pick the immediate variant of roll. It generates correct but crappy -O0 code, though. =/

Comment 6 Eric Christopher 2015-07-24 21:41:22 PDT

You could also use a macro for the function with the inline asm as well FWIW as that would work too. In general though I'd try really hard to get the optimizer to produce the rotate you want ala Sean's comment.

Comment 7 Jeffrey Walton 2015-07-24 21:48:27 PDT

(In reply to comment #6)
> You could also use a macro for the function with the inline asm as well FWIW
> as that would work too.

I like the idea, but its an existing library. There are callers in the field using it. So I think it would be hard to change.

> In general though I'd try really hard to get the
> optimizer to produce the rotate you want ala Sean's comment.

Yeah, I agree about trying to get the compiler to generate the rotate. But there are no guarantees the compiler will recognize a "near-constant time rotate without undefined behavior".

If I could settle for undefined behavior, then the classic example of ((x << y)|(x >> 32-y)) would work nicely. I know nearly every compiler recognizes it because I've examined the generated code. But I also know ICC rejects it :(

Comment 8 Sean Silva 2015-07-25 18:30:40 PDT

(In reply to comment #7)
> (In reply to comment #6)
> > You could also use a macro for the function with the inline asm as well FWIW
> > as that would work too.
> 
> I like the idea, but its an existing library. There are callers in the field
> using it. So I think it would be hard to change.
> 
> > In general though I'd try really hard to get the
> > optimizer to produce the rotate you want ala Sean's comment.
> 
> Yeah, I agree about trying to get the compiler to generate the rotate. But
> there are no guarantees the compiler will recognize a "near-constant time
> rotate without undefined behavior".

There is no way for us to provide a "guarantee" besides taking on the maintenance burden of defining a new builtin for this with the associated guaranteed semantics. That is asking a lot, especially since the desired behavior can already be obtained in practice. If you need extra assurance, you can have your build configuration system perform a check that the compiler optimizes your chosen construct as desired.

> 
> If I could settle for undefined behavior, then the classic example of ((x <<
> y)|(x >> 32-y)) would work nicely. I know nearly every compiler recognizes
> it because I've examined the generated code. But I also know ICC rejects it
> :(

I think in this thread we have already established that a "near-constant time rotate without undefined behavior" can be obtained in the case where the rotation amount is a compile-time constant. Both Reid's solution and my template solution work with clang (as does a macro with inlineasm); I have not tested but I am quite certain that both solutions work in ICC as well (or any sane optimizing compiler).

Judging from the inline-asm in your first comment, the compile-time-constant rotation amount seems to be the case you are interested in. Are you also interested in the case where the rotation amount is not constant? If not, I think we can close this bug.

Comment 9 Jeffrey Walton 2015-07-25 20:49:07 PDT

(In reply to comment #8)
> ...
> Judging from the inline-asm in your first comment, the compile-time-constant
> rotation amount seems to be the case you are interested in. Are you also
> interested in the case where the rotation amount is not constant? If not, I
> think we can close this bug.
Yes, I think I have some options.

We also have the non-const rotate that is easier on everyone (me and the compiler) because the constraint is "cI". It seems when allowing the C register, I get effective behavior of the constant propagation.

Thanks to Sean and everyone.

(And I'll still ship you or anyone else those Heiny's for that intrinsic - I'm getting ready to repeat this exercise for ARM and MIPS. And I already know ARM does not have an I or J constraint for Machine Specific constraints).

Comment 10 Jeffrey Walton 2015-08-09 14:14:44 PDT

(In reply to comment #7)
> (In reply to comment #6)
> > In general though I'd try really hard to get the
> > optimizer to produce the rotate you want ala Sean's comment.
> 
> Yeah, I agree about trying to get the compiler to generate the rotate. But
> there are no guarantees the compiler will recognize a "near-constant time
> rotate without undefined behavior".

Sorry to dig up an old report.

I recently found GCC Bug 57157, "Poor optimization of portable rotate idiom" (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57157). It appears this is the pattern GCC is now recognizing:

    (x << n) | (x >> ((-n) & 31))