Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@lczyk
Copy link
Contributor

@lczyk lczyk commented Jan 15, 2025

I wanted to switch one of my projects to cglm, but couldn't because I needed the Perlin noise implementation. Hence this PR.

So far this is just an implementation of float perlin_vec4(vec4 point), hence the draft version. It is based on the glm::perlin. I'm currently planning to add perlin_vec3 and perlin_vec2.

Also, I've come across some missing vec4-ext functions which, for now, i've put in perlin.h, but I'd like to move them to vec4-ext (and check whether any other vec-ext's want them). These are:

void glm_vec4_floor(vec4 x, vec4 dest) // and maybe ceil too
void glm_vec4_mods(vec4 x, float y, vec4 dest) // mod with scalar
void glm_vec4_steps(vec4 edge, float x, vec4 dest) // step with x as scalar
void glm_vec4_sets(vec4 v, float x) // and maybe glm_vec4_set too actually
void glm_vec4_muls(vec4 x, float y, vec4 dest) // mul with scalar

@lczyk
Copy link
Contributor Author

lczyk commented Jan 15, 2025

Have a look at glm_perlin_test folder in my branch perlin-wip. There is a small testing script for comparing glm::perlin and glm_perlin_vec4. Here is a screenshot:

Screenshot 2025-01-15 at 14 10 49

The difference is within GLM_FLT_EPSILON (as can also be seen by tests)

@lczyk
Copy link
Contributor Author

lczyk commented Jan 17, 2025

Updated comparison which includes glm_perlin_vec3. The delta is smaller due to, i believe, smaller number of flops.

Screenshot 2025-01-17 at 16 42 06

@lczyk
Copy link
Contributor Author

lczyk commented Jan 17, 2025

Updated comparison which includes glm_perlin_vec2:

Screenshot 2025-01-17 at 20 24 03

@lczyk
Copy link
Contributor Author

lczyk commented Jan 17, 2025

Also there is a bit of a speed diff wrt glm::perlin (glm version 1.0.1):

Timing (in clock ticks)
GLM vec4:  106784
CGLM vec4: 68371 (x1.56 speedup)
GLM vec3:  55447
CGLM vec3: 14427 (x3.84 speedup)
GLM vec2:  22182
CGLM vec2: 8953 (x2.48 speedup)

Timed with a spiritual equivalent of:

#define N 1_000_000

clock_t start = clock();

for (size_t = 0; i < N; i++) {
    vec3 p = {float(i)/N, float(i)/N, float(i)/N};
    glm_perlin_vec3(p);
}

clock_t end = clock();

Compiled with:

zig c++ \
        -std=c++11 -O3 \
        -Wall -Wextra -Wpedantic -Wno-null-conversion -Wno-unused-variable -Werror \
        -o glm_perlin_test glm_perlin_test.cpp -lglm -L/opt/homebrew/opt/glm/lib \

@lczyk lczyk marked this pull request as ready for review January 17, 2025 20:36
lczyk added 7 commits January 18, 2025 20:10
_glm_noiseDetail_mod289
_glm_noiseDetail_permute
_glm_noiseDetail_fade_vec4
_glm_noiseDetail_fade_vec3
_glm_noiseDetail_fade_vec2
_glm_noiseDetail_taylorInvSqrt
_glm_noiseDetail_gradNorm_vec4
_glm_noiseDetail_gradNorm_vec3
_glm_noiseDetail_gradNorm_vec2
_glm_noiseDetail_i2gxyzw
_glm_noiseDetail_i2gxyz
_glm_noiseDetail_i2gxy
@lczyk
Copy link
Contributor Author

lczyk commented Jan 18, 2025

Ok, so:

  1. glm_vec4_scale() can be used instead of glm_vec4_muls()

done. also replaced _glm_vec4_sets with already existing _glm_vec4_fill

  1. Some missing useful vec functions can be moved to vec[2|3|4] or -ext.h files

done. moved, _floor and _mods. renamed _steps to _stepr (steps is the threshold by a scalar. stepr is threshold of a scalar by a vector) and moved both to ext. deprecated step_uni in favour of steps (added all the compatibility macros) since, from what i can tell, the 's' suffix is much more prevalent for the scalar version of the function (as opposed to _uni).

  1. No // comment pls

done

  1. Two space instead of 4

done

  1. Not sure about _glm_ functions since they all will be visible to user, maybe macro then #undef macro art end of file? if there is no better way to handle them

done. had no better idea than #define and then #undef so did that. (static would confine the functions to a comp unit, but that would still expose them in header-only mode)

  1. If possible; same coding style with other files ( or I can do some small edits later )

unsure what bits you mean tbh. i've looked at formatting in #433, and as far as i can tell it matches. please feel free to point out bits and i'm more than happy to change them 👍

EDIT: In the future additional optimizations can be made if it could be possible

unsure what you mean(?)

Also, while moving things to ext, i found a bug in swizzle tests for vec3 and vec4 (they used glm_vec3_swizzle as opposed to test macro GLM(vec3)swizzle) so they were not testing the export to static lib). Fixed that.

Also also, found missing vec2_step and vec2_swizzle. added both with tests. although they are not used in noise, i thought that like this the api is consistent between vec2 and vec3/4.

@recp
Copy link
Owner

recp commented Jan 19, 2025

@MarcinKonowalczyk many thanks, some tests (vec3) are failing on some platforms. Are they about floating point errors or something can be improved?

Screenshot 2025-01-19 at 2 21 34 PM

EDIT: In the future additional optimizations can be made if it could be possible
unsure what you mean(?)

Even now, there is a lot of room for optimizations:

(t*t*t) * (t * (t * 6 - 15) + 10)
``

can be re-written as:

```C
t * (t*t * (t*6 - 15) + 10)

which reduces 1 vector mul and similar operations can be optimized even for scalar version.

  dest[0] = (t[0] * t[0] * t[0]) * (t[0] * (t[0] * 6.0f - 15.0f) + 10.0f); \
  dest[1] = (t[1] * t[1] * t[1]) * (t[1] * (t[1] * 6.0f - 15.0f) + 10.0f); \
  dest[2] = (t[2] * t[2] * t[2]) * (t[2] * (t[2] * 6.0f - 15.0f) + 10.0f); \
  dest[3] = (t[3] * t[3] * t[3]) * (t[3] * (t[3] * 6.0f - 15.0f) + 10.0f); \

using glm_vec4_mul() may give better result since it is optimized with SIMD. But well known compilers may do auto-vectorizing not sure for that.

In cglm some operations may be grouped ( if possible ) then the whole function can be optimized with SIMD where vec3/vec4 are used a lot internally.

_stepr

_steps s stands for scalar but r?

My hope was to keep swizzle as macro to use builtin shuffle / blend / permute ... to make them lightweight but 🤷‍♂️ anyway thanks for fixes.

please feel free to point out bits and i'm more than happy to change them 👍

no prob. I can do some small style changes later e.g. indents, declare variables at the beginning of functions or scope at least ( even c99+ doesn't require it ) where possible... probably prefer glm__over _glm_ for internal macros, temp definitions, functions ...

after tests are passed we can merge the PR.

@lczyk
Copy link
Contributor Author

lczyk commented Jan 20, 2025

Are they about floating point errors or something can be improved?

looking into

(ttt) * (t * (t * 6 - 15) + 10)

can be re-written as:

t * (tt * (t6 - 15) + 10)

i'm nor sure it can.. (see this on wolfram alpha: just subtracting one eq from the other. i've also tried it in code and it produces wrong noise pattern.

using glm_vec4_mul() may give better result

i've tried

#define _glm_noiseDetail_fade_vec4(t, dest) { \
    glm_vec4_mul(t, t, dest); /* dest = t * t */ \
    glm_vec4_mul(dest, t, dest); /* dest *= t */ \
    vec4 temp; \
    glm_vec4_scale(t, 6.0f, temp); /* temp = t * 6.0f */ \
    glm_vec4_subs(temp, 15.0f, temp); /* temp -= 15.0f */ \
    glm_vec4_mul(t, temp, temp); /* temp *= t */ \
    glm_vec4_adds(temp, 10.0f, temp); /* temp += 10.0f */ \
    glm_vec4_mul(dest, temp, dest); /* dest *= temp */ \
}

but did not see any appreciable difference in speed. Happy to do a pass like this over the helper functions if you'd prefer though.

generally, so far i've not thought that much about optimisations, just about correct / readable implementation faithful to glm::perlin. the rough benchmarking looked good so i left it at that. i think maybe, given the benchmarks work, let's leave it like that as the initial implementation and do benchmarking / optimisation in a separate piece of work?

_steps s stands for scalar but r?

'reverse' 😅 not the best name, i admit. i just named it something which made sense in my head and then forgot to go back to it. maybe, instead, steps -> stepsv and stepr -> stepvs ?? or, indeed, maybe stepr should just be an internal helper of noise as opposed to an ext func? happy with whatever you'd like there 👍

probably prefer glm__over glm for internal macros

done


Looking at tests in more detail...

@lczyk
Copy link
Contributor Author

lczyk commented Jan 20, 2025

Ive managed to reproduce the CMake/ubuntu-22.04/clang-15 error in docker. Dockerfile:

FROM ubuntu:22.04
WORKDIR /cglm

# Install dependencies
RUN apt-get update -y
RUN apt-get install -y cmake clang-15 ninja-build

# Copy source code, remove build dir if exists
COPY . .
RUN rm -rf build

RUN cmake \
    -B build \
    -GNinja \
    -DCMAKE_C_COMPILER=clang-15 \
    -DCMAKE_BUILD_TYPE=Release \
    -DCGLM_STATIC=ON \
    -DCGLM_USE_TEST=ON

RUN cmake --build build

CMD ["bash"]

which follows pretty much exactly what the pipeline does.

Then

DOCKER_DEFAULT_PLATFORM=linux/amd64 docker build -t cglm_test . && docker run --rm -it cglm_test /cglm/build/tests

fails and

DOCKER_DEFAULT_PLATFORM=linux/arm64 docker build -t cglm_test . && docker run --rm -it cglm_test /cglm/build/tests

succeeds (the diff is amd64 vs arm64). This pretty much confirms that it's to do with the fact that i'm working on an arm mac and i've generated the test values in test_noise.h with a native build.

I will investigate further to see how big is the difference and adjust the tests accordingly.

@lczyk
Copy link
Contributor Author

lczyk commented Jan 20, 2025

So, this is wild...

Screenshot 2025-01-20 at 18 09 32

Yeah, there are big differences on amd64, but only sometimes! I think it might be something to do with vec3 having a slightly defferent implementation in glm::perlin than vec2 and vec4 version. Will have a deeper look into. Might actually be a bug in glm::perlin, but will try to match ours first, and then potentially submit a patch there.

For reference, compiled in the same kind of container as above (aka with clang-15).

@recp
Copy link
Owner

recp commented Jan 20, 2025

t * (tt * (t6 - 15) + 10)
i'm nor sure it can.. (see this on wolfram alpha: just subtracting one eq from the other. i've also tried it in code and it produces wrong noise pattern.

What was I thinking when simplifying it 🫣 But the point was that some operations can be optimized by simplifying, some with SIMD and some with ILP....

As I mentioned before In the future additional optimizations can be made if it could be possible it is not expected in this PR :)

Lets skip micro optimizations for now and merge the PR after tests are passed.

maybe stepr should just be an internal helper of noise as opposed to an ext func?

makes sense for now 👍

Will have a deeper look into. Might actually be a bug in glm::perlin, but will try to match ours first, and then potentially submit a patch there.

thanks

@lczyk
Copy link
Contributor Author

lczyk commented Jan 22, 2025

@recp done. see comment in the patch for explanation, and image below for the comparison.

i think this is something i will raise with glm, and if they and up changing it there i can do the check-and-match once again. 👍 I've ended up semi-automating the build for amd64 and arm64 with docker + makefile and then just going through the source of both noise.h and noise.inl (on glm side), returning early with partial values and seeing where the delta is. See perlin-wip branch for the code. tldr, this should show you the diff:

git clone --depth=1 --branch 14d14be8fac739666bb48c61eda9eff97b8dfd3a https://github.com/MarcinKonowalczyk/cglm/tree/perlin-wip
cd cglm/glm_perlin_test
make
python plot.py --suffix arm # or amd

there is also make test which runs all the tests in the containers.

writing this here partially to document it for myself when i inevitably need to do this again 😅


Screenshot 2025-01-22 at 14 43 31

@lczyk
Copy link
Contributor Author

lczyk commented Jan 22, 2025

so, while i had everything set up, i thought i'd do a compare for other intermediate values and found a couple more bugs / inconsistencies. you know how the delta for amd was just noise ~1e-7 but for arm it had structure at ~1e-6 ? well, now they're both just noise (see below, that's on arm). as part of that i've vectorised some of the intermediate functions, and found a missing simd intrinsic from glm_vec4_divs.

btw, tests passing in both arm and amd container 👍


Screenshot 2025-01-22 at 16 52 09

@recp recp merged commit e8c791e into recp:master Jan 22, 2025
49 of 74 checks passed
@recp
Copy link
Owner

recp commented Jan 22, 2025

@MarcinKonowalczyk the PR is merged, many thanks for your contributions 🚀

@recp
Copy link
Owner

recp commented Jan 22, 2025

you know how the delta for amd was just noise ~1e-7 but for arm it had structure at ~1e-6 ?

Hmm, in tests I've used GLM_FLT_EPSILON which is 1e-5 but can be configured due to these changes. Also FP precision may decrease after lot of FP ops this is why I had mentioned about floating point errors before. Due to this, in tests we may need low precision to compare :/ I guess. Fused math may reduce the errors maybe where available, cglm tries to take the advantage of fused math where possible. Maybe by using glm_vec4_ internally instead of expect auto vectorizing, compiler may use fma to reduce fp errors where possible. Compiler which cant optimize inline functions may generate lot of MOVs, this is why I think we should manually optimize each functions with SIMD where possible in the future...

as part of that i've vectorised some of the intermediate functions

IIRC, there were not too much diff before but okay 👍


Anyway, many thanks for your contributions 🚀

@lczyk
Copy link
Contributor Author

lczyk commented Jan 22, 2025

nice 🥳

tomorrow/the-day-after, i will have another look over the commits and write up some bullet points for easy inclusion in release notes for the next v (given that i've also fixed a couple of bugs and added a couple of ext's).

@gottfriedleibniz
Copy link

vdivq_f32 is available only on A64 so compiling with ARMv7 may lead to issues. In this instance it may be easier to use the existing definition of glmm_div

@recp
Copy link
Owner

recp commented Jan 24, 2025

Hi @gottfriedleibniz,

Many thanks for the catch.

EDIT: 441f265...70a1a94

should fix this


We must add ARMv7 build to CI asap to catch these more quickly ( maybe --ffast-math too )

@lczyk
Copy link
Contributor Author

lczyk commented Jan 25, 2025

cheers, good catch! having had a closer look at the code, i guess we should transitions most of the ext stuff to call through glmm_ where possible, and then handle intrinsic selection in glmm_, right?

... ( maybe --ffast-math too )

There might be some issues for that around the noise tests, given how small numerical differences can get amplified there (see the /7 vs * (1/7) case) but we can disable fast math for those particular tests. I could not find a convenient pre-defined flag, but something like --ffast-math -DCGLM_FAST_MATH=0 should work. To be clear, i mean this only for numerically vulnerable tests. I think, as a baseline, tests should pass under ffast-math.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants