Can't compile on Jetson Orin NX with "LLAMA_CUBLAS=1"

# Prerequisites

Please answer the following questions for yourself before submitting an issue.

- [x] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- [x] I carefully followed the [README.md](https://github.com/ggerganov/llama.cpp/blob/master/README.md).
- [x] I [searched using keywords relevant to my issue](https://docs.github.com/en/issues/tracking-your-work-with-issues/filtering-and-searching-issues-and-pull-requests) to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the [Discussions](https://github.com/ggerganov/llama.cpp/discussions), and have a new bug or useful enhancement to share.

# Expected Behavior
Compiling on the the Jetson Orin works as expected, but with the "LLAMA_CUBLAS=1" flags, I get errors.

# Current Behavior
Compilation fails with 2 errors:
```
/usr/local/cuda/include/crt/sm_80_rt.hpp(141): error: more than one instance of overloaded function "__nv_associate_access_property_impl" has "C" linkage

ggml.h(217): error: identifier "__fp16" is undefined

2 errors detected in the compilation of "ggml-cuda.cu".
make: *** [Makefile:133: ggml-cuda.o] Error 1
```


# Environment and Context

Fresh install of Jetpack on a fresh Ubuntu installation, on a Jetson Orin NX 16Gb.

```
$ lscpu
Architecture:                    aarch64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
CPU(s):                          8
On-line CPU(s) list:             0-3
Off-line CPU(s) list:            4-7
Thread(s) per core:              1
Core(s) per socket:              4
Socket(s):                       1
Vendor ID:                       ARM
Model:                           1
Model name:                      ARMv8 Processor rev 1 (v8l)
Stepping:                        r0p1
CPU max MHz:                     1984,0000
CPU min MHz:                     115,2000
BogoMIPS:                        62.50
L1d cache:                       256 KiB
L1i cache:                       256 KiB
L2 cache:                        1 MiB
L3 cache:                        2 MiB
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:        Mitigation; __user pointer sanitization
Vulnerability Spectre v2:        Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dc
                                 pop asimddp uscat ilrcpc flagm
```

* Operating System, e.g. for Linux:
```
$ uname -a
Linux orin-desktop 5.10.104-tegra #1 SMP PREEMPT Tue Jan 24 15:09:44 PST 2023 aarch64 aarch64 aarch64 GNU/Linux
```

* SDK version, e.g. for Linux:


$ python3 --version
Python 3.10.10

$ make --version
GNU Make 4.2.1

$ g++ --version
g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

$ nvcc -V
Built on Wed_Jun__8_16:59:16_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99


# Failure Information (for bugs)

Please help provide information about the failure if this is a bug. If it is not a bug, please remove the rest of this template.

# Steps to Reproduce

Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.

1. step 1
2. step 2
3. step 3
4. etc.

# Failure Logs

```
$ make LLAMA_CUBLAS=1
I llama.cpp build info:
I UNAME_S:  Linux
I UNAME_P:  aarch64
I UNAME_M:  aarch64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include
I LDFLAGS:   -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
I CC:       cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
I CXX:      g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

cc  -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include   -c ggml.c -o ggml.o
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -c llama.cpp -o llama.o
llama.cpp: In function ‘size_t llama_set_state_data(llama_context*, const uint8_t*)’:
llama.cpp:2686:36: warning: cast from type ‘const uint8_t*’ {aka ‘const unsigned char*’} to type ‘void*’ casts away qualifiers [-Wcast-qual]
 2686 |             kin3d->data = (void *) inp;
      |                                    ^~~
llama.cpp:2690:36: warning: cast from type ‘const uint8_t*’ {aka ‘const unsigned char*’} to type ‘void*’ casts away qualifiers [-Wcast-qual]
 2690 |             vin3d->data = (void *) inp;
      |                                    ^~~
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -c examples/common.cpp -o common.o
nvcc --forward-unknown-to-host-compiler -arch=native -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -Wno-pedantic -c ggml-cuda.cu -o ggml-cuda.o
/usr/local/cuda/include/crt/sm_80_rt.hpp(141): error: more than one instance of overloaded function "__nv_associate_access_property_impl" has "C" linkage

ggml.h(217): error: identifier "__fp16" is undefined

2 errors detected in the compilation of "ggml-cuda.cu".
make: *** [Makefile:133: ggml-cuda.o] Error 1
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Can't compile on Jetson Orin NX with "LLAMA_CUBLAS=1" #1455

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Failure Information (for bugs)

Steps to Reproduce

Failure Logs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Can't compile on Jetson Orin NX with "LLAMA_CUBLAS=1" #1455

Description

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Failure Information (for bugs)

Steps to Reproduce

Failure Logs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions