Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit f2c0dfb

Browse files
Use fp32 in cuBLAS V100 to avoid overflows, env variables to override cuBLAS compute type (ggml-org#19959)
* Update ggml-cuda.cu * Update ggml-cuda.cu * Update build.md * Update build.md * Update ggml/src/ggml-cuda/ggml-cuda.cu Co-authored-by: Johannes Gäßler <[email protected]> * Update ggml-cuda.cu * Update build.md * Update ggml/src/ggml-cuda/ggml-cuda.cu Co-authored-by: Johannes Gäßler <[email protected]> * Update build.md * Update ggml-cuda.cu * Update ggml-cuda.cu --------- Co-authored-by: Johannes Gäßler <[email protected]>
1 parent 9789c4e commit f2c0dfb

2 files changed

Lines changed: 64 additions & 17 deletions

File tree

docs/build.md

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -269,6 +269,14 @@ The environment variable [`CUDA_SCALE_LAUNCH_QUEUES`](https://docs.nvidia.com/cu
269269

270270
Consider setting `CUDA_SCALE_LAUNCH_QUEUES=4x`, which increases the CUDA command buffer to 4 times its default size. This optimization is particularly beneficial for **Multi-GPU setups with pipeline parallelism**, where it significantly improves prompt processing throughput by allowing more operations to be enqueued across GPUs.
271271

272+
#### GGML_CUDA_FORCE_CUBLAS_COMPUTE_32F
273+
274+
Use `GGML_CUDA_FORCE_CUBLAS_COMPUTE_32F` environment variable to use FP32 compute type on all GPUs in FP16 cuBLAS for preventing possible numerical overflows in exchange for slower prompt processing (small impact on RTX PRO/Datacenter products and significant on GeForce products).
275+
276+
#### GGML_CUDA_FORCE_CUBLAS_COMPUTE_16F
277+
278+
Use `GGML_CUDA_FORCE_CUBLAS_COMPUTE_16F` environment variable to force use FP16 compute type (instead of default FP32) in FP16 cuBLAS for V100, CDNA and RDNA4.
279+
272280
### Unified Memory
273281

274282
The environment variable `GGML_CUDA_ENABLE_UNIFIED_MEMORY=1` can be used to enable unified memory in Linux. This allows swapping to system RAM instead of crashing when the GPU VRAM is exhausted. In Windows this setting is available in the NVIDIA control panel as `System Memory Fallback`.
@@ -280,7 +288,7 @@ The following compilation options are also available to tweak performance:
280288
| Option | Legal values | Default | Description |
281289
|-------------------------------|------------------------|---------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
282290
| GGML_CUDA_FORCE_MMQ | Boolean | false | Force the use of custom matrix multiplication kernels for quantized models instead of FP16 cuBLAS even if there is no int8 tensor core implementation available (affects V100, CDNA and RDNA3+). MMQ kernels are enabled by default on GPUs with int8 tensor core support. With MMQ force enabled, speed for large batch sizes will be worse but VRAM consumption will be lower. |
283-
| GGML_CUDA_FORCE_CUBLAS | Boolean | false | Force the use of FP16 cuBLAS instead of custom matrix multiplication kernels for quantized models. There may be issues with numerical overflows (except for CDNA and RDNA4) and memory use will be higher. Prompt processing may become faster on recent datacenter GPUs (the custom kernels were tuned primarily for RTX 3000/4000). |
291+
| GGML_CUDA_FORCE_CUBLAS | Boolean | false | Force the use of FP16 cuBLAS instead of custom matrix multiplication kernels for quantized models. There may be issues with numerical overflows (except for V100, CDNA and RDNA4 which use FP32 compute type by default) and memory use will be higher. Prompt processing may become faster on recent datacenter GPUs (the custom kernels were tuned primarily for RTX 3000/4000). |
284292
| GGML_CUDA_PEER_MAX_BATCH_SIZE | Positive integer | 128 | Maximum batch size for which to enable peer access between multiple GPUs. Peer access requires either Linux or NVLink. When using NVLink enabling peer access for larger batch sizes is potentially beneficial. |
285293
| GGML_CUDA_FA_ALL_QUANTS | Boolean | false | Compile support for all KV cache quantization type (combinations) for the FlashAttention CUDA kernels. More fine-grained control over KV cache size but compilation takes much longer. |
286294

ggml/src/ggml-cuda/ggml-cuda.cu

Lines changed: 55 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1242,6 +1242,34 @@ static cudaError_t ggml_cuda_cpy_tensor_2d(
12421242
}
12431243
}
12441244

1245+
struct cublas_force_compute_type {
1246+
bool fp32 = false;
1247+
bool fp16 = false;
1248+
};
1249+
1250+
static const cublas_force_compute_type & ggml_cuda_cublas_get_force_compute_type() {
1251+
static const cublas_force_compute_type compute_type = [] {
1252+
cublas_force_compute_type result;
1253+
1254+
const bool ggml_cuda_force_cublas_compute_32f_env = getenv("GGML_CUDA_FORCE_CUBLAS_COMPUTE_32F") != nullptr;
1255+
const bool ggml_cuda_force_cublas_compute_16f_env = getenv("GGML_CUDA_FORCE_CUBLAS_COMPUTE_16F") != nullptr;
1256+
1257+
GGML_ASSERT(ggml_cuda_force_cublas_compute_16f_env == false || ggml_cuda_force_cublas_compute_32f_env == false);
1258+
1259+
if (ggml_cuda_force_cublas_compute_32f_env) {
1260+
GGML_LOG_INFO("Detected GGML_CUDA_FORCE_CUBLAS_COMPUTE_32F\n");
1261+
result.fp32 = true;
1262+
} else if (ggml_cuda_force_cublas_compute_16f_env) {
1263+
GGML_LOG_INFO("Detected GGML_CUDA_FORCE_CUBLAS_COMPUTE_16F\n");
1264+
result.fp16 = true;
1265+
}
1266+
1267+
return result;
1268+
}();
1269+
1270+
return compute_type;
1271+
}
1272+
12451273
static void ggml_cuda_op_mul_mat_cublas(
12461274
ggml_backend_cuda_context & ctx,
12471275
const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst, const char * src0_dd_i, const float * src1_ddf_i,
@@ -1324,7 +1352,13 @@ static void ggml_cuda_op_mul_mat_cublas(
13241352

13251353
CUBLAS_CHECK(cublasSetStream(ctx.cublas_handle(id), stream));
13261354

1327-
if (GGML_CUDA_CC_IS_CDNA(cc) || GGML_CUDA_CC_IS_RDNA4(cc)) {
1355+
const auto & force_compute_type = ggml_cuda_cublas_get_force_compute_type();
1356+
1357+
if (!force_compute_type.fp16 && (GGML_CUDA_CC_IS_CDNA(cc)
1358+
|| GGML_CUDA_CC_IS_RDNA4(cc)
1359+
|| cc == GGML_CUDA_CC_VOLTA
1360+
|| force_compute_type.fp32))
1361+
{
13281362
const float alpha = 1.0f;
13291363
const float beta = 0.0f;
13301364
CUBLAS_CHECK(
@@ -1923,10 +1957,23 @@ static void ggml_cuda_mul_mat_batched_cublas_impl(ggml_backend_cuda_context & ct
19231957
cudaDataType_t cu_data_type_b = traits::data_type;
19241958
const void * alpha = traits::get_alpha();
19251959
const void * beta = traits::get_beta();
1926-
const float alpha_f32 = 1.0f;
1927-
const float beta_f32 = 0.0f;
19281960

1929-
if (dst->op_params[0] == GGML_PREC_DEFAULT) {
1961+
const auto & force_compute_type = ggml_cuda_cublas_get_force_compute_type();
1962+
1963+
int id = ggml_cuda_get_device();
1964+
const int cc = ggml_cuda_info().devices[id].cc;
1965+
static constexpr bool is_src0_type_f16 = src0_type == GGML_TYPE_F16;
1966+
1967+
// bf16 and fp32 are already being computed in fp32 (ensure it using static_assert),
1968+
// so checking necessity of forced fp32 only for fp16 src0_type
1969+
static_assert(is_src0_type_f16 || traits::compute_type == CUBLAS_COMPUTE_32F);
1970+
1971+
const bool need_compute_32f = is_src0_type_f16 && !force_compute_type.fp16 && (GGML_CUDA_CC_IS_CDNA(cc)
1972+
|| GGML_CUDA_CC_IS_RDNA4(cc)
1973+
|| cc == GGML_CUDA_CC_VOLTA
1974+
|| force_compute_type.fp32);
1975+
1976+
if (dst->op_params[0] == GGML_PREC_DEFAULT && !need_compute_32f) {
19301977
if constexpr (src0_type == GGML_TYPE_F32) {
19311978
dst_t = (char *) dst_ddf; // Direct F32 output
19321979
} else {
@@ -1936,18 +1983,10 @@ static void ggml_cuda_mul_mat_batched_cublas_impl(ggml_backend_cuda_context & ct
19361983
}
19371984
} else {
19381985
dst_t = (char *) dst_ddf;
1939-
cu_compute_type = CUBLAS_COMPUTE_32F;
1940-
cu_data_type = CUDA_R_32F;
1941-
alpha = &alpha_f32;
1942-
beta = &beta_f32;
1943-
}
1944-
1945-
int id = ggml_cuda_get_device();
1946-
const int cc = ggml_cuda_info().devices[id].cc;
1947-
if (GGML_CUDA_CC_IS_CDNA(cc) || GGML_CUDA_CC_IS_RDNA4(cc)) {
1948-
cu_compute_type = CUBLAS_COMPUTE_32F;
1949-
alpha = &alpha_f32;
1950-
beta = &beta_f32;
1986+
cu_compute_type = batched_mul_mat_traits<GGML_TYPE_F32>::compute_type;
1987+
cu_data_type = batched_mul_mat_traits<GGML_TYPE_F32>::data_type;
1988+
alpha = batched_mul_mat_traits<GGML_TYPE_F32>::get_alpha();
1989+
beta = batched_mul_mat_traits<GGML_TYPE_F32>::get_beta();
19511990
}
19521991

19531992
GGML_ASSERT(ne12 % ne02 == 0);

0 commit comments

Comments
 (0)