Closed
Description
Machine details
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) CPU @ 2.20GHz
CPU family: 6
Model: 79
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
Stepping: 0
BogoMIPS: 4399.99
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clf
lush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_
good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fm
a cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hyp
ervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp fsgsb
ase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsa
veopt arat md_clear arch_capabilities
Virtualization features:
Hypervisor vendor: KVM
Virtualization type: full
Caches (sum of all):
L1d: 128 KiB (4 instances)
L1i: 128 KiB (4 instances)
L2: 1 MiB (4 instances)
L3: 55 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-7
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: Not affected
L1tf: Mitigation; PTE Inversion
Mds: Vulnerable; SMT Host state unknown
Meltdown: Vulnerable
Mmio stale data: Vulnerable
Retbleed: Vulnerable
Spec rstack overflow: Not affected
Spec store bypass: Vulnerable
Spectre v1: Vulnerable: __user pointer sanitization and usercopy barriers only; no swap
gs barriers
Spectre v2: Vulnerable, IBPB: disabled, STIBP: disabled, PBRSB-eIBRS: Not affected
Srbds: Not affected
Tsx async abort: Vulnerable
total used free shared buff/cache available
Mem: 50Gi 825Mi 47Gi 1.0Mi 3.1Gi 49Gi
Swap: 0B 0B 0B
Linux 7f27df5cdb2f 6.1.58+ #1 SMP PREEMPT_DYNAMIC Sat Nov 18 15:31:17 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Bug details:
When I run the below CLI command
!./bin/embedding -m /content/nomic-embed-text-v1.Q4_0.gguf -p """Elon Reeve Musk (/ˈiːlɒn/; EE-lon; born June 28, 1971) is a businessman and investor. He is the founder, chairman, CEO, and CTO of SpaceX;"""
with 50GB RAM, its way faster like 2x
llama_print_timings: load time = 34.47 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 180.80 ms / 50 tokens ( 3.62 ms per token, 276.55 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 302.62 ms / 51 tokens
with 12GB RAM,
llama_print_timings: load time = 56.80 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 598.79 ms / 50 tokens ( 11.98 ms per token, 83.50 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 732.38 ms / 51 tokens
Observations:
The Used memory doesn't change or increase merely allocating more memory makes it run faster why.
This is not just one run, I have done several runs in a loop with different texts the behaviour persists.
Expected behaviour
Runtime should not change based on allocated memory ?