[CPU Embedding Inference] 2x slower when **Allocated** (not used) memory / RAM is lower.

**Machine details**

```python
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  8
  On-line CPU(s) list:   0-7
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) CPU @ 2.20GHz
    CPU family:          6
    Model:               79
    Thread(s) per core:  2
    Core(s) per socket:  4
    Socket(s):           1
    Stepping:            0
    BogoMIPS:            4399.99
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clf
                         lush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_
                         good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fm
                         a cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hyp
                         ervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp fsgsb
                         ase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsa
                         veopt arat md_clear arch_capabilities
Virtualization features: 
  Hypervisor vendor:     KVM
  Virtualization type:   full
Caches (sum of all):     
  L1d:                   128 KiB (4 instances)
  L1i:                   128 KiB (4 instances)
  L2:                    1 MiB (4 instances)
  L3:                    55 MiB (1 instance)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-7
Vulnerabilities:         
  Gather data sampling:  Not affected
  Itlb multihit:         Not affected
  L1tf:                  Mitigation; PTE Inversion
  Mds:                   Vulnerable; SMT Host state unknown
  Meltdown:              Vulnerable
  Mmio stale data:       Vulnerable
  Retbleed:              Vulnerable
  Spec rstack overflow:  Not affected
  Spec store bypass:     Vulnerable
  Spectre v1:            Vulnerable: __user pointer sanitization and usercopy barriers only; no swap
                         gs barriers
  Spectre v2:            Vulnerable, IBPB: disabled, STIBP: disabled, PBRSB-eIBRS: Not affected
  Srbds:                 Not affected
  Tsx async abort:       Vulnerable
               total        used        free      shared  buff/cache   available
Mem:            50Gi       825Mi        47Gi       1.0Mi       3.1Gi        49Gi
Swap:             0B          0B          0B
Linux 7f27df5cdb2f 6.1.58+ #1 SMP PREEMPT_DYNAMIC Sat Nov 18 15:31:17 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
```

**Bug details:**

When I run the below CLI command 

```sh
!./bin/embedding -m /content/nomic-embed-text-v1.Q4_0.gguf -p """Elon Reeve Musk (/ˈiːlɒn/; EE-lon; born June 28, 1971) is a businessman and investor. He is the founder, chairman, CEO, and CTO of SpaceX;"""
```

**with 50GB RAM, its way faster like 2x**

```python
llama_print_timings:        load time =      34.47 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =     180.80 ms /    50 tokens (    3.62 ms per token,   276.55 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =     302.62 ms /    51 tokens
```

**with 12GB RAM,** 

```python
llama_print_timings:        load time =      56.80 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =     598.79 ms /    50 tokens (   11.98 ms per token,    83.50 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =     732.38 ms /    51 tokens
```

**Observations:**

The Used memory doesn't change or increase merely allocating more memory makes it run faster why.
This is not just one run, I have done several runs in a loop with different texts the behaviour persists.

**Expected behaviour**
Runtime should not change based on allocated memory ?




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CPU Embedding Inference] 2x slower when Allocated (not used) memory / RAM is lower. #5846

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[CPU Embedding Inference] 2x slower when **Allocated** (not used) memory / RAM is lower. #5846

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[CPU Embedding Inference] 2x slower when Allocated (not used) memory / RAM is lower. #5846