Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[CPU Embedding Inference] 2x slower when **Allocated** (not used) memory / RAM is lower. #5846

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
PrithivirajDamodaran opened this issue Mar 3, 2024 · 5 comments

Comments

@PrithivirajDamodaran
Copy link

Machine details

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  8
  On-line CPU(s) list:   0-7
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) CPU @ 2.20GHz
    CPU family:          6
    Model:               79
    Thread(s) per core:  2
    Core(s) per socket:  4
    Socket(s):           1
    Stepping:            0
    BogoMIPS:            4399.99
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clf
                         lush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_
                         good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fm
                         a cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hyp
                         ervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp fsgsb
                         ase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsa
                         veopt arat md_clear arch_capabilities
Virtualization features: 
  Hypervisor vendor:     KVM
  Virtualization type:   full
Caches (sum of all):     
  L1d:                   128 KiB (4 instances)
  L1i:                   128 KiB (4 instances)
  L2:                    1 MiB (4 instances)
  L3:                    55 MiB (1 instance)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-7
Vulnerabilities:         
  Gather data sampling:  Not affected
  Itlb multihit:         Not affected
  L1tf:                  Mitigation; PTE Inversion
  Mds:                   Vulnerable; SMT Host state unknown
  Meltdown:              Vulnerable
  Mmio stale data:       Vulnerable
  Retbleed:              Vulnerable
  Spec rstack overflow:  Not affected
  Spec store bypass:     Vulnerable
  Spectre v1:            Vulnerable: __user pointer sanitization and usercopy barriers only; no swap
                         gs barriers
  Spectre v2:            Vulnerable, IBPB: disabled, STIBP: disabled, PBRSB-eIBRS: Not affected
  Srbds:                 Not affected
  Tsx async abort:       Vulnerable
               total        used        free      shared  buff/cache   available
Mem:            50Gi       825Mi        47Gi       1.0Mi       3.1Gi        49Gi
Swap:             0B          0B          0B
Linux 7f27df5cdb2f 6.1.58+ #1 SMP PREEMPT_DYNAMIC Sat Nov 18 15:31:17 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Bug details:

When I run the below CLI command

!./bin/embedding -m /content/nomic-embed-text-v1.Q4_0.gguf -p """Elon Reeve Musk (/ˈiːlɒn/; EE-lon; born June 28, 1971) is a businessman and investor. He is the founder, chairman, CEO, and CTO of SpaceX;"""

with 50GB RAM, its way faster like 2x

llama_print_timings:        load time =      34.47 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =     180.80 ms /    50 tokens (    3.62 ms per token,   276.55 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =     302.62 ms /    51 tokens

with 12GB RAM,

llama_print_timings:        load time =      56.80 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =     598.79 ms /    50 tokens (   11.98 ms per token,    83.50 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =     732.38 ms /    51 tokens

Observations:

The Used memory doesn't change or increase merely allocating more memory makes it run faster why.
This is not just one run, I have done several runs in a loop with different texts the behaviour persists.

Expected behaviour
Runtime should not change based on allocated memory ?

@ggerganov
Copy link
Member

By "allocating more memory" do you mean creating a VM with more memory?

@PrithivirajDamodaran
Copy link
Author

Right, I just figured the behavioural pattern is certain cloud instances with memory, vCPU increases hence the better performance.

So the ultimate problem is embedding via llama.cpp takes way more CPU cycles when run in CPU mode. All my attempts are on X86_64 instances.

Is this something you experienced. Am asking because these models are built for resource constrained decide but this behaviour is exact opposite of it.

Btw thanks for all your work !

@iamlemec
Copy link
Collaborator

iamlemec commented Mar 3, 2024

I tried running the same command as above with a systemd process level memory limit. It was fine all the way down to around 150M, where the speed went down by 3x, presumably due to swapping.

Maybe you could try this on your high memory instance and see if it's the same? If you have systemd, the command is:

systemd-run --scope -p MemoryMax=150M --user ./embedding [...]

@PrithivirajDamodaran
Copy link
Author

PrithivirajDamodaran commented Mar 4, 2024

Sorry, if I was not clear, as mentioned in the last comment it's more to do with CPU cycles. (It appeared like more memory helped but it's more CPU that came with more memory, look at my machine details above. We have 8 CPUs. )

Here is the hypothesis, I am working on


Token for Token GGML models are the best for $ spent on compute if we run end to end inference in C++ with no python bindings. So I am comparing 4-bit quantised GGML models with 8-bit quantised ONNX models.

  • With ONNX inference runs in python
  • With GGML runs end to end C++.

What I did: To test I wrote C++ wrappers to serve bert.cpp (tried The llama.cpp server as well) in serverless like AWS lambda (Lambda specific) and Google Cloud Run (HTTP server).

Observation: For a same token density text (tried 64 to 512), q4_0 model takes way more CPU cycles than ONNX-8bit. The same behaviour can be noticed while running C++ binaries as CLI. ./embedding in (llama.cpp) or ./main in (bert.cpp)

Variants I tried before concluding

To save time I loaded model only once, instrumented code to get token + infer time.

  - Mono app instance, Load model once. Tokenize & encode threads = 8
  - Pool of app instances, Load one model per instance + Async HTTP request processing, Tokenize & encode threads = 8
  - Pool of app instances + HTTP Threadpool + Tokenize & encode threads = 8+ HTTP threads 5

The heavy compute appetite of GGML models are consistent across OS and Serverless offerings and Compute instances. I am not saying this is the right conclusion, but if it is true GGML models are not economically viable. (at least for embeddings)

Any pointers ? Anything I am missing?

@github-actions github-actions bot added the stale label Apr 4, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants