Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Deploying Qwen2 (or any other GGUF models) into AWS Lambda

Notifications You must be signed in to change notification settings

BuddyLim/qwen2-in-a-lambda

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Qwen in a Lambda

Updated at 11/09/2024

(Marking the date because of how fast LLM APIs in Python move and may introduce breaking changes by the time anyone else reads this!)

Intro:

  • This is a minor research on how we can put Qwen GGUF model files into AWS Lambda using Docker and SAM CLI

  • Adapted from https://makit.net/blog/llm-in-a-lambda-function/

    • As of September '24, some required OS packages are not included in the above guide and subsequently in the Dockerfile as potentially the llama-cpp-python @ 0.2.90 does not include the required OS packages (?)
    • Who knows if there's anything new and breaking that will appear in the future :shrugs:

Motivation:

  • I wanted to find out if I can reduce my AWS spending by only leveraging on the capabilities of Lambda and not Lambda + Bedrock as both services would incur more costs in the long run.

  • The idea was to fit a small language model which wouldn't be as resource intensive relatively speaking and to, hopefully, receive subsecond to second latency on a 128 - 256 mb memory configuration

  • I wanted to use also GGUF models to use different levels of quantization to find out which is the best performance / file size to be loaded into memory

    • My experimentation lead to me using Qwen2 1.5b Q5_K_M as it had the best "performance" and "latency" locally to receive prompt and spit out JSON structure using llama-cpp

Prerequisites:

  • Docker
  • AWS SAM CLI
  • AWS CLI
  • Python 3.11
  • ECR permissions
  • Lambda permissions
  • Download qwen2-1_5b-instruct-q5_k_m.gguf into qwen_fuction/function/
    • Or download any other .gguf models that you'd like and change your model path in app.y / LOCAL_PATH

Setup Guide:

  • Install pip packages under qwen_function/function/requirements.txt (preferably in a venv/conda env)
  • Run sam build / sam validate
  • Run sam local start-api to test locally
  • Run curl --header "Content-Type: application/json" \ --request POST \ --data '{"prompt":"hello"}' \ http://localhost:3000/generate to prompt the LLM
    • Or use your preferred API clients
  • Run sam deploy --guided to deploy to AWS
  • This will deploy a cloudformation stack consisting of an API gateway and a Lambda function

Metrics

  • Localhost - Macbook M3 Pro 32 GB

alt text

  • AWS

    • Initial config - 128mb, 30s timeout

      • Lambda timed out! Cold start was timing out the lambda
    • Adjusted config #1 - 512mb, 30s timeout

      • Lambda timed out! Cold start was timing out the lambda
    • Adjusted config #2 - 512mb, 30s timeout

      • Lambda timed out! Cold start was timing out the lambda

alt text

  • Adjusted config #3 - 3008mb, 30s timeout - cold start

alt text

  • Adjusted config #3 - 3008mb, 30s timeout - warm start

alt text

Observation

  • Referring back to the pricing structure of Lambda,

    • Pricing
    • 1536 MB / 1.465 s / $0.024638 over 1000 Lambda invocations
      • Qwen2 1.5b had me cranking up the memory to 3008mb just to not time out and receive 4 - 11 seconds latency response!
    • Claude 3 Haiku / $0.00025 / $0.00125 over 1000 input tokens & 1000 output tokens / Asia - Tokyo
  • It may be cheaper to just use a hosted LLM using AWS Bedrock, etc.. on the cloud as the pricing structure for Lambda w/ Qwen does not look more competitive compared to Claude 3 Haiku

  • Furthermore, the API gateway timeout is not easily configurable beyond the 30s timeout, depending on your usecase, this may not be very ideal

  • Results via local is dependant on your machine specs!! and may heavily skew your perception, expectation vs reality

  • Depending on your use case also, the latency per lambda invocation and responses might incur poor user experiences

Conclusion

All in all, I think this was a fun little experiment even though it didn't quite pan out to the budget & latency requirement via Qwen 1.5b for my side project. Thanks to @makit again for the guide!

About

Deploying Qwen2 (or any other GGUF models) into AWS Lambda

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published