-
-
Notifications
You must be signed in to change notification settings - Fork 43
add resources limit for proper scaling #128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks @mag009 ! I'll let @scjody review this thing (I don't know much about the One thing to keep in mind, I'll like to upgrade the electron version we're using (see #125) in the short-term. So, maybe you run your tests using the |
I was able to stress test and so far so good. No crashes in 30 minutes for a total of 26k requests, at an avg rate of 17 req/s I'm using preempt vm's with shared cpu which can handle ~2req/s. I still have minor issue, i'm getting connection refused when it add a container so i probably need to adjust the health check. Test below is performed with a 384K file Completed 10000 requests Server Software: Document Path: / Concurrency Level: 15 Connection Times (ms) Percentage of the requests served within a certain time (ms) |
Nice!
Can you point us to that file? I might nice to run the tests on a collections of plotly.js mocks or using "real-life" image server requests. Last spring, @scjody used this thing, which could be useful to you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your work on this so far!
Some general comments:
- This should definitely be tested with many real-world requests, including requests known to fail.
- It would be helpful to split your changes across multiple commits, with details on why a change is being made in the commit text. For example the
podAffinity
section could be its own commit. This makes reviewing easier, and makes it easier to understand why something was done in a certain way months/years down the road. (git commit --patch
and related commands can help here.) - Based on what you said on Slack I believe this is a WIP. Please label WIP PRs as WIP in the description when you open the PR. (You can edit this afterwards to remove the WIP when the PR is ready.)
- It looks like you're missing
prod
versions of all these changes. (We could certainly move this to a Helm chart in a future PR to remove this duplication, but for now this needs to be done.)
deployment/kube/stage/frontend.yaml
Outdated
@@ -17,6 +19,15 @@ spec: | |||
tier: frontend | |||
spec: | |||
affinity: | |||
podAffinity: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what this is doing. Can you please explain or point to some documentation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we have local storage mounted it was preventing to scale-down and delete the node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I just realised your question was about the podAffinity.
It's actually wrong it's podAntiAffinity it's to make sure that when it's scale-down it leave the pod on multiple zones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, that makes more sense, thanks.
deployment/kube/stage/frontend.yaml
Outdated
resources: | ||
limits: | ||
cpu: 600m | ||
memory: 1Gi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why limit to so little memory? The nodes have 3.75 GB available, and only one imageserver pod should be running on each node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm testing with preempt instances the 1.7G one just so i dont spend to much money on testing the auto scale. I will adjust the memory accordingly once i'm content with my pr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, sounds good.
deployment/kube/stage/frontend.yaml
Outdated
@@ -43,6 +54,13 @@ spec: | |||
ports: | |||
- name: http-server | |||
containerPort: 9091 | |||
resources: | |||
limits: | |||
cpu: 600m |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to limit CPU usage? Why not let the pod use as much CPU as is available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, i was testing at that time. I will remove the limit for cpu.
@@ -14,5 +14,5 @@ spec: | |||
# Set this to 3x "min-nodes": | |||
minReplicas: 3 | |||
# Set this to 3x "max-nodes": | |||
maxReplicas: 3 | |||
maxReplicas: 6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either the comment needs to be updated, or something else...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
noted!
deployment/run_server
Outdated
@@ -26,7 +26,7 @@ fi | |||
pkill Xvfb | |||
pkill node | |||
|
|||
xvfb-run --auto-servernum --server-args '-screen 0 640x480x24' ./bin/orca.js serve --request-limit=1000 --safe-mode $PLOTLYJS_ARG $@ 1>/proc/1/fd/1 2>/proc/1/fd/2 & | |||
xvfb-run --auto-servernum --server-args '-screen 0 640x480x24' ./bin/orca.js serve --safe-mode $PLOTLYJS_ARG $@ 1>/dev/stdout 2>/dev/stderr & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to change to /dev/stdout
and /dev/stderr
? This wrapper is being run via monit, so stdout
and stderr
of the monit process are not necessarily the right place for this output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, never mind that.
I'm also concerned by the idea of using preemptible VMs. According to this document:
My understanding of this is that with preemptible instances, we could lose all our nodes and have no replacement nodes. Do you have any sources that contradict this understanding? |
Your right about that. Even with Auto-scale there's a chance that we lose all instances at the same time in every zones. Slim change but still. I guess for stage we don't really care if that happen but for prod we can't take that chance. What I'd like to do is to scale with preempt and have 3 min running on none preempt vm's but I guess that should be in a separated issue. |
updated : #41 |
Let's stick with regular instances for now. If we run into significant cost issues we can consider preemptible instances, but we don't need it right now. Autoscaling alone should provide significant savings. Please let me know when you're ready for a re-review on this! |
@scjody ready for review. Just so you know I've created a dedicated pool for the imageserver. The reason is I want to avoid mixing kube-system with default services. For example, heapster is a critical component for Auto-scale I ran into issue when I introduced load where heapster stopped responding and the auto-scale stopped. Procedure to deploy in prod :
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the reasons for all your changes. In future, please make smaller commits and explain your reasoning in the commit comments. As a guideline, any time you use "and" in a commit message is a sign it should be split up 😸
Have you tested this with a variety of real-world requests? Can you please include details of your testing somewhere?
I'm also not sure it's a good idea to create a new pool for these nodes. This will mean we have 3 nodes that exist just to serve Kubernetes internal purposes, which is pretty wasteful. Are you sure there isn't another solution? People were using kubernetes with autoscaling for a while before pools were implemented.
deployment/kube/prod/frontend.yaml
Outdated
strategy: | ||
rollingUpdate: | ||
maxSurge: 100% | ||
maxUnavailable: 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be reasonable to make this higher? In streambed we upgrade 25% of our nodes at a time (or that was the intention anyway), and if we lose an availability zone that's 33% of our nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we have 3 pods running it will spin-up 3 new with the latest image and kill one by one the old one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, but if we have 15 pods running it still kills them one by one, right? Wouldn't it make sense to kill more at a time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes its either a fixed number or a % so we can probably set it to 50%
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
30% would be safer, unless you can guarantee that Kubernetes will wait for all the new nodes to become available before starting to remove nodes. (We wouldn't want to end up with 50% of the required number of nodes.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it actually wait for new pods to ready before it start killing the old one. The default is 25% we can also leave it to default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, the default sounds reasonable.
template: | ||
metadata: | ||
labels: | ||
app: imageserver | ||
tier: frontend | ||
spec: | ||
affinity: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we still need antiAffinity to prevent two pods from ending up on the same node? Or are you counting on resource limits to do that? (Smaller commits, and explaining your reasoning in the commit comment would help here...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
counting on the resource limit to do that.. and if we do switch for larger instances than we won't care if they spin-up to the same server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I designed this initially I couldn't find a way to set the resource requests such that one and only one imageserver process could occupy a node, but also allow Kubernetes internal pods to occupy that node.
Is there a way to do it now? This will be an issue if we want to have imageservers in the default node pool, and I think we do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its the case now that's why i'm using a new pool like we have for redis. This way kube-system won't be allow to run there so we won't affect our critical pods.
The only way i saw it done was with toleration + taint
I don't see why we would want imageservers to run on the default pool. Any reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I explained my concerns about adding a new node pool briefly in the last paragraph here: #128 (review)
You don't need tolerations and taints to prevent two pods from occupying the same node. That's what the podAntiAffinity
statement you're removing was doing, and it was working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
If I use pod
podAntiAffinity
to prevent imageserver on kube-system. Will end-up with the same having dedicated machine for kube-system but sharing the same pool... I don't mind doing. Just simpler to use a pool "backend" and put everything else related in there. -
If I set an
podAntiAffinity
for imageserver we must make sure to apply it on every app this is where I think a pool make sense. -
Another situation is when it scale down it evict the kube-system service that is on the node and it restart else where in case of heapster we lose 5 min of metrics. Not a big deal but we might end up with lots of gap in our graph
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just suggesting using podAntiAffinity
to prevent two imageserver pods from occupying the same node like we do now.
I still don't understand why having Kubernetes internal pods on the same nodes as the imageserver pods is a big deal. They don't use significant amounts of resources, do they? I do understand your concern about losing metrics, but I think it's worth trying anyway. I'm surprised Kubernetes isn't designed to scale down by terminating other nodes rather than these, but it sounds like something we have to live with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed it back the way it was to use the default-pool and re-added the AntiAffinity to prevent two imageserver on the same host.
- I've re-ran a stress test and no issue with heapster.
deployment/kube/prod/hpa.yaml
Outdated
minReplicas: 12 | ||
# Set this to 3x "max-nodes": | ||
minReplicas: 3 | ||
# Set this to 12x "max-nodes": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please explain the reason for this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is to scale down we want to have a minimum a 3 instances when the load is low.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean changing maxReplicas
to be 12x "max-nodes".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i might have to increase it, we peaked at 16 last night
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we changing this from 3x "max-nodes" to 12x or 16x "max-nodes"? Unless something else changed, the "max-nodes" variable set in GKE sets the maximum number of nodes per zone, and so with 3 zones we want to multiply this number by 3 to get "maxReplicas".
"minReplicas" works the same way except for "min-nodes".
If things have changed (as a result of some change elsewhere in GKE, or as a result of your work), please explain what's changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it still 3x "max-nodes" I'll fix the comment 👍
PR #130 fix the travis-ci |
24fc895
to
a470af2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deployment/kube/prod/frontend.yaml
Outdated
- us-central1-a | ||
- us-central1-b | ||
- us-central1-c | ||
podAffinity: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this right? We discussed it over here: #128 (comment) and you said it should be podAntiAffinity
.
If this is right as written, can you please explain what it's doing and how it works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to check with @etpinard and make sure the CI pass before merging and deploying.
I've tested with the following examples : https://drive.google.com/open?id=19_bM6OPBQ-T74qZbz32uSpD50DDloXJs and the folder : jody-imageserver-test:/home/scjody/full/
Your right about AntiAffinity.. I just commited the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ref #130 (comment)
#130 should get merged soon, merging master into this branch after that should suffice to get the tests to pass again.
Alternatively, you can cherry-pick the 4 commits off #130 into this branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the podAntiAffinity
fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💃 if you're completely confident in the testing you've performed (I still don't feel like I have enough information to evaluate your results myself, but as long as you're confident I'll trust that), and once CI has been fixed.
Tests are now ✅ on |
0cd58a7
to
1bb3cd2
Compare
@scjody the I've process the success folder and all of them returned a 200, except file that are too large which returned a 400: textPayload":"400 - invalid or malformed request syntax (figure data is likely to make exporter hang, rejecting request. |
@mag009 did you test this out using Electron v2 afterall? |
@etpinard yes, i did but the memory issue is still present in 2.0.9. I've only tested few files not the entire success folder. |
Ok great. Well, if the memory issues aren't worse using electron 2.0.9, we should be updating. @mag009 Can you testing the entire success folder using electron 2.0.9 or write down the steps to do so? |
See plotly/streambed#9865 and plotly/streambed#11037
Reason for removing the --request-limit is because currently when we hit the 1000 requests the server exit but the container stays up monit handle the restart and when you do that the container is still healthy on the LB. So it is possible that client connect and hit a connection refused.
To avoid that I'm limiting the resources of the container cpu and memory in which if the app have a memory leak it will kill the container and making it unavailable to the LB and just spin a new container.
The annotation is required to scale-down, if a container spin-up on a node where kube-system is running so it tells it's okay to kill that node.
preemptive instancestested with :
ab -r -c 100 -n 100000 -p 86eac25f-4de9-4da2-82a9-0c7d28db1454_200.json http://10.128.0.17:9091/