Thanks to visit codestin.com
Credit goes to github.com

Skip to content

add resources limit for proper scaling #128

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Sep 26, 2018
Merged

add resources limit for proper scaling #128

merged 8 commits into from
Sep 26, 2018

Conversation

mag009
Copy link
Contributor

@mag009 mag009 commented Sep 11, 2018

See plotly/streambed#9865 and plotly/streambed#11037

Reason for removing the --request-limit is because currently when we hit the 1000 requests the server exit but the container stays up monit handle the restart and when you do that the container is still healthy on the LB. So it is possible that client connect and hit a connection refused.

To avoid that I'm limiting the resources of the container cpu and memory in which if the app have a memory leak it will kill the container and making it unavailable to the LB and just spin a new container.

The annotation is required to scale-down, if a container spin-up on a node where kube-system is running so it tells it's okay to kill that node.

  • issues with down-scale, does not down-scale node that should
  • fined tuning the resources
  • preemptive instances
  • deal with evicted instances

tested with :
ab -r -c 100 -n 100000 -p 86eac25f-4de9-4da2-82a9-0c7d28db1454_200.json http://10.128.0.17:9091/

@mag009 mag009 self-assigned this Sep 11, 2018
@mag009 mag009 requested review from scjody and etpinard September 11, 2018 18:28
@etpinard
Copy link
Contributor

etpinard commented Sep 12, 2018

Thanks @mag009 !

I'll let @scjody review this thing (I don't know much about the deployment/ folder).

One thing to keep in mind, I'll like to upgrade the electron version we're using (see #125) in the short-term. So, maybe you run your tests using the electron_2.0.8 orca image on quay,io? That would be much appreciated. Thanks!

@mag009
Copy link
Contributor Author

mag009 commented Sep 12, 2018

I was able to stress test and so far so good. No crashes in 30 minutes for a total of 26k requests, at an avg rate of 17 req/s

I'm using preempt vm's with shared cpu which can handle ~2req/s.

I still have minor issue, i'm getting connection refused when it add a container so i probably need to adjust the health check.

Test below is performed with a 384K file

Completed 10000 requests
Completed 20000 requests

Server Software:
Server Hostname: 10.128.0.17
Server Port: 9091

Document Path: /
Document Length: Variable

Concurrency Level: 15
Time taken for tests: 1481.105 seconds
Complete requests: 26407
Failed requests: 0
Total transferred: 3235762787 bytes
Total body sent: 10379777012
HTML transferred: 3232488319 bytes
Requests per second: 17.83 [#/sec] (mean)
Time per request: 841.314 [ms] (mean)
Time per request: 56.088 [ms] (mean, across all concurrent requests)
Transfer rate: 2133.49 [Kbytes/sec] received
6843.88 kb/s sent
8977.37 kb/s total

Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 7 20.1 2 292
Processing: 137 833 1475.3 315 8102
Waiting: 133 822 1467.5 306 8032
Total: 140 840 1482.5 321 8180

Percentage of the requests served within a certain time (ms)
50% 321
66% 421
75% 530
80% 616
90% 2638
95% 5146
98% 6062
99% 6462
100% 8180 (longest request)

@etpinard
Copy link
Contributor

I was able to stress test and so far so good.

Nice!

Test below is performed with a 384K file

Can you point us to that file? I might nice to run the tests on a collections of plotly.js mocks or using "real-life" image server requests.

Last spring, @scjody used this thing, which could be useful to you.

Copy link
Contributor

@scjody scjody left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work on this so far!

Some general comments:

  • This should definitely be tested with many real-world requests, including requests known to fail.
  • It would be helpful to split your changes across multiple commits, with details on why a change is being made in the commit text. For example the podAffinity section could be its own commit. This makes reviewing easier, and makes it easier to understand why something was done in a certain way months/years down the road. (git commit --patch and related commands can help here.)
  • Based on what you said on Slack I believe this is a WIP. Please label WIP PRs as WIP in the description when you open the PR. (You can edit this afterwards to remove the WIP when the PR is ready.)
  • It looks like you're missing prod versions of all these changes. (We could certainly move this to a Helm chart in a future PR to remove this duplication, but for now this needs to be done.)

@@ -17,6 +19,15 @@ spec:
tier: frontend
spec:
affinity:
podAffinity:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what this is doing. Can you please explain or point to some documentation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have local storage mounted it was preventing to scale-down and delete the node.

https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-types-of-pods-can-prevent-ca-from-removing-a-node

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I just realised your question was about the podAffinity.

It's actually wrong it's podAntiAffinity it's to make sure that when it's scale-down it leave the pod on multiple zones.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, that makes more sense, thanks.

resources:
limits:
cpu: 600m
memory: 1Gi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why limit to so little memory? The nodes have 3.75 GB available, and only one imageserver pod should be running on each node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm testing with preempt instances the 1.7G one just so i dont spend to much money on testing the auto scale. I will adjust the memory accordingly once i'm content with my pr.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, sounds good.

@@ -43,6 +54,13 @@ spec:
ports:
- name: http-server
containerPort: 9091
resources:
limits:
cpu: 600m
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to limit CPU usage? Why not let the pod use as much CPU as is available?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, i was testing at that time. I will remove the limit for cpu.

@@ -14,5 +14,5 @@ spec:
# Set this to 3x "min-nodes":
minReplicas: 3
# Set this to 3x "max-nodes":
maxReplicas: 3
maxReplicas: 6
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either the comment needs to be updated, or something else...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noted!

@@ -26,7 +26,7 @@ fi
pkill Xvfb
pkill node

xvfb-run --auto-servernum --server-args '-screen 0 640x480x24' ./bin/orca.js serve --request-limit=1000 --safe-mode $PLOTLYJS_ARG $@ 1>/proc/1/fd/1 2>/proc/1/fd/2 &
xvfb-run --auto-servernum --server-args '-screen 0 640x480x24' ./bin/orca.js serve --safe-mode $PLOTLYJS_ARG $@ 1>/dev/stdout 2>/dev/stderr &
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to change to /dev/stdout and /dev/stderr? This wrapper is being run via monit, so stdout and stderr of the monit process are not necessarily the right place for this output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, never mind that.

@scjody
Copy link
Contributor

scjody commented Sep 13, 2018

I'm also concerned by the idea of using preemptible VMs. According to this document:

[...] any or all of your Compute Engine instances might be preempted and become unavailable. There are no guarantees as to when new instances become available.

My understanding of this is that with preemptible instances, we could lose all our nodes and have no replacement nodes. Do you have any sources that contradict this understanding?

@mag009 mag009 changed the title add resources limit for proper scaling WIP add resources limit for proper scaling Sep 13, 2018
@mag009
Copy link
Contributor Author

mag009 commented Sep 13, 2018

I'm also concerned by the idea of using preemptible VMs. According to this document:

[...] any or all of your Compute Engine instances might be preempted and become unavailable. There are no guarantees as to when new instances become available.

My understanding of this is that with preemptible instances, we could lose all our nodes and have no replacement nodes. Do you have any sources that contradict this understanding?

Your right about that. Even with Auto-scale there's a chance that we lose all instances at the same time in every zones. Slim change but still. I guess for stage we don't really care if that happen but for prod we can't take that chance.

What I'd like to do is to scale with preempt and have 3 min running on none preempt vm's but I guess that should be in a separated issue.

@mag009
Copy link
Contributor Author

mag009 commented Sep 13, 2018

I still have minor issue, i'm getting connection refused when it add a container so i probably need to adjust the health check.

updated : #41

@scjody
Copy link
Contributor

scjody commented Sep 13, 2018

Let's stick with regular instances for now. If we run into significant cost issues we can consider preemptible instances, but we don't need it right now. Autoscaling alone should provide significant savings.

Please let me know when you're ready for a re-review on this!

@mag009
Copy link
Contributor Author

mag009 commented Sep 17, 2018

@scjody ready for review.

Just so you know I've created a dedicated pool for the imageserver. The reason is I want to avoid mixing kube-system with default services. For example, heapster is a critical component for Auto-scale I ran into issue when I introduced load where heapster stopped responding and the auto-scale stopped.

Procedure to deploy in prod :

  • create new pool with gcloud named : imageserver
  • kubectl replace -f kube/prod

@mag009 mag009 changed the title WIP add resources limit for proper scaling add resources limit for proper scaling Sep 17, 2018
Copy link
Contributor

@scjody scjody left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the reasons for all your changes. In future, please make smaller commits and explain your reasoning in the commit comments. As a guideline, any time you use "and" in a commit message is a sign it should be split up 😸

Have you tested this with a variety of real-world requests? Can you please include details of your testing somewhere?

I'm also not sure it's a good idea to create a new pool for these nodes. This will mean we have 3 nodes that exist just to serve Kubernetes internal purposes, which is pretty wasteful. Are you sure there isn't another solution? People were using kubernetes with autoscaling for a while before pools were implemented.

strategy:
rollingUpdate:
maxSurge: 100%
maxUnavailable: 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be reasonable to make this higher? In streambed we upgrade 25% of our nodes at a time (or that was the intention anyway), and if we lose an availability zone that's 33% of our nodes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we have 3 pods running it will spin-up 3 new with the latest image and kill one by one the old one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, but if we have 15 pods running it still kills them one by one, right? Wouldn't it make sense to kill more at a time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes its either a fixed number or a % so we can probably set it to 50%

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

30% would be safer, unless you can guarantee that Kubernetes will wait for all the new nodes to become available before starting to remove nodes. (We wouldn't want to end up with 50% of the required number of nodes.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it actually wait for new pods to ready before it start killing the old one. The default is 25% we can also leave it to default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, the default sounds reasonable.

template:
metadata:
labels:
app: imageserver
tier: frontend
spec:
affinity:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we still need antiAffinity to prevent two pods from ending up on the same node? Or are you counting on resource limits to do that? (Smaller commits, and explaining your reasoning in the commit comment would help here...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

counting on the resource limit to do that.. and if we do switch for larger instances than we won't care if they spin-up to the same server.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I designed this initially I couldn't find a way to set the resource requests such that one and only one imageserver process could occupy a node, but also allow Kubernetes internal pods to occupy that node.

Is there a way to do it now? This will be an issue if we want to have imageservers in the default node pool, and I think we do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its the case now that's why i'm using a new pool like we have for redis. This way kube-system won't be allow to run there so we won't affect our critical pods.

The only way i saw it done was with toleration + taint

I don't see why we would want imageservers to run on the default pool. Any reason?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I explained my concerns about adding a new node pool briefly in the last paragraph here: #128 (review)

You don't need tolerations and taints to prevent two pods from occupying the same node. That's what the podAntiAffinity statement you're removing was doing, and it was working.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • If I use pod podAntiAffinity to prevent imageserver on kube-system. Will end-up with the same having dedicated machine for kube-system but sharing the same pool... I don't mind doing. Just simpler to use a pool "backend" and put everything else related in there.

  • If I set an podAntiAffinity for imageserver we must make sure to apply it on every app this is where I think a pool make sense.

  • Another situation is when it scale down it evict the kube-system service that is on the node and it restart else where in case of heapster we lose 5 min of metrics. Not a big deal but we might end up with lots of gap in our graph

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just suggesting using podAntiAffinity to prevent two imageserver pods from occupying the same node like we do now.

I still don't understand why having Kubernetes internal pods on the same nodes as the imageserver pods is a big deal. They don't use significant amounts of resources, do they? I do understand your concern about losing metrics, but I think it's worth trying anyway. I'm surprised Kubernetes isn't designed to scale down by terminating other nodes rather than these, but it sounds like something we have to live with.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed it back the way it was to use the default-pool and re-added the AntiAffinity to prevent two imageserver on the same host.

  • I've re-ran a stress test and no issue with heapster.

minReplicas: 12
# Set this to 3x "max-nodes":
minReplicas: 3
# Set this to 12x "max-nodes":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please explain the reason for this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is to scale down we want to have a minimum a 3 instances when the load is low.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean changing maxReplicas to be 12x "max-nodes".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i might have to increase it, we peaked at 16 last night

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we changing this from 3x "max-nodes" to 12x or 16x "max-nodes"? Unless something else changed, the "max-nodes" variable set in GKE sets the maximum number of nodes per zone, and so with 3 zones we want to multiply this number by 3 to get "maxReplicas".

"minReplicas" works the same way except for "min-nodes".

If things have changed (as a result of some change elsewhere in GKE, or as a result of your work), please explain what's changed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it still 3x "max-nodes" I'll fix the comment 👍

@mag009
Copy link
Contributor Author

mag009 commented Sep 18, 2018

PR #130 fix the travis-ci

Copy link
Contributor

@scjody scjody left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes!

Looks like you'll need to work with @etpinard to get #130 merged first, in order to build the image you need to deploy this.

Have you tested this with a variety of real-world requests? Can you please include details of your testing somewhere?

- us-central1-a
- us-central1-b
- us-central1-c
podAffinity:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this right? We discussed it over here: #128 (comment) and you said it should be podAntiAffinity.

If this is right as written, can you please explain what it's doing and how it works?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to check with @etpinard and make sure the CI pass before merging and deploying.

I've tested with the following examples : https://drive.google.com/open?id=19_bM6OPBQ-T74qZbz32uSpD50DDloXJs and the folder : jody-imageserver-test:/home/scjody/full/

Your right about AntiAffinity.. I just commited the change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ref #130 (comment)

#130 should get merged soon, merging master into this branch after that should suffice to get the tests to pass again.

Alternatively, you can cherry-pick the 4 commits off #130 into this branch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the podAntiAffinity fix.

Copy link
Contributor

@scjody scjody left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💃 if you're completely confident in the testing you've performed (I still don't feel like I have enough information to evaluate your results myself, but as long as you're confident I'll trust that), and once CI has been fixed.

@etpinard
Copy link
Contributor

Tests are now ✅ on master.

@mag009
Copy link
Contributor Author

mag009 commented Sep 26, 2018

@scjody the I've process the success folder and all of them returned a 200, except file that are too large which returned a 400: textPayload":"400 - invalid or malformed request syntax (figure data is likely to make exporter hang, rejecting request.

@mag009 mag009 merged commit eb75817 into master Sep 26, 2018
@etpinard
Copy link
Contributor

@mag009 did you test this out using Electron v2 afterall?

@mag009
Copy link
Contributor Author

mag009 commented Sep 26, 2018

@etpinard yes, i did but the memory issue is still present in 2.0.9. I've only tested few files not the entire success folder.

@etpinard
Copy link
Contributor

etpinard commented Sep 26, 2018

Ok great. Well, if the memory issues aren't worse using electron 2.0.9, we should be updating.

@mag009 Can you testing the entire success folder using electron 2.0.9 or write down the steps to do so?

@etpinard etpinard deleted the autoscale_k8s branch December 28, 2018 17:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants