SRE Cloud Native
Job Description:
Mandatory Qualifications:
• Bachelor's and/or Masters in CS /EE or related field
• 5+ years of hands-on experience as an SRE with focus on cloud native technologies
• Well versed with compiling implementation plan, code walk-through, and solution designs.
• Well versed with developing the code with distributed team and code merge issues resolution.
• Hands-on experience deploying, managing and troubleshooting Kubernetes clusters and
components.
• Strong experience configuring and administering Linux systems in cloud/Saas production
environments.
• Systematic problem-solving approach to troubleshooting, and the desire to solve the root cause of
common problems in 24x7 environments.
Preferred Qualifications:
• Experience delivering infrastructure as code - Ansible, Terraform, Git, Jenkins, Helm, Argo CD.
• Good understanding of test-driven development, continuous integration and delivery.
• Good understanding of DNS, DHCP, LDAP, NFS, Kerberos, PAM, PXE, SNMP, SSH, HTTP/S, NTP,
troubleshooting network performance issues.
• Experience with monitoring and logging systems such as Prometheus, Grafana, Nagios, ELK etc. and
the ability to identify new technologies as appropriate.
• Experience tuning and optimizing storage solutions including Object Storage and NFS.
• Knowledge of virtualization, multiple hypervisor technologies as well as cloud computing
technologies like AWS, Azure, GCP.
• Configuration and maintenance of web servers, load balancers, databases, storage systems and
messaging systems.
Good To Have:
• Extensive experience with bare-metal servers.
• Hands-on experience in programming experience in one or more languages including Golang/
Python.
How you will make an impact:
• Assume broad responsibilities for successful delivery of services in a hybrid model including but not
limited to, deployment, configuration, integrations, and ongoing operations.
• Deploy, administer, manage multiple Kubernetes clusters, both on-prem and in private cloud
environments
• Lead efforts to triage, debug and fix issues related to network, storage, scheduling, applications, and
systems, for proactive and reactive incident resolution and root cause analysis.
• Develop and continuously improve platform capabilities for observability, monitoring, notifications,
logging, tracing and continuous delivery with reduced toil.
• Develop standard solutions that enable consistency in service delivery and engage with multiple
cross-functional teams to solve problems that impact service levels.
• Collaborate with the platform engineers for continuous automation of fleet-wide infrastructure and
application deployments.
• Determine and set SLOs for the service and build the process and tools to measure and implement
the SLOs, prevent recurring problems and undesirable service conditions.
• Participate in on-call rotation responsibilities.