Welcome to my profile ^^
I'm Aymen and I'm:
- Dedicated Platform / Site Reliability Engineering Leader with over a decade of experience in architecting, building, deploying, and managing large-scale Cloud platforms and Kubernetes/Containers environments.
- Proven track in driving Cloud Native transformations, enhancing SRE practices, and ensuring secure, efficient, and resilient systems.
kubernets-sigs member Gateway API Inference Extension member Knavtive Serving WG member
Open for new gigs
- Cloud Systems: Proficient in architecting, deploying, and managing distributed systems on Azure, AWS, GCP, and OpenStack (8 years)
- Kubernetes/Containers: 8 years of impactful production involvement in constructing and maintaining Container and Kubernetes platforms, harnessing the full power of highly scalable Cloud Native Apps.
- Strong Leadership: A rich background encompassing 5 years of leadership experience, including roles as Tech Lead, Team Lead, and Group Lead, showcasing adept people management and exceptional interpersonal skills.
- Demonstrated track record in applying SRE principles, encompassing SLOs, Observability, Monitoring, Alerting, and Incident Management, with a track record of optimizing system reliability.
- Cloud Native Ecosystem: A 5-year immersion in the Cloud Native landscape, adeptly utilizing Service Mesh, GitOps, Network Policies, Admission Controllers, API Gateways, and more.
- DevSecOps: Well-versed in safeguarding Cloud-Native applications and adeptly at implementing DevSecOps practices to ensure a robust security posture.
- Modern Platform Engineering: A solid 5-year experience in empowering self-service, GitOps, and internal development platforms, leveraging tools such as Backstage and Keptn to streamline processes.
- Configuration Management and Infrastructure as Code: A seasoned practitioner, proficient in leveraging Terraform (8 years), Helm, Ansible, Salt, and Chef to orchestrate large-scale and complex (Cloud) infrastructural components.
- Programming: Adept with 8 years of Python development and 3 years of Go development, contributing to toil automation and infra tooling
- 2016 – Built a Ceph-based storage-as-a-service on multi-cloud (AWS, GCP, Azure, OpenStack) delivering an S3-compatible API; automated federated multisite cluster deployment with Python, PostgreSQL, Boto3, and Ansible.
- 2017–2019 – Delivered Kubernetes platforms on GKE for connected commerce at ~150 node scale; stood up CI/CD with GitLab, Helm, and Terraform; shipped multi-cloud Vault/Consul secret management operating across ~500 VMs and containers.
- 2019–2020 – Ran ~1000 microservice instances on Azure AKS for Deutsche Bank’s Yunar App; implemented Istio service mesh, GitOps, edge gateways, and full SLI/SLO practice with Prometheus, Grafana, and EFK.
- 2020–2022 – Led ING/Lendico’s cloud-native transformation on Azure: owned AKS, platform security, and SRE; built in-house infra tooling, GitOps, and 24×7 operations with SLOs and incident management. 2022–2025 –
- Operated 300+ customer AWS environments across >3,000 services.
- Rebuilt observability to OpenTelemetry + Grafana/Prometheus/Loki/Tempo, saving >$1.5M annually.
- Designed AI infrastructure for multi-GPU, multi-agent, multi-LLM workloads.
- Built the Spryker Monitoring Integration Product based on OpenTelemetry.
- Delivered an internal developer platform with Terraform, GitHub Actions, Atlantis, and ArgoCD.
- Implemented anomaly detection to cut alert fatigue and speed diagnosis in incident management.
-
Terraform
-
Python
-
Kubernetes Operators