Kirill Kashin

Experience

SRE Engineer – Observability August 2023 – Present

Criteo Limassol, Cyprus

As part of the observability team, led the migration and consolidation of infrastructure during the merger of two companies. Successfully migrated over 50 teams and 600 alerts per hour from Zabbix, unifying the codebase, monitoring tools, and alerting system.
Designed and implemented a CLI application to consolidate over 30 cron jobs, standardize the codebase, migrate secrets to Vault, and integrate with Kubernetes using vault-secrets-webhook for secure secret management.
Developed an SLO framework using the Sloth tool, integrated with a Kubernetes operator built on the CRD SDK in Go, to efficiently manage vmalerts instances.
Resolved an rsyslog throughput collapse on a bare-metal multi-DC log pipeline by diagnosing uneven NIC IRQ distribution and queue-parallelism-induced memory pressure (perf top showing page-fault dominance), then retuning imptcp and queue dequeue parameters.

SRE Engineer – TCRM Platform Jul 2021 – Aug. 2023

Tinkoff Moscow

Migrated more than 30 Java services of the TCRM product to Kubernetes, adapting canary deployment. Assisted in transitioning more than 15 Python services for the T-Messenger product to Kubernetes and GCE, implementing blue-green deployments.
Defined SLAs and reconfigured monitoring around service-specific metrics, achieving 99.9% and 99.95% compliance. Migrated the logging stack from self-hosted ELK/Prometheus to the Sage platform. Established postmortems for user-impacting incidents and gated on-call rotation behind a qualification test.
Built and maintained an Envoy-based load balancer with control-plane reconfiguration at hundreds of thousands of RPS; shipped Java SDKs for ACL, logging, metrics, profiling, and Swagger.

DevOps Engineer – TCRM Platform Jul 2019 – Jun 2021

Tinkoff Moscow

Participated in the migration from TeamCity and Bitbucket to GitLab, contributing to a smoother transition. Helped scale service load by over tenfold. Prepared the migration plan from Rancher to Kubernetes.
Optimized the team’s codebase into a unified CLI tool, streamlining pipeline operations. Established a repository service with configuration files for over 30 services, simplifying deployments.
Designed and implemented a canary release tool, enabling gradual application rollouts, partial traffic switching, real-time error monitoring, and automated rollback in case of alerts. This resulted in a 90% reduction in incidents caused by issues in new releases.

Technical Support Team Lead May 2017 – Jul 2019

Tinkoff Moscow

Led a 24/7 incident-management team supporting the bank's website, mobile apps, and payment services; scaled the team from 5 to 30 across in-office and distributed tiers; built the knowledge base, incident playbooks, and a two-week onboarding program (tests, video, real cases) that brought 20+ engineers to operational level.

National Research Nuclear University Moscow

Bachelor of Computer and Information Systems Security/Information Assurance Aug. 2011 – Jul 2015

Languages: Python, Java, Go

Monitoring: Prometheus, VictoriaMetrics, Grafana, ELK, Loki, OpenTelemetry, Jaeger, rsyslog, Vector

Automation/Scripting: Bash, Jinja2, Jsonnet, Makefiles

CI/CD: GitLab, TeamCity, Jenkins

Containers/Orchestration: Docker, Docker Compose, Kubernetes, Helm, Kustomize

Cloud Platforms: Google Cloud, OpenStack

Databases/Messaging: PostgreSQL, Cassandra, Redis, RabbitMQ

Infrastructure: Ansible, Terraform, Puppet

Load Balancing/Networking: HAProxy, Nginx, Envoy, Istio

Practices: Incident/Alert management, Postmortems, IAC, SLO/SLI Metrics Design