Senior Site Reliability Engineer (SRE)

Уровень дохода не указан

Опыт работы: 3–6 лет

Полная занятость

График: 5/2

Рабочие часы: 8

Формат работы: удалённо

Напишите телефон, чтобы работодатель мог связаться с вами

Пройдите капчу
Чтобы подтвердить, что вы не робот, введите текст с картинки:
captcha
Неверный текст. Пожалуйста, повторите попытку.

Keydev is on the lookout for a Site Reliability Engineer to join our Infrastructure and Operations Department.

Responsibilities:

  • Design, deploy, and maintain observability platforms including Zabbix, Grafana, and Opensearch Stack (Opensearch, Logstash, Kibana).
  • Implement and maintain metrics, logs, traces, and synthetic monitoring across infrastructure and applications.
  • Integrate Prometheus, Alertmanager and OpenTelemetry where applicable to achieve unified observability.
  • Maintain monitoring coverage for Linux, network devices, applications, and cloud services.
  • Maintain and enhance the overall monitoring and logging infrastructure, including capacity, performance, and reliability.
  • Develop meaningful dashboards and alerting logic to ensure timely and actionable incident notifications.
  • Optimize alerting systems: reduce noise, tune thresholds, and focus on critical business and technical metrics.
  • Improve observability processes and implement predictive failure analysis and early-warning signals.
  • Analyze incidents, identify patterns, and drive proactive monitoring improvements.
  • Define and maintain KPIs, SLIs, SLOs, and SLA measurement processes in coordination with service owners.
  • Enhance reliability through structured incident management and post-mortem analysis.
  • Automate deployment and configuration of monitoring components using Ansible, Terraform following IaC principles.
  • Manage configuration templates and Zabbix host provisioning through automation tools (Ansible, Terraform following IaC principles).
  • Leverage APIs and scripting (e.g., Python, Go) for data collection, integrations, and automation.
  • Collaborate closely with Developers, System Engineers, DevOps, and IT Operations teams to improve system reliability and reduce MTTR.
  • Establish and evolve the Monitoring & Diagnostics foundation for the in-house 24/7 App Support team, including tooling, processes, knowledge base, training, runbooks, and troubleshooting guides.
  • Create intelligent, step-by-step troubleshooting instructions to speed up incident resolution.

Requirements:

  • 4+ years of experience as an SRE, Monitoring Engineer, or similar role in production environments.
  • Advanced Linux user with strong command-line and diagnostic skills.
  • Strong understanding of monitoring, logging, and observability concepts (metrics, logs, traces, SLIs/SLOs, alerting).
  • Hands-on experience with at least several of the following: Zabbix, Prometheus, Grafana, Elastic Stack (ELK), Alertmanager, OpenTelemetry.
  • Experience managing both cloud-based and on-premise environments.
  • Automation skills using Python or Go.
  • Proficiency with configuration management / IaC tools (Ansible, Terraform or similar).
  • Solid grasp of networking principles and protocols (TCP/IP, HTTP, DNS, load balancing, etc.).
  • Experience with CI/CD pipelines (GitLab, Jenkins or similar).
  • Familiarity with container orchestration (Kubernetes, Rancher).
  • Experience documenting workflows and training support teams.
  • Proven skills in incident analysis, pattern recognition, and driving preventive improvements.
  • Good communication skills and ability to work with cross-functional teams

Nice to Have:

  • Experience with synthetic monitoring tools and user-experience monitoring.
  • Background in capacity planning and performance tuning.
  • Advanced knowledge of ML-driven monitoring and predictive analysis.
  • Experience with automated incident response (self-healing systems).

Benefits:

  • Take advantage of 25 paid calendar vacation days to explore, relax, and unwind.
  • Join us for exciting corporate events that foster team spirit and fun!
  • Indulge in a variety of snacks available in the office.

We will tell you more about all the benefits on the interview :)

This position is planned to be created (promising).

Задайте вопрос работодателю

Он получит его с откликом на вакансию
Вакансия опубликована 17 ноября 2025 в Минске