Design, deploy, and maintain observability platforms including Zabbix, Grafana, and Opensearch Stack (Opensearch, Logstash, Kibana).
Implement and maintain metrics, logs, traces, and synthetic monitoring across infrastructure and applications.
Integrate Prometheus, Alertmanager and OpenTelemetry where applicable to achieve unified observability.
Maintain monitoring coverage for Linux, network devices, applications, and cloud services.
Maintain and enhance the overall monitoring and logging infrastructure, including capacity, performance, and reliability.
Develop meaningful dashboards and alerting logic to ensure timely and actionable incident notifications.
Optimize alerting systems: reduce noise, tune thresholds, and focus on critical business and technical metrics.
Improve observability processes and implement predictive failure analysis and early-warning signals.
Analyze incidents, identify patterns, and drive proactive monitoring improvements.
Define and maintain KPIs, SLIs, SLOs, and SLA measurement processes in coordination with service owners.
Enhance reliability through structured incident management and post-mortem analysis.
Automate deployment and configuration of monitoring components using Ansible, Terraform following IaC principles.
Manage configuration templates and Zabbix host provisioning through automation tools (Ansible, Terraform following IaC principles).
Leverage APIs and scripting (e.g., Python, Go) for data collection, integrations, and automation.
Collaborate closely with Developers, System Engineers, DevOps, and IT Operations teams to improve system reliability and reduce MTTR.
Establish and evolve the Monitoring & Diagnostics foundation for the in-house 24/7 App Support team, including tooling, processes, knowledge base, training, runbooks, and troubleshooting guides.
Create intelligent, step-by-step troubleshooting instructions to speed up incident resolution.

Requirements:

4+ years of experience as an SRE, Monitoring Engineer, or similar role in production environments.
Advanced Linux user with strong command-line and diagnostic skills.
Strong understanding of monitoring, logging, and observability concepts (metrics, logs, traces, SLIs/SLOs, alerting).
Hands-on experience with at least several of the following: Zabbix, Prometheus, Grafana, Elastic Stack (ELK), Alertmanager, OpenTelemetry.
Experience managing both cloud-based and on-premise environments.
Automation skills using Python or Go.
Proficiency with configuration management / IaC tools (Ansible, Terraform or similar).
Solid grasp of networking principles and protocols (TCP/IP, HTTP, DNS, load balancing, etc.).
Experience with CI/CD pipelines (GitLab, Jenkins or similar).
Familiarity with container orchestration (Kubernetes, Rancher).
Experience documenting workflows and training support teams.
Proven skills in incident analysis, pattern recognition, and driving preventive improvements.
Good communication skills and ability to work with cross-functional teams

Nice to Have:

Experience with synthetic monitoring tools and user-experience monitoring.
Background in capacity planning and performance tuning.
Advanced knowledge of ML-driven monitoring and predictive analysis.
Experience with automated incident response (self-healing systems).

Benefits:

Take advantage of 25 paid calendar vacation days to explore, relax, and unwind.
Join us for exciting corporate events that foster team spirit and fun!
Indulge in a variety of snacks available in the office.

We will tell you more about all the benefits on the interview :)

This position is planned to be created (promising).

Задайте вопрос работодателю

Он получит его с откликом на вакансию

Где располагается место работы?

Какой график работы?

Вакансия открыта?

Какая оплата труда?

Как с вами связаться?

Другой вопрос

Вакансия опубликована 17 ноября 2025 в Минске

Откликнуться

Senior Site Reliability Engineer (SRE)

Напишите телефон, чтобы работодатель мог связаться с вами

Задайте вопрос работодателю

Похожие вакансии