Keydev is on the lookout for a Site Reliability Engineer to join our Infrastructure and Operations Department.
Responsibilities:
- Design, deploy, and maintain observability platforms including Zabbix, Grafana, and Opensearch Stack (Opensearch, Logstash, Kibana).
- Implement and maintain metrics, logs, traces, and synthetic monitoring across infrastructure and applications.
- Integrate Prometheus, Alertmanager and OpenTelemetry where applicable to achieve unified observability.
- Maintain monitoring coverage for Linux, network devices, applications, and cloud services.
- Maintain and enhance the overall monitoring and logging infrastructure, including capacity, performance, and reliability.
- Develop meaningful dashboards and alerting logic to ensure timely and actionable incident notifications.
- Optimize alerting systems: reduce noise, tune thresholds, and focus on critical business and technical metrics.
- Improve observability processes and implement predictive failure analysis and early-warning signals.
- Analyze incidents, identify patterns, and drive proactive monitoring improvements.
- Define and maintain KPIs, SLIs, SLOs, and SLA measurement processes in coordination with service owners.
- Enhance reliability through structured incident management and post-mortem analysis.
- Automate deployment and configuration of monitoring components using Ansible, Terraform following IaC principles.
- Manage configuration templates and Zabbix host provisioning through automation tools (Ansible, Terraform following IaC principles).
- Leverage APIs and scripting (e.g., Python, Go) for data collection, integrations, and automation.
- Collaborate closely with Developers, System Engineers, DevOps, and IT Operations teams to improve system reliability and reduce MTTR.
- Establish and evolve the Monitoring & Diagnostics foundation for the in-house 24/7 App Support team, including tooling, processes, knowledge base, training, runbooks, and troubleshooting guides.
- Create intelligent, step-by-step troubleshooting instructions to speed up incident resolution.
Requirements:
- 4+ years of experience as an SRE, Monitoring Engineer, or similar role in production environments.
- Advanced Linux user with strong command-line and diagnostic skills.
- Strong understanding of monitoring, logging, and observability concepts (metrics, logs, traces, SLIs/SLOs, alerting).
- Hands-on experience with at least several of the following: Zabbix, Prometheus, Grafana, Elastic Stack (ELK), Alertmanager, OpenTelemetry.
- Experience managing both cloud-based and on-premise environments.
- Automation skills using Python or Go.
- Proficiency with configuration management / IaC tools (Ansible, Terraform or similar).
- Solid grasp of networking principles and protocols (TCP/IP, HTTP, DNS, load balancing, etc.).
- Experience with CI/CD pipelines (GitLab, Jenkins or similar).
- Familiarity with container orchestration (Kubernetes, Rancher).
- Experience documenting workflows and training support teams.
- Proven skills in incident analysis, pattern recognition, and driving preventive improvements.
- Good communication skills and ability to work with cross-functional teams
Nice to Have:
- Experience with synthetic monitoring tools and user-experience monitoring.
- Background in capacity planning and performance tuning.
- Advanced knowledge of ML-driven monitoring and predictive analysis.
- Experience with automated incident response (self-healing systems).
Benefits:
- Take advantage of 25 paid calendar vacation days to explore, relax, and unwind.
- Join us for exciting corporate events that foster team spirit and fun!
- Indulge in a variety of snacks available in the office.
We will tell you more about all the benefits on the interview :)
This position is planned to be created (promising).
Задайте вопрос работодателю
Он получит его с откликом на вакансию
Вакансия опубликована 17 ноября 2025 в Минске