305 King St W
Suite 1100
Kitchener, ON N2G 1B9
Canada
Senior Site Reliability Engineer Remote
Senior Site Reliability Engineer Description
We are actively seeking a Senior Site Reliability Engineer to join our team, with a focus on independently managing complex tasks, including infrastructure enhancements and automating development and deployment processes.
The ideal candidate will have several years of experience, demonstrate deep troubleshooting skills, and excel at resolving platform-related issues. This position provides the chance to participate in sprint planning and story grooming sessions, contributing insights on implementation challenges. Under the direction of the Engineering Manager, this role is crucial for maintaining high operational standards and improving our engineering processes.
EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.
#LI-DNI
Responsibilities
- Independently investigate and resolve platform-related issues
- Analyze, create, and improve automation processes
- Craft scripts that automate different tasks
- Offer active participation in sprint meetings and story estimations
- Monitor production APM tools, such as Datadog
- Handle the collection and analysis of application logs
- Supervise alerts associated with application and instance for site reliability
- Engage in architecture discussions regarding infrastructure
- Keep applications and libraries used on the platform updated
- Manage methods and servers for code deployment
- Provide mentorship and support to fellow engineers
- Undertake code reviews
Requirements
- A minimum of 3 years in a Site Reliability Engineer role or similar
- Proficiency in TypeScript, NodeJS/NestJS, React Native
- Strong background in Python/Django, familiarity with PostgreSQL, Redis
- Competency in CircleCI, Spinnaker, Expo
- Background in administering production application workloads on AWS Cloud
- Understanding of Cloud networks and VPC peering
- Skills in Cloud computing including EC2, SNS/SQS, RDS
- Knowledge of containerization and orchestration with Docker, Kubernetes, EKS
- Expertise in provisioning and configuration tools like Terraform, Ansible
- Proficiency in Linux or Windows server administration
- Capability to integrate monitoring, logging, and alerting into systems
- Excellent at collaboratively debugging complex issues
- Experience with HIPAA compliance and similar standards
- Flexibility to quickly learn and adapt to new changes
Nice to have
- Knowledge of monitoring tools similar to Datadog
- Familiarity with scripting languages such as Python, Groovy, PowerShell, or Ruby
- Passion or experience in health services industry
We offer
- Connectivity Bonus (15,000 ARS are paid with a salary receipt at the end of each month as a non-wages concept)
- Medicina Prepaga (It covers the collaborator and direct family group)
- Paternity Leave (Two additional days are added to what is established by law, total of 4 days)
- Discounts card
- English Training (English lessons, twice per week)
- Training Program (Access to multiple customized training plans according to the needs of each role within the company)
- Marriage bonus (The company doubles the allowance established by law that ANSES offers)
- Referral Program (Referral bonus is paid when the referral of a collaborator joins the Company)
- External Agreements and Discounts
- Vacations: 14 calendar days a year
By applying to our role, you are agreeing that your personal data may be used as in set out in EPAM´s Privacy Notice and Policy.