SENIOR SITE RELIABILITY ENGINEER

Svitla Systems Inc. is looking for a Senior Site Reliability Engineer for a full-time position (40 hours per week) in Argentina. Our client is a leading expert network, providing business and government professionals opportunities to communicate with industry and subject-matter experts to answer research questions. Their customers consult with these experts over the phone, in person at conferences, teleconferences, custom events, and workshops, or may gather their primary research data through surveys, polls, or web-based data offerings. Experts are categorized into six main industry sectors: healthcare, financial and business services, consumer goods and services; energy, industrials, and basic materials; tech, media, and telecom; and legal and regulatory. Since 2003, the company has provided its customers with primary research services, helping professionals comprehensively understand a topic before making significant investment and/or business decisions. Their multinational client list includes nine top 10 consulting firms, hundreds of hedge funds, and many of the largest private equity firms and fortune-ranked companies.

We are seeking a skilled Site Reliability Engineer (SRE) with experience managing production-level SaaS applications hosted on Azure. The ideal candidate will be adept in monitoring, analyzing, and troubleshooting application and infrastructure-related issues time-sensitively.

Requirements

5 years of professional experience.
Proven experience with SaaS applications hosted on Azure.
Proficiency in using Datadog for real-time monitoring, alerts, anomaly detection, and incident management.
Expertise in debugging and troubleshooting production issues using logs and monitoring dashboards.
Strong knowledge of Azure services, including Azure Functions, SQL Server, Azure App Services, and event-driven architecture such as RabbitMQ.
Ability to work under pressure to resolve critical production issues quickly and effectively.
Knowledge of .NET codebases, Single Page Applications (SPA), REST APIs, and event-driven systems.
Strong analytical and problem-solving skills, particularly in high-stakes production environments.

Nice to have

Experience working with event-driven architectures like RabbitMQ (RMQ).
Familiarity with DevOps practices, CI/CD pipelines, and Infrastructure as Code (IaC) principles.
Experience with Azure DevOps or similar platforms for managing builds and releases.
Ability to work in a fast-paced, collaborative environment while managing multiple priorities.

Responsibilities

Production monitoring & incident management:

Manage, monitor, and maintain production SaaS applications hosted on Azure.
Quickly identify and resolve production issues using Datadog and other monitoring tools.
Perform root cause analysis of production incidents to improve the system’s reliability and performance.
Troubleshoot production systems, analyzing logs and error stacks to diagnose and resolve code or environment-related issues.

Performance optimization:

Identify bottlenecks and performance issues in the production environment and implement solutions.
Work closely with development teams to ensure applications are optimized for performance and scalability.

Environment debugging and support:

Analyze and debug errors using logs, monitoring dashboards, and performance metrics.
Engage with cross-functional teams, including software development, DevOps, and product teams, to resolve issues.

Azure expertise:

Leverage your knowledge of various Azure services to diagnose and mitigate environment-related issues.
Monitor, manage, and optimize services such as Azure Functions, SQL Server, and event-driven architectures (e.g., RabbitMQ).

Collaboration:

Collaborate with the software development team to ensure the efficient deployment and continuous operation of SaaS applications.
Participate in on-call rotations and respond to incidents, ensuring high availability of services.

Automation & improvement:

Identify areas of improvement in monitoring, alerting, and logging, and implement automation where possible.
Create and maintain reliable infrastructure and deployment pipelines to enhance the stability and performance of services.

We offer

US and EU projects based on advanced technologies.
Competitive compensation based on skills and experience.
Annual performance appraisals.
Remote-friendly culture and no micromanagement.
Bonuses for recommendations of new employees.
Bonuses for article writing, public talks, other activities.
15 vacation days, 10 national holidays, sick leaves.
Platzi unlimited training account.
Free webinars, meetups and conferences organized by Svitla.
Fun corporate celebrations and activities.
Awesome team, friendly and supportive community!

About Svitla

Svitla Systems is a global trusted IT solutions company headquartered in California, with business and development offices through out the US, Latin America, Europe, and Asia. Svitla is an outspoken advocate of workplace flexibility, best known for its well-established remote culture, individual approach to our teammate’s professional and personal growth, and family-like environment.

Since 2003, Svitla has served a wide range of clients, from innovative start-ups in California to mega-large corporations such as Ingenico, Amplience, InvoiceASAP and Global Citizen. At Svitla, developers work with clients’ teams directly, building lasting and successful partnerships, as a result of seamless integration with on-site processes.

Svitla Systems’ global mission is to build a business that contributes to the well-being of our partners, personnel and their families, improves our communities, and makes a lasting difference in the world. Join us!

If you are interested in our vacancy, please send your CV.
We will be happy to see you in our friendly team :)

Let's meet in person

Mariia Hunchak

Recruiter

Email: m.hunchak@svitla.com

Skype: live:.cid.4f941cafb19a1fb7

Phone: +380633941492

LinkedIn: Mariia Hunchak

Senior Site Reliability Engineer