About the role
CircleCI is seeking a Staff Site Reliability Engineer to work closely with our Software Engineers to deliver and manage the high-performance and scalable infrastructure underlying our multi-tenant Cloud offering as well as our Server-installed, on-premises solution. You will not only have the chance to automate and optimize infrastructure through the construction of appropriate tooling, but you will help software engineers through the design phase to optimize their services for scale in our production environment.
The CircleCI SRE team is globally distributed and remote-friendly. We take advantage of multiple timezones to manage a platform for our global customer base.
What will make you successful:
- Experience managing a container-based microservice architecture, including orchestration, service-discovery, monitoring, and debugging
- Understanding of standard networking protocols and components such as: TCP/IP, HTTP, DNS, ICMP, the OSI Model, Subnetting, and Load Balancing
- In-depth knowledge of operating systems (processes, threads, IPC, concurrency, locks, mutexes, semaphores, etc.).
- Proficiency in one or more of: C, C++, Java, Python, Go
- Comprehensive knowledge of the internal workings of at least one of Postgres, Mongo, Redis
- Systematic problem solving approach, coupled with a strong sense of ownership and drive
- Track-record of working cooperatively with software engineering teams
- Focus on security in the delivery of all levels of a system
- Passion for modern software development and operation, including agile, CI/CD, and infrastructure-as-code
- Desire to learn and grow
- 6+ years of experience
- Design and deliver solutions to improve the availability, scalability, latency, and efficiency of CircleCI’s services.
- Engage in service capacity planning and demand forecasting, anticipating performance bottlenecks
- Diagnose and resolve production issues in conjunction with software engineering teams
- Architect and implement shared infrastructure used by all services within the CircleCI platform, for both SaaS and on-prem configurations
- Support and advise software engineering teams in the design of scalable services
- Build and maintain tools for deployment, monitoring, and debugging
- Plan and execute disaster recovery drills
- Participate in rotating on-call duties, including incident management