1. MUST HAVE — Minimum of 5+ years of system administration experience for a high-usage, web-based software service ideally built using open-source software components
2. MUST HAVE — Knowledge of Amazon AWS services and API’s including EC2, S3, VPC, IAM
3. MUST HAVE — Knowledge and familiarity with alerts & monitoring tools, and system management tools for Linux environments (including DataDog, Nginx, NewRelic, CloudFlare, MySQL/PostgreSQL, Apache, IPTables, ELK stack
4. NICE TO HAVE — Knowledge and familiarity with configuration management tools including Ansible, Chef or Puppet
5. NICE TO HAVE — Knowledge of deploying / troubleshooting / tuning Ruby on Rails applications (Passenger, Capistrano, Sidekiq, Bundler)
6. NICE TO HAVE — Knowledge of type-1 hypervisor virtualization (Xen, VSphere)
1. MUST HAVE — Strong communication skills with an ability to coordinate the incident response with urgency
2. MUST HAVE — Proper remote presence & etiquette (acknowledging requests in a timely fashion over Slack, not leaving requests unacknowledged at all)
3. MUST HAVE — Tagging the appropriate person and persistently reminding them every 24 hours until a full resolution is achieved (not having things fall through the cracks)
4. MUST HAVE — Effective adherence to operating procedures (organizing day-to-day work and large-scale tasks in a calm manner with priority-driven sequencing)
— Competitive salary.
— Career and professional growth.
— Cozy fully-equipped office in Ivano-Frankivsk.
— Great work-life balance with flexible working hours and free office lunches, remote-friendly.
— Paid vacation and stipend for Language classes, gym, IT events, etc.
— Dedicated AWS account (or bare metal servers, per your choice) for infrastructure automation testing, development and general learning
— Retina MacBook Pro or another laptop of your specification, peripherals and displays included
— Books, library & conference budget
— Reliably automate the server provisioning process to reduce the labour of our R&D team
— Building scalable infrastructure to manage high-load, concurrent sessions to support ~50 mm monthly page views and 500k+ active users
— Drive the company through “Disaster Recovery Tests”, where we manually turn down pieces of infrastructure to test products overall resiliency to failures
— Implement the systems and processes that Product Developers use to deploy their software into production
— Build an auto-remediation system to automatically resolve production incidents before escalating them to on-call Developers
— Because of the nature of SRE work you should also be prepared for on-call shifts and potential “all-hands-on-deck” situations at any hour of the day or night. Minimizing those situations is part of your job!