Shift Lead for first responders (L1/L2) to any Infra services requests or incidents, internal/external call handling, application events/impacts that are detected in the PayPal Command Center by various monitoring systems.
Take ownership of P0/P1 incidents PPCC operations and resolve the issues.
Assess system service impacts, create recovery plans, flex-ups/flex-downs, and provide step execution approval for backline teams during critical incident restoration activities and changes.
Work independently and within a team to triage and remediate production system and application incidents.
Develop self-remediation or self-healing tools to protect production infra services. Engage with internal and external partners and enable them to build the right signals/tools for site restoration.
Be a Shift lead of Cloud Services Engineering (CSE) team and work closely with the PayPal Command Center TDO team in investigating the reason for outage of the event and work towards mitigating / resolution of hypervisors, physical and virtual infra-assets and infra services.
Shift timings are daytime to early evening IST. Each 12hr shift will alternate between 3 days and 4 days per week. Being a floater or having flexible hours whenever there is a strong need to cover or augment Cloud Services or IRE team shifts.
Identify recurring problems and work with Problem Management and/or other stakeholders to drive for permanent solution.
Develop and maintain standard operating procedure (SOP) documentation for use by all of PayPal operations interacting with the live site.
Work in partnership with the Lead SRE to ensure incident details are documented according to critical incident process.
Technology Skills & Requirements
System Tech lead must be versatile in Linux/Unix, and virtual infrastructure (Public and Private cloud). Good understanding of cloud container orchestration platforms, OpenStack, KVM, REST, Object oriented technologies, automation tools like Ansible/Puppet, load balancers, monitoring tools (Splunk/Nagios/Zabbix/Sensu), build and process tools (Git, Jenkins, Jira, ServiceNow), programming skills in one or more (C, Perl, Python and Go) languages
Deep understanding of load balancer pools, Availability Zones and fail-away principles, Networking and Storage SRE core infra operations.
Principles around Incident Management KPIs – TTD, TTA, TTR, Average Incident Response time, % of Incidents resolved in a defined timeframe, etc.
Team player, energetic personality, quality minded, focused, committed, able to work independently in a fast paced, changing environment.
BS Computer Engineering degree or Master’s in Technology field or related technical field involving infrastructure support with 10-15 years of computer industry experience is required.
Apply for the Job