Job Id: 20201006018
Company: Phenom People
Job Role: Command Center Lead
Experience: 7+ Years
Job Location: Hyderabad
Salary: Best in Industry
Vacancies: Not Mentioned
Job Description Phenom People Careers Job Vacancies for Command Center Lead in October 2020:
● Responsible for monitoring an organization’s servers, networks, and computer systems for irregularities and performance issues.
● Assess system data and error logs, along with user reports, to determine areas for improvement or repair. In this aspect of the role, an IT operations manager may also determine when systems or servers are due for upgrades.
● Monitor environments, technical assets and/or services for behavior or performance outside of standards or SLAs. Identify potential causes and evaluate impact on infrastructure, delivery or services. Determine appropriate next steps (e.g. closer monitoring, further review or immediate action). Alert appropriate team (per process) when a threshold has been reached or a change/failure has occurred. Provide advice and guidance to others in monitoring and analysis of assets, systems and services.
● Provide oversight, technical direction, and expertise to the other Comand Center teams as it relates to data analysis, monitoring tools and processes, and event detection
● Responsible for major IT systems incident management from initiation until an acceptable work-around is in place or resolved.
● Responsible for training team members and putting process & procedure in place to support the system and to handle the critical incidents.
● Coordinate appropriate resources to resolve critical incidents in accordance with service level agreements and operational level agreements.
● Own all communication during a major system outage, ensuring IT management and the businesses are kept updated until the incident is resolved.
● With thorough understanding of technology assets/environments/services, business needs and SLAs, lead the creation, revision and implementation of monitoring tools, processes and reports.
● Regularly review and identify process improvement opportunities and implement changes in collaboration with process owners and other technology functions. Champion and provide oversight to ensure adherence to established processes, tools and methodologies.
● Engage in establishment of environment and technical asset and service availability, reliability and maintainability requirements.
● Review availability information and identify developing issues and opportunities for improvement. Ensure effective hand-offs with appropriate technology function(s). Provide input into and drive availability improvement plans.
● Document concerns and findings, collecting all pertinent data (to include comparison of exception data and normal data). Ensure incident/event tracking tools are current (per established guidelines and procedures). Review, improve and champion the accuracy and maintenance of knowledge base content and known error database
● develop and implement on call schedules for the Command Center team
● Broad experience in troubleshooting large-scale distributed systems covering application, OS, networking and storage areas.
● Self-motivated and proactive, with demonstrated creative and critical thinking capabilities
● Lead a team of tight-knit, super smart engineers passionate about large-scale distributed systems
● Seek out every path to support and improve your team’s happiness, engagement, and effectiveness
● Champion a culture of learning, continuous improvement, and blameless retrospection within your team and across the company
● Mentor and grow your junior engineers, and empower and unblock your senior ones
● Strategic relationship and partnership building skills
● Excellent time management, organizational, communication skills
● Familiarity with cloud support engineering practices.
● Conversant with a wide range of relevant tools and technologies! We don’t expect managers to be writing code in their day-to-day job, but your interests should include some or all of: AWS, GCP, C++, Go, Kubernetes, CI/CD, distributed systems, Terraform, and Puppet. Some familiarity with compliance environments like SOC2 and ISO
● Well versed in AWS, Azure cloud environments and management including direct work with customer support for maintenance and repair requests.
● Ability to failover and handle datacenter region outages.
● Good hands-on experience on any of these technologies including MSSQL, Grafana, Sumologic, Nagios, SaltStack, Zenoss, Confluence, Jira, Pagerduty
● Working experience in mac, Linux, and Windows based production environments and strong knowledge in fundamentals and internals – file systems, memory management, threads and processes
● Working experience in scripting languages such as: Python, Groovy, Ruby with preference to a strong developer background
● Strong understanding of networking protocols, IP packets, DNS, OSI layers and load balancing.
● Experience with system monitoring and alerting for availability, reliability and performance.
● Excellent analytical and problem-solving skills.
● Respond to service incidents and publish root cause analysis (RCA) reports
● Ability to solve operational related challenges through automation or process related improvements
● Ability to develop and plan for longer term projects to directly impact the Command Center and Line Of Business (LOB) relationship and our understanding and ability to support the related products.