This job has expired, please see additional jobs below
Site Reliability Engineer - Hadoop / Data Platforms
Entertainment & Media Industry Company
San Francisco, CA, United States
Job Details - this job has expired, please see similar jobs below
Who We Are
SREs work on improving the availability, scalability, performance and reliability of Company’s production services. Come join us.
About This Job
As a Site Reliability Engineer (SRE) in Company’s Hadoop team you will be working to improve the reliability and performance of Hadoop clusters and data management services. The Hadoop clusters at Company are among the largest in the world. We manage data used by millions of people as they connect, explore, and interact with information and one another. You will work shoulder-to-shoulder with our engineering teams to design, build and operate our clusters/services. Your focus will be on debugging, automation, availability and performance, and above all efficiency at ‘reach-every-user-on-the-planet’ scale. We have a wide range of opportunities for varying skill levels and experience.
Responsibilities
• Work in engineering team to design, build, and maintain Hadoop clusters and data services
• Participate in and build tools to:
◦ Diagnose, and troubleshoot complex distributed systems handling 10s of petabytes of data and develop solutions that have a significant impact at our massive scale.
◦ Troubleshoot issues across the entire stack - hardware, software, application and network
◦ Test, monitor, administrate, and operate of multiple clusters across data centers, primarily in Python and Java.
• Collaborate across teams such as Application services, Linux kernel, JVM and Capacity Planning, Hardware, Network, and Datacenter Operations to design next-gen storage platforms.
• Take part in a 24x7 on-call rotation
• Interact with the open source community
Qualifications
• 2+ years of managing services in a distributed, internet-scale *nix environment.
• Familiarity with systems management tools (Puppet, Chef, Capistrano, etc)
• Demonstrable knowledge of Linux operating system internals, filesystems, disk/storage technologies and storage protocols and networking stack.
• Hands-on operational experience on managing JVM services.
• Practical knowledge of shell scripting and at least one scripting language (Python, Ruby, Perl).
• Ability to prioritize tasks and work independently
• Track record of practical problem solving, excellent communication, and documentation skills
• BS or MS degree in Computer Science or Engineering, or equivalent experience.
Desired
• Understanding of Hadoop, YARN,
• Understanding of Scalding, Parquet
• Customer oriented