Job Description
Job Title: Site Reliability Engineer
Location: Bellevue, WA/ Frisco, TX/ Atlanta, GA/ Overland Park, KS - Hybrid
Term: Contract
Job Description:
Key Responsibilities:
- Kubernetes Management: Deploy, manage, and optimize Kubernetes clusters in production and staging environments, ensuring high availability and efficient resource utilization.
- AWS Infrastructure: Leverage AWS cloud services (EC2, S3, RDS, EKS, Lambda, etc.) to build, manage, and scale cloud-native infrastructure.
- Automation & Infrastructure as Code: Develop and maintain automated workflows using Infrastructure as Code (IaC) tools like Terraform, CloudFormation, or Ansible to provision, configure, and manage cloud infrastructure.CI/CD Pipeline Support: Build, optimize, and maintain CI/CD pipelines to enable seamless code delivery and deployments, using tools like Jenkins, GitLab CI, or CircleCI.
- Monitoring & Observability: Implement and maintain monitoring, alerting, and logging solutions using tools such as Prometheus, Grafana, CloudWatch, or ELK stack to ensure system health and availability.
- Incident Response: Lead and support incident response efforts, conduct root cause analysis, and implement post-incident reviews to improve system resilience.
- Performance Optimization: Identify and resolve performance bottlenecks, improve system efficiency, and ensure applications and infrastructure are optimized for both cost and performance.
- Security & Compliance: Work with security teams to implement best practices for securing Kubernetes clusters, AWS resources, and platform infrastructure, including access controls, network policies, and encryption.
- Collaboration & Documentation: Work closely with development, DevOps, and infrastructure teams to align on best practices, improve automation, and document procedures for infrastructure management and troubleshooting.
Required Qualifications:
- Kubernetes Expertise: Strong expertise in managing and scaling Kubernetes clusters, including experience with Kubernetes networking, storage, and multi-cluster architectures.
- AWS Cloud Expertise: Proficiency with AWS services such as EC2, S3, EKS, RDS, VPC, Lambda, IAM, CloudWatch, and others. Experience with AWS best practices for scalability, security, and cost management.
- Infrastructure as Code (IaC): Hands-on experience with IaC tools such as Terraform, AWS CloudFormation, or Ansible for provisioning and managing cloud infrastructure.CI/CD Pipelines: Experience building and maintaining continuous integration and continuous deployment (CI/CD) pipelines using Jenkins, GitLab CI, or similar tools.
- Scripting & Automation: Proficiency in scripting languages such as Python, Bash, or Go to automate operational tasks and improve workflows.
- Monitoring & Logging: Experience with monitoring, logging, and alerting tools like Prometheus, Grafana, CloudWatch, ELK stack, or similar tools.
- Troubleshooting & Incident Management: Ability to troubleshoot complex issues in distributed systems, conduct root cause analysis, and implement solutions to prevent recurrence.
- Collaboration Skills: Strong communication skills with the ability to work collaboratively with developers, operations, and product team
Key Skills:
Kubernetes, AWS Cloud, Reliability Engineer, CI/CD, IAC
Job Tags
Contract work,