AI Infrastructure Engineer

Location
Remote (U.S.)
About Topia

Topia builds scalable, secure infrastructure for enterprise deployments that process billions of encrypted data connections annually. We are expanding our AI capabilities to power the next generation of our infrastructure-as-a-service.

Position Summary

This role is responsible for designing, deploying, and managing GPU infrastructure that powers Topia's AI and machine learning workloads. You'll work across on-premise clusters and cloud environments to build reliable, scalable compute infrastructure. The position combines hands-on systems work with architecture decisions that directly impact our AI capabilities.

Key Responsibilities
  • Design and deploy GPU clusters for training and inference workloads across on-premise and cloud environments
  • Configure and optimize NVIDIA drivers, CUDA, container runtimes, and ML frameworks
  • Build and maintain orchestration for GPU workloads using Kubernetes, Slurm, or similar schedulers
  • Implement monitoring, alerting, and capacity planning for GPU resources
  • Optimize infrastructure costs through efficient resource allocation and scheduling
  • Troubleshoot hardware, driver, and software issues across the GPU stack
  • Collaborate with ML engineers to understand workload requirements and optimize performance
  • Document infrastructure architecture, runbooks, and operational procedures
  • Evaluate and integrate new GPU hardware and cloud offerings
Qualifications
  • 3–5 years of experience in infrastructure engineering, DevOps, or systems administration
  • Hands-on experience with NVIDIA GPUs, CUDA, and container runtimes such as Docker or containerd
  • Strong Linux systems administration skills
  • Experience with GPU offerings from at least one major cloud provider such as AWS, GCP, or Azure
  • Familiarity with Kubernetes or other container orchestration platforms
  • Solid understanding of networking fundamentals including TCP/IP, DNS, and load balancing
  • Scripting proficiency in Python, Bash, or similar languages
  • Self-directed problem solver comfortable working autonomously

Nice to Have

  • Experience with GPU cluster management tools such as Slurm, Ray, or DeepSpeed
  • Knowledge of ML training workflows and frameworks such as PyTorch or TensorFlow
  • Experience with InfiniBand or other high-speed networking technologies
  • Familiarity with infrastructure-as-code tools such as Terraform or Pulumi
  • Background in performance optimization and benchmarking
  • Experience with multi-node distributed training setups
Compensation & Benefits

Competitive salary, stock options, health/dental/vision insurance, 401(k), and home office stipend.