Your mission
- Develop distributed systems involving node-level daemons, dynamic library interceptors, and system-level components for allowing GPU workload co-location and checkpointing/restoration on Kubernetes nodes.
- Integrate solutions into Kubernetes-based GPU clusters with custom scheduling behavior.
- Build lightweight HTTP/gRPC services to allow for interaction with various components, export metrics, and provide custom views.
- Engineer novel GPU co-location and GPU checkpoint/restore pipelines for controlling workload's access to GPUs without termination/restarts.