OpenDILab RL Kubernetes Custom Resource and Operator Lib

Overview

DI Orchestrator

DI Orchestrator is designed to manage DI (Decision Intelligence) jobs using Kubernetes Custom Resource and Operator.

Prerequisites

  • A well-prepared kubernetes cluster. Follow the instructions to create a kubernetes cluster, or create a local kubernetes node referring to kind or minikube
  • Cert-manager. Installation on kubernetes please refer to cert-manager docs. Or you can install it by the following command.
kubectl create -f ./config/certmanager/cert-manager.yaml

Install DI Orchestrator

DI Orchestrator consists of two components: di-operator and di-server. Install di-operator and di-server with the following command.

kubectl create -f ./config/di-manager.yaml

di-operator and di-server will be installed in di-system namespace.

$ kubectl get pod -n di-system
NAME                               READY   STATUS    RESTARTS   AGE
di-operator-57cc65d5c9-5vnvn   1/1     Running   0          59s
di-server-7b86ff8df4-jfgmp     1/1     Running   0          59s

Install global components of DIJob defined in AggregatorConfig:

kubectl create -f config/samples/agconfig.yaml -n di-system

Submit DIJob

# submit DIJob
$ kubectl create -f config/samples/dijob-cartpole.yaml

# get pod and you will see coordinator is created by di-operator
# a few seconds later, you will see collectors and learners created by di-server
$ kubectl get pod

# get logs of coordinator
$ kubectl logs cartpole-dqn-coordinator

User Guide

Refers to user-guide. For Chinese version, please refer to 中文手册

Contributing

Refers to developer-guide.

Contact us throw [email protected]

Comments
  • 在 Pod 内增加集群信息

    在 Pod 内增加集群信息

    希望以 dijob replica 方式提交时,每个 pod 都能见到整个 replica 的 host 信息和自己的启动顺序,增加以下几个环境变量:

    1. replica 中所有 pod 的 FQDN,依据启动顺序排序
    2. 当前 pod 的 FQDN
    3. 当前 pod 的顺序编号

    DI-engine 中会根据这些变量实现对应的网络连接,attach-to 的生成逻辑可以从 di-orchestrator 中移除

    enhancement 
    opened by sailxjx 3
  • add tasks to dijob spec

    add tasks to dijob spec

    1. goal

    There is only one pod template defined in a dijob, which results in that we can not define different commands or resources for different componets of di-engine such as collector, learner and evaluator. So we are supposed to find a more general way to define a custom resource of dijob.

    2. design *

    Inspired by VolcanoJob, we define the spec.tasks to describe different componets of di-engine. spec.tasks is a list, which allows us to define multiple tasks. We can specify different task.type to label the task as one of collector, learner, evaluator and none. none means the task is a general task, which is the default value.

    After change, the dijob can be defined as follow:

    apiVersion: diengine.opendilab.org/v2alpha1
    kind: DIJob
    metadata:
      name: job-with-tasks
    spec:
      priority: "normal"  # job priority, which is a reserved field for allocator
      backoffLimit: 0  # restart count
      cleanPodPolicy: "Running"  # the policy to clean pods after job completion
      preemptible: false  # job is preemtible or not
      minReplicas: 2  
      maxReplicas: 5
      tasks:
      - replicas: 1
        name: "learner"
        type: learner
        template:
          metadata:
            name: di
          spec:
            containers:
            - image: registry.sensetime.com/xlab/ding:nightly
              imagePullPolicy: IfNotPresent
              name: pydi
              env:
              - name: NCCL_DEBUG
                value: "INFO"
              command: ["/bin/bash", "-c",]
              args: 
              - |
                ditask --label learner xxx
              resources:
                requests:
                  cpu: "1"
                  nvidia.com/gpu: 1
            restartPolicy: Never
      - replicas: 1
        name: "evaluator"
        type: evaluator
        template:
          metadata:
            name: di
          spec:
            containers:
            - image: registry.sensetime.com/xlab/ding:nightly
              imagePullPolicy: IfNotPresent
              name: pydi
              env:
              - name: NCCL_DEBUG
                value: "INFO"
              command: ["/bin/bash", "-c",]
              args: 
              - |
                ditask --label evaluator xxx
            restartPolicy: Never
      - replicas: 2
        name: "collector"
        type: collector
        template:
          metadata:
            name: di
          spec:
            containers:
            - image: registry.sensetime.com/xlab/ding:nightly
              imagePullPolicy: IfNotPresent
              name: pydi
              env:
              - name: NCCL_DEBUG
                value: "INFO"
              command: ["/bin/bash", "-c",]
              args: 
              - |
                ditask --label collector xxx
            restartPolicy: Never
    status:
      conditions:
      - lastTransitionTime: "2022-05-26T07:25:11Z"
        lastUpdateTime: "2022-05-26T07:25:11Z"
        message: job created.
        reason: JobPending
        status: "False"
        type: Pending
      - lastTransitionTime: "2022-05-26T07:25:11Z"
        lastUpdateTime: "2022-05-26T07:25:11Z"
        message: job is starting since all pods are created.
        reason: JobStarting
        status: "False"
        type: Starting
      phase: Starting
      profilings: {}
      readyReplicas: 0
      replicas: 4
      taskStatus:
        learner:
          Pending: 1
        evaluator:
          Pending: 1
        collector:
          Pending: 2
      reschedules: 0
      restarts: 0
    

    task definition:

    type Task struct {
    	Name string `json:"name,omitempty"`
    
    	Type TaskType `json:"type,omitempty"`
    
    	Replicas int32 `json:"replicas,omitempty"`
    
    	Template corev1.PodTemplateSpec `json:"template,omitempty"`
    }
    
    type TaskType string
    
    const (
    	TaskTypeLearner TaskType = "learner"
    
    	TaskTypeCollector TaskType = "collector"
    
    	TaskTypeEvaluator TaskType = "evaluator"
    
    	TaskTypeNone TaskType = "none"
    )
    
    

    status.taskStatus definition:

    type DIJobStatus struct {
      // Phase defines the observed phase of the job
      // +kubebuilder:default=Pending
      Phase Phase `json:"phase,omitempty"`
    
      // ...
      
      // map for different task statuses. key: task.name, value: TaskStatus
      TaskStatus map[string]TaskStatus
    
      // ...
    }
    
    // count of different pod phases
    type TaskStatus map[corev1.PodPhase]int32
    
    enhancement 
    opened by konnase 1
  • new version for di-engine new architecture

    new version for di-engine new architecture

    release notes

    features

    • v1.0.0 for DI-engine new architecture
    • remove webhook
    • manage commands with cobra
    • refactor orchestrator architecture inspired from adaptdl
    • use gin to rewrite di-server
    • update di-server http interface
    enhancement 
    opened by konnase 1
  • v0.2.0

    v0.2.0

    • [x] split webhook and operator
    • [x] add dockerfile.dev
    • [x] update CleanPolicyALL to CleanPolicyAll
    • [x] remove k8s service related operations from server, and operator is responsible for managing services
    • [x] add e2e test
    enhancement 
    opened by konnase 1
  • refactor job spec

    refactor job spec

    • refactor job spec definition and add spec.tasks to support multi tasks #20
    • add DI_RANK to pod env and remove engineFields in job.spec #16
    • add e2e test
    • add validator to validate the correctness of dijob spec
    • change job.phase to Pending when job replicas scaled to 0
    • implement a processor to process di-server requests
    • refactor project structure
    enhancement 
    opened by konnase 0
  • Release/v1.0

    Release/v1.0

    release notes

    features

    • v1.0.0 for DI-engine new architecture
    • remove webhook
    • manage commands with cobra
    • refactor orchestrator architecture inspired from adaptdl
    • use gin to rewrite di-server
    • update di-server http interface
    enhancement 
    opened by konnase 0
  • fix: job failed submit when collector/learner missed

    fix: job failed submit when collector/learner missed

    job failed submit when collector/learner missed because webhook create an empty dijob, and golang builder add some default value to some feilds of collector/learner, which result in invalid type error. solved by make coordinator/collector/learner as pointers.

    bug 
    opened by konnase 0
  • Feat/job create event

    Feat/job create event

    • add event handler for dijob, and mark job as Created when job submitted
    • mark collector and learner as optional, only coordinator is required(https://github.com/opendilab/DI-orchestrator/pull/13/commits/653e64af01ec7752b08d4bf8381738d566fca224)
    • mark job Failed when the submitted job is incorrect(https://github.com/opendilab/DI-orchestrator/pull/13/commits/bea840a5eee3508be18b53b325168a5647daff94), but it's hard to test since client-go reflector decodes DIJob strictly, we have no chance to handle DIJob add event when incorrect job submitted
    • version -> v0.2.1
    enhancement 
    opened by konnase 0
  • allocate的一些问题

    allocate的一些问题

    1.目前的allocator的逻辑,对于不可被抢占的job的初始分配,仅利用minreplicas修改replicas属性,那job的pods部署到哪个节点是完全由K8S决定吗?而且Release1.13代码的allocator.go中对不可被抢占job的初始分配部分貌似还没有写。 2.job是否可以被抢占的含义具体是什么?和是否能被调度是不是等价的? 3.调度策略的FitPolicy的Allocate和Optimize方法也没有进行实现,这部分内容什么时候可以补充? 4.文档中存在许多与最新代码不符合的地方,比如DIJob.Spec.Group属性在代码中已经被移除,文档中提到的job.spec.minreplicas属性代码中也没有,而是在JobInfo中。可以更新一下文档吗? 感谢!

    opened by RZ-Q 3
Releases(v1.1.3)
  • v1.1.3(Aug 22, 2022)

  • v1.1.2(Jul 21, 2022)

    bugs fix

    • global cmd flag error(https://github.com/opendilab/DI-orchestrator/pull/23)
    • wrong pod subdomain(https://github.com/opendilab/DI-orchestrator/pull/24)
    • incorrect to get global rank(https://github.com/opendilab/DI-orchestrator/pull/25)
    Source code(tar.gz)
    Source code(zip)
    di-manager.yaml(445.36 KB)
  • v1.1.1(Jul 4, 2022)

  • v1.1.0(Jun 30, 2022)

    • refactor job spec definition and add spec.tasks to support multi tasks #20
    • add DI_RANK to pod env and remove engineFields in job.spec #16
    • add e2e test
    • add validator to validate the correctness of dijob spec
    • change job.phase to Pending when job replicas scaled to 0
    • implement a processor to process di-server requests
    • refactor project structure

    see details in https://github.com/opendilab/DI-orchestrator/pull/21

    Source code(tar.gz)
    Source code(zip)
    di-manager.yaml(374.01 KB)
  • v1.0.0(Mar 23, 2022)

  • v0.2.2(Dec 15, 2021)

  • v0.2.1(Oct 12, 2021)

    feature

    • add event handler for dijob, and mark job as Created when job submitted(https://github.com/opendilab/DI-orchestrator/pull/13)
    • mark collector and learner as optional, only coordinator is required(https://github.com/opendilab/DI-orchestrator/pull/13/commits/653e64af01ec7752b08d4bf8381738d566fca224)
    • mark job Failed when the submitted job is incorrect(https://github.com/opendilab/DI-orchestrator/pull/13/commits/bea840a5eee3508be18b53b325168a5647daff94), but it's hard to test since client-go reflector decodes DIJob strictly, we have no chance to handle DIJob add event when incorrect job submitted
    Source code(tar.gz)
    Source code(zip)
    di-manager.yaml(1.38 MB)
  • v0.2.0(Sep 28, 2021)

  • v0.2.0-rc.0(Sep 6, 2021)

    • split webhook and operator
    • add dockerfile.dev
    • update CleanPolicyALL to CleanPolicyAll
    • remove k8s service related operations from server, and operator is responsible for managing services
    • add e2e test
    Source code(tar.gz)
    Source code(zip)
  • v0.1.0(Jul 8, 2021)

    Features

    • Define DIJob CRD to support DI jobs' submission
    • Define AggregatorConfig CRD to support aggregator definition
    • Add webhook to validate DIJob submission
    • Provide http service for DI jobs to request for DI modules
    • Docs to introduce DI-orchestrator architecture
    Source code(tar.gz)
    Source code(zip)
Owner
OpenDILab
Open sourced Decision Intelligence (DI)
OpenDILab
Code and data for ACL2021 paper Cross-Lingual Abstractive Summarization with Limited Parallel Resources.

Multi-Task Framework for Cross-Lingual Abstractive Summarization (MCLAS) The code for ACL2021 paper Cross-Lingual Abstractive Summarization with Limit

Yu Bai 43 Nov 07, 2022
A Python toolbox to create adversarial examples that fool neural networks in PyTorch, TensorFlow, and JAX

Foolbox Native: Fast adversarial attacks to benchmark the robustness of machine learning models in PyTorch, TensorFlow, and JAX Foolbox is a Python li

Bethge Lab 2.4k Dec 25, 2022
Neural style transfer as a class in PyTorch

pt-styletransfer Neural style transfer as a class in PyTorch Based on: https://github.com/alexis-jacq/Pytorch-Tutorials Adds: StyleTransferNet as a cl

Tyler Kvochick 31 Jun 27, 2022
transfer attack; adversarial examples; black-box attack; unrestricted Adversarial Attacks on ImageNet; CVPR2021 天池黑盒竞赛

transfer_adv CVPR-2021 AIC-VI: unrestricted Adversarial Attacks on ImageNet CVPR2021 安全AI挑战者计划第六期赛道2:ImageNet无限制对抗攻击 介绍 : 深度神经网络已经在各种视觉识别问题上取得了最先进的性能。

25 Dec 08, 2022
As-ViT: Auto-scaling Vision Transformers without Training

As-ViT: Auto-scaling Vision Transformers without Training [PDF] Wuyang Chen, Wei Huang, Xianzhi Du, Xiaodan Song, Zhangyang Wang, Denny Zhou In ICLR 2

VITA 68 Sep 05, 2022
Light-Head R-CNN

Light-head R-CNN Introduction We release code for Light-Head R-CNN. This is my best practice for my research. This repo is organized as follows: light

jemmy li 835 Dec 06, 2022
Implementation of temporal pooling methods studied in [ICIP'20] A Comparative Evaluation Of Temporal Pooling Methods For Blind Video Quality Assessment

Implementation of temporal pooling methods studied in [ICIP'20] A Comparative Evaluation Of Temporal Pooling Methods For Blind Video Quality Assessment

Zhengzhong Tu 5 Sep 16, 2022
Python code to fuse multiple RGB-D images into a TSDF voxel volume.

Volumetric TSDF Fusion of RGB-D Images in Python This is a lightweight python script that fuses multiple registered color and depth images into a proj

Andy Zeng 845 Jan 03, 2023
AIR^2 for Interaction Prediction

This is the repository for AIR^2 for Interaction Prediction. Explanation of the solution: Video: link License AIR is released under the Apache 2.0 lic

21 Sep 27, 2022
Implementation of a Transformer using ReLA (Rectified Linear Attention)

ReLA (Rectified Linear Attention) Transformer Implementation of a Transformer using ReLA (Rectified Linear Attention). It will also contain an attempt

Phil Wang 49 Oct 14, 2022
The code for SAG-DTA: Prediction of Drug–Target Affinity Using Self-Attention Graph Network.

SAG-DTA The code is the implementation for the paper 'SAG-DTA: Prediction of Drug–Target Affinity Using Self-Attention Graph Network'. Requirements py

Shugang Zhang 7 Aug 02, 2022
A high-performance distributed deep learning system targeting large-scale and automated distributed training.

HETU Documentation | Examples Hetu is a high-performance distributed deep learning system targeting trillions of parameters DL model training, develop

DAIR Lab 150 Dec 21, 2022
SpiroMask: Measuring Lung Function Using Consumer-Grade Masks

SpiroMask: Measuring Lung Function Using Consumer-Grade Masks Anonymised repository for paper submitted for peer review at ACM HEALTH (October 2021).

0 May 10, 2022
Adabelief-Optimizer - Repository for NeurIPS 2020 Spotlight "AdaBelief Optimizer: Adapting stepsizes by the belief in observed gradients"

AdaBelief Optimizer NeurIPS 2020 Spotlight, trains fast as Adam, generalizes well as SGD, and is stable to train GANs. Release of package We have rele

Juntang Zhuang 998 Dec 29, 2022
A 35mm camera, based on the Canonet G-III QL17 rangefinder, simulated in Python.

c is for Camera A 35mm camera, based on the Canonet G-III QL17 rangefinder, simulated in Python. The purpose of this project is to explore and underst

Daniele Procida 146 Sep 26, 2022
Employs neural networks to classify images into four categories: ship, automobile, dog or frog

Neural Net Image Classifier Employs neural networks to classify images into four categories: ship, automobile, dog or frog Viterbi_1.py uses a classic

Riley Baker 1 Jan 18, 2022
In this project we use both Resnet and Self-attention layer for cat, dog and flower classification.

cdf_att_classification classes = {0: 'cat', 1: 'dog', 2: 'flower'} In this project we use both Resnet and Self-attention layer for cdf-Classification.

3 Nov 23, 2022
This is an example of object detection on Micro bacterium tuberculosis using Mask-RCNN

Mask-RCNN on Mycobacterium tuberculosis This is an example of object detection on Mycobacterium Tuberculosis using Mask RCNN. Implement of Mask R-CNN

Jun-En Ding 1 Sep 16, 2021
ESPNet: Efficient Spatial Pyramid of Dilated Convolutions for Semantic Segmentation

ESPNet: Efficient Spatial Pyramid of Dilated Convolutions for Semantic Segmentation This repository contains the source code of our paper, ESPNet (acc

Sachin Mehta 515 Dec 13, 2022
COPA-SSE contains crowdsourced explanations for the Balanced COPA dataset

COPA-SSE Repository for COPA-SSE: Semi-Structured Explanations for Commonsense Reasoning. COPA-SSE contains crowdsourced explanations for the Balanced

Ana Brassard 5 Jul 31, 2022