Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Latest commit

 

History

History

README.md

Overview

NCX Infra Controller (NICo) is a collection of services that provides site-local, zero-trust bare-metal lifecycle management with DPU-enforced isolation, allowing for deployment of multi-tenant AI infrastructure at scale. NICo enables zero-touch automation and ensures the integrity and separation of workloads at the bare-metal layer.

NICo Operational Principles

NICo has been designed according to the following principles:

  • The machine is untrustworthy.
  • Operating system requirements are not imposed on the machine.
  • After being racked, machines must become ready for use with no human intervention.
  • All monitoring of the machine must be done using out-of-band methods.
  • The network fabric (i.e. Leaf Switches and routers) stays static even during tenancy changes within NICo.

NICo Responsibilities

NICo is responsible for the following tasks in the data-center environment:

  • Maintain hardware inventory of ingested machines.
  • Integrate with RedFish APIs to manage usernames and passwords
  • Perform hardware testing and burn-in.
  • Validate and update firmware.
  • Allocate IP addresses (IPv4).
  • Control power (power on/off/reset).
  • Provide DNS services for managed machines.
  • Orchestrate provisioning, wiping, and releasing nodes.
  • Ensure trust of the machine when switching tenants.

Responsibilities not Covered

NICo is not responsible for the following tasks:

  • Configuration of services and software running on managed machines.
  • Cluster assembly (that is, it does not build SLURM or Kubernetes clusters)
  • Underlay network management

NICo Components and Services

NICo is a service with multiple components that drive actions based on API calls, which can originate from users or as events triggered by machines (e.g. a DHCP boot or PXE request).

Trusted services that are part of NICo deployment communicate with the NICo Core over gRPC using protocol buffers.

The NICo deployment includes a number of core services:

  • NICo Core gRPC API: Allows other components and services to query the state of all objects and to request creation, configuration, and deletion of entities.
  • DHCP: Provides IPs to all devices on underlay networks, including Host BMCs, DPU BMCs, and DPU OOB addresses. It also provides IPs to Hosts on the overlay network.
  • PXE: Delivers images to managed hosts at boot time. Currently, managed hosts are configured to always boot from PXE. If a local bootable device is found, the host will boot it. Hosts can also be configured to always boot from a particular image for stateless configurations.
  • Hardware health: Pulls hardware health and configuration information emitted from a Prometheus /metrics endpoint on port 9009 and reports that state information back to NICo.
  • SSH console: Provides a virtual serial console logging and access over ssh, allowing console access to remote machines deployed on site. The ssh-console also logs the serial console output of each host into the logging system, where it can be queried using tools such as Grafana and logcli.
  • DNS: Provides domain name service (DNS) functionality using two services:
    • carbide-dns: Handles DNS queries from the site controller and managed nodes.
    • unbound: Provides recursive DNS services to managed machines and instances.

This set of services is also referred to as the Site Controller

Component and Service Dependencies

In addition to the NICo core service components, there are other supporting services that must be set up to support the Site Controller. controller nodes.

Kubernetes

NICo requires persistent, durable storage to maintain state for the following components:

  • Hashicorp Vault: Used by Kubernetes for certificate signing requests (CSRs), this vault uses three each (one per K8s control node) of the data-vault and audit-vault 10GB PVs to protect and distribute the data in the absence of a shared storage solution.
  • Postgres: This database is used to store state for any NICo or site controller components that require it, including the main "forgedb". There are three 10GB pgdata PVs deployed to protect and distribute the data in the absence of a shared storage solution. The forgedb database is stored here.
  • Certificate Management Infrastructure: This is a set of components that manage the certificates for the site controller and managed hosts.

Site Management

  • Site Agent: Maintains a northbound Temporal connection to NICo REST (Cloud or centrally deployed or on-Site) to sync data with REST layer DB cache and delegate gRPC requests to NICo Core.
  • Admin CLI: Provides an admin level command line interface into NICo Core using the gRPC API

NICo REST

A collection of microservices that comprise the resource allocation and management backend for NCX Infra Controller, exposed as a REST API. This is the primary interface for end-users to interact with NICo.

The REST layer can be deployed in the datacenter with NCX Infra Controller Core, or deployed anywhere in Cloud with Site Agent connecting from the datacenter. Multiple NCX Infra Controller Cores running in different datacenters can also connect to NCX Infra Controller REST through respective Site Agents.

For details on NICo REST, please refer to NICo REST Github Repository and NICo REST API Schema.

Managed Hosts

The point of having a Site Controller is to administer a Site that has been populated with managed hosts. Each managed host is a pairing of a single BlueField (BF) 2/3 DPU and a host server. During initial deployment, the scout service runs, informing the NICo Core gRPC API of any discovered DPUs. NICo completes the installation of services on the DPU and boots into regular operation mode. Thereafter, the dpu-agent starts as a daemon.

Each DPU runs the dpu-agent which connects to NICo Core gRPC API to retrieve configuration instructions.

Metrics and Logs

NICo collects metrics and logs from the managed hosts and the Site Controller. This information is in Prometheus format and can be scraped by a Prometheus server.