rtgwg                                                            R. Chen
Internet-Draft                                           ZTE Corporation
Intended status: Informational                                    K. Yao
Expires: 30 July 2025                                       China Mobile
                                                            L. Changwang
                                                    New H3C Technologies
                                                                  C. Gao
                                                         ZTE Corporation
                                                         26 January 2025


   A Framework and Definition for Collective Communication Offloading
            draft-chen-rtgwg-cco-framework-and-definition-02

Abstract

   This document provides a definition of the term "Collective
   Communication Offloading" for use within the IETF and specifically as
   a reference for other IETF documents that describe or use aspects of
   Collective Communication Offloading.

   The document also describes the characteristics of an IETF Collective
   Communication Offloading, related terms and their meanings, and
   discusses the general framework for Collective Communication
   Offloading, the necessary system components and interfaces.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on 30 July 2025.

Copyright Notice

   Copyright (c) 2025 IETF Trust and the persons identified as the
   document authors.  All rights reserved.




Chen, et al.              Expires 30 July 2025                  [Page 1]

Internet-Draft     A Framework and Definition for CCO       January 2025


   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents (https://trustee.ietf.org/
   license-info) in effect on the date of publication of this document.
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.  Code Components
   extracted from this document must include Revised BSD License text as
   described in Section 4.e of the Trust Legal Provisions and are
   provided without warranty as described in the Revised BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
     1.1.  Requirements Language . . . . . . . . . . . . . . . . . .   3
   2.  Terms and Abbreviations . . . . . . . . . . . . . . . . . . .   3
   3.  Definition of CCO . . . . . . . . . . . . . . . . . . . . . .   3
   4.  Framework . . . . . . . . . . . . . . . . . . . . . . . . . .   4
     4.1.  CCOM  . . . . . . . . . . . . . . . . . . . . . . . . . .   5
     4.2.  Infrastructure Layer  . . . . . . . . . . . . . . . . . .   6
     4.3.  CCOM Southbound Interface . . . . . . . . . . . . . . . .   6
       4.3.1.  Interface between the CCO-member and CCOM . . . . . .   6
       4.3.2.  Interface between the CCO-switch and CCOM . . . . . .   7
   5.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .   7
   6.  Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .   7
   7.  Security Considerations . . . . . . . . . . . . . . . . . . .   7
   8.  References  . . . . . . . . . . . . . . . . . . . . . . . . .   7
     8.1.  Normative References  . . . . . . . . . . . . . . . . . .   7
     8.2.  Informative References  . . . . . . . . . . . . . . . . .   8
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .   8

1.  Introduction

   Collective Communication Offloading(CCO) feature allows the
   offloading of collective operations to the switches.  Distributed
   applications that might benefit from the CCO include but not limited
   to:

   *  Artificial intelligence (AI).

   *  High performance computing (HPC).

   In Network Computing(INC) is a relatively common technology.  Both AI
   and HPC networks, the specific usage of INC is Collective
   Communication Offloading(CCO).  The use cases and characteristics of
   each use case are further described in
   [I-D.yao-tsvwg-cco-problem-statement-and-usecases].






Chen, et al.              Expires 30 July 2025                  [Page 2]

Internet-Draft     A Framework and Definition for CCO       January 2025


   This document provides a definition of the term " Collective
   Communication Offloading " for use within the IETF and specifically
   as a reference for other IETF documents that describe or use aspects
   of Collective Communication Offloading.

   The document also describes the characteristics of an IETF Collective
   Communication Offloading, related terms and their meanings, and
   discusses the general framework for Collective Communication
   Offloading, the necessary system components and interfaces.

1.1.  Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
   "OPTIONAL" in this document are to be interpreted as described in BCP
   14 [RFC2119] [RFC8174] when, and only when, they appear in all
   capitals, as shown here.

2.  Terms and Abbreviations

   The terms and abbreviations used in this document are listed below.

   INC:In Network Computing

   CCO: Collective Communication Offloading

   CCOM: Collective Communication Offloading Manager

   CCO-switch: A device in a network that performs collective operations

   CCO-member: A member of a collective group.

   CCO-tree: A tree of CCO-switches used for collective offload for a
   Collective Group.

   Collective Group: A set of works that participate in a collective
   operation.

3.  Definition of CCO

   The definition of CCO in IETF context is as follows:










Chen, et al.              Expires 30 July 2025                  [Page 3]

Internet-Draft     A Framework and Definition for CCO       January 2025


   The Collective Communication Offloading (CCO) can efficiently and
   controllably utilize the storage and computing resources of network
   equipment without affecting the normal functions of network
   equipment.  CCO feature takes the approach of offloading collective
   operations to the CCO switch to achieve the ultimate network
   performance, such as reduced latency, increased throughput, and so
   on.

   The type of collective operations referred to in this draft is as
   follows, they can benefit from CCO:

   *  Broadcast:distribute data from one member to all other members.

   *  AllGather:collect and distribute data from all members.

   *  Reduce:combine data from all members and distribute the results to
      one member.

   *  AllReduce:combine data from all members and distribute the results
      to all members.

   *  ReduceScatter:combine data from all members but scatter the
      results to all members.

   *  Barrier:synchronize across all members.

4.  Framework

   An IETF CCO and its realization involves the following stakeholders
   and it is relevant to define them for consistent terminology(see
   Figure 1).

   *  CCOM: The CCOM can be used to discover CCO-switch capability and
      manages CCO-switch resources.  It is mainly responsible for
      establishes collective groups and configuration of resources
      allocated to a group for collective offload.

   *  Infrastructure Layer: It includes CCO-switch and CCO-member.

   *  CCOM Southbound Interface:It includesInterface between the CCO-
      member and CCOM and Interface between the CCO-switch and CCOM.










Chen, et al.              Expires 30 July 2025                  [Page 4]

Internet-Draft     A Framework and Definition for CCO       January 2025


   +-----------------------------------------------------------------+
   |+----------------+ +-------------------+ +----------------------+|
   ||Group Management| |Topology Management| |CCO-switch capability ||
   |+----------------+ +-------------------+ |    Management        ||
   |   CCOM                                  +----------------------+|
   +-----------------+-------------------------------+---------------+
                     |                               |
    I/F between CCO-member and CCOM   I/F between CCO-switch and CCOM
                     |                               |
        +------------+-------------------------------+------------+
        |            |                               |            |
        |     +------+-----+                   +-----+------+     |
        |     | CCO-member |                   | CCO-switch |     |
        |     +------------+                   +------------+     |
        | Infrastructure Layer                                    |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+


                Figure 1: Figure 1: The framework of the CCO

4.1.  CCOM

   The CCOM is mainly responsible for establish collective groups and
   allocates the necessary CCO resources for the collective group.  The
   CCOM is reachable by the CCO-switch and by the group CCO-member,
   possibly through different networks.  When a set of members decide to
   form a group, The CCOM determines an CCO-tree to assign to the group
   for collective offload.  The CCOM then configures each individual
   switch via Interface between CCO-switches and CCOM, and finally
   returns to the CCO-member all information required to communicate
   with their neighboring CCO-switch.

   The CCOM has the following functional modules:

   *  Group Management: It includes group creation/destruction, group
      status query, and allocates and de-allocate the necessary CCO
      resources for the collective group.

   *  Topology Management: The CCOM obtains or computes an CCO-tree
      between CCO-member and CCO-switch for the group after receiving
      requests from all members.  The CCO-tree is determined using the
      underlay topology and CCO-switch resource information.

   *  CCO-switch capability management: The CCOM MUST obtain the CCO-
      switch capability and manages CCO-switch resources.






Chen, et al.              Expires 30 July 2025                  [Page 5]

Internet-Draft     A Framework and Definition for CCO       January 2025


4.2.  Infrastructure Layer

   It includes CCO-switch and CCO-member.

   *  CCO-switch: A device in a network that performs collective
      operations.  It receives input data from CCO-member and performs a
      reduction operation to produce a single result, and then
      distribute the output data to one or more members depending on the
      collective group configuration and particular collective
      operation.

   *  CCO-member: A member of a collective group.  It provides input
      data and accepts output data, the initiator of collective
      operations.

4.3.  CCOM Southbound Interface

   The interworking and interoperability between the CCOM and the CCO-
   switch and the CCO-member to provide common means of provisioning,
   operating and monitoring the CCO is enabled by the following
   communication interfaces (see Figure 1).

4.3.1.  Interface between the CCO-member and CCOM

   It is an interface between CCOM and CCO-member.  The CCOM can use
   this interface to communicate with CCO-member about group
   information, including requests from CCO-member to join the group,
   creation and destruction of the group, group status query, etc.  The
   main interactive information is as follows:

   *  CCO-member registers with CCOM: CCO-member need to register with
      CCOM to let CCOM know the existence of the current member and
      maintain the connection with the CCO-member.  CCO-member register
      with CCOM through this interface.  Registration request parameters
      include: the CCO-member's addressing information and MTU supported
      by CCO-member.

   *  Group setup: CCO-member joins a group by providing a set of
      required capabilities to the CCOM.  A group is established after
      all members have attempted to join.  Group creation MUST fail if
      the required network resources or capabilities are not provided.

   *  Group destruction: If the topology changes during the life time of
      the group or Once any member has left, the group is no longer
      usable, the CCOM must tear down the group, and build a new CCO-
      tree.





Chen, et al.              Expires 30 July 2025                  [Page 6]

Internet-Draft     A Framework and Definition for CCO       January 2025


4.3.2.  Interface between the CCO-switch and CCOM

   It is an interface between CCOM and CCO-switch.  The CCOM discover
   CCO-switch capability and manages CCO-switch resources.  The main
   interactive information is as follows:

   *  Discover CCO-switch capability: The CCOM queries the CCO-switch to
      obtain their capabilities.  The capabilities of the CCO-switch
      mainly include: whether it supports network computing, supported
      types of collective operations, supported group numbers, number of
      trees, supported MTU, etc.

   *  Allocate and de-allocate switch resources for a group: To allocate
      resources for the Collective Group, the CCOM first needs to know
      the type of collective operations the group intends to perform.
      Because, a deployment can have different types of CCO-switch,
      e.g., some switches can have reduction support while others can
      support only data transfer offload.  So, the CCOM queries the CCO-
      switch to obtain their capabilities.  In this way, the CCOM can
      allocate appropriate resources when different Collective Groups
      might perform different types of collective operations.

5.  IANA Considerations

   There are no requests to IANA in this framework document.

6.  Acknowledgements

   TBD.

7.  Security Considerations

   Collective Communication Offloading MAY raise security and privacy
   concerns

   CCOM Authentication: Underlay networks must be protected against
   attacks from malicious CCOMs, as such attacks could destabilize
   overall network operations.  A CCOM SHOULD have strong authentication
   with the Infrastructure Layer.  Furthermore, the interface between
   the CCOM and the Infrastructure Layer needs to be secured with robust
   authentication and authorization mechanisms, as well as associated
   auditing mechanisms.

8.  References

8.1.  Normative References





Chen, et al.              Expires 30 July 2025                  [Page 7]

Internet-Draft     A Framework and Definition for CCO       January 2025


   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,
              <https://www.rfc-editor.org/info/rfc2119>.

   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
              May 2017, <https://www.rfc-editor.org/info/rfc8174>.

8.2.  Informative References

   [I-D.yao-tsvwg-cco-problem-statement-and-usecases]
              Yao, K., Shiping, X., Li, Y., Huang, H., and D. KUTSCHER,
              "Collective Communication Optimization: Problem Statement
              and Use cases", Work in Progress, Internet-Draft, draft-
              yao-tsvwg-cco-problem-statement-and-usecases-00, 23
              October 2023, <https://datatracker.ietf.org/doc/html/
              draft-yao-tsvwg-cco-problem-statement-and-usecases-00>.

Authors' Addresses

   Ran Chen
   ZTE Corporation
   Nanjing
   China
   Email: chen.ran@zte.com.cn


   Kehan Yao
   China Mobile
   Beijing
   China
   Email: yaokehan@chinamobile.com


   Changwang Lin
   New H3C Technologies
   Beijing
   China
   Email: linchangwang.04414@h3c.com


   Chenqiang Gao
   ZTE Corporation
   Nanjing
   China
   Email: gao.chenqiang@zte.com.cn




Chen, et al.              Expires 30 July 2025                  [Page 8]