Network Working Group                                            E. Song
Internet-Draft                                                   Y. Song
Intended status: Informational                                  S. Zhang
Expires: 10 July 2025                                              X. Li
                                                                 J. Zhao
                                                           Alibaba Cloud
                                                          6 January 2025


 Problem Statements of Service Mesh Infrastructure and Requirements of
                                  DMSC
              draft-song-dmsc-promblem-and-requirements-00

Abstract

   Service meshes, as one infrastructure, has been widely used in the
   major public cloud providers.  Its main function is to accomplish the
   policy routing, precise traffic allocation, and traffic throttling
   etc.  Currently, the design and implementation of service mesh takes
   the centralized control approach, which bring various challenges for
   its current deployments and further developments.  This document
   analyzes the problems that exists in current service mesh
   implementations, and provide the requirements for the future
   distributed micro service communication(DMSC) infrastructure.

Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in [RFC2119] [RFC8174]
   when, and only when, they appear in all capitals, as shown here.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on 10 July 2025.



Song, et al.              Expires 10 July 2025                  [Page 1]

Internet-Draft   Service Mesh Problem Statement and DMSC    January 2025


Copyright Notice

   Copyright (c) 2025 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents (https://trustee.ietf.org/
   license-info) in effect on the date of publication of this document.
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.  Code Components
   extracted from this document must include Revised BSD License text as
   described in Section 4.e of the Trust Legal Provisions and are
   provided without warranty as described in the Revised BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
   2.  Probem Statements of current Service Mesh Infrastructure  . .   3
     2.1.  Service Mesh is highly Coupled with User Service  . . . .   3
     2.2.  Service Mesh Introduces Additional Performance
           Overhead  . . . . . . . . . . . . . . . . . . . . . . . .   3
     2.3.  Service Mesh Results in High Resource Consumption . . . .   4
     2.4.  Service Mesh Incurs Overhead in Control Plane . . . . . .   4
   3.  Requirements of Distributed Micro Services Communication
           (DMSC)  . . . . . . . . . . . . . . . . . . . . . . . . .   5
     3.1.  Non-intrusive Service Mesh for User Applications  . . . .   5
     3.2.  Reduce Control Plane Overhead . . . . . . . . . . . . . .   5
     3.3.  Improve Data Plane Performance  . . . . . . . . . . . . .   5
     3.4.  Implement an Application Mesh that is Not Limited to
           Kubernetes  . . . . . . . . . . . . . . . . . . . . . . .   6
   4.  Security Considerations . . . . . . . . . . . . . . . . . . .   6
   5.  Acknowledgement . . . . . . . . . . . . . . . . . . . . . . .   6
   6.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .   6
   7.  References  . . . . . . . . . . . . . . . . . . . . . . . . .   6
     7.1.  Normative References  . . . . . . . . . . . . . . . . . .   6
     7.2.  Informative References  . . . . . . . . . . . . . . . . .   6
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .   7

1.  Introduction

   Service meshes, as an infrastructure component, facilitate
   communication between services.  Major public cloud providers such as
   AWS, Azure, GCP, and Alibaba Cloud have all introduced service mesh-
   based products to simplify the building and management of
   microservices-based applications.  In many service mesh frameworks, a
   key component is the sidecar proxy, which is responsible for managing
   pod traffic and implementing functionalities such as policy routing,
   precise traffic allocation, and traffic throttling.  By decoupling



Song, et al.              Expires 10 July 2025                  [Page 2]

Internet-Draft   Service Mesh Problem Statement and DMSC    January 2025


   network functionalities into the sidecar, flexible traffic management
   can be achieved without altering the user business logic.  However,
   deploying sidecars in production environments reveals certain
   performance bottlenecks,which have also been mentioned in other
   literature [Dissecting]

   This document analyzes the problems that exists in current service
   mesh implementations, and provide the requirements for the future
   distributed micro service communication(DMSC) infrastructure.

2.  Probem Statements of current Service Mesh Infrastructure

2.1.  Service Mesh is highly Coupled with User Service

   In the model where a sidecar (such as Istio Mesh [Istio]) is deployed
   within each pod, the sidecar is embedded within the user
   application's pod and is responsible for handling the communication
   tasks of the application.  The sidecar coexists with the user
   application, sharing the pod’s resources[SPRIGHT] [CanalMesh].  To
   ensure uninterrupted communication between applications and to avoid
   resource waste caused by isolated sidecars, both the sidecar and the
   application are designed to be created, destroyed, and scaled
   simultaneously, sharing the same life cycle.  However, this design
   introduces stability and security issues; for example, memory leaks
   in the sidecar may lead to application crashes, and upgrading the
   sidecar requires restarting the pod, resulting in interruptions to
   application operation.

2.2.  Service Mesh Introduces Additional Performance Overhead

   Since traffic needs to be processed through the sidecar, the outgoing
   traffic from the user application is redirected to the sidecar (for
   example, using iptables), which introduces additional processing
   steps[SPRIGHT] [CanalMesh] Specifically, at both the source and
   destination, the traffic redirection introduces two additional
   context switches, memory copying, and protocol stack processing
   overhead [SPRIGHT] Furthermore, the sidecar is required to perform
   complex Layer 7 (L7) tasks, such as CPU-intensive TLS encryption and
   decryption operations, which may further lead to significant
   performance degradation.











Song, et al.              Expires 10 July 2025                  [Page 3]

Internet-Draft   Service Mesh Problem Statement and DMSC    January 2025


2.3.  Service Mesh Results in High Resource Consumption

   Since the sidecar is deployed within the user pod, it consumes
   resources that would otherwise be allocated to the user application
   [6].  For example, a customer with 500 nodes and 15,000 pods found
   that the sidecars consumed 1,500 CPU cores (10% of the total) and
   5,000 GB of memory (10% of the total)[CanalMesh].  In extreme cases,
   the CPU and memory usage of the sidecar can even exceed that of the
   application itself due to the complex Layer 7 functionalities it
   provides.  This issue has raised concerns among customers, as the pod
   resources they purchased are not fully utilized for running their
   applications.  Additionally, measurement results indicate that to
   achieve optimal performance, it may even be necessary to oversupply
   resources for the sidecar.

2.4.  Service Mesh Incurs Overhead in Control Plane

   With the growing popularity of service meshes, an increasing number
   of customers are choosing to use them to deploy micro services, which
   has rapidly increased the number of sidecars that the control plane
   needs to manage.  Sidecars can handle many types of configurations;
   however, orchestrating service dependency configurations for each
   sidecar individually is both time-consuming and error prone, and any
   misconfiguration could potentially affect service continuity.  To
   reduce complexity, a common practice is to download the same
   configuration set to all sidecars.  This configuration set contains
   all possibly relevant configurations, ensuring that any pod can
   freely communicate with other pods as needed.  However, pushing the
   complete configuration to all pods during each update significantly
   increases southbound bandwidth overhead.  This is because whenever a
   sidecar is updated even if the updates are not related to other side
   cars they still need to be pushed to all sidecars.  In scenarios
   involving cross-region or multi-cloud deployments within a Kubernetes
   cluster (such as on-premises deployments or multi-site disaster
   recovery), the significant southbound configuration bandwidth
   overhead may lead to configuration delays or even losses.  Since
   cross-region/cross-cloud communication requires VPNs or dedicated
   lines, the communication costs are relatively high.  As a result,
   most customers opt for a more conservative bandwidth purchasing
   strategy.  This means that when managing cross-region or multi-cloud
   clusters, the controller's configuration updates to geographically
   distributed sidecars can deplete the customer's cross-region/cross-
   cloud bandwidth, potentially resulting in delays or losses of
   configuration data.







Song, et al.              Expires 10 July 2025                  [Page 4]

Internet-Draft   Service Mesh Problem Statement and DMSC    January 2025


3.  Requirements of Distributed Micro Services Communication (DMSC)

3.1.  Non-intrusive Service Mesh for User Applications

   Current mainstream service mesh solutions like Istio and Ambient
   exhibit a high degree of intrusiveness toward user services.  This is
   manifested in components such as sidecars that share the life cycle
   with pods (L4 + L7 proxies), L4 proxies that share resources with
   other pods within the same node, and L7 proxies that share resources
   across all nodes in the Kubernetes cluster.  These components not
   only occupy resources that users allocate for their business
   operations but also introduce potential failure risks.  To ensure
   equivalence in service mesh functionalities, Canal Mesh [5] still
   retains lightweight proxies locally.  Therefore, there is a pressing
   need for service meshes to further reduce their intrusiveness to
   users, with the ultimate goal of achieving a completely non-intrusive
   service mesh.

3.2.  Reduce Control Plane Overhead

   The control plane of the service mesh needs to handle tasks such as
   full configuration orchestration and mass sidecar configuration
   pushing.  When the overhead is too high, it can lead to issues like
   prolonged configuration effectiveness time and excessive consumption
   of dedicated line bandwidth during cross-cloud or IDC deployments.
   Additionally, this overhead is directly proportional to the scale of
   the cluster, which severely hinders the scalable deployment of
   service meshes.  Therefore, there is an urgent need to reduce the
   overhead of the service mesh control plane.  One potential solution
   is the centralized mesh gateway configuration in Canal Mesh
   [CanalMesh].  Moreover, further optimizing the configuration
   orchestration and pushing methods (for example, transforming full
   pushes into incremental pushes) is also a potentially viable
   direction.

3.3.  Improve Data Plane Performance

   The service mesh takes over the user's advanced network communication
   needs by inserting proxy nodes into the user's communication path.
   While this provides the convenience of allowing users to focus solely
   on business development, redirecting traffic through the proxy
   inevitably affects the data plane transmission latency and
   throughput.  Whether the service mesh proxies are located remotely in
   the cloud or retained locally in a limited capacity, improving the
   data plane performance of the service mesh is crucial.  For example,
   leveraging SmartNICs to offload proxy functions can help reduce the
   performance degradation that deploying a service mesh may bring to
   user applications.  This represents an important direction for



Song, et al.              Expires 10 July 2025                  [Page 5]

Internet-Draft   Service Mesh Problem Statement and DMSC    January 2025


   evolution.

3.4.  Implement an Application Mesh that is Not Limited to Kubernetes

   In addition to Kubernetes users, there are many business scenarios
   that also wish to introduce the concept of service mesh to reduce
   repetitive development for network communication needs.  For example,
   AWS’s VPC Lattice service unifies advanced network communication
   capabilities across various forms such as VMs, bare metal, and
   Kubernetes, providing a broader range of service mesh functionalities
   [1].  Some operators also hope to extend the concept of service mesh
   into the backbone network, offering advanced network features at a
   cloud and IDC granularity through routers[I-D.li-dmsc-architecture].
   In summary, expanding the concept of service mesh beyond Kubernetes
   to achieve a more generalized application mesh is a potential
   research direction.

4.  Security Considerations

   This information document introduces no any extra security problem to
   the Internet.

5.  Acknowledgement

   TBD

6.  IANA Considerations

   None

7.  References

7.1.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,
              <https://www.rfc-editor.org/info/rfc2119>.

   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
              May 2017, <https://www.rfc-editor.org/info/rfc8174>.

7.2.  Informative References

   [CanalMesh]
              Song, E., Song, Y., Lu, C., Pan, T., Zhang, S., Lu, J.,
              Zhao, J., Wang, X., Wu, X., and M. Gao, "Canal mesh: A



Song, et al.              Expires 10 July 2025                  [Page 6]

Internet-Draft   Service Mesh Problem Statement and DMSC    January 2025


              cloud-scale sidecar-free multitenant service mesh
              architecture.", ACM SIGCOMM 2024 Conference, 860–875,
              2024..

   [Dissecting]
              Zhu, X., She, G., Xue, B., Zhang, Y., Zhang, Y., Zou, X.,
              Duan, X., He, P., Krishnamurthy, A., and L. Lentz,
              "Dissecting overheads of service mesh sidecars.", ACM
              SoCC pages 142–157, 2023..

   [I-D.li-dmsc-architecture]
              Li, X., Wang, A., Wang, W., and D. KUTSCHER, "Distributed
              Micro Service Communication architecture based on Content
              Semantic", Work in Progress, Internet-Draft, draft-li-
              dmsc-architecture-00, 2 January 2025,
              <https://datatracker.ietf.org/doc/html/draft-li-dmsc-
              architecture-00>.

   [Istio]    Calcote, L. and Z. Butcher, "Istio: Up and running: Using
              a service mesh to connect, secure, control, and observe.",
              O’Reilly Media, 2019 .

   [SPRIGHT]  Qi, S., Monis, L., Zeng, Z., Wang, I., and K.
              Ramakrishnan, "Extracting the Server from Serverless
              Computing, High-performance eBPF-based Event-driven,
              Shared-memory Processing.", ACM SIGCOMM pages 780–794,
              2022..

Authors' Addresses

   Enge Song
   Alibaba Cloud
   Alibaba Beijing Chaoyang Science & Technology Park
   Beijing
   100124
   China
   Email: enge.seg@alibaba-inc.com


   Yang Song
   Alibaba Cloud
   Alibaba Beijing Chaoyang Science & Technology Park
   Beijing
   100124
   China
   Email: song288954@alibaba-inc.com





Song, et al.              Expires 10 July 2025                  [Page 7]

Internet-Draft   Service Mesh Problem Statement and DMSC    January 2025


   Shaokai Zhang
   Alibaba Cloud
   Alibaba Beijing Chaoyang Science & Technology Park
   Beijing
   100124
   China
   Email: shaokai.zsk@alibaba-inc.com


   Xing Li
   Alibaba Cloud
   Alibaba Beijing Chaoyang Science & Technology Park
   Beijing
   100124
   China
   Email: lixing.lix@aliyun-inc.com


   Jiangu Zhao
   Alibaba Cloud
   Alibaba Beijing Chaoyang Science & Technology Park
   Beijing
   100124
   China
   Email: jiangu.zjg@alibaba-inc.com


























Song, et al.              Expires 10 July 2025                  [Page 8]