Active Queue Management Wolfram Lautenschlaeger and Packet Scheduling (aqm) Nokia Internet Draft Bell Labs Intended status: Expires: November 2016 May 25, 2016 Global Synchronization Protection for Packet Queues draft-lauten-aqm-gsp-03.txt Abstract The congestion avoidance processes of transmission capacity sharing TCP flows tend to be synchronized among each other, so that the rate variations of the individual flows do not compensate. In contrary, they accumulate into large variations of the whole aggregate. The effect is known as global synchronization. Large queuing buffer demand and large latency and jitter are the consequences. Global Synchronization Protection (GSP) is an extension of regular tail drop packet queuing schemes that prevents global synchronization. For large traffic aggregates the de-correlation between the individual flow variations reduces buffer demand and packet sojourn time by an order of magnitude and more. Even though quite simple, the solution has a theoretical background and is not heuristic. It has been tested with a Linux kernel implementation and shows equivalent performance as other relevant AQM schemes. Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt Lautenschlaeger Expires November 25, 2016 [Page 1] Internet-Draft Global Synchronization Protection May 2016 The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html This Internet-Draft will expire on November 25, 2016. Copyright Notice Copyright (c) 2016 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Table of Contents 1. Introduction...................................................2 2. Conventions used in this document..............................4 3. Root cause of global synchronization...........................4 4. Protecting queues of global synchronization....................5 4.1. Basic algorithm...........................................5 4.2. Interval adaptation at large flow numbers.................5 4.3. Interval adaptation at small RTT..........................7 4.4. Sanity checks and special cases...........................7 5. Delay based operation..........................................8 5.1. Queue delay vs. queue size................................8 5.2. Delay based GSP...........................................8 6. Security Considerations........................................9 7. IANA Considerations............................................9 8. Conclusions....................................................9 9. References.....................................................9 9.1. Normative References......................................9 9.2. Informative References....................................9 10. Acknowledgments..............................................10 1. Introduction The congestion window (CWND) of a particular TCP connection, in combination with the round trip time (RTT), limits the transmission rate of the flow, which enables adaptation of the sending rate to the actual network conditions, [1]. TCP uses a rather coarse congestion control feedback by halving the congestion window in response to Lautenschlaeger Expires November 25, 2016 [Page 2] Internet-Draft Global Synchronization Protection May 2016 packet loss. To fill a bottleneck link by 100% anyway, a packet buffer in front of the link is required. For a single TCP flow a buffer in the range of bottleneck capacity multiplied by the round trip time is required (bandwidth-delay product rule, BDP), [2]. For aggregated traffic of many flows the picture is not so clear. Conservative estimations tend towards BDP of the whole aggregate, i.e. link capacity * RTT. At the other hand, rate reductions due to CWND halving are still only in the range of a particular flow rate. With the assumption of N sharing flows, this yields ideally a buffer size of only (link capacity/N)*RTT. Unfortunately this value cannot be reached in practice. It would require a uniform distribution of rate reductions by the different flows over time. In opposite, rate reductions of bottleneck sharing flows tend to synchronize among each other, which is called global synchronization. In worst case, with all flows synchronized, the buffer demand is back at BDP of the whole traffic, thus confirming the conservative estimation. There are cases where global synchronization does not occur, in particular large number of flows (N>500), large spread of RTT between the different flows, and high frequency of flow renewals. In these cases the buffer size can be reduced to BDP/sqrt(N), which lies between the conservative and overly optimistic estimations above,[3]. Nevertheless there are still doubts, whether the absence of global synchronization is a general reliable design assumption for high capacity links, [4]. Most Active Queue Management (AQM) algorithms are aiming at better control of the queue size (RED [5]) or the queue delay (CoDel [6], PIE [7]), which implies control over global synchronization. Global Synchronization Protection (GSP [10]) goes the other way round. It suppresses the root cause of global synchronization and de-correlates the CWND variations of the competing flows, but it does not try to impact the behavior of a particular flow. This way it moves the buffer size demand down from conservative BDP of the whole link into the direction towards the ideal BDP of only one of the competing flows. Experiments on GSP performance, as reported in [10], show that the stabilizing effect of GSP is equivalent to that of the other AQMs. It is a simple extension to plain tail drop queues. The basic algorithm is memoryless and does not need artificial randomization. Particularly for small numbers of flows it performs better than randomized AQMs like e.g. RED or PIE. Without any special precautions GSP is robust to bursts and never drops at low or empty queues. Automatic parameter adaptation, if needed at all, is external to the basic control loop, which makes it less time critical and less Lautenschlaeger Expires November 25, 2016 [Page 3] Internet-Draft Global Synchronization Protection May 2016 sensitive to traffic conditions than other AQMs with integrated control and adaptation loops. 2. Conventions used in this document In this document, the term "packet drop" is used for congestion notification, silently assuming that congestion marking for ECN could be equally applied. In this document, the term "queue size" is preferably applied in number of bytes, however, the algorithm could be also applied to the number of packets, or even to the queuing delay (milliseconds). 3. Root cause of global synchronization Global synchronization occurs in cases where a number of greedy TCP flows with comparably uniform RTT cross a tail drop queue in front of a shared transmission link. Greedy TCP means, the flow is probing the available capacity on this particular link and is not limited elsewhere (up- or downstream). Tail drop means, a newly arriving packet is placed at the end of the queue if buffer space permits. Otherwise it is dropped. The queue is drained from head of the queue at the speed of the link as long as packets are available. In congestion avoidance state, all senders gradually increase their sending rate, which is, after a while, exceeding the link capacity so that the queue in front of the link is filling up. At some point in time, a first packet is dropped due to lack of buffer space. Ideally, the TCP flow, where the dropped packet belongs to, reduces its sending rate, the queue relaxes, its size goes down, and subsequently arriving packets again can be placed in the buffer. Senders continue to increase their sending rates until the next drop, and so on. Unfortunately not one, but several packets get dropped in such incident for following reason: The rate reduction due to the first dropped packet needs at least one RTT to take effect at the queue entry. During that RTT interval all senders continue to gradually increase their sending rates, whereas the queue is still full. Further packets need to be dropped. It can be shown analytically that for N flows with NewReno and delayed ACK the number of drops is in the range of N/2. Experiments confirm this and show an even higher number with CUBIC. The outcome is that even though the rate reduction by one flow would suffice, not one, but as much as half of the flows are triggered within one RTT to reduce their sending rates - we have global synchronization. A more detailed analytical and experimental investigation of the effect can be found in [8]. Lautenschlaeger Expires November 25, 2016 [Page 4] Internet-Draft Global Synchronization Protection May 2016 4. Protecting queues of global synchronization 4.1. Basic algorithm The basic algorithm works as follows: Set a threshold on queue size below the actual buffer size. If a new packet arrives and the queue size is above the threshold, then immediately drop that packet. After that, ignore any further threshold violation for a timeout interval of 1 - 3 RTT. After expiry of the timeout proceed as above. Algorithm: initialization: interval = e.g. 2 * RTT threshold = e.g. 1/2 * buffer size timeout_expiry = now(), with now() returning the current time at any packet arrival do: if queue_size > threshold && now() > timout_expiry: drop this packet timeout_expiry = now() + interval else enqueue this packet The first dropped packet is triggering the rate reduction by one of the end points. During the timeout the queue is growing further beyond the threshold until the rate reduction takes effect at queue entry. Afterwards the queue size should have dropped below the threshold, so that at timeout expiry the threshold is typically not violated anymore. No explicit action occurs at timeout expiry, which makes the parameter rather insensitive to the actual traffic characteristics. Even if the timeout interval is too short, the algorithm still reduces global synchronization. 4.2. Interval adaptation at large flow numbers The basic algorithm works well for moderate numbers of flows N, i.e. in a range of 1 < N < 20. More precisely, at flow numbers N smaller than the average CWND of one of the sharing flows. At larger numbers the total rate increase during the timeout interval is larger than the subsequent rate reduction by one of the flows. As consequence, after timeout expiry the threshold is still violated, the queue is growing further and further, and, eventually, reaches the buffer limit and enters tail drop operation. The performance is still better Lautenschlaeger Expires November 25, 2016 [Page 5] Internet-Draft Global Synchronization Protection May 2016 than plain tail drop and one could rely on the observation that at large flow numbers global synchronization disappears, anyway. Alternatively the initial timeout interval can be reduced, depending on the actual traffic, in a way, where not just once, but twice, or even more times per RTT the timeout expires. The adaptation criterion is the proportion of time above and below threshold. In regular operation according to the basic algorithm, the queue is most of the time below the threshold. If, however, the queue is more frequently above than below threshold, the interval should be reduced until equilibrium is reached. In this condition the queue is oscillating around the threshold, periodically dropping during times above the threshold, quite similar like PED [9]. Algorithm: initialization: tau = a preset parameter controlling the adaptation speed; e.g. 500 milliseconds or 5 * initial_interval initial_interval - the preset timeout interval as in the basic algorithm cumTime = 0; the cumulative time above/below threshold state = CLEAR; the recent overflow state of the queue at any packet arrival do: if the packet would overflow the buffer (hard tail drop): state = OVERFLOW if state == OVERFLOW && queue is empty: state = DRAIN if state == DRAIN && queue_size > threshold: state = CLEAR update the cumulative time cumTime: account by twice the duration for queue episodes that are entirely ABOVE the threshold if status == CLEAR: account by negated duration for queue episodes that are entirely BELOW the threshold clamp the cumulative time: cumTime = max(cumTime, 0) cumTime = min(cumTime, some sanity limit of several minutes) Lautenschlaeger Expires November 25, 2016 [Page 6] Internet-Draft Global Synchronization Protection May 2016 calculate timeout interval (to be used at next drop decision): interval = initial_interval/(1+cumTime/tau) The adaptation heuristics works as follows: The basic GSP algorithm executes single drops per threshold violation as long as the cumulative time (cumTime) is clamped at zero. As soon as the cumulative time gets positive, the adaptation algorithm implements an integral controller for the drop rate that the basic GSP algorithm executes during times of threshold violation. The transition between both operating conditions is smooth. The adaptation speed is controlled by the parameter tau. Plain adaptation by cumulative times above / below threshold might latch up in circumstances where abrupt traffic changes cause massive buffer overflows. To avoid this, after a hard buffer overflow the accounting for times below threshold is suspended until the queue performed a full cycle of down to empty and back above threshold. 4.3. Interval adaptation at small RTT The RTT is not known exactly but there should be at least a rough idea on the range of RTT for setting up the timeout interval. If this estimation is much too large, a similar situation occurs like in the large flow numbers case. The total rate increase during the timeout interval (which turns out to be multiple RTTs) is larger than the subsequent rate reduction by one flow. The adaptation rule is the same as for large flow numbers, section 4.2. 4.4. Sanity checks and special cases An additional rule can be introduced that prevents large packet bursts from immediately triggering the drop: Restart the timeout not only after a packet drop but also whenever a packet is arriving at an empty queue. at any packet arrival do: if queue is empty: timeout_expiry = now() + interval Lautenschlaeger Expires November 25, 2016 [Page 7] Internet-Draft Global Synchronization Protection May 2016 5. Delay based operation 5.1. Queue delay vs. queue size Recent new AQM proposals ([6], [7]) are focusing on queue delay rather than on queue size in bytes. One reason for this move is that ideally the steady state queue oscillation depends only on the RTT and on the number of sharing TCP flows - if measured in delay. If measured in bytes, however, the queue oscillation additionally depends on the link capacity. The oscillation span sets the minimum queue size for 100% link utilization. (This is where the bandwidth delay product rule comes from.) A larger queue creates only unnecessary delay (standing queue). Obviously it is preferable to stabilize the delay instead of the size. It eliminates the interface rate from the parameter list, which is particularly welcome in circumstances with unknown or variable drain rate. Such situations are typical for low priority queues in front of a priority scheduler and generally in wireless scenarios. 5.2. Delay based GSP In section 4. we silently assumed queue size in bytes. However, the algorithm can be equally applied to the queue delay (packet sojourn time). In this case the threshold has to be in milliseconds, whereas the empty queue condition remains the same as before. While the queue size in bytes or packets is typically maintained by ordinary queue implementations, obtaining the queue delay requires additional effort. Two solutions are available and both are applicable to GSP: Time stamping of packets like in CoDel [6] or estimating the drain rate for a translation of size into delay like in PIE [7]. Algorithm (time stamping): at any arrival of a packet p do: p.time = now() at any departure of a packet p do: queue_delay = now() - p.time The basic algorithm of section 4.1. rephrased to delay based operation: Lautenschlaeger Expires November 25, 2016 [Page 8] Internet-Draft Global Synchronization Protection May 2016 at any packet arrival do: if queue_delay > threshold && now() > timout_expiry: drop this packet timeout_expiry = now() + interval else enqueue this packet Please note that queue_delay is a per queue variable, not per packet, i.e. the drop decision at enqueuing (tail drop) depends on the delay that another, most recently dequeued packet experienced. This approach is a forecast of the expected delay and it is justified by the inherent inertance of the queue itself. 6. Security Considerations Global synchronization is a particular problem of many elastic flows sharing a bottleneck. GSP is there to prevent this. But it does not protect of unresponsive flows. If the congestion notification according to section 4.1. randomly hits an unresponsive flow then the expected rate reduction within the timeout interval might simply not happen, which postpones the notification by one timeout interval. The interval adaptation of section 4.2. automatically ensures that the timeout interval is reduced accordingly, depending on the average fraction of unresponsive traffic. In extreme cases, when the unresponsive traffic alone is exceeding the link capacity, GSP behaves like plain tail drop. 7. IANA Considerations There are no actions for IANA. 8. Conclusions tbc 9. References 9.1. Normative References 9.2. Informative References [1] Van Jacobson, Congestion avoidance and control, Proc. SIGCOMM '88, 1988 Lautenschlaeger Expires November 25, 2016 [Page 9] Internet-Draft Global Synchronization Protection May 2016 [2] M. Mathis, J. Semke, J. Mahdavi, and T. Ott, The Macroscopic Behavior of the TCP Congestion Avoidance Algorithm, Comput. Commun. Rev., 27.3, 1997, pp. 67-82. [3] G. Appenzeller, I. Keslassy, and N. McKeown, Sizing router buffers, Proc. ACM SIGCOMM '04, 2004. [4] G. Vu-Brugier, R. S. Stanojevic, D. J. Leith, R. N. Shorten, A critique of recently proposed buffer-sizing strategies, ACM SIGCOMM Computer Communication Review, 37.1, 2007 [5] S. Floyd, Van Jacobsen, Random Early Detection Gateways for Congestion Avoidance, IEEE/ACM Trans. Networking, 1.4, 1993 [6] K. Nichols, Van Jacobson, "Controlling Queue Delay", ACM Queue - Networks, 2012 [7] R. Pan, P. Natarajan, C. Piglione, M.S. Prabhu, V. Subramanian, F. Baker, B. VerSteeg, PIE: A lightweight control scheme to address the bufferbloat problem, 14th High Performance Switching and Routing (HPSR), 2013 IEEE [8] W. Lautenschlaeger, A deterministic TCP bandwidth sharing model, 2014, online http://arxiv.org/pdf/1404.4173v1 [9] A. Francini, Beyond RED: Periodic Early Detection for On-Chip Buffer Memories in Network Elements, Proc. IEEE High- Performance Switching and Routing Conference (HPSR 2011), Cartagena, Spain, July 4-6, 2011 [10] W. Lautenschlaeger, A. Francini, Global Synchronization Protection for Bandwidth Sharing TCP Flows in High-Speed Links, IEEE HPSR 2015, Budapest, Hungary, July 2015, online http://arxiv.org/pdf/1602.05333 10. Acknowledgments This document was prepared using 2-Word-v2.0.template.dot. Lautenschlaeger Expires November 25, 2016 [Page 10] Internet-Draft Global Synchronization Protection May 2016 Authors' Addresses Wolfram Lautenschlaeger Alcatel-Lucent Deutschland AG Bell Labs Lorenzstrasse 10 70435 Stuttgart Germany Email: Wolfram.Lautenschlaeger@nokia.com Lautenschlaeger Expires November 25, 2016 [Page 11]