p2002_04_417: Talk for Scalable QoS Workshop

I was invited to be a panel speaker at the NSF ITR Workshop on Scalable QoS Solutions, held in Annapolis, MD, April 15th-16th 2002. The panel in question is "Provider Operational and Deployment Issues", and I represented the perspective of an operator of a University campus network and gigaPoP.

I submitted the following position paper for the panel (my slides were based on this paper):

Position Statement on QoS and Deployment Issues
University of Pennsylvania and the MAGPI GigaPoP
April 15th, 2002
Author: Shumon Huque, shuque@isc.upenn.edu
NSF ITR QoS Workshop, April 2002
Session I: Provider Operational and Deployment Issues

Some University researchers want QoS. For example, people doing research into QoS itself, and people with research applications requiring stricter guarantees on latency, bandwidth and jitter.

University and GigaPoP Networking staff are interested in some form of QoS to manage exploding bandwidth needs and to enable new classes of applications. Specifically, they would like to be able to (1) prioritize certain mission critical traffic and delay-sensitive applications like VoIP and (2) run non-mission critical traffic (resnet dorm traffic, peer-to-peer recreational file sharing applications, long running bulk transfers etc) at lowered priority.

We're interested in deploying something simple, relatively coarse grained and scalable. So we're probably looking at the IETF's emerging DiffServ architecture for L3 QoS in the campus routing core and the gigapop network. A large portion of the campus network is switched ethernet and it seems likely that we'd employ IEEE 802.1p priority there, in order to assure the appropriate forwarding treatment at the link layer. We thus also need a scheme to map L3 QoS to L2 and vice versa. Signalling and admission control are still open issues for us.

The types of diffserv forwarding behavior we're most interested in are: EF (expedited forwarding) for the highest priority application traffic, BE (best effort) for most normal traffic and a less-than- best-effort (LBE) service class, like the Qbone Scavenger Service (QBSS) currently in use at Internet2, for the lower priority non- mission-critical traffic. We might also be interested in the low delay form of the ABE (Alternative Best Effort) service for some applications like video conferencing, which are delay sensitive but tolerant of some loss.

We're not terribly optimistic about being able to do inter-domain QoS over the commercial Internet anytime soon. While it's true that some ISP's now have preliminary QoS offerings, the situation is complicated by the presence of many provider networks between the endpoints, the need for them to be running inter-operable QoS implementations and the need for mechanisms to ask for QoS reservations across multiple administrative domains, possibly leading to very complicated peering and settling arrangements. As far as we know, no such mechanisms exist today.

We're more optimistic about inter-domain QoS across R&E networks like Internet2. In this case, typically there are only one or a few QoS enabled backbone networks (like Abilene) in place. There is usually an agreed upon QoS architecture, and even some basic resource provisioning procedures in place. And last but certainly not least, there is clear demand from researchers to use Internet2 QoS.

So, we'd like to at least facilitate end-to-end QoS experimentation across the Internet2 backbone. But even this is proving to be quite tough. One of the problems is that, even though Internet2 is a research network, we're not using it exclusively for research. Universities are typically using it to transport all production traffic between them. Not just traffic associated with "meritorious research applications", which was one of the original ideas behind many R&E networks. Furthremore, many Universities connect to the Internet2 backbone via a regional aggregation network called a GigaPoP. Often the GigaPoP offers both Internet2 and commodity Internet connectivity to the university. So the GigaPoP is a production network, and any architectural changes we make to the network to facilitate QoS cannot be allowed to threaten the performance of existing production traffic.

Router code that provides a lot of the needed features to enable QoS (eg. support for diffserv based marking, remarking, prioritization, policing, approriate queue scheduling disciplines etc) has often been available only in experimental code trains or cutting edge releases that make them unsuitable for deployment on a production network. There are often not enough queues per output port to really support large scale service differentiation. Sometimes the queueing discipline required to implement a certain service class characteristic is too computationally intensive to implement in hardware, so only software implementations are available. If those are used, they can dramatically reduce the overall performance of the router. And often you can't even run certain queueing algorithms on really high speed interfaces. Of course, we hope this situation will improve in the future as more of the existing code becomes better tested, as newer code is implemented and written in to ASICs and as we gain operational experience using them in real networks.

Providing a parallel networking infrastructure that would allow us to deploy QoS enabled routers with experimental code and features without impacting the production network is cost prohibitive. Researchers would love have access to such an infrastructure of course!

Many QoS policy issues remain to be worked out. Where does initial QoS marking occur? Who's allowed to mark and how do we validate the markings? A QoS provisioning policy needs be be developed and mechanisms need to be deployed to manage and enforce this policy, like policy servers, bandwidth brokers etc. Cryptographic authentication and verification of resource allocation requests will be necessary. All of this adds significant complexity to the network infrastructure and its going to take time and significant work to figure out how to do these things and how to do them correctly.

There are currently no suitable mechanisms available for end-to-end inter-domain signalling of QoS reservations and call admission control. Research work with bandwidth brokers and an interdomain signalling protocol called SIBBS (Simple Interdomain Bandwidth Broker Signalling) is ongoing but not production worthy implementations exist today.

In the face of all these challenges what do we do today to facilitate researchers doing wide area QoS experiments? Although the production network does not include diffserv enabled routers, we make a conscious effort not to impede. We provide the researchers with an uncongested path through the campus/gigapop network to the QoS enabled Abilene backbone and we make sure that intervening routers are not marking or remarking DSCP code points in packet headers. University researchers have been involved in the early Internet2 QoS research efforts (QBone, Abilene Premium Service etc).

Aside from research applications, we do have a real need for bandwidth management. Penn, like other Universities, has been experiencing explosive growth of Internet bandwidth consumption. And it's clear that a lot of the growth comes from peer2peer file sharing applications. We've addressed the increased consumption in a number of ways. The campus network and gigapop network has been overprovisioned (actually "adequately" provisioned). To relieve clogged commodity internet pipes, we dramatically ramped up our commodity Internet bandwidth. Where appropriate we've employed rate limiting to target the community that was consuming a disproportionate share of our external bandwidth. This scheme is working adequately, but we are also looking into whether or not it will be useful to employ lightweight QoS, like the scavenger service to mark residential network traffic for degraded treatment over Internet2.

As we deploy high performance network infrastructures, we've frequently found that network applications are unable to use available bandwidth because of problems on the endstations themselves: poorly designed applications and application layer protocols, inefficient network protocol stacks, duplex mismatch, MTU mismatch etc. Having a QoS enabled network infrastructure does not help address this very common class of performance problems. So endstation tuning issues need to be solved in addition to any QoS mechanisms we investigate for deployment.

In summary, we're interested in QoS and think it has the potential to solve some problems for us. It's probably too early to deploy many reservation based forms of QoS in production networks. Intra- domain QoS is a near term possibility where we have greater control over the entire network. Inter-domain QoS still looks very difficult, because of the complicated signalling and SLA issues.

Shumon Huque
Email: shuque@isc.upenn.edu or shuque@magpi.net
University of Pennsylvania and the MAGPI GigaPoP

Maintained by Shumon Huque
ISC Network Engineering and Services
Last Updated: 2002-04-22