Examples in iproute2 distribution.
This chapter was written by Esteve Camps <[email protected]>.
First of all, it would be a great idea for you to read RFCs written about this (RFC2474, RFC2475, RFC2597 and RFC2598) at IETF DiffServ working Group web site and Werner Almesberger web site (he wrote the code to support Differentiated Services on Linux).
Dsmark is a queueing discipline that offers the capabilities needed in Differentiated Services (also called DiffServ or, simply, DS). DiffServ is one of two actual QoS architectures (the other one is called Integrated Services) that is based on a value carried by packets in the DS field of the IP header.
One of the first solutions in IP designed to offer some QoS level was the Type of Service field (TOS byte) in IP header. By changing that value, we could choose a high/low level of throughput, delay or reliability. But this didn't provide sufficient flexibility to the needs of new services (such as real-time applications, interactive applications and others). After this, new architectures appeared. One of these was DiffServ which kept TOS bits and renamed DS field.
Differentiated Services is group-oriented. I mean, we don't know anything about flows (this will be the Integrated Services purpose); we know about flow aggregations and we will apply different behaviours depending on which aggregation a packet belongs to.
When a packet arrives to an edge node (entry node to a DiffServ domain) entering to a DiffServ Domain we'll have to policy, shape and/or mark those packets (marking refers to assigning a value to the DS field. It's just like the cows :-) ). This will be the mark/value that the internal/core nodes on our DiffServ Domain will look at to determine which behaviour or QoS level apply.
As you can deduce, Differentiated Services involves a domain on which all DS rules will have to be applied. In fact you can think I will classify all the packets entering my domain. Once they enter my domain they will be subjected to the rules that my classification dictates and every traversed node will apply that QoS level.
In fact, you can apply your own policies into your local domains, but some Service Level Agreements should be considered when connecting to other DS domains.
At this point, you maybe have a lot of questions. DiffServ is more than I've explained. In fact, you can understand that I can not resume more than 3 RFCs in just 50 lines :-).
As the DiffServ bibliography specifies, we differentiate boundary nodes and interior nodes. These are two important points in the traffic path. Both types perform a classification when the packets arrive. Its result may be used in different places along the DS process before the packet is released to the network. It's just because of this that the diffserv code supplies an structure called sk_buff, including a new field called skb->tc_index where we'll store the result of initial classification that may be used in several points in DS treatment.
The skb->tc_index value will be initially set by the DSMARK qdisc, retrieving it from the DS field in IP header of every received packet. Besides, cls_tcindex classifier will read all or part of skb->tcindex value and use it to select classes.
But, first of all, take a look at DSMARK qdisc command and its parameters:
... dsmark indices INDICES [ default_index DEFAULT_INDEX ] [ set_tc_index ]What do these parameters mean?
indices: size of table of (mask,value) pairs. Maximum value is 2ˆn, where n>=0.
Default_index: the default table entry index if classifier finds no match.
Set_tc_index: instructs dsmark discipline to retrieve the DS field and store it onto skb->tc_index.
This qdisc will apply the next steps:
If we have declared set_tc_index option in qdisc command, DS field is retrieved and stored onto skb->tc_index variable.
Classifier is invoked. The classifier will be executed and it will return a class ID that will be stored in skb->tc_index variable. If no filter matches are found, we consider the default_index option to determine the classId to store. If neither set_tc_index nor default_index has been declared results may be unpredictable.
After been sent to internal qdiscs where you can reuse the result of the filter, the classid returned by the internal qdisc is stored into skb->tc_index. We will use this value in the future to index a mask- value table. The final result to assign to the packet will be that resulting from next operation:
New_Ds_field = ( Old_DS_field & mask ) | value
Thus, new value will result from "anding" ds_field and mask values and next, this result "ORed" with value parameter. See next diagram to understand all this process:
skb->ihp->tos - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - > | | ^ | -- If you declare set_tc_index, we set DS | | <-----May change | value into skb->tc_index variable | |O DS field | A| |R +-|-+ +------+ +---+-+ Internal +-+ +---N|-----|----+ | | | | tc |--->| | |--> . . . -->| | | D| | | | | |----->|index |--->| | | Qdisc | |---->| v | | | | | |filter|--->| | | +---------------+ | ---->(mask,value) | -->| O | +------+ +-|-+--------------^----+ / | (. , .) | | | | ^ | | | | (. , .) | | | +----------|---------|----------------|-------|--+ (. , .) | | | sch_dsmark | | | | | +-|------------|---------|----------------|-------|------------------+ | | | <- tc_index -> | | | |(read) | may change | | <--------------Index to the | | | | | (mask,value) v | v v | pairs table - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -> skb->tc_index
How to do marking? Just change the mask and value of the class you want to remark. See next line of code:
tc class change dev eth0 classid 1:1 dsmark mask 0x3 value 0xb8This changes the (mask,value) pair in hash table, to remark packets belonging to class 1:1.You have to "change" this values because of default values that (mask,value) gets initially (see table below).
Now, we'll explain how TC_INDEX filter works and how fits into this. Besides, TCINDEX filter can be used in other configurations rather than those including DS services.
This is the basic command to declare a TC_INDEX filter:
... tcindex [ hash SIZE ] [ mask MASK ] [ shift SHIFT ] [ pass_on | fall_through ] [ classid CLASSID ] [ police POLICE_SPEC ]
Next, we show the example used to explain TC_INDEX operation mode. Pay attention to bolded words:
tc qdisc add dev eth0 handle 1:0 root dsmark indices 64 set_tc_index tc filter add dev eth0 parent 1:0 protocol ip prio 1 tcindex mask 0xfc shift 2 tc qdisc add dev eth0 parent 1:0 handle 2:0 cbq bandwidth 10Mbit cell 8 avpkt 1000 mpu 64 # EF traffic class tc class add dev eth0 parent 2:0 classid 2:1 cbq bandwidth 10Mbit rate 1500Kbit avpkt 1000 prio 1 bounded isolated allot 1514 weight 1 maxburst 10 # Packet fifo qdisc for EF traffic tc qdisc add dev eth0 parent 2:1 pfifo limit 5 tc filter add dev eth0 parent 2:0 protocol ip prio 1 handle 0x2e tcindex classid 2:1 pass_on(This code is not complete. It's just an extract from EFCBQ example included in iproute2 distribution).
First of all, suppose we receive a packet marked as EF . If you read RFC2598, you'll see that DSCP recommended value for EF traffic is 101110. This means that DS field will be 10111000 (remember that less significant bits in TOS byte are not used in DS) or 0xb8 in hexadecimal codification.
TC INDEX FILTER +---+ +-------+ +---+-+ +------+ +-+ +-------+ | | | | | | | |FILTER| +-+ +-+ | | | | | |----->| MASK | -> | | | -> |HANDLE|->| | | | -> | | -> | | | | . | =0xfc | | | | |0x2E | | +----+ | | | | | | | . | | | | | +------+ +--------+ | | | | | | . | | | | | | | | | -->| | . | SHIFT | | | | | | | |--> | | . | =2 | | | +----------------------------+ | | | | | | | | | CBQ 2:0 | | | | | +-------+ +---+--------------------------------+ | | | | | | | +-------------------------------------------------------------+ | | DSMARK 1:0 | +-------------------------------------------------------------------------+
The packet arrives, then, set with 0xb8 value at DS field. As we explained before, dsmark qdisc identified by 1:0 id in the example, retrieves DS field and store it in skb->tc_index variable. Next step in the example will correspond to the filter associated to this qdisc (second line in the example). This will perform next operations:
Value1 = skb->tc_index & MASK Key = Value1 >> SHIFT
In the example, MASK=0xFC and SHIFT=2.
Value1 = 10111000 & 11111100 = 10111000 Key = 10111000 >> 2 = 00101110 -> 0x2E in hexadecimal
The returned value will correspond to a qdisc internal filter handle (in the example, identifier 2:0). If a filter with this id exists, policing and metering conditions will be verified (in case that filter includes this) and the classid will be returned (in our example, classid 2:1) and stored in skb->tc_index variable.
But if any filter with that identifier is not found, the result will depend on fall_through flag declaration. If so, value key is returned as classid. If not, an error is returned and process continues with the rest filters. Be careful if you use fall_through flag; this can be done if a simple relation exists between values of skb->tc_index variable and class id's.
The latest parameters to comment on are hash and pass_on. The first one relates to hash table size. Pass_on will be used to indicate that if no classid equal to the result of this filter is found, try next filter. The default action is fall_through (look at next table).
Finally, let's see which possible values can be set to all this TCINDEX parameters:
TC Name Value Default ----------------------------------------------------------------- Hash 1...0x10000 Implementation dependent Mask 0...0xffff 0xffff Shift 0...15 0 Fall through / Pass_on Flag Fall_through Classid Major:minor None Police ..... None
This kind of filter is very powerful. It's necessary to explore all possibilities. Besides, this filter is not only used in DiffServ configurations. You can use it as any other kind of filter.
I recommend you to look at all DiffServ examples included in iproute2 distribution. I promise I will try to complement this text as soon as I can. Besides, all I have explained is the result of a lot of tests. I would thank you tell me if I'm wrong in any point.