Measuring maximum effective throughput of WAN links

Carter Bullard carter at qosient.com
Tue Jun 3 17:19:42 EDT 2014


Hey Ruven,
So these are a much different set of questions all together, and graphing 5 minute averages isn't going to get you there.

You are interested in the instantaneous demand on the path, and the nature of the competition for that instantaneous demand.  The time domain for this type of analytic is in the 0.1 - 1.0 second domain, not in the 5 minute domain.

Is A getting all the instantaneous bandwidth that is available on an arbitrary path ???
Not a trivial analysis.

You will want to look at tools like rahisto() to get a better look at the demand / performance behavior, rather than simple time-series trending graphs.

Just trying to understand the available bandwidth trends is a pretty interesting awareness problem.  We do it with highly granular argus data, and we classify it based on destination, as each path has different characteristics.

But, just looking at all the data, graphing it isn’t going to give you the data you want.  Try tools like rahisto().

Here is a rahisto() output for a whole days worth of TCP flow status reports, here at QoSient World HQ.  This shows a frequency distribution of the log of the instantaneous load of tcp connections that had more than 10 packets during the flow status interval.  We generate flow status records every 5 seconds, so we’re analysing each 5 second interval of flow activity.  You want to tally records that have a reasonable number of packets in each record, to remove artifacts from wireline bursting.

thoth:03 carter$ rahisto -R . -H load 20L:0-50M -s mean:16 ploss:14  -- tcp and pkts gt 10
 N = 16996   mean =  210899.281250  stddev = 1106422.625000  max = 31356940.000000  min = 3562.766357
           median =   63132.775391     95% = 455102.531250
 Class      Interval         Freq    Rel.Freq     Cum.Freq                Mean          pLoss 
     1    1.000000e+00          0     0.0000%      0.0000%    
     2    2.426322e+00          0     0.0000%      0.0000%    
     3    5.887040e+00          0     0.0000%      0.0000%    
     4    1.428386e+01          0     0.0000%      0.0000%    
     5    3.465724e+01          0     0.0000%      0.0000%    
     6    8.408964e+01          0     0.0000%      0.0000%    
     7    2.040286e+02          0     0.0000%      0.0000%    
     8    4.950391e+02          0     0.0000%      0.0000%    
     9    1.201124e+03          0     0.0000%      0.0000%    
    10    2.914315e+03         28     0.1647%      0.1647%         5371.159180       0.000000
    11    7.071068e+03        880     5.1777%      5.3424%        12711.680664       0.008106
    12    1.715669e+04       1074     6.3191%     11.6616%        27777.550781       0.028428
    13    4.162766e+04      11388    67.0040%     78.6656%        62848.570312       0.005226
    14    1.010021e+05       2207    12.9854%     91.6510%       131986.562500       0.067182
    15    2.450637e+05        750     4.4128%     96.0638%       385810.000000       0.599786
    16    5.946036e+05        324     1.9063%     97.9701%       902760.375000       1.409191
    17    1.442700e+06        167     0.9826%     98.9527%      2305591.500000       0.973604
    18    3.500455e+06        112     0.6590%     99.6117%      5190583.000000       0.384768
    19    8.493232e+06         52     0.3060%     99.9176%     12485728.000000       0.603903
    20    2.060732e+07         14     0.0824%    100.0000%     24197996.000000       0.483406


So the result from this is, regardless of destination or service, the max bandwidth (in + out) available is 31 Mbps, but we
rarely get there.  The normal instantaneous demand per flow is low, 63Kbps, average running around 214Kbps.  This is due to a lot of things, but primarily because of application demand and distance, a lot of polling for email, and notification pushing etc…  I don’t run test flows, there’s plenty on the wire already to tell me how thing are.

Of the 3% or so of flows that get up there, some are definitely experiencing packet loss, so congestion does have an impact on 6-7% of the traffic.  So if we just look at the flows between 1-50M, you see that loss does play a role, and you would have to break it down by destination to figure out the significance.  When the demand climbs, loss does come into play, and of course, loss will have an interesting relationship with load …, increase and then decrease, because you just can’t go fast if you have loss, ….

thoth:03 carter$ rahisto -R . -H load 20L:1-50M -s mean:16 ploss:14  -- tcp and pkts gt 10
 N = 453     mean =  4598764.500000  stddev =  5070832.000000  max = 31356940.000000  min = 1000291.375000
           median =  2646114.250000     95% = 15621540.000000
 Class      Interval         Freq    Rel.Freq     Cum.Freq                Mean          pLoss 
     1    1.000000e+06         65    14.3488%     14.3488%      1100025.250000       0.375940
     2    1.216042e+06         47    10.3753%     24.7241%      1343241.125000       2.873105
     3    1.478758e+06         35     7.7263%     32.4503%      1634416.125000       0.391892
     4    1.798231e+06         40     8.8300%     41.2804%      2001053.500000       1.094470
     5    2.186724e+06         44     9.7130%     50.9934%      2435435.500000       0.644238
     6    2.659148e+06         32     7.0640%     58.0574%      2946525.000000       1.713742
     7    3.233635e+06         33     7.2848%     65.3422%      3589139.250000       0.280101
     8    3.932235e+06         38     8.3885%     73.7307%      4258922.500000       0.450252
     9    4.781762e+06         17     3.7528%     77.4834%      5340447.000000       0.308536
    10    5.814823e+06         19     4.1943%     81.6777%      6345330.000000       0.223294
    11    7.071068e+06         18     3.9735%     85.6512%      7712374.500000       0.464576
    12    8.598714e+06         15     3.3113%     88.9625%      9426358.000000       0.645439
    13    1.045640e+07         16     3.5320%     92.4945%     11556370.000000       0.504574
    14    1.271541e+07         11     2.4283%     94.9227%     14028866.000000       0.497002
    15    1.546247e+07          8     1.7660%     96.6887%     17613572.000000       1.132233
    16    1.880302e+07          7     1.5453%     98.2340%     21075482.000000       0.342511
    17    2.286525e+07          7     1.5453%     99.7792%     25578700.000000       0.681431
    18    2.780510e+07          1     0.2208%    100.0000%     31356940.000000       0.000000
    19    3.381217e+07          0     0.0000%    100.0000%    
    20    4.111701e+07          0     0.0000%    100.0000%


Now this doesn’t tell you anything about competition, etc…, but it does tell you a little bit of what is going on.

A practical analysis will be much more complicated than this, but hopefully this will show that you may want to do something
a bit more than a set of time series graphs.

I do this type of stuff for hire, so if its serious, give me a holler !!!!

Carter



On Jun 3, 2014, at 3:24 PM, Ruven Gottlieb <ruven.gottlieb at gmail.com> wrote:

> On 6/2/14, Carter Bullard <carter at qosient.com> wrote:
>> Hey Ruven,
>> If your observation domain is say an interface, or a link,
> 
> Yes, it's an interface: the one for the LAN.
> ...
> 
>> You should be able to use the load rate, or whatever. Check out the manpage
>> for rabins.1.
> 
> OK. Now, say I want to produce a graph showing my data rate throughout
> the day, with 5 minute resolution. How do I show which parts are
> maxima that are caused by the DSL or Fios line saturating (or more
> likely, hitting it's rate cap), or being slowed by congestion
> upstream; and maxima which are merely due to users not causing much
> data to flow over the wire, i.e. the line could do 6Mb/sec, but the
> users are only using about 4 Mb/sec.?
> 
> Our typical case is that we have a DSL line that is supposed to
> deliver a maximum of 6Mb/sec. Sometimes we do speed tests and only
> show e.g. 2 or 3 Mb/sec. We want to be able to have a consistent way
> of monitoring our sites to see if and when we aren't getting the
> quality of service we pay for.
> ...
> 
>> ragraph() can also generate a graph for you, but that requires a little bit
>> more, such as having rrd_tool on the machine etc….
> 
> That's fine. I can add it.
> 
> Thanks,
> 
> Ruven Gottlieb
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 455 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20140603/4c06ac3b/attachment.sig>


More information about the argus mailing list