[Argus] Re: Packet Loss with racluster

Carter Bullard carter at qosient.com
Tue Mar 18 13:38:46 EDT 2008


Hey Nick,
When you merge two records together, the aggregation engine goes through
each DSR (data specific record) in the two argus records, and compares  
them
for applicability/consistency etc... If the 2 corresponding DSRs are  
incompatible,
the aggregation engine will simply throw that DSR away.

All the TCP information, base sequence numbers, acks, roundtrip times,
window sizes, retransmissions, etc .... are all contained in the
ARGUS_NETWORK_DSR, which holds protocol specific information.
If you merge an ICMP flow with a TCP flow, the aggregator just tosses
the ARGUS_NETWORK_DSR away, because the DSR has different
meanings and are not compatible, and you lose the TCP specific  
information.
This can happen when the flow key is just "-m saddr daddr", and so
flows between A and B, regardless of protocol, get merged together.

If one argus record has, say, ethernet addresses in the ARGUS_MAC_DSR,
but the other record to be merged doesn't have an ARGUS_MAC_DSR, for
whatever reason, we'll toss the ARGUS_MAC_DSR when generating the
resultant merged record.  Now, there are conditions where we are  
"preserving"
and we keep the DSR, rather than throw it away.  This happens with
ARGUS_AGR_DSR's.  Each DSR has its own set of rules for what to do.

However, when the DSR's are compatible, but different, you can get
some interesting results.  There are 3 types of ARGUS_NETWORK_DSRs, for
TCP data:  ArgusTcpInit, ArgusTcpStatus, and ArgusTcpPerf, and only one
has loss statistics.  If you merge an ArgusTcpInit DSR (which only has  
base sequence
numbers, flags and roundtrip times)  with a ArgusTcpPerf DSR (which has
everything), you are suppose to get an ArgusTcpPerf DSR, with some  
slight
mods to the fields.  (state values get or'd, flags get or'd, base  
sequence numbers are
checked to make sure they are the same, and if not the result is  
adjusted,
total bytes transmitted are summed, etc....)  The source code is  
pretty dense
in this area, so there is a lot to talk about.

With loss, there is such a thing as negative loss.  We see this with
protocols like ESP and RTP quite often, when packets get out of order.
Argus see's sequence number 23, then 25, and we need to report the
flow, and so we report a loss of 1 packet.  Well,  the next packet
that argus see's after sending the status report,  is packet number 24  
and
then 26 and then 27.   Well, we need to report that 24 showed up, and so
when we generate the next flow status record, we report a loss of  
-1.   Later,
when you merge the two status flow records together, the loss becomes  
zero.

You won't see that too often with TCP, but you can get that kind of
behavior, especially when the Far Status Interval is below 1 second.

I'm thinking that this situation is caused by a bug, where we merge an
ArgusTcpInit and an ArgusTcpPerf DSR together, and fail to redefine
the DSR to ArgusTcpPerf, but leave it as ArgusTcpInit, which of course
doesn't/can't have any retransmission stats.  The newest client code  
that is
on the server (refreshed yesterday) does have some addition logic
to make it less likely to have this problem, but I have to double/triple
check to see what is actually going on.  Having data that generates the
problem, makes that much easier.

This is a long topic, so keep sending questions, and we'll get a  
something
written down that may make some sense.

Carter


On Mar 18, 2008, at 12:30 PM, Nick Diel wrote:

> Carter,
>
> First thanks for everything you have done.  Second thanks for all  
> this great info, it as been extremely helpful as I learn Argus.  We  
> will need a wiki page just for all in the info you have given so far.
>
> Hopefully Stew can anonymize the data, so you can shed some light on  
> what is going on.
>
> Can you tell me/the rest of the list a little bit more how racluster  
> handles Ip attributes and TCP attributes.  For instance, if  
> racluster is merging based on flow keys, will it attempt to find  
> additional retransmitted packets.  For example if a singleton is  
> actually a retransmitted packet for another non-singleton, would  
> racluster detect that and increase the loss count after they are  
> merged together?
>
> Nick
>
> Carter Bullard wrote:
>>
>> Gentlemen,
>> Well, racluster() does modify the IP attributes and TCP attributes  
>> based
>> on the records that are being merged together.   Because you are  
>> modifying
>> the flow key, and then merging data together, some data maybe  
>> ignored.
>>
>> As an example, If you merge a record that is a singleton with a non- 
>> singleton,
>> your resulting merged result may/could retain some singleton  
>> properties.  A
>> singleton is a flow with only one packet.  One of the properties of  
>> a singleton
>> is that it doesn't have any duration, and it also doesn't have any  
>> loss.
>> Now, if you merge a singleton with a non-singleton you get a non- 
>> singleton
>> as the result, so losing things like loss would, of course, be a bug.
>>
>> The best solution is to see if you can ranonymize() the data, and  
>> get the
>> same graph.  You could share that "primitive" data?
>>
>> Primitive data is the set pf original flow records directly from  
>> argus().
>>
>> What do you think?
>>
>> Carter
>>
>>
>>
>> On Mar 18, 2008, at 12:02 AM, Stewart Gray wrote:
>>
>>> That's right, I'll show the example I'm working with:
>>>
>>> ra -m proto -s loss -r packet-dump-2008-03-18_08\:28.arg - tcp |  
>>> awk '{total=total+$1;} END {print total;}'
>>> 33244
>>>
>>> racluster -m proto -s loss -r packet-dump-2008-03-18_08\:28.arg -  
>>> tcp
>>> 0
>>>
>>> Unfortunately I'm not able to distribute the data I'm working with  
>>> - it's customers flow logs. I'll see if I can replicate the issues  
>>> @ home so I can provide something to work with.
>>>
>>> Cheers,
>>>
>>> Stew
>>> From: Nick Diel [mailto:ndiel at engr.colostate.edu]
>>> Sent: Tuesday, 18 March 2008 4:53 p.m.
>>> To: Carter Bullard
>>> Cc: Stewart Gray; Argus
>>> Subject: Re: [ARGUS] [Argus] Re: Packet Loss with racluster
>>>
>>> Carter,
>>>
>>> What you are saying makes sense (I think), but I think there is  
>>> something else going on here.
>>>
>>> Stew had a 2 minute file.  If he used ra to look at just this file  
>>> he would see individual records that had positive values for loss  
>>> packet count.  Then he used racluster to merge all status flow  
>>> records and it reported 0 loss packets.  I think Stew was doing  
>>> this one file at a time.
>>>
>>> Basically if a single file (regardless how it was created) has any  
>>> status flows with a positive packet loss count, shouldn't  
>>> racluster be able to report this total for this file?
>>>
>>> ra -s loss -r argus.arg - tcp | awk '{total=total+$1;} END {print  
>>> total;}'  >0
>>> racluster -m proto -s loss -r argus.arg - tcp  = 0
>>>
>>> I may be missing something, but this was how I interpreted Stew's  
>>> problem.
>>>
>>> Nick
>>>
>>> Carter Bullard wrote:
>>>>
>>>> Hey Guys,
>>>> There are a lot of things going on that can affect the  
>>>> "distribution" of numbers
>>>> on a time series graph, when using flow data.  Flows are not  
>>>> fixed length samples
>>>> of network activity, and so you have to do some statistical mods  
>>>> to make the data
>>>> generally useful.    Programs like rasplit() and rabins() are  
>>>> critical to distributing
>>>> load, rate, packet numbers, loss numbers, jitter, interpackt  
>>>> arrival times, etc...
>>>> correctly into timed bins.  Without the use of either rasplit()  
>>>> or rabins(), which
>>>> are split/aggregate tools, you can end up with flows that are  
>>>> longer than the
>>>> time interval its suppose to represent, which skews the data in  
>>>> weird ways, and
>>>> can generate bins with no data in them.
>>>>
>>>> Loss doesn't have to be constant, and so the drop outs may  
>>>> actually be real.
>>>> And the there are no guarantees that there are actually tcp  
>>>> connections during
>>>> those intervals (no TCP, no loss), so we have to look at the data  
>>>> to see if there
>>>> is anything wrong.
>>>>
>>>> Remember, flows from argus() are as long as the  
>>>> ARGUS_FAR_STATUS_INTERVAL.
>>>> A flow that starts at 1:59:59.999999, will be tallied in the  
>>>> 1:58:00 - 2:00:00 bin, even
>>>> though its duration could significantly extend well into the  
>>>> 2:00:00-2:02:00 interval.
>>>>
>>>> The trick is to split the data into strict time slots, and then  
>>>> aggregating those slots.
>>>> rabins() is very good at this, that is why its at the heart of  
>>>> ragraph().
>>>>
>>>> If I can get some of the data used to generate the graph in the  
>>>> email, I can
>>>> see if using rabins() would remove the drop outs.
>>>>
>>>> Carter
>>>>
>>>>
>>>>
>>>> On Mar 17, 2008, at 8:40 PM, Stewart Gray wrote:
>>>>
>>>>> I just feed the values into cacti, it's a base metric I can use  
>>>>> for spotting anomalies. Even if it's not 100% accurate, the  
>>>>> accuracy should be pretty consistent even if argus inflates/ 
>>>>> deflates the figure slightly on files which have been sliced up.
>>>>>
>>>>> I'm running this argus instance on a busy section of our network  
>>>>> and there is a constant flow of between 80-140mb/s. I ran the  
>>>>> rate/load/loss command and got got:
>>>>>
>>>>> 17949.785637 94528448 0
>>>>>
>>>>> You can see the blips this morning. The file is actually split  
>>>>> every 2mins on this particular box.
>>>>>
>>>>> <Outlook.jpg>
>>>>>
>>>>> It's a bit unusual, if I run 'ra -m proto -s loss -r argus.arg -  
>>>>> tcp' there are quite a number of losses/retransmits. Might be an  
>>>>> issue with how racluster is aggregating these?
>>>>>
>>>>> Stew
>>>>>
>>>>> From: Nick Diel [mailto:ndiel at engr.colostate.edu]
>>>>> Sent: Tuesday, 18 March 2008 12:10 p.m.
>>>>> To: Stewart Gray
>>>>> Cc: Argus
>>>>> Subject: [Argus] Re: Packet Loss with racluster
>>>>>
>>>>> Stew,
>>>>>
>>>>> I think the first question is what are you using this number  
>>>>> for.  If you are just using it as an indicator of congestion or  
>>>>> other network problems then the 5 minute boundary will most  
>>>>> likely not be a problem.
>>>>>
>>>>> I believe Argus just counts the number of retransmitted packets  
>>>>> to get a loss/drop count, I don't think it is doing any triple  
>>>>> duplicate ack or tcp timeout checks (if I am wrong, someone  
>>>>> please say so).  Since retransmissions will occur in a time  
>>>>> window of a few seconds, you should capture most retransmitted  
>>>>> packets in your 5 minute boundaries.  So even if a flow cross  
>>>>> that boundary, you still have a good chance of counting  
>>>>> retransmitted packets correctly.
>>>>>
>>>>> For cases you are receiving a count of 0, I would look at packet  
>>>>> rate and bit rate, it is possible the link just doesn't have  
>>>>> much traffic on it at that time. racluster -m proto -s rate load  
>>>>> loss -r argus.arg - tcp
>>>>>
>>>>> Though I did notice something unusual on my end.  The command I  
>>>>> gave you, should be a strong estimate, but doesn't account for  
>>>>> retransmitted packets over status flow boundaries within the  
>>>>> file (though same argument above applies).  So to get an exact  
>>>>> count on the file (assuming racluster reanalyzes the status flow  
>>>>> records for retransmissions) you would need something like:  
>>>>> racluster -r argus.arg -w - - tcp | racluster -m proto -s loss - 
>>>>> r - (first merge status flow records, then count retransmitted  
>>>>> packets).  Though this is the output I get:
>>>>>
>>>>> racluster -m proto -s loss -r argus.out - tcp
>>>>>      62521
>>>>> racluster -r argus.out -w - - tcp | racluster -m proto -s loss - 
>>>>> r -
>>>>>      60047
>>>>>
>>>>> At a minimum I would expect the numbers to stay the same, no  
>>>>> retransmitted packets crossed any status flows or racluster  
>>>>> doesn't try to find any new retransmitted packets.  The number  
>>>>> going down doesn't make any sense to me.  Maybe someone can  
>>>>> explain what is going on to me.
>>>>>
>>>>> Nick[
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Stewart Gray wrote:
>>>>>>
>>>>>> Hey Guys,
>>>>>>
>>>>>> How does racluster handle argus files which have been  
>>>>>> periodically split, when producing packet loss statistics? My  
>>>>>> monitoring machine rotates the argus file every 5minutes. When  
>>>>>> using the following command, how skewed are the figures going  
>>>>>> to be as a result of having an incomplete argus file (ie  
>>>>>> connections that were current when the log file was rotated).
>>>>>>
>>>>>> I'm also note than sometimes the resulting figure is 0. It only  
>>>>>> seems to do this in about 1/10 argus files I run the command at.
>>>>>>
>>>>>> racluster -m proto -s loss -r argus.arg - tcp
>>>>>> 0
>>>>>>
>>>>>> racluster -m proto -s loss -r argus.arg - tcp
>>>>>> 33036
>>>>>> Any ideas?
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Stew
>>>>>>
>>>>>> From: Nick Diel [mailto:ndiel at engr.colostate.edu]
>>>>>> Sent: Wednesday, 12 March 2008 10:24 a.m.
>>>>>> To: Stewart Gray
>>>>>> Cc: Argus
>>>>>> Subject: Re: [ARGUS] Cheat sheet premiere
>>>>>>
>>>>>> How about:
>>>>>> racluster -m proto -s loss -r argus.arg - tcp
>>>>>>
>>>>>> This should merge all records based on protocol (in this case  
>>>>>> only tcp because of the filter) and then print the loss column  
>>>>>> of all merged records.
>>>>>>
>>>>>> Nick
>>>>>>
>>>>>> Stewart Gray wrote:
>>>>>>>
>>>>>>> awesome, That's a really good start. I've already been playing  
>>>>>>> with a few of the options I hadn't toyed with before :)
>>>>>>>
>>>>>>> Is there an easy way to generate a raw count of packets loss/ 
>>>>>>> retransmitted rather than having it graphed?
>>>>>>>
>>>>>>> I figure we start with:
>>>>>>>
>>>>>>> racluster -s loss -r argus.arg -w -
>>>>>>>
>>>>>>> How are the figured totaled? Do we pipe it to rasort or ra?
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Stewart
>>>>>>>
>>>>>>> From: Stéphane Peters [mailto:stephane.peters at forem.be]
>>>>>>> Sent: Saturday, 8 March 2008 11:06 a.m.
>>>>>>> To: Carter Bullard
>>>>>>> Cc: Stewart Gray; Argus
>>>>>>> Subject: Re: Re: [ARGUS] Cheat sheet premiere
>>>>>>>
>>>>>>> Hi Carter,
>>>>>>>
>>>>>>> I would love to see such a sheet in the distribution,
>>>>>>> and I also was hoping that you could check,
>>>>>>> if those examples made sense or were appropriate.
>>>>>>> So please go on !
>>>>>>>
>>>>>>>
>>>>>>> Some cosmetic work could be done too;
>>>>>>> for example to use everywhere some "standard" parameters like  
>>>>>>> this one :
>>>>>>>     file=argus-eth1.out
>>>>>>>     ra -r $file
>>>>>>> so it is easy to paste the line "as is".
>>>>>>> without forgetting the shell escapes ( \$srcid) like in:
>>>>>>>     rasplit -S $argushost  -M 1d -w /path/argus-\$srcid.%Y.%m. 
>>>>>>> %d.log
>>>>>>>
>>>>>>> By the way, as another example given to the list, here are 3  
>>>>>>> scripts I use.
>>>>>>> The PATH vars permit to have a nicer ps(1) output.
>>>>>>>
>>>>>>> start-argus
>>>>>>>> #!/bin/sh
>>>>>>>> interf=eth1
>>>>>>>> PATH=/sbin ifconfig $interf | grep UP || PATH=/sbin ifconfig  
>>>>>>>> $interf up
>>>>>>>> PATH=/usr/local/sbin argus -d -i $interf -e `hostname` -P 561  
>>>>>>>> -U128 -mRS 30 -w argus-eth1.out
>>>>>>>
>>>>>>> rotate:
>>>>>>>> #!/bin/sh
>>>>>>>>
>>>>>>>> # Rotates server log files, without affecting users who may be
>>>>>>>> # connected to the server.
>>>>>>>>
>>>>>>>> # This can be run as a cron script
>>>>>>>>
>>>>>>>> DATE=`date +%Y-%m%d-%H%M`
>>>>>>>> LOGS='argus-eth1.out'
>>>>>>>>
>>>>>>>>  for i in $LOGS; do
>>>>>>>>    if [ -f $i ]; then
>>>>>>>>      mv $i $i.$DATE
>>>>>>>>      gzip -9 $i.$DATE
>>>>>>>>    fi
>>>>>>>>  done
>>>>>>>
>>>>>>> rotate-daily
>>>>>>>> #!/bin/sh
>>>>>>>> ./rotate
>>>>>>>> sleep 60 # sometimes the preceding command finishes too early
>>>>>>>> echo ./rotate-daily | at 0000 > /tmp/rotate-daily.log
>>>>>>>
>>>>>>> I use at(1) instead of cron(8) to cut the files closer to  
>>>>>>> midnight.,
>>>>>>> but rastream(1)'s extended "-w" option seems promising.
>>>>>>> A better solution could be to use argus(8) to preprocess the  
>>>>>>> flows,
>>>>>>> and rastream(1). to write, "rotate" and compress the files.
>>>>>>> Another thread, perhaps.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Carter Bullard wrote :
>>>>>>>>
>>>>>>>> Hey Stephane,
>>>>>>>> This is great!!!!  I'll put this in the distribution, if you  
>>>>>>>> don't mind!!!!
>>>>>>>> And I'll also go through it to make sure that any changes in  
>>>>>>>> the
>>>>>>>> code actually don't break this, and I can add some of the ones
>>>>>>>> that I do.
>>>>>>>>
>>>>>>>> So Russell is asking for a wiki, and we already have one at:
>>>>>>>>
>>>>>>>> http://www.vorant.com/nsmwiki/index.php?title=Argus
>>>>>>>>
>>>>>>>>
>>>>>>>> Carter
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mar 7, 2008, at 2:24 PM, Stéphane Peters wrote:
>>>>>>>>
>>>>>>>>> Hi Stewart,
>>>>>>>>>
>>>>>>>>> I also think that a cheat sheet would be nice !
>>>>>>>>> Here is a good occasion to show mine...
>>>>>>>>>
>>>>>>>>> Please note, most of the stuff has been collected right from  
>>>>>>>>> this argus list,
>>>>>>>>> so hopefully, you shouldn't browse all the (numerous) past  
>>>>>>>>> messages.
>>>>>>>>>
>>>>>>>>> Any suggestions ?
>>>>>>>>>
>>>>>>>>> flow filtering on certain port range:
>>>>>>>>>    ra -r file - dst port \( gt 1024 and lt 2048 \)
>>>>>>>>> (...)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Stewart Gray a écrit :
>>>>>>>>>>
>>>>>>>>>> awesome, that's more like what I was after :) Thanks for  
>>>>>>>>>> your help
>>>>>>>>>> again.
>>>>>>>>>>
>>>>>>>>>> As I mentioned earlier, I reckon it'd be neat to have some  
>>>>>>>>>> sort of cheat
>>>>>>>>>> sheet for doing common tasks. I bet there's lot's of stuff  
>>>>>>>>>> you know that
>>>>>>>>>> others don't, having written the application yourself. I  
>>>>>>>>>> don't know what
>>>>>>>>>> I don't know!
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> -- 
>>>>>>>>> Stephane.Peters at forem.be, Postmaster at forem.be
>>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>> -- 
>>>>>>> Stephane.Peters at forem.be
>>>>>>> #####################################################################################
>>>>>>> Important: This electronic message and attachments (if any)  
>>>>>>> are confidential and may be legally privileged. If you are not  
>>>>>>> the intended recipient do not copy, disclose or use the  
>>>>>>> contents in any way. Please let us know by return e-mail  
>>>>>>> immediately and then destroy this message.
>>>>>>> #####################################################################################
>>>>>>
>>>>>> #####################################################################################
>>>>>> Important: This electronic message and attachments (if any) are  
>>>>>> confidential and may be legally privileged. If you are not the  
>>>>>> intended recipient do not copy, disclose or use the contents in  
>>>>>> any way. Please let us know by return e-mail immediately and  
>>>>>> then destroy this message.
>>>>>> #####################################################################################
>>>>>
>>>>> #####################################################################################
>>>>> Important: This electronic message and attachments (if any) are  
>>>>> confidential and may be legally privileged. If you are not the  
>>>>> intended recipient do not copy, disclose or use the contents in  
>>>>> any way. Please let us know by return e-mail immediately and  
>>>>> then destroy this message.
>>>>> #####################################################################################
>>>>
>>>> Carter Bullard
>>>> CEO/President
>>>> QoSient, LLC
>>>> 150 E. 57th Street Suite 12D
>>>> New York, New York 10022
>>>>
>>>> +1 212 588-9133 Phone
>>>> +1 212 588-9134 Fax
>>>>
>>>>
>>>>
>>>
>>> #####################################################################################
>>> Important: This electronic message and attachments (if any) are  
>>> confidential and may be legally privileged. If you are not the  
>>> intended recipient do not copy, disclose or use the contents in  
>>> any way. Please let us know by return e-mail immediately and then  
>>> destroy this message.
>>> #####################################################################################
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20080318/ca729a39/attachment.html>


More information about the argus mailing list