normalized appbyte ratio

Fri May 3 18:43:45 EDT 2013

Just occurred to me that the issues with 0 can be finessed by normalizing.
That is, compute the appbyte ratio as  (s-d)/(s+d) with 0/0 returning -0.0

The result is a number ranging from -1 (all inbound) to +1 (all outbound)
which captures any asymmetry directly in the sign.  The use of -0.0 is arguably
a hack, but it allows the edge case to be captured in a way that is detectable
if desired via signbit(), but has no computational impact.

This formulation also permits the existence of both sabr and dabr in that they
would just define the order of the subtraction although I'm not sure there's
much utility in that.

John Gerth      gerth at graphics.stanford.edu  Gates 378   (650) 725-3273 fax 725-6949

On 5/3/2013 11:39 AM, Carter Bullard wrote:
> Hey John,
> So when dealing with the ratio ( [s | d]appbytes / [s | d]bytes) we do end
> up with some issues we have to deal with.  May not seem intuitive, but we
> will have conditions where we end up with ( 0 / X ) and ( 0 / 0 ) as the actual
> values for the metric, and ( 0 / X ) is a completely different state than ( 0 / 0 ).
> While every flow record has to have at least some bytes in it, we can
> easily have ( bytes == 0 ) in one of the directions.   So it is a condition
> we need to convey.  We can return -1 for ( 0 / 0 ) to discriminate that
> condition?
> 
> In dealing with all the zero's that we may get in this new metric, a few
> situations shouldn't exist.  At least we know that when the denominator 
> of ( appbytes / bytes ) is zero, the numerator had better also be zero,
> or something is definitely wrong ;O)
> 
> Carter
> 
> On May 2, 2013, at 1:14 AM, John Gerth <gerth at graphics.stanford.edu> wrote:
> 
>> I'm a big fan of the appbyte metric and have created and used their ratio in the past.
>>
>> One interesting question that comes up is what to do with the 0's. It's important because
>> knowing that one or both sides didn't send any payload can be significant (not to
>> mention what to do when 0 is in the denominator).
>>
>> /J
>>
>> --
>> John Gerth      gerth at graphics.stanford.edu  Gates 378   (650) 725-3273
>>
>> On 5/1/13 5:23 AM, Carter Bullard wrote:
>>> Hey Jesse,
>>> How about we make a new field;  " [ s | d ]abr " for the [ src or dst ] appbyte ratio ?  I'll do that today.
>>>
>>> Not sure what is happening with the multiple addresses showing up. That would seem to be a bug.  Can you share some data so I can try to recreate the
>>> problem ?
>>>
>>> Carter
>>>
>>> On Apr 30, 2013, at 10:44 PM, Jesse Bowling <jessebowling at gmail.com <mailto:jessebowling at gmail.com>> wrote:
>>>
>>>> Hi Carter,
>>>>
>>>> I've been working through this example; this is a very interesting approach in that you're boiling host network patterns into a single number that
>>>> you can watch over time to indicate a change in the host...This sort of distillation seems like a big win, once you're instrumented to track it! ...
>>>>
>>>> On that subject, I had some difficulties while trying to blindly implement the commands you gave and wanted to send back some notes and questions to
>>>> the list...
>>>>
>>>> * The text states you need "-M rmon" in the first racluster, but the example doesn't include it; I found it should be:
>>>>
>>>> racluster -R argus_dir/ -M rmon -m saddr proto sport -w argus.out - 'ipv4'
>>>>
>>>> * I found I could calculate the ratio of sappbytes/dappbytes (and create a 'label') using awk like:
>>>>
>>>> awk '{if( $8 + 0 != 0) {LABEL="Balanced";RATIO=$7/$8; if ( RATIO > 1.5) {LABEL="Producer"}; if (RATIO < 0.95) {LABEL="Consumer"}; print
>>>> $0,RATIO"\t"LABEL}}' ra_text_output_file
>>>>
>>>> However my example is based on the fields in my rarc file, and thus this method isn't very elegant...and will also miss any records that are missing
>>>> a field...It would seem that this metric would be easy to calculate with the clients themselves and would give the added benefit of allowing for
>>>> ralabel'ing to be used on the metric (much more portable and useful I think)...I think this is a feature request... :)
>>>>
>>>> * I wanted to start iterating through various test cases on my data, varying time ranges and networks that I examined. I found that I can get very
>>>> 'off' results based on how I try to filter which networks I want...for instance:
>>>>
>>>> This example will lead to hosts showing up multiple times in the final output
>>>> # /usr/local/bin/racluster -r ${HOUR}* -M rmon -m saddr proto sport -w ${TMP1} - 'ipv4 and *src net 10.10.10.0/24 <http://10.10.10.0/24>*'
>>>> #/usr/local/bin/racluster -r ${TMP1} -m saddr -w - | /usr/local/bin/rasort -r - -m sappbytes -s stime dur saddr proto sport sappbytes dappbytes
>>>>
>>>> This example will appears to be fine in the final output
>>>> # /usr/local/bin/racluster -r ${HOUR}* -M rmon -m saddr proto sport -w ${TMP1} - 'ipv4 and *net 10.10.10.0/24 <http://10.10.10.0/24>*'
>>>> #/usr/local/bin/racluster -r ${TMP1} -m saddr -w - | /usr/local/bin/rasort -r - -m sappbytes -s stime dur saddr proto sport sappbytes dappbytes
>>>>
>>>> I think I have a misunderstanding about how racluster and filters interact; can you explain why the 'src' part in the first example would cause
>>>> multiple entries for individual hosts in the final output?
>>>>
>>>> Thank you for sharing your knowledge and experience to this community!
>>>>
>>>> Cheers,
>>>>
>>>> Jesse
>>>>
>