Reachability monitoring... request for rafilteraddr "negation" method, rabins SIGHUP reload conf?

Mon Jan 6 14:16:47 EST 2014

Hey Matt,
Really Big topic.  The easiest way to approach this is to talk in
general about how reachability, connectivity, and availability
metrics work in argus.  Keep it short and sweet, and get you to
try out some of the features of the tools you may not be familiar
with.

This all comes from the bi-directional flow support that is pretty
unique to argus.  Flows do have responses, and argus tries really
hard to match both sides of a flow to generate flow records that
have everything you need to know if there is successful networking
going on.

When there is a failure (no response for flows that usually have a
response), with argus data, you just need to dig a little to figure
out what is going on.  They key is to find persistent flow attempts
between nodes or networks, where you do see both sides of the
traffic, once you have a working “beacon”, then you can track it
to do your analysis.

Pings are a great example.  Argus has a 6-tuple flow model
for ping traffic, such that argus will generate flow records on each
ping volley, rather than aggregating them over time.  This allows you
to get Round Trip Times, and to track failure on individual ping
attempts.  So argus generates appropriate data to use ping as an
instantaneous availability indicator, just as its intended to do.

Now, the real trick to tracking anything in argus, is to intelligently
aggregate flow records over time, to see trends, etc….  
racluster() has embedded logic that allows you to track availability
for all flow types, using the “ -A “ option (A for Availability).  Try
it out with a days/weeks worth of pings to a given node.

   ra -r argus.data -w /tmp/ping.test.out - echo and host x.y.z.w and host y.z.w.x

Then test this out:

   racluster -Ar /tmp/ping.test.out

and see if the resulting aggregations don’t give you a representation
of when the flow was available, and then when its not.  If there was
an availability failure, the racluster() output should give you a
single flow for the period where the nodes were available, and a single
flow for when they weren’t, and another when they start up again.
With this basic data, you can do all the availability studies you need.
For the flow during the failure, you should be able to figure it out
based on whether there are ICMP replies in the flow record during
the outage period.

Hopefully you’ll see how this type of aggregation can start to
do the things you’re interested in.

Carter

On Jan 6, 2014, at 1:48 PM, Matt Brown <matthewbrown at gmail.com> wrote:

> Hello guys,
> 
> I wanted to separate this out into another thread, but this is a semi-continuation from "Under what circumstances will an ICMP flow with state ECO return dbytes of 0?" (http://thread.gmane.org/gmane.network.argus/10140/focus=10142)
> 
> 
> Thanks for the great ideas, Chas.  See below... I'm trying to be super broad for a reachability monitor.  Later I will probably also focus on monitoring "service connectability" and "application availability."
> 
> 
> I'm currently focusing on monitoring four things:
> 
> 1) ICMP error status (concluded: `ra -S 127.0.0.1:561 -s ltime saddr daddr smac dmac spkts dpkts flgs state inode - "icmp and (dst pkts eq 0 or not echo)"` Thanks Carter!)
> 2) loss (`rabins -S 127.0.0.1:561 -B 15s -M 5s - ploss gt 0 or loss gt 0`)
> 3) TCP errors (`rabins -S 127.0.0.1:561 -B 15s -M 5s - frag or retrans or outoforder or winshut`)
> 4) Reachability.
> 
> 
> I figured a very broad method for detecting if a host is unreachable is to set a tolerance for 0 dpkts or 0 spkts (in this case `-M 5m`) and create a filter based on criteria:
> 
> `rabins -B 15s -M 5m hard -S 127.0.0.1:156 | rafilteraddr -f monitoredhosts.in -s ltime saddr daddr smac dmac spkts dpkts flgs state inode - "not fin and not reset and (dpkts gt 0 or spkts gt 0)"`
> 
> 
> How would I go about outputting only hosts from a file that are NOT found?  Can I request a modification to `rafilteraddr` to support this (something like `-M negate`)?
> 
> Given: address.spec = 192.168.1.1, 192.168.1.2
> If the data (from rabins in this case) does NOT include 192.168.1.2 as an saddr or daddr, then zero sum/length stats for 192.168.1.2 are listed.  I think time would be a challenge.
> 
> Carter: Is this something that makes sense to implement?
> 
> 
> Is it feasible to implement a reachability monitor in this broad of a manner?
> 
> 
> 
> Also, is rabins currently handling SIGHUP?  If received, reload the config file?  That would be great.
> - I would probably adjust the bin time lower during peak times so the probe alerts faster.
> - Out of peak times, I would use a heartbeat of a scheduled ping (probably using icinga), so there is at least some pkts to and from monitored nodes.
> 
> 
> 
> Thanks very much as usual,
> 
> Matt
> 
> 
> 
> On Jan 3, 2014, at 4:08 PM, Carter Bullard <carter at qosient.com> wrote:
> 
>> What is node liveliness ???  I can imagine, but its not a normal metric.
>> So dpkts == 0  ???  Means "no response", no one at home.
>> 
>> But the failure to get a packet back could be that the machine didn't
>> get the Echo, got it but didn't respond, responded but the response
>> was lost.  These equate to Reachable, Available, Connectable.  We
>> differentiate these based on a list of criteria.
>> 
>> You could have gotten a different type of ICMP packet back, such as an
>> Unreachable...  Look at the flgs field for an 'I', showing that the packet
>> got an ICMP mapped to the original packet, but not an Echo.  With that
>> you will get a duration, for the single packet flow, and the field " inode "
>> will have the address of the intermediate node that generated the ICMP.
>> 
>> Carter
>> 
>> 
>> On Jan 3, 2014, at 2:34 PM, Matt Brown <matthewbrown at gmail.com> wrote:
>> 
>>> Hello again,
>>> 
>>> I am investigating how to use argus for "node liveliness detection."
>>> 
>>> Considering leveraging ra() as:
>>> 
>>> ra -S 127.0.0.1:561 -s ltime stime daddr sport sbytes dbytes flgs state - icmp
>>> 
>>> I see dbytes can be 0 when the state of a flow is ECO.
>>> 
>>> Why would this be?
>>> 
>>> 
>>> I have covered this question thoroughly on the network engineering
>>> stackexchange: http://networkengineering.stackexchange.com/q/5683
>>> 
>>> 
>>> I think this is my last question for the day!
>>> 
>>> 
>>> Thanks,
>>> 
>>> Matt
>>> 
>>> 
>>> 
>>> Any assistance is appreciated.
>>> 
>>> 
>>> Thanks,
>>> 
>>> Matt
>>> 
>> 

Carter Bullard
CEO/President
QoSient, LLC
150 E 57th Street Suite 12D
New York, New York  10022

+1 212 588-9133 Phone
+1 212 588-9134 Fax

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20140106/e8a6a165/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 455 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20140106/e8a6a165/attachment.sig>