Reachability monitoring... request for rafilteraddr "negation" method, rabins SIGHUP reload conf?

Mon Jan 6 13:48:49 EST 2014

Hello guys,

I wanted to separate this out into another thread, but this is a
semi-continuation from "Under what circumstances will an ICMP flow with
state ECO return dbytes of 0?" (
http://thread.gmane.org/gmane.network.argus/10140/focus=10142)

Thanks for the great ideas, Chas.  See below... I'm trying to be super
broad for a reachability monitor.  Later I will probably also focus on
monitoring "service connectability" and "application availability."

I'm currently focusing on monitoring four things:

1) ICMP error status (concluded: `ra -S 127.0.0.1:561 -s ltime saddr daddr
smac dmac spkts dpkts flgs state inode - "icmp and (dst pkts eq 0 or not
echo)"` Thanks Carter!)
2) loss (`rabins -S 127.0.0.1:561 -B 15s -M 5s - ploss gt 0 or loss gt 0`)
3) TCP errors (`rabins -S 127.0.0.1:561 -B 15s -M 5s - frag or retrans or
outoforder or winshut`)
4) Reachability.

I figured a very broad method for detecting if a host is unreachable is to
set a tolerance for 0 dpkts or 0 spkts (in this case `-M 5m`) and create a
filter based on criteria:

`rabins -B 15s -M 5m hard -S 127.0.0.1:156 | rafilteraddr -f
monitoredhosts.in -s ltime saddr daddr smac dmac spkts dpkts flgs state
inode - "not fin and not reset and (dpkts gt 0 or spkts gt 0)"`

How would I go about outputting only hosts from a file that are NOT found?
 Can I request a modification to `rafilteraddr` to support this (something
like `-M negate`)?

Given: address.spec = 192.168.1.1, 192.168.1.2
If the data (from rabins in this case) does NOT include 192.168.1.2 as an
saddr or daddr, then zero sum/length stats for 192.168.1.2 are listed.  I
think time would be a challenge.

Carter: Is this something that makes sense to implement?

Is it feasible to implement a reachability monitor in this broad of a
manner?

Also, is rabins currently handling SIGHUP?  If received, reload the config
file?  That would be great.
- I would probably adjust the bin time lower during peak times so the probe
alerts faster.
- Out of peak times, I would use a heartbeat of a scheduled ping (probably
using icinga), so there is at least some pkts to and from monitored nodes.

Thanks very much as usual,

Matt

On Jan 3, 2014, at 4:08 PM, Carter Bullard <carter at qosient.com> wrote:

What is node liveliness ???  I can imagine, but its not a normal metric.
So dpkts == 0  ???  Means "no response", no one at home.

But the failure to get a packet back could be that the machine didn't
get the Echo, got it but didn't respond, responded but the response
was lost.  These equate to Reachable, Available, Connectable.  We
differentiate these based on a list of criteria.

You could have gotten a different type of ICMP packet back, such as an
Unreachable...  Look at the flgs field for an 'I', showing that the packet
got an ICMP mapped to the original packet, but not an Echo.  With that
you will get a duration, for the single packet flow, and the field " inode "
will have the address of the intermediate node that generated the ICMP.

Carter

On Jan 3, 2014, at 2:34 PM, Matt Brown <matthewbrown at gmail.com> wrote:

Hello again,

I am investigating how to use argus for "node liveliness detection."

Considering leveraging ra() as:

ra -S 127.0.0.1:561 -s ltime stime daddr sport sbytes dbytes flgs state -
icmp

I see dbytes can be 0 when the state of a flow is ECO.

Why would this be?

I have covered this question thoroughly on the network engineering

stackexchange: http://networkengineering.stackexchange.com/q/5683

I think this is my last question for the day!

Thanks,

Matt

Any assistance is appreciated.

Thanks,

Matt
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20140106/9a3a9a2a/attachment.html>