Strace output
Carter Bullard
carter at qosient.com
Fri Jun 8 12:08:22 EDT 2012
Hey CS Lee,
Just going through the messages you sent, so we have some dialog on what is going on.
Based on the time stamps, looks like radium disconnects the slow client, and then argus
gets into trouble.
OK, so the radium messages are ok, client not processing, disconnecting. That is normal,
if the client can't read records fast enough. But, radium should just bump that particular
client off, and keep going. However, it looks like one of radium's threads is exiting or
crashing, when it disconnects the client.
I believe that argus is doing OK, but gets into trouble when radium gets into trouble.
If we can keep radium happy, all should work well, until I can make some changes,
so that argus can survive all of this transport congestion.
We should focus on what is going on with radium. If you run radium on the linux box,
without the client, all seems to be good ?! If so, great all working as designed. Once a
client attaches, fine, but when it either disconnects, or if the client stays connected, but
doesn't seems to be able to keep up with the load, radium has some big problems.
If you run radium on your linux machine under gdb, when it closes the ra connection,
does gdb tell us something, such as a segfault on one of its threads?
If we can get radium to do the right thing, the that can be a workaround, to get you
going again. I'll focus on fixing radium first, then argus second, if we can get argus
isolated from these issues.
Carter
On Jun 8, 2012, at 11:47 AM, CS Lee wrote:
> hi Carter,
>
> I basically run argus on bivio, and radium on another linux box, but they are connected via direct 10G link.
>
> Now I run everything in the bivio box, In order for argus to run in foreground and check, I need to force it to run on 1 cpu, I start argus and radium, nothing much happening and it stays, however when I use ra to connect to radium, after a while here's what I get -
>
> argus
> argus[1708.48c93490]: 08 Jun 12 22:05:38.142199 ArgusWriteSocket: write (4, 0x693015e0, 32, ...) -1
> argus[1708.48c93490]: 08 Jun 12 22:05:38.142226 ArgusWriteSocket: write (4, 0x693015e0, 32, ...) -1
> argus[1708.48c93490]: 08 Jun 12 22:05:38.142251 ArgusWriteSocket: write (4, 0x693015e0, 32, ...) -1
> argus[1708.48c93490]: 08 Jun 12 22:05:38.142277 ArgusWriteSocket: write (4, 0x693015e0, 32, ...) -1
> Killed
>
> radium -
> radium[1756]: 22:03:15.953146 connect from localhost
> radium[1756]: 22:03:55.399637 ArgusWriteOutSocket(0x49b5a4e8) client not processing: disconnecting
> radium[1756]: 22:05:47.968393 connect to 10.0.0.1:561 failed 'Connection refused'
>
> ra just quit
>
> By the way now argus is running on less than 1G traffic. I used to run argus on gigabit network and never see such issue, anyway bivio is new for me as I have never used it last time.
>
>
>
> On Fri, Jun 8, 2012 at 10:44 PM, Carter Bullard <carter at qosient.com> wrote:
> Hey CS Lee,
> OK, so two things, first there does seem to be a bug in how argus tries
> to gracefully recover from this type of problem. I am working on that now.
> Second, we need to get things such that the argus data flow is stable, then
> add components to see what is causing the problem. Also, we'd like
> to insulate argus from all this, so that it doesn't die.
>
> What seems to be the problem is your clients are connecting, but not reading
> flow data fast enough ( my interpretation of the write failure messages, and
> possibly the "client not ready" messages ). Argus is designed to allow for
> a large number of write errors that are related to client queuing and flow
> control, but the real bug is that argus is not dealing with slow clients very
> well, leaving data in queues, not clearing status quickly enough, and then
> giving up, but not terminating properly.
>
> As a work around to this problem, we need to get the first link in your data
> chain, argus -> radium, so that the channel never back pressures argus.
>
> Does the argus radium connection work without any ra* clients attached?
>
> Where does your radium run? On the Bivio or another machine ?
>
> If radium is not running on Bivio, I would recommend that we do that, so that
> radium is managing the interface that remote clients interact with, and
> argus only see's a single consistent connect from a single radium.
>
> But I will also recommend that you run a radium on the remote machine,
> so that the data chain is [ argus -> radium ] -> [ radium->ra*].
>
> Lets get the data flow going reliably, without ra* clients, and then see what
> is going on when it attaches.
>
> Carter
>
>
> On Jun 8, 2012, at 7:07 AM, CS Lee wrote:
>
>> hi Carter,
>>
>> I'm not sure if this is useful to help, here's the output from strace -
>>
>> strace -c /usr/local/sbin/argus -i s0.e0
>> argus[28208]: 08 Jun 12 17:12:50.271411 started
>> argus[28208]: 08 Jun 12 17:12:50.292235 ArgusGetInterfaceStatus: interface s0.e0 is up
>> argus[28208]: 08 Jun 12 17:14:18.699681 connect from 10.0.0.3
>>
>>
>> % time seconds usecs/call calls errors syscall
>> ------ ----------- ----------- --------- --------- ----------------
>> 99.68 41.720000 164252 254 126 futex
>> 0.17 0.072972 973 75 mmap
>> 0.12 0.050000 50000 1 1 restart_syscall
>> 0.02 0.009062 432 21 munmap
>> 0.00 0.000884 34 26 5 setsockopt
>> 0.00 0.000144 3 46 10 open
>> 0.00 0.000000 0 112 read
>> 0.00 0.000000 0 1 write
>> 0.00 0.000000 0 62 close
>> 0.00 0.000000 0 1 waitpid
>> 0.00 0.000000 0 1 execve
>> 0.00 0.000000 0 4 time
>> 0.00 0.000000 0 1 setuid
>> 0.00 0.000000 0 2 getuid
>> 0.00 0.000000 0 1 1 access
>> 0.00 0.000000 0 5 brk
>> 0.00 0.000000 0 1 getgid
>> 0.00 0.000000 0 56 1 ioctl
>> 0.00 0.000000 0 3 clone
>> 0.00 0.000000 0 28 mprotect
>> 0.00 0.000000 0 3 _llseek
>> 0.00 0.000000 0 1 select
>> 0.00 0.000000 0 1 writev
>> 0.00 0.000000 0 2 sched_get_priority_max
>> 0.00 0.000000 0 2 sched_get_priority_min
>> 0.00 0.000000 0 8 rt_sigaction
>> 0.00 0.000000 0 2 rt_sigprocmask
>> 0.00 0.000000 0 1 getrlimit
>> 0.00 0.000000 0 5 mmap2
>> 0.00 0.000000 0 1 stat64
>> 0.00 0.000000 0 30 fstat64
>> 0.00 0.000000 0 2 getdents64
>> 0.00 0.000000 0 5 fcntl64
>> 0.00 0.000000 0 1 set_tid_address
>> 0.00 0.000000 0 126 clock_gettime
>> 0.00 0.000000 0 1 tgkill
>> 0.00 0.000000 0 1 get_robust_list
>> 0.00 0.000000 0 1 SYS_317
>> 0.00 0.000000 0 27 socket
>> 0.00 0.000000 0 8 bind
>> 0.00 0.000000 0 7 3 connect
>> 0.00 0.000000 0 1 listen
>> 0.00 0.000000 0 5 getsockname
>> 0.00 0.000000 0 4 sendto
>> 0.00 0.000000 0 9 getsockopt
>> 0.00 0.000000 0 11 recvmsg
>> ------ ----------- ----------- --------- --------- ----------------
>> 100.00 41.853062 966 147 total
>>
>> Hopefully this strace is helpful.
>>
>> --
>> Best Regards,
>>
>> CS Lee<geek00L[at]gmail.com>
>>
>> http://geek00l.blogspot.com
>> http://defcraft.net
>> <bivio-argus-strace.log>
>
>
>
>
> --
> Best Regards,
>
> CS Lee<geek00L[at]gmail.com>
>
> http://geek00l.blogspot.com
> http://defcraft.net
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20120608/6f1a0647/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4367 bytes
Desc: not available
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20120608/6f1a0647/attachment.bin>
More information about the argus
mailing list