Anonymization of argus flow data

Carter Bullard carter at qosient.com
Tue Sep 10 12:33:28 EDT 2013


Well,
On my system 80% of the cycles are being spent doing the address,
port, mac, AS number mappings (managing allocation of a new object
and caching the values), and a small amount on the lookups.

I'll work on profiling the mapping logic to see if we've got
something askew.

Hope all is most excellent,

Carter



On Sep 10, 2013, at 12:22 PM, Kaustubh Gadkari <kaustubh at CS.ColoState.EDU> wrote:

> 
> On Sep 10, 2013, at 8:40 AM, Carter Bullard <carter at qosient.com> wrote:
> 
>> Hey Kaustubh,
>> I've been profiling ranonymize() with a lot of data, and
>> while I do see opportunities to improve performance, I don't
>> see many massively inefficient parts of the code, when run
>> against my data sets.  There are still some things for
>> me to look at, so I wanted you to know that I'm working on
>> your problem. 
>> 
> 
> Thanks for looking at this, Carter. 
> 
>> Based on what you've seen me so far, you're machine is 85%
>> idle, is ranonymize() using 100% of a single core, or is it
>> sleeping a lot?
>> 
> 
> top says ranonymize is using 100% of a single core.
> 
>> What kind of machine are you running on??  Can you describe the
>> machine a bit?  CPUs, memory, disks, etc….
>> 
> 
> I've been testing this on two machines. One is a Dell PowerEdge 2970, with 2 quad core AMD Opteron processors. The machine has 32GB RAM, a 130GB system disk and 16 8TB RAID5 partitions. The other machine is a Dell PowerEdge 2950. It has 2 quad core Intel Xeon X5450 CPUs, with 32GB RAM, a 140GB system disk and 3 8TB RAID5 partitions.
> 
> Thanks,
> Kaustubh
> 
>> Carter
>> 
>> 
>> On Sep 3, 2013, at 3:05 PM, Kaustubh Gadkari <kaustubh at cs.colostate.edu> wrote:
>> 
>>> On Tue, Sep 3, 2013 at 12:33 PM, Carter Bullard <carter at qosient.com> wrote:
>>>> Hmmmmmm, well, you're not using the machine much (85% idle)
>>>> so I'm looking into whether we're making any calls to any
>>>> routines that would add some wait states, like name lookups, or
>>>> sleeping somewhere.
>>>> 
>>>> Lets assume that there is a big problem, and I'll try to make
>>>> some changes to improve your performance.
>>>> 
>>> 
>>> Thanks, Carter.
>>> 
>>> Kaustubh
>>> 
>>>> Carter
>>>> 
>>>> 
>>>> 
>>>> On Sep 3, 2013, at 1:57 PM, Kaustubh Gadkari <kaustubh at cs.colostate.edu> wrote:
>>>> 
>>>>> Hey Carter,
>>>>> 
>>>>> On Sep 3, 2013, at 11:36 AM, Carter Bullard <carter at qosient.com> wrote:
>>>>> 
>>>>>> Hey Kaustubh,
>>>>>> If its still writing records to the output file, its not in an infinite loop,
>>>>>> although I'm sure that it feels like one.  So, no need to print debug msgs
>>>>>> or run under gdb().
>>>>>> 
>>>>>> Hmmmmmm, you must have a very large number of IP addresses.  racount() isn't doing
>>>>>> anything exotic with the "-M addr" mode.  Its hashing and storing each unique
>>>>>> IP address, so that we can report on how many and what types.
>>>>>> 
>>>>>> My guess is that you must be short on physical memory, and the programs are swapping,
>>>>>> which means that everything on this machine will be going very slowly.
>>>>>> Run " top " to see if one of our programs is eating all the memory, or
>>>>>> use vmstat() or vm_stat() or whatever to see if there is any paging.
>>>>>> 
>>>>> 
>>>>> No, the machine is not running out of memory. ranonymize is the largest memory user, and it is using 42.1% of a total of 32GB RAM. The swap usage is only 205MB, which is OK.  vmstat shows me the following:
>>>>> 
>>>>> kaustubh at proton:~$ sudo vmstat -w
>>>>> procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
>>>>> r  b       swpd       free       buff      cache   si   so    bi    bo   in   cs  us sy  id wa st
>>>>> 1  0     205916    1638176     101636   16287400    0    0   527   342    1    1  14  0  85  1  0
>>>>> 
>>>>> There are no other memory intensive processes running on the box.
>>>>> 
>>>>>> If it is a memory problem, then you will need to subdivide the data based
>>>>>> on size, not on time, using rasplit().  And yes its easy to merge split files
>>>>>> back to a single file.
>>>>>> 
>>>>>> UNFORTUNATELY, because the scope of anonymization is the file, anonymizing a
>>>>>> single big file of records will generate different results compared to
>>>>>> anonymizing a set of split files created from the big file.  Address A will be
>>>>>> anonymized potentially to a different address in each file.
>>>>>> 
>>>>>> The configuration provides the means to get consistent results between files,
>>>>>> but its a bit of work to do so.
>>>>>> 
>>>>>> Do you think you're running out of memory?
>>>>>> 
>>>>> 
>>>>> No, I think I'm ok in terms of memory usage.
>>>>> 
>>>>> Kaustubh
>>>>> 
>>>>>> Carter
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Sep 3, 2013, at 1:11 PM, Kaustubh Gadkari <kaustubh.gadkari at gmail.com> wrote:
>>>>>> 
>>>>>>> On Tue, Sep 3, 2013 at 8:49 AM, Carter Bullard <carter at qosient.com> wrote:
>>>>>>>> Hmmm, if racount() takes 18min, I would think ranonymize() should take about 20min
>>>>>>>> to complete.   You can run " racount -M addr " to get racount() to printout address
>>>>>>>> information, like how many addresses are in the file.
>>>>>>>> 
>>>>>>> 
>>>>>>> Carter, I ran racount with -M addr, but the process hasn't finished
>>>>>>> yet (it's been running for about 90 min now). I'll let it run for a
>>>>>>> while longer and keep you updated.
>>>>>>> 
>>>>>>>> ranonymize() works on a single argus record at a time, reading a single record,
>>>>>>>> anonymizing all the various data elements, and then writing the anonymized
>>>>>>>> record out to the output file.  If ranonymize() hasn't written out a record recently,
>>>>>>>> then its possible that its in an infinite loop, especially if its running at 100%, and
>>>>>>>> its been running for a month, and it seems to have stopped writing into the file.
>>>>>>>> What was the last " modified " time on your output file ???
>>>>>>>> 
>>>>>>> 
>>>>>>> It hasn't stopped writing to file .. the last modified time is right
>>>>>>> now, since the process is still running.
>>>>>>> 
>>>>>>> 
>>>>>>>> If you've compiled debug support into your ra* programs, you can send a USR1
>>>>>>>> signal to the running ranonymize() and it will start writing debug information out
>>>>>>>> to stderr().  Send a USR2 to turn debug output off.  Assuming that ranonymize()s
>>>>>>>> process id is 35122, you can do this:
>>>>>>>> 
>>>>>>>> % kill -USR1 35122
>>>>>>>> % kill -USR2 35122
>>>>>>>> 
>>>>>>>> If you've compiled development support into your programs, you can attach
>>>>>>>> to ranonymize() using gdb(), and then step through the program to see where
>>>>>>>> it is.
>>>>>>>> 
>>>>>>> 
>>>>>>> I haven't compiled my ra* programs with debug or development support.
>>>>>>> If you can tell me what I need to change in the Makefiles, I can do so
>>>>>>> and run ranonymize with gdb and see what's happening.
>>>>>>> 
>>>>>>> Kaustubh
>>>>>>> 
>>>>>>>> % gdb ranonymize 35122
>>>>>>>> 
>>>>>>>> This will attach to the program, and stop the acitve process.  If this all seems
>>>>>>>> unfamiliar, send more email, and I'll walk you through one of these strategies.
>>>>>>>> 
>>>>>>>> Carter
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Sep 3, 2013, at 9:56 AM, Kaustubh Gadkari <kaustubh.gadkari at gmail.com> wrote:
>>>>>>>> 
>>>>>>>>> On Tue, Sep 3, 2013 at 7:19 AM, Kaustubh Gadkari
>>>>>>>>> <kaustubh.gadkari at gmail.com> wrote:
>>>>>>>>>> On Tue, Sep 3, 2013 at 6:00 AM, Carter Bullard <carter at qosient.com> wrote:
>>>>>>>>>>> Hmmmm,
>>>>>>>>>>> There shouldn't be any performance issues with anonymizing a file, if your
>>>>>>>>>>> just
>>>>>>>>>>> anonymizing the IP addresses.  How many addresses are in the file?
>>>>>>>>>>> What does your ranonymize.conf file look like?   How much memory is it
>>>>>>>>>>> using?
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> I am not quite sure how many IP addresses there are in the file. My
>>>>>>>>>> ranonymize.conf looks like this:
>>>>>>>>>> 
>>>>>>>>>> RANON_PRESERVE_ETHERNET_VENDOR=yes
>>>>>>>>>> RANON_PRESERVE_BROADCAST_ADDRESS=yes
>>>>>>>>>> RANON_NET_ANONYMIZATION=sequential
>>>>>>>>>> RANON_HOST_ANONYMIZATION=sequential
>>>>>>>>>> RANON_PRESERVE_NET_ADDRESS_HIERARCHY=class
>>>>>>>>>> 
>>>>>>>>>> I took a look at how much memory ranonymize is using .. the usage is
>>>>>>>>>> about 42% on a machine with 32GB RAM.
>>>>>>>>>> 
>>>>>>>>>>> ranonymize() can be a little complex O(nLogN + C), but it should be
>>>>>>>>>>> in the same time frame as racount().  How long does it take for racount()
>>>>>>>>>>> to read the file?
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> I am running racount right now .. I will post results once it finishes.
>>>>>>>>> 
>>>>>>>>> racount takes about 18min to run on the file:
>>>>>>>>> 
>>>>>>>>> real    17m58.528s
>>>>>>>>> user    17m12.413s
>>>>>>>>> sys     2m0.332s
>>>>>>>>> 
>>>>>>>>> Kaustubh
>>>>>>>>> 
>>>>>>>>>>> Just a rule of thumb. If a ra* program doesn't complete in a few minutes,
>>>>>>>>>>> you
>>>>>>>>>>> should stop it and try to figure out if there is a memory problem or not.
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Thanks, I'll keep this in mind :)
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> Kaustubh
>>>>>>>>>> 
>>>>>>>>>>> Carter
>>>>>>>>>>> 
>>>>>>>>>>> On Sep 2, 2013, at 2:20 PM, Kaustubh Gadkari <kaustubh.gadkari at gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Hi,
>>>>>>>>>>> 
>>>>>>>>>>> I have a set of argus flow data captured at our data capture vantage point,
>>>>>>>>>>> and I want to anonymize the IP addresses (both source and destination) fully
>>>>>>>>>>> i.e. I want to replace both the addresses, using a prefix preserving
>>>>>>>>>>> technique. I have tried using ranonymize, but it is taking an extremely long
>>>>>>>>>>> time to anonymize the file (I started the process a couple of months ago, on
>>>>>>>>>>> a ~125GB file, and the output file size today is only ~30GB).
>>>>>>>>>>> 
>>>>>>>>>>> Can anyone suggest the right way to go about anonymizing the data set I
>>>>>>>>>>> have? Is ranonymize the right tool for the job?
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Kaustubh
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> Kaustubh Gadkari
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> Kaustubh Gadkari
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Kaustubh Gadkari
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Kaustubh Gadkari
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> --
>>>>> Kaustubh Gadkari
>>>>> kaustubh at cs.colostate.edu
>>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> Kaustubh Gadkari
>>> 
>> 
> 
> --
> Kaustubh Gadkari
> kaustubh at cs.colostate.edu
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 6837 bytes
Desc: not available
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20130910/e4f6bb9f/attachment.bin>


More information about the argus mailing list