Anonymization of argus flow data

Kaustubh Gadkari kaustubh at CS.ColoState.EDU
Tue Oct 15 15:27:43 EDT 2013


On Oct 15, 2013, at 1:24 PM, Carter Bullard <carter at qosient.com> wrote:

> Hey Kaustubh,
> I haven't found anything that would generate obvious delays in the algorithms.
> How many IP addresses are we talking about??
> 
> racount -M addr -r input.argus
> 

I am not sure. I have run this before and the process never finished (I let it run for about 2 hours before killing it), so I expect we have quite a large number of IP addresses. I can racount and let it run and I'll post results when it finishes.

Kaustubh

> Carter
> 
> 
> On Oct 8, 2013, at 2:11 PM, Kaustubh Gadkari <kaustubh at cs.colostate.edu> wrote:
> 
>> Hey Carter,
>> 
>> I don't have a .rarc file, and I am not setting RA_PRINT_NAMES explicitly anywhere. My invocation of ranonymize is as follows:
>> 
>> ranonymize -f /path/to/configfile -r input.argus -w output.argus - <filter expression>
>> 
>> The config file has the following entries:
>> RANON_PRESERVE_ETHERNET_VENDOR=yes
>> RANON_PRESERVE_BROADCAST_ADDRESS=yes
>> RANON_NET_ANONYMIZATION=sequential
>> RANON_HOST_ANONYMIZATION=sequential
>> RANON_PRESERVE_NET_ADDRESS_HIERARCHY=class
>> 
>> Thanks,
>> Kaustubh
>> 
>> 
>> On Oct 8, 2013, at 10:04 AM, Carter Bullard <carter at qosient.com> wrote:
>> 
>>> Hey Kaustubh,
>>> There is a chance that if you run ranonymize() with the options to
>>> print hostnames, either in the .rarc file or using the -nn option
>>> on the command line, you will hurt ranonymize's performance by doing
>>> bind lookups on each address before the number is translated.
>>> 
>>> Any chance that is going on here?  What is the value of your RA_PRINT_NAMES
>>> variable in your .rarc, and/or how are you calling ranonymize() ?
>>> 
>>> Carter
>>> 
>>> 
>>> On Oct 8, 2013, at 8:45 AM, Carter Bullard <carter at qosient.com> wrote:
>>> 
>>>> Hey Kaustubh,
>>>> I have not had a chance, but thanks for reminding me.
>>>> I'll look at it today !!!!  Keep bugging me !!!
>>>> 
>>>> Carter
>>>> 
>>>>> On Oct 7, 2013, at 12:39 PM, Kaustubh Gadkari <kaustubh at cs.colostate.edu> wrote:
>>>>> 
>>>>> Hey Carter,
>>>>> 
>>>>> I just wanted to check if you've found any reasons why ranonymize is taking so long to complete on my dataset?
>>>>> 
>>>>> Thanks,
>>>>> Kaustubh
>>>>> 
>>>>>> On Sep 10, 2013, at 10:40 AM, Kaustubh Gadkari <kaustubh at CS.ColoState.EDU> wrote:
>>>>>> 
>>>>>> 
>>>>>>> On Sep 10, 2013, at 9:33 AM, Carter Bullard <carter at qosient.com> wrote:
>>>>>>> 
>>>>>>> Well,
>>>>>>> On my system 80% of the cycles are being spent doing the address,
>>>>>>> port, mac, AS number mappings (managing allocation of a new object
>>>>>>> and caching the values), and a small amount on the lookups.
>>>>>>> 
>>>>>>> I'll work on profiling the mapping logic to see if we've got
>>>>>>> something askew.
>>>>>> 
>>>>>> Great. Thanks again for the help.
>>>>>> 
>>>>>>> Hope all is most excellent,
>>>>>> 
>>>>>> And with you too :)
>>>>>> 
>>>>>> Kaustubh
>>>>>> 
>>>>>>> Carter
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On Sep 10, 2013, at 12:22 PM, Kaustubh Gadkari <kaustubh at CS.ColoState.EDU> wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Sep 10, 2013, at 8:40 AM, Carter Bullard <carter at qosient.com> wrote:
>>>>>>>>> 
>>>>>>>>> Hey Kaustubh,
>>>>>>>>> I've been profiling ranonymize() with a lot of data, and
>>>>>>>>> while I do see opportunities to improve performance, I don't
>>>>>>>>> see many massively inefficient parts of the code, when run
>>>>>>>>> against my data sets.  There are still some things for
>>>>>>>>> me to look at, so I wanted you to know that I'm working on
>>>>>>>>> your problem.
>>>>>>>> 
>>>>>>>> Thanks for looking at this, Carter. 
>>>>>>>> 
>>>>>>>>> Based on what you've seen me so far, you're machine is 85%
>>>>>>>>> idle, is ranonymize() using 100% of a single core, or is it
>>>>>>>>> sleeping a lot?
>>>>>>>> 
>>>>>>>> top says ranonymize is using 100% of a single core.
>>>>>>>> 
>>>>>>>>> What kind of machine are you running on??  Can you describe the
>>>>>>>>> machine a bit?  CPUs, memory, disks, etc….
>>>>>>>> 
>>>>>>>> I've been testing this on two machines. One is a Dell PowerEdge 2970, with 2 quad core AMD Opteron processors. The machine has 32GB RAM, a 130GB system disk and 16 8TB RAID5 partitions. The other machine is a Dell PowerEdge 2950. It has 2 quad core Intel Xeon X5450 CPUs, with 32GB RAM, a 140GB system disk and 3 8TB RAID5 partitions.
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Kaustubh
>>>>>>>> 
>>>>>>>>> Carter
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On Sep 3, 2013, at 3:05 PM, Kaustubh Gadkari <kaustubh at cs.colostate.edu> wrote:
>>>>>>>>>> 
>>>>>>>>>>> On Tue, Sep 3, 2013 at 12:33 PM, Carter Bullard <carter at qosient.com> wrote:
>>>>>>>>>>> Hmmmmmm, well, you're not using the machine much (85% idle)
>>>>>>>>>>> so I'm looking into whether we're making any calls to any
>>>>>>>>>>> routines that would add some wait states, like name lookups, or
>>>>>>>>>>> sleeping somewhere.
>>>>>>>>>>> 
>>>>>>>>>>> Lets assume that there is a big problem, and I'll try to make
>>>>>>>>>>> some changes to improve your performance.
>>>>>>>>>> 
>>>>>>>>>> Thanks, Carter.
>>>>>>>>>> 
>>>>>>>>>> Kaustubh
>>>>>>>>>> 
>>>>>>>>>>> Carter
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> On Sep 3, 2013, at 1:57 PM, Kaustubh Gadkari <kaustubh at cs.colostate.edu> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Hey Carter,
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Sep 3, 2013, at 11:36 AM, Carter Bullard <carter at qosient.com> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hey Kaustubh,
>>>>>>>>>>>>> If its still writing records to the output file, its not in an infinite loop,
>>>>>>>>>>>>> although I'm sure that it feels like one.  So, no need to print debug msgs
>>>>>>>>>>>>> or run under gdb().
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hmmmmmm, you must have a very large number of IP addresses.  racount() isn't doing
>>>>>>>>>>>>> anything exotic with the "-M addr" mode.  Its hashing and storing each unique
>>>>>>>>>>>>> IP address, so that we can report on how many and what types.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> My guess is that you must be short on physical memory, and the programs are swapping,
>>>>>>>>>>>>> which means that everything on this machine will be going very slowly.
>>>>>>>>>>>>> Run " top " to see if one of our programs is eating all the memory, or
>>>>>>>>>>>>> use vmstat() or vm_stat() or whatever to see if there is any paging.
>>>>>>>>>>>> 
>>>>>>>>>>>> No, the machine is not running out of memory. ranonymize is the largest memory user, and it is using 42.1% of a total of 32GB RAM. The swap usage is only 205MB, which is OK.  vmstat shows me the following:
>>>>>>>>>>>> 
>>>>>>>>>>>> kaustubh at proton:~$ sudo vmstat -w
>>>>>>>>>>>> procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
>>>>>>>>>>>> r  b       swpd       free       buff      cache   si   so    bi    bo   in   cs  us sy  id wa st
>>>>>>>>>>>> 1  0     205916    1638176     101636   16287400    0    0   527   342    1    1  14  0  85  1  0
>>>>>>>>>>>> 
>>>>>>>>>>>> There are no other memory intensive processes running on the box.
>>>>>>>>>>>> 
>>>>>>>>>>>>> If it is a memory problem, then you will need to subdivide the data based
>>>>>>>>>>>>> on size, not on time, using rasplit().  And yes its easy to merge split files
>>>>>>>>>>>>> back to a single file.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> UNFORTUNATELY, because the scope of anonymization is the file, anonymizing a
>>>>>>>>>>>>> single big file of records will generate different results compared to
>>>>>>>>>>>>> anonymizing a set of split files created from the big file.  Address A will be
>>>>>>>>>>>>> anonymized potentially to a different address in each file.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The configuration provides the means to get consistent results between files,
>>>>>>>>>>>>> but its a bit of work to do so.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Do you think you're running out of memory?
>>>>>>>>>>>> 
>>>>>>>>>>>> No, I think I'm ok in terms of memory usage.
>>>>>>>>>>>> 
>>>>>>>>>>>> Kaustubh
>>>>>>>>>>>> 
>>>>>>>>>>>>> Carter
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Sep 3, 2013, at 1:11 PM, Kaustubh Gadkari <kaustubh.gadkari at gmail.com> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Tue, Sep 3, 2013 at 8:49 AM, Carter Bullard <carter at qosient.com> wrote:
>>>>>>>>>>>>>>> Hmmm, if racount() takes 18min, I would think ranonymize() should take about 20min
>>>>>>>>>>>>>>> to complete.   You can run " racount -M addr " to get racount() to printout address
>>>>>>>>>>>>>>> information, like how many addresses are in the file.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Carter, I ran racount with -M addr, but the process hasn't finished
>>>>>>>>>>>>>> yet (it's been running for about 90 min now). I'll let it run for a
>>>>>>>>>>>>>> while longer and keep you updated.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> ranonymize() works on a single argus record at a time, reading a single record,
>>>>>>>>>>>>>>> anonymizing all the various data elements, and then writing the anonymized
>>>>>>>>>>>>>>> record out to the output file.  If ranonymize() hasn't written out a record recently,
>>>>>>>>>>>>>>> then its possible that its in an infinite loop, especially if its running at 100%, and
>>>>>>>>>>>>>>> its been running for a month, and it seems to have stopped writing into the file.
>>>>>>>>>>>>>>> What was the last " modified " time on your output file ???
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> It hasn't stopped writing to file .. the last modified time is right
>>>>>>>>>>>>>> now, since the process is still running.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> If you've compiled debug support into your ra* programs, you can send a USR1
>>>>>>>>>>>>>>> signal to the running ranonymize() and it will start writing debug information out
>>>>>>>>>>>>>>> to stderr().  Send a USR2 to turn debug output off.  Assuming that ranonymize()s
>>>>>>>>>>>>>>> process id is 35122, you can do this:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> % kill -USR1 35122
>>>>>>>>>>>>>>> % kill -USR2 35122
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> If you've compiled development support into your programs, you can attach
>>>>>>>>>>>>>>> to ranonymize() using gdb(), and then step through the program to see where
>>>>>>>>>>>>>>> it is.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I haven't compiled my ra* programs with debug or development support.
>>>>>>>>>>>>>> If you can tell me what I need to change in the Makefiles, I can do so
>>>>>>>>>>>>>> and run ranonymize with gdb and see what's happening.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Kaustubh
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> % gdb ranonymize 35122
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> This will attach to the program, and stop the acitve process.  If this all seems
>>>>>>>>>>>>>>> unfamiliar, send more email, and I'll walk you through one of these strategies.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Carter
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Sep 3, 2013, at 9:56 AM, Kaustubh Gadkari <kaustubh.gadkari at gmail.com> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Tue, Sep 3, 2013 at 7:19 AM, Kaustubh Gadkari
>>>>>>>>>>>>>>>> <kaustubh.gadkari at gmail.com> wrote:
>>>>>>>>>>>>>>>>> On Tue, Sep 3, 2013 at 6:00 AM, Carter Bullard <carter at qosient.com> wrote:
>>>>>>>>>>>>>>>>>> Hmmmm,
>>>>>>>>>>>>>>>>>> There shouldn't be any performance issues with anonymizing a file, if your
>>>>>>>>>>>>>>>>>> just
>>>>>>>>>>>>>>>>>> anonymizing the IP addresses.  How many addresses are in the file?
>>>>>>>>>>>>>>>>>> What does your ranonymize.conf file look like?   How much memory is it
>>>>>>>>>>>>>>>>>> using?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I am not quite sure how many IP addresses there are in the file. My
>>>>>>>>>>>>>>>>> ranonymize.conf looks like this:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> RANON_PRESERVE_ETHERNET_VENDOR=yes
>>>>>>>>>>>>>>>>> RANON_PRESERVE_BROADCAST_ADDRESS=yes
>>>>>>>>>>>>>>>>> RANON_NET_ANONYMIZATION=sequential
>>>>>>>>>>>>>>>>> RANON_HOST_ANONYMIZATION=sequential
>>>>>>>>>>>>>>>>> RANON_PRESERVE_NET_ADDRESS_HIERARCHY=class
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I took a look at how much memory ranonymize is using .. the usage is
>>>>>>>>>>>>>>>>> about 42% on a machine with 32GB RAM.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> ranonymize() can be a little complex O(nLogN + C), but it should be
>>>>>>>>>>>>>>>>>> in the same time frame as racount().  How long does it take for racount()
>>>>>>>>>>>>>>>>>> to read the file?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I am running racount right now .. I will post results once it finishes.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> racount takes about 18min to run on the file:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> real    17m58.528s
>>>>>>>>>>>>>>>> user    17m12.413s
>>>>>>>>>>>>>>>> sys     2m0.332s
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Kaustubh
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Just a rule of thumb. If a ra* program doesn't complete in a few minutes,
>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>> should stop it and try to figure out if there is a memory problem or not.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Thanks, I'll keep this in mind :)
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> Kaustubh
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Carter
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Sep 2, 2013, at 2:20 PM, Kaustubh Gadkari <kaustubh.gadkari at gmail.com>
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> I have a set of argus flow data captured at our data capture vantage point,
>>>>>>>>>>>>>>>>>> and I want to anonymize the IP addresses (both source and destination) fully
>>>>>>>>>>>>>>>>>> i.e. I want to replace both the addresses, using a prefix preserving
>>>>>>>>>>>>>>>>>> technique. I have tried using ranonymize, but it is taking an extremely long
>>>>>>>>>>>>>>>>>> time to anonymize the file (I started the process a couple of months ago, on
>>>>>>>>>>>>>>>>>> a ~125GB file, and the output file size today is only ~30GB).
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Can anyone suggest the right way to go about anonymizing the data set I
>>>>>>>>>>>>>>>>>> have? Is ranonymize the right tool for the job?
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> Kaustubh
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> Kaustubh Gadkari
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> Kaustubh Gadkari
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Kaustubh Gadkari
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Kaustubh Gadkari
>>>>>>>>>>>> 
>>>>>>>>>>>> --
>>>>>>>>>>>> Kaustubh Gadkari
>>>>>>>>>>>> kaustubh at cs.colostate.edu
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> -- 
>>>>>>>>>> Kaustubh Gadkari
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Kaustubh Gadkari
>>>>>>>> kaustubh at cs.colostate.edu
>>>>>> 
>>>>>> --
>>>>>> Kaustubh Gadkari
>>>>>> kaustubh at cs.colostate.edu
>>>>> 
>>>>> --
>>>>> Kaustubh Gadkari
>>>>> kaustubh at cs.colostate.edu
>>>>> 
>>>> 
>>> 
>> 
>> --
>> Kaustubh Gadkari
>> kaustubh at cs.colostate.edu
>> 
> 

--
Kaustubh Gadkari
kaustubh at cs.colostate.edu

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20131015/6422aa96/attachment.sig>


More information about the argus mailing list