Anonymization of argus flow data

Kaustubh Gadkari kaustubh at cs.colostate.edu
Tue Sep 3 15:05:20 EDT 2013


On Tue, Sep 3, 2013 at 12:33 PM, Carter Bullard <carter at qosient.com> wrote:
> Hmmmmmm, well, you're not using the machine much (85% idle)
> so I'm looking into whether we're making any calls to any
> routines that would add some wait states, like name lookups, or
> sleeping somewhere.
>
> Lets assume that there is a big problem, and I'll try to make
> some changes to improve your performance.
>

Thanks, Carter.

Kaustubh

> Carter
>
>
>
> On Sep 3, 2013, at 1:57 PM, Kaustubh Gadkari <kaustubh at cs.colostate.edu> wrote:
>
>> Hey Carter,
>>
>> On Sep 3, 2013, at 11:36 AM, Carter Bullard <carter at qosient.com> wrote:
>>
>>> Hey Kaustubh,
>>> If its still writing records to the output file, its not in an infinite loop,
>>> although I'm sure that it feels like one.  So, no need to print debug msgs
>>> or run under gdb().
>>>
>>> Hmmmmmm, you must have a very large number of IP addresses.  racount() isn't doing
>>> anything exotic with the "-M addr" mode.  Its hashing and storing each unique
>>> IP address, so that we can report on how many and what types.
>>>
>>> My guess is that you must be short on physical memory, and the programs are swapping,
>>> which means that everything on this machine will be going very slowly.
>>> Run " top " to see if one of our programs is eating all the memory, or
>>> use vmstat() or vm_stat() or whatever to see if there is any paging.
>>>
>>
>> No, the machine is not running out of memory. ranonymize is the largest memory user, and it is using 42.1% of a total of 32GB RAM. The swap usage is only 205MB, which is OK.  vmstat shows me the following:
>>
>> kaustubh at proton:~$ sudo vmstat -w
>> procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
>> r  b       swpd       free       buff      cache   si   so    bi    bo   in   cs  us sy  id wa st
>> 1  0     205916    1638176     101636   16287400    0    0   527   342    1    1  14  0  85  1  0
>>
>> There are no other memory intensive processes running on the box.
>>
>>> If it is a memory problem, then you will need to subdivide the data based
>>> on size, not on time, using rasplit().  And yes its easy to merge split files
>>> back to a single file.
>>>
>>> UNFORTUNATELY, because the scope of anonymization is the file, anonymizing a
>>> single big file of records will generate different results compared to
>>> anonymizing a set of split files created from the big file.  Address A will be
>>> anonymized potentially to a different address in each file.
>>>
>>> The configuration provides the means to get consistent results between files,
>>> but its a bit of work to do so.
>>>
>>> Do you think you're running out of memory?
>>>
>>
>> No, I think I'm ok in terms of memory usage.
>>
>> Kaustubh
>>
>>> Carter
>>>
>>>
>>>
>>> On Sep 3, 2013, at 1:11 PM, Kaustubh Gadkari <kaustubh.gadkari at gmail.com> wrote:
>>>
>>>> On Tue, Sep 3, 2013 at 8:49 AM, Carter Bullard <carter at qosient.com> wrote:
>>>>> Hmmm, if racount() takes 18min, I would think ranonymize() should take about 20min
>>>>> to complete.   You can run " racount -M addr " to get racount() to printout address
>>>>> information, like how many addresses are in the file.
>>>>>
>>>>
>>>> Carter, I ran racount with -M addr, but the process hasn't finished
>>>> yet (it's been running for about 90 min now). I'll let it run for a
>>>> while longer and keep you updated.
>>>>
>>>>> ranonymize() works on a single argus record at a time, reading a single record,
>>>>> anonymizing all the various data elements, and then writing the anonymized
>>>>> record out to the output file.  If ranonymize() hasn't written out a record recently,
>>>>> then its possible that its in an infinite loop, especially if its running at 100%, and
>>>>> its been running for a month, and it seems to have stopped writing into the file.
>>>>> What was the last " modified " time on your output file ???
>>>>>
>>>>
>>>> It hasn't stopped writing to file .. the last modified time is right
>>>> now, since the process is still running.
>>>>
>>>>
>>>>> If you've compiled debug support into your ra* programs, you can send a USR1
>>>>> signal to the running ranonymize() and it will start writing debug information out
>>>>> to stderr().  Send a USR2 to turn debug output off.  Assuming that ranonymize()s
>>>>> process id is 35122, you can do this:
>>>>>
>>>>> % kill -USR1 35122
>>>>> % kill -USR2 35122
>>>>>
>>>>> If you've compiled development support into your programs, you can attach
>>>>> to ranonymize() using gdb(), and then step through the program to see where
>>>>> it is.
>>>>>
>>>>
>>>> I haven't compiled my ra* programs with debug or development support.
>>>> If you can tell me what I need to change in the Makefiles, I can do so
>>>> and run ranonymize with gdb and see what's happening.
>>>>
>>>> Kaustubh
>>>>
>>>>> % gdb ranonymize 35122
>>>>>
>>>>> This will attach to the program, and stop the acitve process.  If this all seems
>>>>> unfamiliar, send more email, and I'll walk you through one of these strategies.
>>>>>
>>>>> Carter
>>>>>
>>>>>
>>>>> On Sep 3, 2013, at 9:56 AM, Kaustubh Gadkari <kaustubh.gadkari at gmail.com> wrote:
>>>>>
>>>>>> On Tue, Sep 3, 2013 at 7:19 AM, Kaustubh Gadkari
>>>>>> <kaustubh.gadkari at gmail.com> wrote:
>>>>>>> On Tue, Sep 3, 2013 at 6:00 AM, Carter Bullard <carter at qosient.com> wrote:
>>>>>>>> Hmmmm,
>>>>>>>> There shouldn't be any performance issues with anonymizing a file, if your
>>>>>>>> just
>>>>>>>> anonymizing the IP addresses.  How many addresses are in the file?
>>>>>>>> What does your ranonymize.conf file look like?   How much memory is it
>>>>>>>> using?
>>>>>>>>
>>>>>>>
>>>>>>> I am not quite sure how many IP addresses there are in the file. My
>>>>>>> ranonymize.conf looks like this:
>>>>>>>
>>>>>>> RANON_PRESERVE_ETHERNET_VENDOR=yes
>>>>>>> RANON_PRESERVE_BROADCAST_ADDRESS=yes
>>>>>>> RANON_NET_ANONYMIZATION=sequential
>>>>>>> RANON_HOST_ANONYMIZATION=sequential
>>>>>>> RANON_PRESERVE_NET_ADDRESS_HIERARCHY=class
>>>>>>>
>>>>>>> I took a look at how much memory ranonymize is using .. the usage is
>>>>>>> about 42% on a machine with 32GB RAM.
>>>>>>>
>>>>>>>> ranonymize() can be a little complex O(nLogN + C), but it should be
>>>>>>>> in the same time frame as racount().  How long does it take for racount()
>>>>>>>> to read the file?
>>>>>>>>
>>>>>>>
>>>>>>> I am running racount right now .. I will post results once it finishes.
>>>>>>
>>>>>> racount takes about 18min to run on the file:
>>>>>>
>>>>>> real    17m58.528s
>>>>>> user    17m12.413s
>>>>>> sys     2m0.332s
>>>>>>
>>>>>> Kaustubh
>>>>>>
>>>>>>>> Just a rule of thumb. If a ra* program doesn't complete in a few minutes,
>>>>>>>> you
>>>>>>>> should stop it and try to figure out if there is a memory problem or not.
>>>>>>>>
>>>>>>>
>>>>>>> Thanks, I'll keep this in mind :)
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Kaustubh
>>>>>>>
>>>>>>>> Carter
>>>>>>>>
>>>>>>>> On Sep 2, 2013, at 2:20 PM, Kaustubh Gadkari <kaustubh.gadkari at gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I have a set of argus flow data captured at our data capture vantage point,
>>>>>>>> and I want to anonymize the IP addresses (both source and destination) fully
>>>>>>>> i.e. I want to replace both the addresses, using a prefix preserving
>>>>>>>> technique. I have tried using ranonymize, but it is taking an extremely long
>>>>>>>> time to anonymize the file (I started the process a couple of months ago, on
>>>>>>>> a ~125GB file, and the output file size today is only ~30GB).
>>>>>>>>
>>>>>>>> Can anyone suggest the right way to go about anonymizing the data set I
>>>>>>>> have? Is ranonymize the right tool for the job?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Kaustubh
>>>>>>>>
>>>>>>>> --
>>>>>>>> Kaustubh Gadkari
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Kaustubh Gadkari
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Kaustubh Gadkari
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Kaustubh Gadkari
>>>>
>>>
>>
>> --
>> Kaustubh Gadkari
>> kaustubh at cs.colostate.edu
>>
>



-- 
Kaustubh Gadkari



More information about the argus mailing list