Anonymization of argus flow data
Carter Bullard
carter at qosient.com
Tue Sep 3 14:33:02 EDT 2013
Hmmmmmm, well, you're not using the machine much (85% idle)
so I'm looking into whether we're making any calls to any
routines that would add some wait states, like name lookups, or
sleeping somewhere.
Lets assume that there is a big problem, and I'll try to make
some changes to improve your performance.
Carter
On Sep 3, 2013, at 1:57 PM, Kaustubh Gadkari <kaustubh at cs.colostate.edu> wrote:
> Hey Carter,
>
> On Sep 3, 2013, at 11:36 AM, Carter Bullard <carter at qosient.com> wrote:
>
>> Hey Kaustubh,
>> If its still writing records to the output file, its not in an infinite loop,
>> although I'm sure that it feels like one. So, no need to print debug msgs
>> or run under gdb().
>>
>> Hmmmmmm, you must have a very large number of IP addresses. racount() isn't doing
>> anything exotic with the "-M addr" mode. Its hashing and storing each unique
>> IP address, so that we can report on how many and what types.
>>
>> My guess is that you must be short on physical memory, and the programs are swapping,
>> which means that everything on this machine will be going very slowly.
>> Run " top " to see if one of our programs is eating all the memory, or
>> use vmstat() or vm_stat() or whatever to see if there is any paging.
>>
>
> No, the machine is not running out of memory. ranonymize is the largest memory user, and it is using 42.1% of a total of 32GB RAM. The swap usage is only 205MB, which is OK. vmstat shows me the following:
>
> kaustubh at proton:~$ sudo vmstat -w
> procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
> r b swpd free buff cache si so bi bo in cs us sy id wa st
> 1 0 205916 1638176 101636 16287400 0 0 527 342 1 1 14 0 85 1 0
>
> There are no other memory intensive processes running on the box.
>
>> If it is a memory problem, then you will need to subdivide the data based
>> on size, not on time, using rasplit(). And yes its easy to merge split files
>> back to a single file.
>>
>> UNFORTUNATELY, because the scope of anonymization is the file, anonymizing a
>> single big file of records will generate different results compared to
>> anonymizing a set of split files created from the big file. Address A will be
>> anonymized potentially to a different address in each file.
>>
>> The configuration provides the means to get consistent results between files,
>> but its a bit of work to do so.
>>
>> Do you think you're running out of memory?
>>
>
> No, I think I'm ok in terms of memory usage.
>
> Kaustubh
>
>> Carter
>>
>>
>>
>> On Sep 3, 2013, at 1:11 PM, Kaustubh Gadkari <kaustubh.gadkari at gmail.com> wrote:
>>
>>> On Tue, Sep 3, 2013 at 8:49 AM, Carter Bullard <carter at qosient.com> wrote:
>>>> Hmmm, if racount() takes 18min, I would think ranonymize() should take about 20min
>>>> to complete. You can run " racount -M addr " to get racount() to printout address
>>>> information, like how many addresses are in the file.
>>>>
>>>
>>> Carter, I ran racount with -M addr, but the process hasn't finished
>>> yet (it's been running for about 90 min now). I'll let it run for a
>>> while longer and keep you updated.
>>>
>>>> ranonymize() works on a single argus record at a time, reading a single record,
>>>> anonymizing all the various data elements, and then writing the anonymized
>>>> record out to the output file. If ranonymize() hasn't written out a record recently,
>>>> then its possible that its in an infinite loop, especially if its running at 100%, and
>>>> its been running for a month, and it seems to have stopped writing into the file.
>>>> What was the last " modified " time on your output file ???
>>>>
>>>
>>> It hasn't stopped writing to file .. the last modified time is right
>>> now, since the process is still running.
>>>
>>>
>>>> If you've compiled debug support into your ra* programs, you can send a USR1
>>>> signal to the running ranonymize() and it will start writing debug information out
>>>> to stderr(). Send a USR2 to turn debug output off. Assuming that ranonymize()s
>>>> process id is 35122, you can do this:
>>>>
>>>> % kill -USR1 35122
>>>> % kill -USR2 35122
>>>>
>>>> If you've compiled development support into your programs, you can attach
>>>> to ranonymize() using gdb(), and then step through the program to see where
>>>> it is.
>>>>
>>>
>>> I haven't compiled my ra* programs with debug or development support.
>>> If you can tell me what I need to change in the Makefiles, I can do so
>>> and run ranonymize with gdb and see what's happening.
>>>
>>> Kaustubh
>>>
>>>> % gdb ranonymize 35122
>>>>
>>>> This will attach to the program, and stop the acitve process. If this all seems
>>>> unfamiliar, send more email, and I'll walk you through one of these strategies.
>>>>
>>>> Carter
>>>>
>>>>
>>>> On Sep 3, 2013, at 9:56 AM, Kaustubh Gadkari <kaustubh.gadkari at gmail.com> wrote:
>>>>
>>>>> On Tue, Sep 3, 2013 at 7:19 AM, Kaustubh Gadkari
>>>>> <kaustubh.gadkari at gmail.com> wrote:
>>>>>> On Tue, Sep 3, 2013 at 6:00 AM, Carter Bullard <carter at qosient.com> wrote:
>>>>>>> Hmmmm,
>>>>>>> There shouldn't be any performance issues with anonymizing a file, if your
>>>>>>> just
>>>>>>> anonymizing the IP addresses. How many addresses are in the file?
>>>>>>> What does your ranonymize.conf file look like? How much memory is it
>>>>>>> using?
>>>>>>>
>>>>>>
>>>>>> I am not quite sure how many IP addresses there are in the file. My
>>>>>> ranonymize.conf looks like this:
>>>>>>
>>>>>> RANON_PRESERVE_ETHERNET_VENDOR=yes
>>>>>> RANON_PRESERVE_BROADCAST_ADDRESS=yes
>>>>>> RANON_NET_ANONYMIZATION=sequential
>>>>>> RANON_HOST_ANONYMIZATION=sequential
>>>>>> RANON_PRESERVE_NET_ADDRESS_HIERARCHY=class
>>>>>>
>>>>>> I took a look at how much memory ranonymize is using .. the usage is
>>>>>> about 42% on a machine with 32GB RAM.
>>>>>>
>>>>>>> ranonymize() can be a little complex O(nLogN + C), but it should be
>>>>>>> in the same time frame as racount(). How long does it take for racount()
>>>>>>> to read the file?
>>>>>>>
>>>>>>
>>>>>> I am running racount right now .. I will post results once it finishes.
>>>>>
>>>>> racount takes about 18min to run on the file:
>>>>>
>>>>> real 17m58.528s
>>>>> user 17m12.413s
>>>>> sys 2m0.332s
>>>>>
>>>>> Kaustubh
>>>>>
>>>>>>> Just a rule of thumb. If a ra* program doesn't complete in a few minutes,
>>>>>>> you
>>>>>>> should stop it and try to figure out if there is a memory problem or not.
>>>>>>>
>>>>>>
>>>>>> Thanks, I'll keep this in mind :)
>>>>>>
>>>>>> Thanks,
>>>>>> Kaustubh
>>>>>>
>>>>>>> Carter
>>>>>>>
>>>>>>> On Sep 2, 2013, at 2:20 PM, Kaustubh Gadkari <kaustubh.gadkari at gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I have a set of argus flow data captured at our data capture vantage point,
>>>>>>> and I want to anonymize the IP addresses (both source and destination) fully
>>>>>>> i.e. I want to replace both the addresses, using a prefix preserving
>>>>>>> technique. I have tried using ranonymize, but it is taking an extremely long
>>>>>>> time to anonymize the file (I started the process a couple of months ago, on
>>>>>>> a ~125GB file, and the output file size today is only ~30GB).
>>>>>>>
>>>>>>> Can anyone suggest the right way to go about anonymizing the data set I
>>>>>>> have? Is ranonymize the right tool for the job?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Kaustubh
>>>>>>>
>>>>>>> --
>>>>>>> Kaustubh Gadkari
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Kaustubh Gadkari
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Kaustubh Gadkari
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Kaustubh Gadkari
>>>
>>
>
> --
> Kaustubh Gadkari
> kaustubh at cs.colostate.edu
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 6837 bytes
Desc: not available
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20130903/9ecd117a/attachment.bin>
More information about the argus
mailing list