radark program design

Wed Oct 17 15:53:16 EDT 2007

Gentle people,
Here is a programatic description of what radark does.
Hopefully this will be helpful.

Radark is a perl script that takes as input a CIDR address that
describes the local address space that would be the target of the
scan and options that specify the input file (or stream) and any
filter options.

Radark generates a directory, in the current working directory, and
creates a number of intermediate files that are used in the production
of the Scanner List for the time period of the data.  The directory name
is generated through hashing the parameters, so if you run the script
multiple times, you don't have to redo most of the processing.

Radark uses the standard ra(), racluster() and rafilteraddr() multiple
times to eventually generate a report of the host addresses that have
scanned addresses in the target address space.

The first task is to condition the data, using racluster() to  
aggregate all
the flow status records into single flow reports, and to correct any
direction oriented semantics that we'll use later.

line 102
    `racluster -M norep -w $RADATA/racluster.out @arglist`;

Using this as a starting point, the next step is to generate the list of
active IP addresses in the target address space.  The definition of  
active
in this case is IP addresses that exchanged user data.  This is  
straight-
foward:

line 130
    `racluster -M norep rmon -m smac saddr daddr -r $RADATA/ 
racluster.out -w - - appbytes gt 0 | \
     racluster -m smac saddr  -w $RADATA/lightnet.out - (src net  
$localaddr and not dst net $localaddr and src pkts gt 0`;

The first call to racluster() generates the list of src and dst IP  
addresses for flows that had some
form of user data exchanged, creating two entries for each set of src/ 
dst IP pairs (using the -M rmon
we remove the src and dst semanitcs by creating entries (A -> B and B  
-> A).  The second call
filters for src/dst IP pairs where the source is inside the target  
address space, and the dst is
outside the target address space, this means that the saddr holds  
internal IP addresses for relationships
with external machines, that moved data.  With this as the input  
stream, we aggregate the saddr field,
which gives us records that have the aggregation stats for internal  
IP addresses, and we cache that in
"lightnet.out".

We do this so we can get the filter for picking flows that touch non- 
active IP addresses.
It is trivial for the lightnet.out list to be maintained over a long  
period of time.

So the next real step is to grab all the records that involve non- 
active local machines, this
will be the foundation for our scanner list.

line 157
       `rafilteraddr -m daddr -vf $RADATA/lightnet.txt -R $RADATA/ 
racluster.out -w - - not src net $localaddr and dst net $localaddr | \
        racluster -m smac saddr -w $RADATA/darkscanners.out`;

So this is pretty straightforward, rafilteraddr() takes the list of  
active IP addresses, and
with the "-v" option, we match records that don't involve these  
addresses in the daddr
field (-m daddr) from the complete set of conditioned data.  We use a  
pre-filter of flows
that originate from the outside and head to the inside.  The output  
stream of this is
racluster()'d to give us the list of external IP addresses, which we  
write out in ascii, so
we can use it as a filter later.

The reason we use rafilteraddr() is because the list of addresses can  
be huge, and
building a standard command line filter for ra or racluster generates  
poorly performing
filters.  rafilteraddr() can handle millions of addresses at a time.

line 158
       `ra -L-1 -r $RADATA/darkscanners.out -s saddr > $RADATA/ 
darkscanners.txt`;

The "-L -1" suppresses printing the column labels, which ralabelfilter 
() doesn't really want to see.

OK, now its really easy.

We know the external IP addresses that touch the dark space that  
we're interested in,
and so use this list to get the records of interest.  Aggregate these  
so that the aggregation
stats reflect the number of hosts that this IP address touched.

line 168

       `rafilteraddr -m saddr -f $RADATA/darkscanners.txt -r $RADATA/ 
racluster.out -w - | \
        racluster -M norep -m smac dmac saddr daddr -w - | \
        racluster -m smac saddr -w - | \
        rasort -m trans -w $RADATA/scanreport.out`;

So what are we doing here?
rafilteraddr() picks out the flows from the original conditioned  
data, so we see all the flows
these IP addresses were involved in.  This stream of data is  
aggregated to make the
src and dst address a single flow record (we also preserve the mac  
addresses, but this is
not necessary) and then we aggregate again, preserving only the src  
IP address, which
is the address of the scanner.  The resulting aggregated stream has  
stats for the number
of internal hosts that it touched, which is in the "trans" field.  So  
sort on the "trans" field, and
you generate the list of external IP addresses sorted by the number  
of unique internal
IP addresses that it attempted to touch.  Remember this is only for  
IP addresses that touch
a non-active address during the observation period.  Because we went  
back to the original
data, we also pick up the active hosts that it touched as well, so we  
get the complete list of
addresses that scanners accessed.

 From this final output, we generate the report, and perl is pretty  
good at grabbing fields and
moving them around.  The real information is acquired using:

line 193 and 197

    my @args = "ra -L-1 -r $RADATA/scanreport.out -s saddr dur trans - 
c , ";
    open(SESAME, "@args |");

Here we just print out the IP addresses, the duration of the  
aggregates and the
number of aggregations.  Because of how we did this, the aggregation  
count
equals the number of internal IP addresses that a particular external  
address
touched.

Ok, hopefully this is helpful.  If anyone has any questions, please  
holler.
This particular version of radark() is in the newest set of clients  
that I'll put
up later today.

Carter