Argus detecting historical APT1 activity #3 cont

Dave Edelman dedelman at iname.com
Tue Apr 9 23:15:15 EDT 2013


Carter,
 
I've had Argus set to collect the application byte count metrics so I dug
through the data and compared the source/destination ratios using the
application byte counts and using the byte counts with the associated
overhead. I understand your point about a malicious actor gaming the system
but would it make sense to actually calculate the ratio of the ratios
(sappbytes/dappbytes)/(sbytes/dbytes)  as an indicator of malicious
obfuscation? I haven't tried it yet but if workloads for a system didn't
change significantly then I would expect that over some window of time,
changes in this ratio would be worth investigation.
 
The other idea that came to mind was determining the producer / consumer
characteristics of an entire subnet. At least in our environment subnets
rarely contain mixed populations and a shift in roles would be very unusual.
 
--Dave
 
From: argus-info-bounces+dedelman=iname.com at lists.andrew.cmu.edu
[mailto:argus-info-bounces+dedelman=iname.com at lists.andrew.cmu.edu] On
Behalf Of Carter Bullard
Sent: Tuesday, April 02, 2013 5:09 PM
To: Argus
Subject: Re: [ARGUS] Argus detecting historical APT1 activity #3 cont
 
Gentle people,
To continue on the Argus and APT1 discussion, I had written that the
Mandiant
APT1 document described two basic nodes in the APT1 attack infrastructure,
the Hop Point and the Target Attack Nodes.  I'm going to continue to write
about
Hop Points for a few more emails, because, having one of your machines
acting
as an APT1 Hop Point, is possibly the worst thing that can happen to you in
the
APT1 attack life cycle.
 
I suggested that the best strategy for identifying APT1 Hop Points is to use
Time
Series Analysis methods, specifically Transfer Function Models, and
Intervention
Analysis, to realize that a node has been transformed, (Identification) and
to
realize who, what, when, how it was transformed (Attribution).  Now, I'm
pretty
sure that most people are not interested in a long discourse on how to use
2nd and 3rd order differentials over different time periods, to recognize
trending discontinuities.  This stuff is pretty complicated, and advanced
even
for complex Time Series forecasting and control methods.  But that is the
kind of direction you want to go in if you want to do Machine Learning
methods,
or if you want to do unsupervised systems for network behavioral anomaly
detection, which would be a really cool thing to have.
 
In support of this APT1 Hop Point identification process, however, there are
more
direct things you can look at, that don't take a lot of math, and can be
done with
simple, effective, reliable strategies that are easily explained and
understood.  
Lets look briefly at one that should be useful.
 
Most nodes that can be transformed to an APT1 Hop Point, are either
predominately
consumers or producers of transport network data.  User driven machines are
generally transport service consumers, little requests sent, big responses
received,
such as those seen in web browsing and streaming video services.  Machine
driven
machines, such as DNS, Web and Database servers, are generally network
transport data producers, they receive little requests and send bigger
responses.
Even in Peer-to-Peer networks, you see stable consumer / producer roles
emerge,
where your node, the one you are paying for, generally becomes a network
data
producer, providing services to a lot of machines you don't know.
Peer-to-peer
networks do present a challenge to this kind of strategy, but not always.
 
When a node is transformed to an APT1 Hop Point service provider, and the
stepping stone function is active, a node will become both a consumer and
a producer of the data it is transporting.  If it moves a large amount of
data,
as indicated in the Mandiant report, the overall producer / consumer
properties
of the affected node will move from where ever they are, toward 1.0, ...
a balanced transport node.
 
Our job, is to try to identify when a node is transformed to an APT1 Hop
Point,
which means that it will go from whatever it was doing, to being a balanced
producer / consumer, accepting data from an attack target, and relaying that
data to the attacker.  If, historically, a node can be determined to be
predominately
a producer or consumer, then detecting when it becomes a large scale
balanced producer / consumer, will be pretty easy, as the deviation from its
normal behavior will be pretty dramatic.
 
Now, producer / consumer metrics are not a measure of the packets (rate)
or total bytes (load) sent or received on the wire by a node.  Instead, its
a
measure of the transport bytes successfully received and sent.
 
Protocols like TCP generally present a balance of packets sent and received
on the wire.  After the 3-packet TCP setup handshake, one side sends data,
and the other sends TCP overhead ACKs, almost 1-for-1.  So we can't use
packet counts to indicate producer / consumer roles on the network.
 
The total bytes on the wire has a bit more asymmetry that can reveal
consumer and producer relationships,  but the noise generated by the TCP
protocol overhead bytes can make the distinction a bit more difficult.
There are tricks, such as PUSHing one byte at a time, through a TCP
connection, or reducing the allowable window size on a connection, that
can reduce the number of transported bytes per packet very small.
Reporting the actually ACK'd data, rather than the total data on the wire,
makes this type of analysis possible.
 
To measure the application bytes received or sent, argus needs to be
configured
to generate the metric. Set ARGUS_GENERATE_APPBYTE_METRIC=yes in your
/etc/argus.conf file.  Lets assume that your argus is monitoring your
enterprise
border interface, so that you monitor all the traffic going in and out of
your site.
The resulting argus data will have the information needed to determine all
the
producers and consumers of your enterprise, i.e. those that are bringing
data in
and those that are transporting data out (this is a starting point for
developing
formal Transfer Function Models, by the way, when you get there).
 
A simple measure of the producer / consumer role is the ratio of application
data sent (produced) vs the application data received (consumed).  Using
argus
data, you can calculate the metric on each status record, on each aggregated
flow, or on any of the various aggregations that you can perform.  So its
trivial to
calculate the  "sappbytes / dappbytes" whether its a instantaneous
microflow,
or if its an entire subnet's traffic aggregated into a single argus flow
record.
 
To start a simple analysis, lets process a days worth of data from a single
QoSient
workstation and see what's up.  Lets measure the sent and received
application
bytes of the top IP addresses seen, to assign simple producer and consumer
roles,
and try to use those labels as guides, to see what the trends are, and how
to interpret the data. SPOILER: In this set of data, there are no APT1
Hop Points, but....
 
Lets look at the IP addresses that the QoSient node 192.168.0.68 talked to,
on
April Fool's Day, 2013. This node resides in the 192.168.0.0/24 network, and
is
a basic workstation, using shared file systems, with email, web browsing,
automated
software updates and cloud services.  What nodes does this node talk to,
outside
or inside our own network, and are they producers or consumers ?  
 
This node doesn't provide any services, so we expect all other nodes to be
producers,
not consumers, and we expect the node to be a consumer of network services.
Let's
see, but to keep the email short, lets just look at the top 10 nodes.
 
Grabing an entire days worth of data from the collection archive, lets track
individual IP addresses (so we'll use the " -M rmon " option), preserving
the protocol
and ports used, by each address.  We'll use this first pass derived data, as
starting data
for the actual analysis, which will we'll generate to report individual IP
addresses total
src application bytes and dst application bytes sent.  We'll take that data,
and formulate
the sappbytes/dappbytes ratio, by hand for this exercise, and if the ratio
is > 1.5 then
we'll label it as a Producer, if the ratio is < 0.95, we'll label it a
Consumer, and between
these numbers, we'll call the transport Balanced.  We'll color the output,
so Consumers
are in red, and Producers whose ratios are HUGE, we'll color blue.
 
Lets look at the top 10 SrcAppByte generators, to see how this might work.
Here we go....
 
 
thoth:01 carter$  racluster -R /archive/192.168.0.68/2013/04/01 -m saddr
proto sport -w /tmp/argus.out - ipv4
thoth:01 carter$ racluster -r /tmp/argus.out -m saddr -w - |  rasort -m
sappbytes \
                       -s stime dur saddr proto sport sappbytes dappbytes
-No10
                 StartTime        Dur            SrcAddr  Proto  Sport
SAppBytes    DAppBytes         Ratio
2013/04/01.00:00:00.847207 86399.101*       192.168.0.66     ip
69805178      1339356       52.1185  Producer
2013/04/01.15:54:08.964340  25.124109      208.59.201.94    tcp http
27104415          120   225870.1250  Producer
2013/04/01.00:01:16.133367 86285.734*        66.39.3.162    tcp imaps
12816471      1012491       12.6584  Producer
2013/04/01.00:00:00.847207 86399.101*       192.168.0.68     ip
11872196    120391392        0.0986  Consumer
2013/04/01.17:17:37.184721 528.364441       171.67.72.17    tcp ssh
4347072        50746       85.6633  Producer
2013/04/01.00:02:51.660475 85447.757*      17.172.208.43    tcp https
2103142       430417        4.8863  Producer
2013/04/01.15:55:58.919139 28921.785*       192.168.0.78     ip
1399179      7702570        0.1817  Consumer
2013/04/01.09:47:16.282091 43205.253*        17.154.65.1    tcp https
472376        20531       23.0079  Producer
2013/04/01.00:05:42.767984 85586.210*      192.168.0.127     ip
461937            0           Inf  Producer
2013/04/01.00:29:54.738518 81412.937*      173.194.43.33    tcp *
413487        18616       22.2114  Producer
 
 
Basically, what this data is saying, of the top 10 addresses sending data on
April Fool's day,
most are producers, just as we expected.  And the workstation itself,
192.168.0.68, is 
a consumer (first line in red), with a sent/recv'd ratio of 0.0986.  We've
got some really
HUGE producers, which indicates purely one-way transfers, the kind we're
looking for.
 
In this data we're looking for ATP1 relay data candidates. Large data
transfers from a
remote site to an internal node, that is then relayed to another external
node, possibly
Chinese, possibly not, in real time.
 
None of the producers are sending enough data to represent a LARGE
exfiltration of data,
one of the definitions of being an APT1 Hop Point.  But LARGE is a relative
term, so we
need to analyze any potential APT1 traffic candidate.
 
>From the first remote address in the list, 208.59.201.94, our largest remote
producer, we
received 27MB of data.  The sent / recv'd ratio of 225,870 is just what you
would expect
from a large transfer of data into your infrastructure, and is a good
candidate for APT1
style stepping stone data influx.  Even though its using HTTP as the
protocol, we should
assume the transport technique to be somewhat clever, so whether its HTTP,
SSH, or
a mix of protocols, is notable, but potentially insignificant.
 
For the purposes of this dialog, to identify this flow as a part of an APT1
Hop Point action,
we need to find an outflow from our workstation that would transfer the data
received from this
remote node to another node.  In a simple APT1 Hop Point, our workstation
would want to
transfer the 27MB to a remote address, which we don't see in the top 10.
Whew !!!
If its a simple relay, you would expect the outgoing flow to sort closely to
the flow of
interest, as they would both be transporting the same amount of application
data.
 
In a slight variation of the basic APT1 Infrastructure, the Hop Point may
relay the exfiltrated
data to another internal node.  Our simple report indicates that the
workstation doesn't
transmit 27MB to any single node, either external or internal.  
 
And in the most complex relay models that could be implemented, where
multiple
endpoints receive portions of the exfiltrated data, our node still does not
look to be
an APT1 Hop Point.  By looking at the entry for 192.168.0.68, our
workstation, we see
that we don't actually send 27MB of total data out of the node for the whole
day !!!
 
                 StartTime        Dur            SrcAddr  Proto  Sport
SAppBytes    DAppBytes         Ratio
2013/04/01.00:00:00.847207 86399.101*       192.168.0.68     ip
11872196    120391392        0.0986  Consumer
 
As you can see, we only sent (SAppBytes) 11.8MB total, to all our transport
endpoints
combined, for all of April 1, 2013.  So our candidate 27MB flow, does not
look to be relayed.
Now in the original data, there are about 10K individual flows that may
could be candidates,
but the aggregate analysis generated only a few hundred candidate IP
addresses.
 
An automated system would iterate through all potential candidate transfers
and attempt to
find candidate outflows that could support the relay concept.  That would be
the most elegant
of analytics, and not that expensive, if you have a good aggregation model
and
analytic framework.
 
Now, just looking at an arbitrary day, by itself, you can get some assurance
that you aren't
support an APT1 Hop Point type of relay service.  But the strength of argus
based network
activity auditing, is that you have historical data that can support the
development of 
hourly producer / consumer metrics for every IP address in your archive,
which could
be abstractly called a Transport Function Model for all the assets in your
observable
domain.
 
I have done this type of analytic for our workstation over the last 2 years,
and this
workstation has maintained the 0.09 sent vs received application byte ratio
for
almost every day.  It has never gone over 0.17.  So this would be a great
candidate
machine for this type of analysis.  While it may receive a lot of data from
the outside,
it doesn't transfer a lot of data.  And if it did, it would be very easy to
know it.
 
So, all I need is the sent / recv'd ratio for all the end points in my
enterprise, and
if they have had stable ratios that are >> 10 or << 0.1, indicating that
they are stable
producers and/or consumers, then detecting a significant shift toward 1.0, a
balanced
consumer / producer role, is pretty easy.  If you think that the change is
significant,
then you can go through the original flow data, looking at sappbyte an
dappbyte
metrics to figure out what happened.  Your looking for new producer roles
for your
consumers and new consumer roles for your producers, that are contributing 
to the ratio moving toward 1.0.
 
It's a system that can work for the majority of the nodes in your
enterprise.  For
the ones that it doesn't, there are more complex analytics that can be used,
but
enough for a single piece of email.
 
Reactions, opinions, attitude and flames welcome, 
 
Hope all is most excellent,
 
Carter
 
 
 
On Mar 27, 2013, at 12:09 PM, Carter Bullard <carter at qosient.com> wrote:



Gentle people,
To continue on the Argus and APT1 discussion, I had written that the
Mandiant
APT1 document described two basic nodes in the APT1 attack infrastructure,
the Hop Point and the Target Attack Nodes.  I'm going to continue to write
about
Hop Points for a few more emails, because, having one of your machines
acting
as an APT1 Hop Point, is possible the worst thing that can happen in the
APT1
attack life cycle.
 
So far, I've presented that Mandiant's report gives us a lot of detail,
trends and
methods, that allow us to detect overt APT1 behavior using the argus data.
Trends
such as APT1's establishment and use of well defined attack infrastructure
and
the tendancy to access that infrastructure directly, from well defined IP
address
blocks, using specific access methods, and a good description of the
attackers
intent, exfiltration of large amounts of data.  These trends lead to a set
of very
simple tests for APT1 activity, that can be tested against argus data
archives
to help you realize if you've been had, or not. 
 
The APT1 strategies that Mandiant describes are conventional, and the attack
infrastructure itself is simple, direct, almost optimal (minimal reliable
methods,
2-3 hops from attacker to target), suggesting that the infrastructure has
predictable utility, i.e. it may actually work to scale, and work well
enough to
be worth the effort.  The ultimate simplicity of the realized APT1
infrastructure,
may be the result of a limit in Mandiant's detection capability ( you can
only
see what you are looking for ), but there is no question that what they
describe
is real.
 
While Mandiant is very detailed in what it does talk about, there are huge
gaps in what it doesn't talk about.  I'd like to dive deeper into APT1 Hop
Point
identification, but we're lacking key information.  What kind of systems
does
APT1 use for Hop Points? Linux workstations ? Windows XP machines ? 
Web Servers ?  Android devices ?  Routers ?  While we have some really
great patterns to look for, like specific SSH certificates, there are so
many
things we don't have; initial penetration techniques, command and control
methods, beaconing patterns, persistent vs dynamic access.
 
In the absence of real detail, we'll have to develop general strategies for
detection, and if we want to have any success, we'll need to avoid 
awareness / detection system pitfalls, such as sampling, and sampling bias
(looking only at one protocol or one type of OS), and matching complexity.
 
One of the simple characteristics that I will try to leverage in my
discussions, is
the intent of the APT1 attack, and the goal of the APT1 Hop Point; to move
a lot of data, from a remote site to another remote site.  If that really is
the
singular attack goal for APT1, then with good argus data generation and
analytics, we should be able to find any node that is acting as an APT1
Hop Point, as well as any the other APTx Hop Points that may exist.
 
The approach that I will try to describe in the next set of emails, is one
based
on a Bell-LaPadula style of analysis, to find nodes that have been
transformed
from being one type of network based node, to another type of network node, 
in the case of APT1, one that is supporting a demanding network based
transport
service.
 
I'm going to use Time Series Analysis methods, specifically Transfer
Function
Models, and Intervention Analysis to realize that a node is doing something
different.  The Transfer Function Models, are perfect for this, as they are
generally used to describe input / output dynamic system response,
and Intervention Analysis is all about the notion that there is an event
that
motivates a dynamic change in system input / output.  So I'm going to try to
use
this strategy to identify a change in input / output, and then to try to
find the
event that correlates with the change.
 
If you can imagine that there is an argus running on every node in an
infrastructure, establishing a generalize network activity audit, that goes
back quite a ways, then we should have a very rich set of data to perform
this type of analysis, either automated, or by hand.   The goal will be to
realize that a node went from being a specific type of producer / consumer,
to a different kind of producer / consumer, over some period of time.
 
OK, that is going to be my strategy, any other approaches that seem to be
appropriate?    More to come.
 
Hope all is most excellent,
 
Carter
 
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20130409/acd2d46e/attachment.html>


More information about the argus mailing list