Argus detecting historical APT1 activity #3 cont

Fri Apr 12 17:56:56 EDT 2013

Carter, thank you very much for writing such thorough and thoughtful emails on this subject.  I've shared them with the entire security team at work.

I've been thinking a bit on how Argus could be used to do more proactive alerting of these kinds of threats.  As you know, I'm not a programmer *at all*.  I had a couple of ideas and was wondering what it would take to implement them.

How difficult is it to reassemble a file from payload data?  I'm imagining being able to reassemble executable files from HTTP connections to external hosts, generating an MD5 of that file, and then use dig to query the malware database from:

http://www.team-cymru.org/Services/MHR/#dns

For APT threats that don't drop a file and write directly to some vulnerable part of memory, what kind of payload data would be indicative of that kind of attack?  Looking through some of our Snort rules, I see patterns that look like:

Heapspray example:
|5C|x0c|5C|x0c|5C|x0c|5C|x0c|5C|x0c|5C|x0c|5C|x0c|5C|x0c|

Shellcode examples:
|90 90 90 E8 C0 FF FF FF|/bin/sh
|6a 43 59 e8 ff ff ff ff c1 5e 30 4c 0e 07 e2 fa 6b 63 5b 9d|
\x00\x00\x00[\x00\x01].{4}\x00\x01\x00\x01lanattacks_(start_dhcp|reset_dhcp|set_dhcp_option|stop_dhcp|dhcp_log|start_tftp|reset_tftp|add_tftp_file|stop_tftp)

How would you use Argus to detect those types of attacks without generating a ton of false positives?  It doesn't seem like the types of patterns above should be all that common in normal HTTP traffic.

How much of the payload would Argus need to look at to reliably identify those types of attacks?

Thanks.

Craig

From: argus-info-bounces+cmerchant=responsys.com at lists.andrew.cmu.edu [mailto:argus-info-bounces+cmerchant=responsys.com at lists.andrew.cmu.edu] On Behalf Of Carter Bullard
Sent: Tuesday, April 02, 2013 2:09 PM
To: Argus
Subject: Re: [ARGUS] Argus detecting historical APT1 activity #3 cont

Gentle people,
To continue on the Argus and APT1 discussion, I had written that the Mandiant
APT1 document described two basic nodes in the APT1 attack infrastructure,
the Hop Point and the Target Attack Nodes.  I'm going to continue to write about
Hop Points for a few more emails, because, having one of your machines acting
as an APT1 Hop Point, is possibly the worst thing that can happen to you in the
APT1 attack life cycle.

I suggested that the best strategy for identifying APT1 Hop Points is to use Time
Series Analysis methods, specifically Transfer Function Models, and Intervention
Analysis, to realize that a node has been transformed, (Identification) and to
realize who, what, when, how it was transformed (Attribution).  Now, I'm pretty
sure that most people are not interested in a long discourse on how to use
2nd and 3rd order differentials over different time periods, to recognize
trending discontinuities.  This stuff is pretty complicated, and advanced even
for complex Time Series forecasting and control methods.  But that is the
kind of direction you want to go in if you want to do Machine Learning methods,
or if you want to do unsupervised systems for network behavioral anomaly
detection, which would be a really cool thing to have.

In support of this APT1 Hop Point identification process, however, there are more
direct things you can look at, that don't take a lot of math, and can be done with
simple, effective, reliable strategies that are easily explained and understood.
Lets look briefly at one that should be useful.

Most nodes that can be transformed to an APT1 Hop Point, are either predominately
consumers or producers of transport network data.  User driven machines are
generally transport service consumers, little requests sent, big responses received,
such as those seen in web browsing and streaming video services.  Machine driven
machines, such as DNS, Web and Database servers, are generally network
transport data producers, they receive little requests and send bigger responses.
Even in Peer-to-Peer networks, you see stable consumer / producer roles emerge,
where your node, the one you are paying for, generally becomes a network data
producer, providing services to a lot of machines you don't know.   Peer-to-peer
networks do present a challenge to this kind of strategy, but not always.

When a node is transformed to an APT1 Hop Point service provider, and the
stepping stone function is active, a node will become both a consumer and
a producer of the data it is transporting.  If it moves a large amount of data,
as indicated in the Mandiant report, the overall producer / consumer properties
of the affected node will move from where ever they are, toward 1.0, ...
a balanced transport node.

Our job, is to try to identify when a node is transformed to an APT1 Hop Point,
which means that it will go from whatever it was doing, to being a balanced
producer / consumer, accepting data from an attack target, and relaying that
data to the attacker.  If, historically, a node can be determined to be predominately
a producer or consumer, then detecting when it becomes a large scale
balanced producer / consumer, will be pretty easy, as the deviation from its
normal behavior will be pretty dramatic.

Now, producer / consumer metrics are not a measure of the packets (rate)
or total bytes (load) sent or received on the wire by a node.  Instead, its a
measure of the transport bytes successfully received and sent.

Protocols like TCP generally present a balance of packets sent and received
on the wire.  After the 3-packet TCP setup handshake, one side sends data,
and the other sends TCP overhead ACKs, almost 1-for-1.  So we can't use
packet counts to indicate producer / consumer roles on the network.

The total bytes on the wire has a bit more asymmetry that can reveal
consumer and producer relationships,  but the noise generated by the TCP
protocol overhead bytes can make the distinction a bit more difficult.
There are tricks, such as PUSHing one byte at a time, through a TCP
connection, or reducing the allowable window size on a connection, that
can reduce the number of transported bytes per packet very small.
Reporting the actually ACK'd data, rather than the total data on the wire,
makes this type of analysis possible.

To measure the application bytes received or sent, argus needs to be configured
to generate the metric. Set ARGUS_GENERATE_APPBYTE_METRIC=yes in your
/etc/argus.conf file.  Lets assume that your argus is monitoring your enterprise
border interface, so that you monitor all the traffic going in and out of your site.
The resulting argus data will have the information needed to determine all the
producers and consumers of your enterprise, i.e. those that are bringing data in
and those that are transporting data out (this is a starting point for developing
formal Transfer Function Models, by the way, when you get there).

A simple measure of the producer / consumer role is the ratio of application
data sent (produced) vs the application data received (consumed).  Using argus
data, you can calculate the metric on each status record, on each aggregated
flow, or on any of the various aggregations that you can perform.  So its trivial to
calculate the  "sappbytes / dappbytes" whether its a instantaneous microflow,
or if its an entire subnet's traffic aggregated into a single argus flow record.

To start a simple analysis, lets process a days worth of data from a single QoSient
workstation and see what's up.  Lets measure the sent and received application
bytes of the top IP addresses seen, to assign simple producer and consumer roles,
and try to use those labels as guides, to see what the trends are, and how
to interpret the data. SPOILER: In this set of data, there are no APT1
Hop Points, but....

Lets look at the IP addresses that the QoSient node 192.168.0.68 talked to, on
April Fool's Day, 2013. This node resides in the 192.168.0.0/24 network, and is
a basic workstation, using shared file systems, with email, web browsing, automated
software updates and cloud services.  What nodes does this node talk to, outside
or inside our own network, and are they producers or consumers ?

This node doesn't provide any services, so we expect all other nodes to be producers,
not consumers, and we expect the node to be a consumer of network services.  Let's
see, but to keep the email short, lets just look at the top 10 nodes.

Grabing an entire days worth of data from the collection archive, lets track
individual IP addresses (so we'll use the " -M rmon " option), preserving the protocol
and ports used, by each address.  We'll use this first pass derived data, as starting data
for the actual analysis, which will we'll generate to report individual IP addresses total
src application bytes and dst application bytes sent.  We'll take that data, and formulate
the sappbytes/dappbytes ratio, by hand for this exercise, and if the ratio is > 1.5 then
we'll label it as a Producer, if the ratio is < 0.95, we'll label it a Consumer, and between
these numbers, we'll call the transport Balanced.  We'll color the output, so Consumers
are in red, and Producers whose ratios are HUGE, we'll color blue.

Lets look at the top 10 SrcAppByte generators, to see how this might work.
Here we go....

thoth:01 carter$  racluster -R /archive/192.168.0.68/2013/04/01 -m saddr proto sport -w /tmp/argus.out - ipv4
thoth:01 carter$ racluster -r /tmp/argus.out -m saddr -w - |  rasort -m sappbytes \
                       -s stime dur saddr proto sport sappbytes dappbytes -No10
                 StartTime        Dur            SrcAddr  Proto  Sport    SAppBytes    DAppBytes         Ratio
2013/04/01.00:00:00.847207 86399.101*       192.168.0.66     ip            69805178      1339356       52.1185  Producer
2013/04/01.15:54:08.964340  25.124109      208.59.201.94    tcp http       27104415          120   225870.1250  Producer
2013/04/01.00:01:16.133367 86285.734*        66.39.3.162    tcp imaps      12816471      1012491       12.6584  Producer
2013/04/01.00:00:00.847207 86399.101*       192.168.0.68     ip            11872196    120391392        0.0986  Consumer
2013/04/01.17:17:37.184721 528.364441       171.67.72.17    tcp ssh         4347072        50746       85.6633  Producer
2013/04/01.00:02:51.660475 85447.757*      17.172.208.43    tcp https       2103142       430417        4.8863  Producer
2013/04/01.15:55:58.919139 28921.785*       192.168.0.78     ip             1399179      7702570        0.1817  Consumer
2013/04/01.09:47:16.282091 43205.253*        17.154.65.1    tcp https        472376        20531       23.0079  Producer
2013/04/01.00:05:42.767984 85586.210*      192.168.0.127     ip              461937            0           Inf  Producer
2013/04/01.00:29:54.738518 81412.937*      173.194.43.33    tcp *            413487        18616       22.2114  Producer

Basically, what this data is saying, of the top 10 addresses sending data on April Fool's day,
most are producers, just as we expected.  And the workstation itself, 192.168.0.68, is
a consumer (first line in red), with a sent/recv'd ratio of 0.0986.  We've got some really
HUGE producers, which indicates purely one-way transfers, the kind we're looking for.

In this data we're looking for ATP1 relay data candidates. Large data transfers from a
remote site to an internal node, that is then relayed to another external node, possibly
Chinese, possibly not, in real time.

None of the producers are sending enough data to represent a LARGE exfiltration of data,
one of the definitions of being an APT1 Hop Point.  But LARGE is a relative term, so we
need to analyze any potential APT1 traffic candidate.

>From the first remote address in the list, 208.59.201.94, our largest remote producer, we
received 27MB of data.  The sent / recv'd ratio of 225,870 is just what you would expect
from a large transfer of data into your infrastructure, and is a good candidate for APT1
style stepping stone data influx.  Even though its using HTTP as the protocol, we should
assume the transport technique to be somewhat clever, so whether its HTTP, SSH, or
a mix of protocols, is notable, but potentially insignificant.

For the purposes of this dialog, to identify this flow as a part of an APT1 Hop Point action,
we need to find an outflow from our workstation that would transfer the data received from this
remote node to another node.  In a simple APT1 Hop Point, our workstation would want to
transfer the 27MB to a remote address, which we don't see in the top 10.  Whew !!!
If its a simple relay, you would expect the outgoing flow to sort closely to the flow of
interest, as they would both be transporting the same amount of application data.

In a slight variation of the basic APT1 Infrastructure, the Hop Point may relay the exfiltrated
data to another internal node.  Our simple report indicates that the workstation doesn't
transmit 27MB to any single node, either external or internal.

And in the most complex relay models that could be implemented, where multiple
endpoints receive portions of the exfiltrated data, our node still does not look to be
an APT1 Hop Point.  By looking at the entry for 192.168.0.68, our workstation, we see
that we don't actually send 27MB of total data out of the node for the whole day !!!

                 StartTime        Dur            SrcAddr  Proto  Sport    SAppBytes    DAppBytes         Ratio
2013/04/01.00:00:00.847207 86399.101*       192.168.0.68     ip            11872196    120391392        0.0986  Consumer

As you can see, we only sent (SAppBytes) 11.8MB total, to all our transport endpoints
combined, for all of April 1, 2013.  So our candidate 27MB flow, does not look to be relayed.
Now in the original data, there are about 10K individual flows that may could be candidates,
but the aggregate analysis generated only a few hundred candidate IP addresses.

An automated system would iterate through all potential candidate transfers and attempt to
find candidate outflows that could support the relay concept.  That would be the most elegant
of analytics, and not that expensive, if you have a good aggregation model and
analytic framework.

Now, just looking at an arbitrary day, by itself, you can get some assurance that you aren't
support an APT1 Hop Point type of relay service.  But the strength of argus based network
activity auditing, is that you have historical data that can support the development of
hourly producer / consumer metrics for every IP address in your archive, which could
be abstractly called a Transport Function Model for all the assets in your observable
domain.

I have done this type of analytic for our workstation over the last 2 years, and this
workstation has maintained the 0.09 sent vs received application byte ratio for
almost every day.  It has never gone over 0.17.  So this would be a great candidate
machine for this type of analysis.  While it may receive a lot of data from the outside,
it doesn't transfer a lot of data.  And if it did, it would be very easy to know it.

So, all I need is the sent / recv'd ratio for all the end points in my enterprise, and
if they have had stable ratios that are >> 10 or << 0.1, indicating that they are stable
producers and/or consumers, then detecting a significant shift toward 1.0, a balanced
consumer / producer role, is pretty easy.  If you think that the change is significant,
then you can go through the original flow data, looking at sappbyte an dappbyte
metrics to figure out what happened.  Your looking for new producer roles for your
consumers and new consumer roles for your producers, that are contributing
to the ratio moving toward 1.0.

It's a system that can work for the majority of the nodes in your enterprise.  For
the ones that it doesn't, there are more complex analytics that can be used, but
enough for a single piece of email.

Reactions, opinions, attitude and flames welcome,

Hope all is most excellent,

Carter

On Mar 27, 2013, at 12:09 PM, Carter Bullard <carter at qosient.com<mailto:carter at qosient.com>> wrote:

Gentle people,
To continue on the Argus and APT1 discussion, I had written that the Mandiant
APT1 document described two basic nodes in the APT1 attack infrastructure,
the Hop Point and the Target Attack Nodes.  I'm going to continue to write about
Hop Points for a few more emails, because, having one of your machines acting
as an APT1 Hop Point, is possible the worst thing that can happen in the APT1
attack life cycle.

So far, I've presented that Mandiant's report gives us a lot of detail, trends and
methods, that allow us to detect overt APT1 behavior using the argus data.  Trends
such as APT1's establishment and use of well defined attack infrastructure and
the tendancy to access that infrastructure directly, from well defined IP address
blocks, using specific access methods, and a good description of the attackers
intent, exfiltration of large amounts of data.  These trends lead to a set of very
simple tests for APT1 activity, that can be tested against argus data archives
to help you realize if you've been had, or not.

The APT1 strategies that Mandiant describes are conventional, and the attack
infrastructure itself is simple, direct, almost optimal (minimal reliable methods,
2-3 hops from attacker to target), suggesting that the infrastructure has
predictable utility, i.e. it may actually work to scale, and work well enough to
be worth the effort.  The ultimate simplicity of the realized APT1 infrastructure,
may be the result of a limit in Mandiant's detection capability ( you can only
see what you are looking for ), but there is no question that what they describe
is real.

While Mandiant is very detailed in what it does talk about, there are huge
gaps in what it doesn't talk about.  I'd like to dive deeper into APT1 Hop Point
identification, but we're lacking key information.  What kind of systems does
APT1 use for Hop Points? Linux workstations ? Windows XP machines ?
Web Servers ?  Android devices ?  Routers ?  While we have some really
great patterns to look for, like specific SSH certificates, there are so many
things we don't have; initial penetration techniques, command and control
methods, beaconing patterns, persistent vs dynamic access.

In the absence of real detail, we'll have to develop general strategies for
detection, and if we want to have any success, we'll need to avoid
awareness / detection system pitfalls, such as sampling, and sampling bias
(looking only at one protocol or one type of OS), and matching complexity.

One of the simple characteristics that I will try to leverage in my discussions, is
the intent of the APT1 attack, and the goal of the APT1 Hop Point; to move
a lot of data, from a remote site to another remote site.  If that really is the
singular attack goal for APT1, then with good argus data generation and
analytics, we should be able to find any node that is acting as an APT1
Hop Point, as well as any the other APTx Hop Points that may exist.

The approach that I will try to describe in the next set of emails, is one based
on a Bell-LaPadula style of analysis, to find nodes that have been transformed
from being one type of network based node, to another type of network node,
in the case of APT1, one that is supporting a demanding network based transport
service.

I'm going to use Time Series Analysis methods, specifically Transfer Function
Models, and Intervention Analysis to realize that a node is doing something
different.  The Transfer Function Models, are perfect for this, as they are
generally used to describe input / output dynamic system response,
and Intervention Analysis is all about the notion that there is an event that
motivates a dynamic change in system input / output.  So I'm going to try to use
this strategy to identify a change in input / output, and then to try to find the
event that correlates with the change.

If you can imagine that there is an argus running on every node in an
infrastructure, establishing a generalize network activity audit, that goes
back quite a ways, then we should have a very rich set of data to perform
this type of analysis, either automated, or by hand.   The goal will be to
realize that a node went from being a specific type of producer / consumer,
to a different kind of producer / consumer, over some period of time.

OK, that is going to be my strategy, any other approaches that seem to be
appropriate?    More to come.

Hope all is most excellent,

Carter

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist1.pair.net/pipermail/argus/attachments/20130412/41f36db6/attachment.html>