Applying Predictive Modeling Techniques to Information Security

Sunday, February 13, 2011

Fred Williams


A few months ago, I presented an article on Infosec Island titled: "Using Analytics and Modeling to Predict Attacks". In that article, I wondered if analytics could assist security professionals in predicting future computer attacks. 

After writing a research paper on the subject for my last semester in graduate school, in a nutshell, my simple answer is Yes... and as Dr. Chuvakin commented on my previous article: "The devil's in the details!".  The focus of my paper was on the details.

Basically, analytics can be used in any type of industry that produces and consumes data.  Of course that includes security.

Predictive analytics and data mining at first may seem to mean the same thing but there are differences. Data mining defines the process of exploring large amounts of data for relationships that can be exploited for proactive decision making.  Data mining can produce decisions through normal reports that explain what happened.  Alerts can be created to define the times when reactions are necessary.  To me, predictive modeling goes a few steps above data mining and therefore adds the most value to a business.  Predictive modeling starts with  statistical analysis and moves on from standard reporting and alerts to forecasting and optimizations.  Instead of focusing on what happened, predictive modeling allows us to look at what will happen next, what trends will continue and how we can do better.

There are considerable barriers to this field.  For one, analytics involves the use of advanced statistics.  My limited statistical training was certainly a big hurdle for me as I began to put analytics into practice.  I dusted off my grad school business statistics book and began to reread the sections on measures of central tendency, probability theories and Bayesian statistics.  At the same time, I was learning what exactly was meant by "business analytics" and "predictive modeling".  Luckily for me, I work at one of the largest software companies in the world whose focus on business analytics has provided me with a wealth of material and software tools to put this into hands-on practice.

The complex nature of the field leads to the next barrier: you need highly paid, highly skilled modeling professionals.  Which leads to the next barrier: you need people who know how to use modeling software.  Since analytics is complicated, the software to use it is complicated.  But even if you know statistics and learn how to use the tools, you may not be able interpret the results you get.  Matter of fact, there is a trend in the industry to combat the complex nature of the field.  There are companies that are planning to release tools that bring analytics to the novice end user.  For my paper I evaluated two open source packages: R (the stats package) with the Rattle data mining plugin and Weka (a data mining package).  I compared the open source offerings to SAS Enterprise Miner - an enterprise strength data mining package with descriptive and predictive modeling capabilities.

In order to apply the techniques to information security I needed datasets.  I used a commonly applied dataset in information security research: The network intrusion dataset from the KDD archive popularly referred to as the KDD 99 Cup set.  The KDD 99 Cup consists of 41 attributes and 345,814 observations gathered from 9 weeks of raw TCP data from simulated United States Air Force network traffic.  The intrusion dataset is quite different from a raw TCP dump.  First of all, the KDD99 Cup dataset has a number of attributes that are not found in raw TCP data.

Secondly, two features are missing from the dataset that would actually improve intrusion detection models.  Those two features are timestamp and source IP address.  Web log analysis is based upon these two useful features and they provide valuable insights on access patterns.  The data set creators simulated 24 attack types in this data set broken down into 4 classes:  Denial of service, Root to Local, probing and User to Root attack types.  This dataset was downloaded in two forms:  (1) the raw dataset in CSV format for loading into SAS Enterprise Miner and (2) the dataset in ARFF format as required by Weka software.  Immediately I realized a major problem in using R and Weka - while I could load 400,000 records in R and Weka - when I chose to build models, both packages frequently hung whereas SAS Enterprise Miner ran like a champ.

Next in my paper, I proposed a basic modeling framework.  By using a modeling framework, modelers can apply techniques in an iterative fashion similar to software engineering.  This enables the modelers to share models, evaluate models for effectiveness and determine if model results are accurate.  My framework start with data exploration, then move onto modeling envisioning, followed by iterative modeling and finally ending with modeling testing and deployment.  This framework is loosely based upon the Predictive Model Markup Language (PMML) that was designed by the Data Mining Group. 

By starting with data exploration you can use the software to display measures of central tendency. 

Summary Statistics generated by SAS Enterprise Miner for the KDD99Cup Dataset

For example, when I imported the KDD 99 Cup dataset into the software, it showed several interesting things.  For one, the summary detected that 57% of all observations involved Smurf DDoS attacks and that 100% of the Smurf attacks involved the ICMP protocol.  In addition, 22% of all Neptune attacks involved TCP traffic types.  This identifies that Smurf attacks involved a flood of ICMP packets whereas the Neptune attacks are variants of the TCP 3-way handshaking process.  Overall the summary statistics showed very irregular data distributions on the KDD99 Cup data set. For example, the DDOS records always come in large clusters whereas the U2R attacks are always represented by isolated records.  This does represent a common technique among hackers: Attackers will launch a massive attack against a target in a DDOS attack that overwhelms the server.  Hidden in this tremendous amount of data, the attackers will launch more lucrative u2r and l2r attacks.  The idea is that the security analysts will be so busy mitigating the DDOS attacks that they don’t even detect the attack trying to gain access through backdoor attacks or password guess attacks.

When moving to model envisioning, you use agile software techniques to document candidate models to aid in predictive modeling.  A common model is the decision tree.

Decision tree generated by SAS Enterprise Miner 

When using a decision tree, you identify a target variable from your dataset and the software uses a series of IF - ELSE rules to divide the data into logical segments. Improvements to the predictive models occur during subsequent iterations where model effectiveness is measured. In my paper, I started further dividing the decision tree built in previous iterations by various attributes until I was relatively sure that results that I see could be accurate and useful.  The final phase, model testing and deployment, involve determining whether the predictive models constructed in earlier phases perform effectively.  Cumulative lift charts are excellent ways to visually show the performance of a model.  The lift, a measure of effectiveness of a predictive model, is calculated as the ratio between the results obtained with and without the predictive model. 

Lift = confidence / expected confidence

Lift chart

Basically, the greater the area between the lift curve and the baseline, the better the model will be at predicting outcomes.

There has been an increasing amount of work in the information technology field concerning predictive techniques and the need to uncover patterns in data.  Al-Shayea used artificial neural networks in order to predict student’s academic performance with the goal of improving student scores using preplanned strategic programs.  Fouad, Abdel-Aziz, and Nazmy conducted research on using artificial neural networks in IDSs in order to detect unknown signature patterns in network traffic.

Predictive modeling has proven to be extremely effective in solving a wide array of important business problems.  There are several hurdles to overcome before this process can be effectively used by a wider audience.  One problem is that a trained data analyst who is experienced in modeling techniques and is knowledgeable about the data sources needs to be involved.  A highly automated technology solution that incorporates the framework features presented in this paper exposed as a web service would enable developers and database analysts all over the world to build customizable solutions for their company.

Possibly Related Articles:
IDS ICMP Analytics Information Security Data Predictive Modeling PMML
Post Rating I Like this!
David-Joshua Ginsberg Reading about Predictive Modeling here, it seems that it would mesh well with the newer 'Proactive' style of fraud detection that fraud investigators might prefer to use rather than 'Reactive' style deduction. This means they would have a better chance to detect fraudulent activity before much theft occurs, rather than after.

Predictive Modeling sounds like a really successful way to data mine, taking full population database analysis a step further, rather than a step backward into 'discovery sampling' which only examines a portion of a data sampling, and makes inferences on the entire group, based on the small portion. The flaw with discovery sampling is that it misses a large chunk of the data to analyze, where many anomalies and even incidents of fraud may be detected.
Good work, Lance!
David-Joshua Ginsberg (Sorry... make that "Good work, Fred!" I was invited to this group by Lance Williams, but hadn't noticed that Fred Williams was the contributor of this article.)
Don Turnblade What I learned is that your local Six Sigma Blackbelt has the statistical might to help here.

I love Network and Host Intrusion Detection/Prevention modeling based on electronic signatures only. That said, something vigilant to help InfoSec monitor meaningful signatures is always welcome. I cannot tell you how many units are unable to detect authorized Attack and Penetration efforts until the report prints it for them.

I am working a Business Impact Forrum at were we turn technical frequency and impact into Dollars and Cents in a quanitative way. Consider joining the forum.

Statistics are allowed. Curve fits, Confidence Intervals, reality based frequency and impact models are wanted and encouraged.

Best Wishes

Fred Williams Thanks for your comments. It is true that predictive analytics is being used in industries right now. The credit card industries uses it to detect fradulant transactions at the point of sale and then after the fact by combing through the databases looking for anomolies. Insurance, government and health care are also heavy users.
Don Turnblade For those of you that want to take it a step further, Monte Carlo based analysis is recognized by Business Leads as a valid decision support metric for Funding Security Projects.

But, you are going to have to get some data that may give you a few more white hairs. Generally, your C-levels will openly thank you however.

Start with collecting Mean Time Between Failures;
Move on to Return On Invested Capital;
Then, finish with situational Breach Costing.
The views expressed in this post are the opinions of the Infosec Island member that posted this content. Infosec Island is not responsible for the content or messaging of this post.

Unauthorized reproduction of this article (in part or in whole) is prohibited without the express written permission of Infosec Island and the Infosec Island member that posted this content--this includes using our RSS feed for any purpose other than personal use.