CSE 525 (Winter 2004)
Topic #27: Machine learning in IDS #2
Jason Bittel

T. Lane and C. E. Brodley, "An application of machine learning to anomaly detection", NIST-NCSC National Information Systems Security Conference, 1997, Paper
J. Ryan, M. Lin and R. Miikkulainen, "Intrusion Detection with Neural Networks", MIT Press, 1998, Paper
A. K. Ghosh, A. Schwatzbard and M. Shatz, "Learning Program Behavior Profiles for Intrusion Detection", USENIX Workshop on Intrusion Detection and Network Monitoring, 1999, Paper
D. Endler, "Intrusion detection: Applying machine learning to solaris audit data", ACSAC'98, Paper


Summary: This paper group covered four papers, which was quite a lot for the sheer amount of data to cover. However, while each of the papers came at the problem from different angles, there are a number of similarities that can be generalized between the papers.

To begin, there are two major approaches to IDS. First is misuse detection which means intrusions are defined before-hand by means of patterns and are then matched to data. While this methodology is simple, its main flaw is that it cannot adapt to new intrusions unless its database of patterns is updated. Secondly, anomaly detection can be used by creating a baseline pattern of normal system use and then looking for significant deviations from that norm. The main benefit to this system is that it can abstract information about normal behavior to defend against unknown attacks.

There are also several important terms to be aware of. Concept drift is the idea that a user's methodology for utilizing a system changes over time, and therefore the IDS must be able to adapt to the change as well. An on-line IDS is one that gets and processes usage data in real time. The main drawback to this is that it is computationally expensive, although it does provide intrusion information nearly simultaneously to their occurrence. Conversely, and off-line system is one that analyzes usage data that has been stored for a period of time. This system can analyze the data daily, weekly, or whatever the interval has been set at.

The focus of the first paper was on utilizing sequences as the fundamental unit of comparison. For this experiment, data was collected from four different users. This data was parsed from a command stream into a token stream, with arguments replaced with the numerical count of how many arguments were passed to that program. The sequences within the data then had a numerical similarity measure computed for them which related sequences that had a close resemblance. Different analysis methods were experimented with; specifically they found that sequence length and dictionary size contributed most to the accuracy of the system. Ultimately, they determined that the system worked "well" with a couple of caveats. Specifically, they did not address concept drift at all and they theorized that novice users could be the most difficult to develop a baseline fingerprint for.

The second paper focused on the NNID (Neural Network Intrusion Detector). This system works off-line and uses distribution of commands to identify specific user's behavior. The data was collected and two specific tests were run against the data. Overall, the system performed well and all of the false positives were attributed to a specific user for which not enough data could be recorded. Overall, the authors of this paper felt that their system was a success as well, although they did leave several questions unanswered at the end as a prelude to future work.

The third paper experimented with three different algorithms for the same thing. These three algorithms were table lookup, essentially a matching algorithm; a backpropagation network, which was a neural network implementation; and an Elman network, which is a modified backpropagation network that is able to recall and use older data to make current decisions. To summarize the results of the three algorithms, the Elman network outperformed the other two by a significant margin in terms of low false positives and a high anomaly detection rate.

The fourth paper was specific to Solaris. Data was collected from four users using the Basic Security Module to collect audit data. This data was then parsed with a Perl script into distinct files for each user, with simulated intrusions randomly interspersed into the files. These files were then run through a neural network simulator to determine how well they would do. Overall, the network was able to pick up the majority of the anomalies without many false positives, although there were no hard figures given. Ultimately, the authors found the best solution to be a combination of both anomaly and misuse detection.

To summarize, the methodology of fingerprinting users by their usage patterns holds promise as a means to perform intrusion detection both on-line and off-line. However, there are some weaknesses to this research that was performed. Specifically, a user is capable of tainting the network if he can break into the network during the learning phase, all of these tests were performed with low user counts and so no scalability tests were performed, and there was no real-world tests done as all of the intrusions were merely simulated in the data. However, with additional work all of these hurdles could be easily overcome. Considering that these papers are now relatively old, much of this research may have already been completed.

Presentation: Slides