iF2: An Interpretable Fuzzy Rule Filter for Web Log Post-Compromised Malicious Activity Monitoring

To alleviate the loads of tracking web log file by human effort, machine learning methods are now commonly used to analyze log data and to identify the pattern of malicious activities. Traditional kernel based techniques, like the neural network and the support vector machine (SVM), typically can deliver higher prediction accuracy. However, the user of a kernel based techniques normally cannot get an overall picture about the distribution of the data set. On the other hand, logic based techniques, such as the decision tree and the rule-based algorithm, feature the advantage of presenting a good summary about the distinctive characteristics of different classes of data such that they are more suitable to generate interpretable feedbacks to domain experts. In this study, a real web-access log dataset from a certain organization was collected. An efficient interpretable fuzzy rule filter (iF2) was proposed as a filter to analyze the data and to detect suspicious internet addresses from the normal ones.

The historical information of each internet address recorded in web log file is summarized as multiple statistics. And the design process of iF2 is elaborately modeled as a parameter optimization problem which simultaneously considers 1) maximizing prediction accuracy, 2) minimizing number of used rules, and 3) minimizing number of selected statistics. Experimental results show that the fuzzy rule filter constructed with the proposed approach is capable of delivering superior prediction accuracy in comparison with the conventional logic based classifiers and the expectation maximization based kernel algorithm. On the other hand, though it cannot match the prediction accuracy delivered by the SVM, however, when facing real web log file where the ratio of positive and negative cases is extremely unbalanced, the proposed iF2 of having optimization flexibility results in a better recall rate and enjoys one major advantage due to providing th- user with an overall picture of the underlying distributions.

Share This Post