SE Can't Code

A Tokyo based Software Engineer. Not System Engineer :(

Infrastructure's machine learning.

I guess that you wanna detect anomaly from server logs. Using Jubatus, you can create reporting system that check per anomaly of access log of web site from Apache2.


  • Apache2
  • Fluentd
  • Jubatus
  • Python

First, Start Fluentd with option of starting.

You know, Fluentd allows you to unify data collection and consumption for a better use and understanding of data. In this case, I use fluentd for transferring access log of web site to Jubatus.
So, create Fluentd's config below:

  type tail
  path /var/log/apache2/access.log
  tag apache.log
  pos_file /tmp/fluent_pos
  format apache2

<match apache.**>
  type forward
  host localhost
  port 9191
  flush_interval 0.1s

Type command with -c option below:

$ fluentd -c apache_forward.conf

Second, Create checking anomaly for server's access log.

To purpose of detective anomaly of access logs, it gives three request path and method, refer in Apache2 and observe progress of checking anomaly using Jubatus. So, There four processes below:

  • Receive message that is access log of Apache2 from Fluentd
  • Interpret tuple data in message received with deserialization
  • Convert tuple data to Datum that is Jubatus's data type
  • Calculate score of anomaly

I guess that you feel it is easy. Coding with Python below:

Third, Setting Jubatus config.

I set Jubatus configure file to detect per anomaly of access logs from path and method, refer. It is detail configure of anomaly algorithm.
We can write JSON file for Jubatus config, below:

    "converter" : {
        "string_types" : {
            "bigram" : { "method": "ngram", "char_num" : "2"}
        "string_rules" : [
                "key" : "path",
                "type" : "bigram",
                "sample_weight" : "bin",
                "global_weight" : "bin"
                "key" : "method",
                "type" : "str",
                "sample_weight" : "bin",
                "global_weight" : "bin"
                "key" : "referer",
                "type" : "bigram",
                "sample_weight" : "bin",
                "global_weight" : "bin"
    "parameter" : {
        "nearest_neighbor_num" : 3,
        "reverse_nearest_neighbor_num" : 3,
        "method" : "euclid_lsh",
        "parameter" : {
            "hash_num" : 512
    "method" : "light_lof"

In Jubatus, using anomaly algorithm is Nearest Neighbors that is providing functionality for unsupervised and supervised neighbors-based learning methods. The principle behind nearest neighbor methods is to find a predefined number of training samples closest in distance to the new point, and predict the label from these. The number of samples can be a user-defined constant (k-nearest neighbor learning), or vary based on the local density of points (radius-based neighbor learning). The distance can, in general, be any metric measure: standard Euclidean distance is the most common choice.

And you can see ngram, is function of extraction feature value. This is splitting two words term from string given. There are usecase, For example Search Engine's natural language processing etc... So, in this case, Jubatus check whether anomaly using words split by two as feature value. Using ngram, you can check shell code and SQL injection in URL of HTTP requests, maybe. As these case, you can prevent from attacking unknown using Jubatus.