Big Data

FOSS Project Spotlight: Sawmill, the Data Processing Project

on March 22, 2018

Introducing Sawmill, an open-source Java library for enriching, transforming and filtering JSON documents.

If you're into centralized logging, you are probably familiar with the ELK Stack: Elasticsearch, Logstash and Kibana. Just in case you're not, ELK (or Elastic Stack, as it's being renamed these days) is a package of three open-source components, each responsible for a different task or stage in a data pipeline.

Logstash is responsible for aggregating the data from your different data sources and processing it before sending it off for indexing and storage in Elasticsearch. This is a key role. How you process your log data directly impacts your analysis work. If your logs are not structured correctly and you have not configured Logstash correctly, your logs will not be parsed in a way that enables you to query and visualize them in Kibana.

Logz.io used to rely heavily on Logstash for ingesting data from our customers, running multiple Logstash instances at any given time. However, we began to experience some pain points that ultimately led us down the path to the project that is the subject of this article: Sawmill.

Explaining the Motivation

Over time, and as our data pipelines became more complex and heavy, we began to encounter serious performance issues. Our Logstash configuration files became extremely complicated, which resulted in extremely long startup times. Processing also was taking too long, especially in the case of long log messages and in cases where there was a mismatch between the configuration and the actual log message.

The above points resulted in serious stability issues, with Logstash coming to a halt or sometimes crashing. The worst thing about it was that troubleshooting was a huge challenge. We lacked visibility and felt a growing need for a way to monitor key performance metrics.

There were additional issues we encountered, such as dynamic configuration reload and the ability to apply business logic, but suffice it to say, Logstash was simply not cutting it for us.

Introducing Sawmill

Before diving into Sawmill, it's important to point out that Logstash has developed since the time we began working on this project, with new features that help deal with some of the pain points described above.

So, what is Sawmill?

Sawmill is an open-source Java library for enriching, transforming and filtering JSON documents.

For Logstash users, the best way to understand Sawmill is as a replacement of the filter section in the Logstash configuration file. Unlike Logstash, Sawmill does not have any inputs or outputs to read and write data. It is responsible only for data transformation.

Using Sawmill pipelines, you can use your groks, geoip, user-agent resolving, add or remove fields/tags and more, in a descriptive manner, using configuration files or builders, in a simple DSL, allowing you to change transformations dynamically.

Sawmill Key Features

Here's a list of the key features and processing capabilities that Sawmill supports:

Written in Java, Sawmill is thread-safe and efficient, and uses caches where needed.
Sawmill can be configured in HOCON or JSON.
Sawmill allows you to configure a timeout for long processing using a configurable threshold.
Sawmill generates metrics for successful, failed, expired and dropped executions, and a metric for processing exceeding a defined threshold. All metrics are available per pipeline and processor.
25+ processors, including grok, geoip, user-agent, date, drop, key-value, json, math and more.
Nine logical conditions, including the basics as well as field-exists, has-value, match-regex and math-compare.

Using Sawmill

Here is a basic example illustrating how to use Sawmill:


Doc doc = new Doc(myLog);
PipelineExecutor pipelineExecutor = new PipelineExecutor();
pipelineExecutor.execute(pipeline, doc);

As you can see above, there are a few entities in Sawmill:

Doc — essentially a Map representing a JSON.
Processor — a single document logical transformation. Either grok-processor, key-value-processor, add-field and so on.
Pipeline — specifies a series of processing steps using an ordered list of processors. Each processor transforms the document in some specific way. For example, a pipeline might have one processor that removes a field from the document, followed by another processor that renames a field.
PipelineExecutor — executes the processors defined in the pipeline on a document. The PipelineExecutor is responsible for the execution flow—handling onFailure and onSuccess flows, stops on failure, exposes metrics of the execution and more.
PipelineExecutionTimeWatchdog — responsible for warning on long processing time, interrupts and stops processing on timeout (not shown in the example above).

Sawmill Configuration

A Sawmill pipeline can get built from a HOCON string (Human-Optimized Config Object Notation).

Here is a simple configuration snippet, to get the feeling of it:


{
"steps": [{
    "grok": {
        "config": {
            "field": "message",
            "overwrite": ["message"],
"patterns":["%{COMBINEDAPACHELOG}+%{GREEDYDATA:extra_fields}"]
           }
        }
    }]
}

Which is equivalent to the following in HOCON:


steps: [{
    grok.config: {
            field : "message"
            overwrite : ["message"]
            patterns :
["%{COMBINEDAPACHELOG}+%{GREEDYDATA:extra_fields}"]
           }
    }]

To understand how to use Sawmill, here's a simple example showing GeoIP resolution:


package io.logz.sawmill;

import io.logz.sawmill.Doc;
import io.logz.sawmill.ExecutionResult;
import io.logz.sawmill.Pipeline;
import io.logz.sawmill.PipelineExecutor;

import static io.logz.sawmill.utils.DocUtils.createDoc;

public class SawmillTesting {

    public static void main(String[] args) {

        Pipeline pipeline = new Pipeline.Factory().create(
                "{ steps :[{\n" +
                "    geoIp: {\n" +
                "      config: {\n" +
                "        sourceField: \"ip\"\n" +
                "        targetField: \"geoip\"\n" +
                "        tagsOnSuccess: [\"geo-ip\"]\n" +
                "      }\n" +
                "    }\n" +
                "  }]\n" +
                "}");

        Doc doc = createDoc("message", "testing geoip resolving",
         ↪"ip", "172.217.11.174");
        ExecutionResult executionResult = new
PipelineExecutor().execute(pipeline, doc);

        if (executionResult.isSucceeded()) {
            System.out.println("Success! result
             ↪is:"+doc.toString());
        }
    }
}

End Results

We've been using Sawmill successfully in our ingest pipelines for more than a year now, processing the huge amounts of log data shipped to us by our users.

We know Sawmill is still missing some key features, and we are looking forward to getting contributions from the community. We also realize that at the end of the day, Sawmill was developed for our specific needs and might not be relevant for your use case. Still, we'd love to get your feedback.

About the Author

Daniel Berman is Product Evangelist at Logz.io. He is passionate about log analytics, big data, cloud and family, and he loves running, Liverpool FC and writing about disruptive tech stuff. Follow him @proudboffin.

Load Disqus comments

Big Data

FOSS Project Spotlight: Sawmill, the Data Processing Project

Recent Articles