So, you work with data, and I mean a lot of data, the kind you can't fit in memory. So what are your options? The likely solution is that you'll need to use a data streaming solution, one of which is NiFi, which is what this post will focus on. You may have also heard of a technology called docker, this is a very, very powerful. What docker allows you to do is to run multiple concurrent processes platform independently, but more on docker in another post.

So what is NiFi? NiFi is an Apache foundation project designed to allow for streaming data-sets, through multiple stream processors. This allows you to manipulate the data into a standardised format.

Idyllic landscape with a waterfall
Photo by Robert Lukeman / Unsplash

For this tutorial, I will make a few assumptions.

  • You are running a Linux Operating System
  • You know what distribution of Linux you are running
  • You can follow an install guide

Firstly, you will need to follow this guide on installing docker on your Linux machine

Photo by Thomas Kelley / Unsplash

We now need to initialise a docker swarm, this is so that we can create internal networks and services. To do this, run the following command

docker@host$ docker swarm init

This will initialise the swarm and generate a join-token, you can use this to hook in more boxes, but again, this can be covered in another post.

Photo by Anastasia Dulgier / Unsplash

Next, we need to define a few files, the first being the docker-stack.yml

version: '3'
    image: apache/nifi:1.9.2
      - 8080:8080
      - data:/data
      - /path/to/bootstrap.conf:/opt/nifi/nifi-current/conf/bootstrap.conf

Then we need to define the bootstrap.conf, this allows us to configure NiFi to have more memory, this is because by default the docker image will only give java 512m, I'd recommend giving it at least 2GB of RAM but it's probably better to set the lower limit to 4GB if possible

The settings that we want to update are

# JVM memory settings

In context this looks like this

# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# See the License for the specific language governing permissions and
# limitations under the License.

# Java command to use when running NiFi

# Username to use when running NiFi. This value will be ignored on Windows.

# Configure where NiFi's lib and conf directories live

# How long to wait after telling NiFi to shutdown before explicitly killing the Process

# Disable JSR 199 so that we can use JSP's without running a JDK

# JVM memory settings
- java.arg.2=-Xms512m
- java.arg.3=-Xmx512m
+ java.arg.2=-Xms4g
+ java.arg.3=-Xmx8g

# Enable Remote Debugging

# allowRestrictedHeaders is required for Cluster/Node communications to work properly

# The G1GC is still considered experimental but has proven to be very advantageous in providing great
# performance without significant "stop-the-world" delays.

#Set headless mode by default

# Master key in hexadecimal format for encrypted sensitive configuration values

# Sets the provider of SecureRandom to /dev/urandom to prevent blocking on VMs

# Requires JAAS to use only the provided JAAS configuration to authenticate a Subject, without using any "fallback" methods (such as prompting for username/password)
# Please see, section "EXCEPTIONS TO THE MODEL"

# Notification Services for notifying interested parties when NiFi is stopped, started, dies

# XML File that contains the definitions of the notification services

# In the case that we are unable to send a notification for an event, how many times should we retry?

# Comma-separated list of identifiers that are present in the; which services should be used to notify when NiFi is started?

# Comma-separated list of identifiers that are present in the; which services should be used to notify when NiFi is stopped?

# Comma-separated list of identifiers that are present in the; which services should be used to notify when NiFi dies?

Now that this is all configured we just need to bring the service up, we can do this by running the following command.

docker@host$ docker stack deploy -c docker-stack.yml NIFI_STACK

And that's it, navigate to http://localhost:8080/nifi and you should be able to see the NiFi frontend, see you in the next post when we'll be looking at how to setup a CSV pipeline to get it into Elasticsearch.

Until next time, enjoy NiFi