So, you work with data, and I mean a lot of data, the kind you can't fit in memory. So what are your options? The likely solution is that you'll need to use a data streaming solution, one of which is NiFi, which is what this post will focus on. You may have also heard of a technology called docker, this is a very, very powerful. What docker allows you to do is to run multiple concurrent processes platform independently, but more on docker in another post.

So what is NiFi? NiFi is an Apache foundation project designed to allow for streaming data-sets, through multiple stream processors. This allows you to manipulate the data into a standardised format.

Idyllic landscape with a waterfall
Photo by Robert Lukeman / Unsplash

For this tutorial, I will make a few assumptions.

  • You are running a Linux Operating System
  • You know what distribution of Linux you are running
  • You can follow an install guide

Firstly, you will need to follow this guide on installing docker on your Linux machine https://docs.docker.com/install/

Photo by Thomas Kelley / Unsplash

We now need to initialise a docker swarm, this is so that we can create internal networks and services. To do this, run the following command

docker@host$ docker swarm init

This will initialise the swarm and generate a join-token, you can use this to hook in more boxes, but again, this can be covered in another post.

Photo by Anastasia Dulgier / Unsplash

Next, we need to define a few files, the first being the docker-stack.yml

version: '3'
services:
  nifi:
    image: apache/nifi:1.9.2
    ports:
      - 8080:8080
    volumes:
      - data:/data
      - /path/to/bootstrap.conf:/opt/nifi/nifi-current/conf/bootstrap.conf

Then we need to define the bootstrap.conf, this allows us to configure NiFi to have more memory, this is because by default the docker image will only give java 512m, I'd recommend giving it at least 2GB of RAM but it's probably better to set the lower limit to 4GB if possible

The settings that we want to update are

# JVM memory settings
java.arg.2=-Xms4g
java.arg.3=-Xmx8g

In context this looks like this

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# Java command to use when running NiFi
java=java

# Username to use when running NiFi. This value will be ignored on Windows.
run.as=

# Configure where NiFi's lib and conf directories live
lib.dir=./lib
conf.dir=./conf

# How long to wait after telling NiFi to shutdown before explicitly killing the Process
graceful.shutdown.seconds=20

# Disable JSR 199 so that we can use JSP's without running a JDK
java.arg.1=-Dorg.apache.jasper.compiler.disablejsr199=true

# JVM memory settings
- java.arg.2=-Xms512m
- java.arg.3=-Xmx512m
+ java.arg.2=-Xms4g
+ java.arg.3=-Xmx8g

# Enable Remote Debugging
#java.arg.debug=-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=8000

java.arg.4=-Djava.net.preferIPv4Stack=true

# allowRestrictedHeaders is required for Cluster/Node communications to work properly
java.arg.5=-Dsun.net.http.allowRestrictedHeaders=true
java.arg.6=-Djava.protocol.handler.pkgs=sun.net.www.protocol

# The G1GC is still considered experimental but has proven to be very advantageous in providing great
# performance without significant "stop-the-world" delays.
java.arg.13=-XX:+UseG1GC

#Set headless mode by default
java.arg.14=-Djava.awt.headless=true

# Master key in hexadecimal format for encrypted sensitive configuration values
nifi.bootstrap.sensitive.key=

# Sets the provider of SecureRandom to /dev/urandom to prevent blocking on VMs
java.arg.15=-Djava.security.egd=file:/dev/urandom

# Requires JAAS to use only the provided JAAS configuration to authenticate a Subject, without using any "fallback" methods (such as prompting for username/password)
# Please see https://docs.oracle.com/javase/8/docs/technotes/guides/security/jgss/single-signon.html, section "EXCEPTIONS TO THE MODEL"
java.arg.16=-Djavax.security.auth.useSubjectCredsOnly=true

###
# Notification Services for notifying interested parties when NiFi is stopped, started, dies
###

# XML File that contains the definitions of the notification services
notification.services.file=./conf/bootstrap-notification-services.xml

# In the case that we are unable to send a notification for an event, how many times should we retry?
notification.max.attempts=5

# Comma-separated list of identifiers that are present in the notification.services.file; which services should be used to notify when NiFi is started?
#nifi.start.notification.services=email-notification

# Comma-separated list of identifiers that are present in the notification.services.file; which services should be used to notify when NiFi is stopped?
#nifi.stop.notification.services=email-notification

# Comma-separated list of identifiers that are present in the notification.services.file; which services should be used to notify when NiFi dies?
#nifi.dead.notification.services=email-notification

Now that this is all configured we just need to bring the service up, we can do this by running the following command.

docker@host$ docker stack deploy -c docker-stack.yml NIFI_STACK

And that's it, navigate to http://localhost:8080/nifi and you should be able to see the NiFi frontend, see you in the next post when we'll be looking at how to setup a CSV pipeline to get it into Elasticsearch.

Until next time, enjoy NiFi