So, you work with data, and I mean a lot of data, the kind you can't fit in memory. So what are your options? The likely solution is that you'll need to use a data streaming solution, one of which is NiFi, which is what this post will focus on. You may have also heard of a technology called docker, this is a very, very powerful. What docker allows you to do is to run multiple concurrent processes platform independently, but more on docker in another post.
So what is NiFi? NiFi is an Apache foundation project designed to allow for streaming data-sets, through multiple stream processors. This allows you to manipulate the data into a standardised format.
For this tutorial, I will make a few assumptions.
- You are running a Linux Operating System
- You know what distribution of Linux you are running
- You can follow an install guide
Firstly, you will need to follow this guide on installing docker on your Linux machine https://docs.docker.com/install/
We now need to initialise a docker swarm, this is so that we can create internal networks and services. To do this, run the following command
docker@host$ docker swarm init
This will initialise the swarm and generate a join-token, you can use this to hook in more boxes, but again, this can be covered in another post.
Next, we need to define a few files, the first being the docker-stack.yml
version: '3' services: nifi: image: apache/nifi:1.9.2 ports: - 8080:8080 volumes: - data:/data - /path/to/bootstrap.conf:/opt/nifi/nifi-current/conf/bootstrap.conf
Then we need to define the bootstrap.conf, this allows us to configure NiFi to have more memory, this is because by default the docker image will only give java 512m, I'd recommend giving it at least 2GB of RAM but it's probably better to set the lower limit to 4GB if possible
The settings that we want to update are
# JVM memory settings java.arg.2=-Xms4g java.arg.3=-Xmx8g
In context this looks like this
# # Licensed to the Apache Software Foundation (ASF) under one or more # contributor license agreements. See the NOTICE file distributed with # this work for additional information regarding copyright ownership. # The ASF licenses this file to You under the Apache License, Version 2.0 # (the "License"); you may not use this file except in compliance with # the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # # Java command to use when running NiFi java=java # Username to use when running NiFi. This value will be ignored on Windows. run.as= # Configure where NiFi's lib and conf directories live lib.dir=./lib conf.dir=./conf # How long to wait after telling NiFi to shutdown before explicitly killing the Process graceful.shutdown.seconds=20 # Disable JSR 199 so that we can use JSP's without running a JDK java.arg.1=-Dorg.apache.jasper.compiler.disablejsr199=true # JVM memory settings - java.arg.2=-Xms512m - java.arg.3=-Xmx512m + java.arg.2=-Xms4g + java.arg.3=-Xmx8g # Enable Remote Debugging #java.arg.debug=-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=8000 java.arg.4=-Djava.net.preferIPv4Stack=true # allowRestrictedHeaders is required for Cluster/Node communications to work properly java.arg.5=-Dsun.net.http.allowRestrictedHeaders=true java.arg.6=-Djava.protocol.handler.pkgs=sun.net.www.protocol # The G1GC is still considered experimental but has proven to be very advantageous in providing great # performance without significant "stop-the-world" delays. java.arg.13=-XX:+UseG1GC #Set headless mode by default java.arg.14=-Djava.awt.headless=true # Master key in hexadecimal format for encrypted sensitive configuration values nifi.bootstrap.sensitive.key= # Sets the provider of SecureRandom to /dev/urandom to prevent blocking on VMs java.arg.15=-Djava.security.egd=file:/dev/urandom # Requires JAAS to use only the provided JAAS configuration to authenticate a Subject, without using any "fallback" methods (such as prompting for username/password) # Please see https://docs.oracle.com/javase/8/docs/technotes/guides/security/jgss/single-signon.html, section "EXCEPTIONS TO THE MODEL" java.arg.16=-Djavax.security.auth.useSubjectCredsOnly=true ### # Notification Services for notifying interested parties when NiFi is stopped, started, dies ### # XML File that contains the definitions of the notification services notification.services.file=./conf/bootstrap-notification-services.xml # In the case that we are unable to send a notification for an event, how many times should we retry? notification.max.attempts=5 # Comma-separated list of identifiers that are present in the notification.services.file; which services should be used to notify when NiFi is started? #nifi.start.notification.services=email-notification # Comma-separated list of identifiers that are present in the notification.services.file; which services should be used to notify when NiFi is stopped? #nifi.stop.notification.services=email-notification # Comma-separated list of identifiers that are present in the notification.services.file; which services should be used to notify when NiFi dies? #nifi.dead.notification.services=email-notification
Now that this is all configured we just need to bring the service up, we can do this by running the following command.
docker@host$ docker stack deploy -c docker-stack.yml NIFI_STACK
And that's it, navigate to http://localhost:8080/nifi and you should be able to see the NiFi frontend, see you in the next post when we'll be looking at how to setup a CSV pipeline to get it into Elasticsearch.
Until next time, enjoy NiFi