What is Storm?
Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Storm is simple, can be used with any programming language, and is a lot of fun to use!
Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.
Setting up Storm
First grab version 0.9.2 of Storm (already compiled version)
$ wget http://apache.mesi.com.ar/storm/apache-storm-0.9.2-incubating/apache-storm-0.9.2-incubating.tar.gz
Extract the files:
$ tar -zxvf apache-storm-0.9.2-incubating.tar.gz
Grab the latest version of maven:
$ sudo apt-get install maven
If you are on a Mac:
$ brew install maven
also set the JAVA_HOME path in Mac through the ~/.bash_profile file:
$ export JAVA_HOME=$(/usr/libexec/java_home)
Check the maven version to see that it installed correctly:
$ mvn -version
If you checked out the src version of Storm, you can build and install the Storm jars locally with the following command (requires pom.xml file). This doesn’t need to be done if you already downloaded the compiled version as this guide has shown. However, it’s worth noting now because you’ll be using this command to compile your projects if you modify any of the source code.
Instead of building jars for the Storm project (since we’ve checked out the compiled version), let’s build the jar file for the storm-starter example project. First go into the storm-starter project within the apache-storm-0.9.2-incubating/examples folder:
$ cd apache-storm-0.9.2-incubating/examples/storm-starter
Now compile and build the jar files:
$ mvn clean install -DskipTests=true
It should take a few minutes, and you’ll see a lot of output. At the end of the output, you should see:
You are now ready to run your first Storm job (called a “topology”). We are going to run the topology located in apache-storm-0.9.2-incubating/examples/storm-starter/src/jvm/storm/starter/WordCountTopology.java
Let’s run the topology first, and then go briefly into the details of what is happening.
In the storm-starter directory, issue:
$ mvn compile exec:java -Dstorm.topology=storm.starter.WordCountTopology
The whole process will take about 50 seconds, and you will see a whole bunch of output. The main thing to look for in this output is something like this:
It should occur near the middle of all the output being shown. The end of the output should a bunch of shutdown messages, along with a success message like this:
Congratulations! You have ran your first Storm topology!
The entire source is available at this gist. It has three classes:
Now, for the problems and questions:
SnapshotGet vs TupleCollectionGet in stateQuery
Configuring the parallelism of a topology
Note that in Storm’s terminology “parallelism” is specifically used to describe the so-called parallelism hint, which means the initial number of executors (threads) of a component. In this article though I use the term “parallelism” in a more general sense to describe how you can configure not only the number of executors but also the number of worker processes and the number of tasks of a Storm topology. I will specifically call out when “parallelism” is used in the narrow definition of Storm.
The following table gives an overview of the various configuration options and how to set them in your code. There is more than one way of setting these options though, and the table lists only some of them. Storm currently has the following order of precedence for configuration settings: external component-specific configuration > internal component-specific configuration > topology-specific configuration > storm.yaml > defaults.yaml.
How to change the parallelism of a running topology
A nifty feature of Storm is that you can increase or decrease the number of worker processes and/or executors without being required to restart the cluster or the topology. The act of doing so is called rebalancing.
You have two options to rebalance a topology:
Here is an example of using the CLI tool:
1 2 3 4 5
# Reconfigure the topology "mytopology" to use 5 worker processes, # the spout "blue-spout" to use 3 executors and # the bolt "yellow-bolt" to use 10 executors. $ storm rebalance mytopology -n 5 -e blue-spout=3 -e yellow-bolt=10
If you have any doubts on Apache Strom visit Mindmajix. My personal impression is that Storm is a very promising tool. On the one hand I like its clean and elegant design, and on the other hand I loved to find out that a young open source tool can still have an excellent documentation. In this article I tried to summarize my own understanding of the parallelism of topologies, which may or may not be 100% correct – feel free to let me know if there are any mistakes in the description above!