This article is designed to extend the great work by @Ali Bajwa: Sample HDF/NiFi flow to Push Tweets into Solr/Banana, HDFS/Hive
I have included the complete notebook on my Github site, which can be found here
Step 1 - Follow Ali's tutorial to establish an Apache Solr collection called "tweets"
Step 2 - Verify the version of Apache Spark being used, and visit the Solr-Spark connector site. The key is to match the version of Spark the version of the Solr-Spark connector. In the example below, the version of Spark is 2.2.0, and the connector version is 3.4.4
%spark2 sc sc.version res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@617d134a res1: String = 2.2.0.2.6.4.0-91
Step 3 - Include the Solr-Spark dependency in Zeppelin. Important note: This needs to be run before the Spark Context has been initialized.
%dep z.load("com.lucidworks.spark:spark-solr:jar:3.4.4") //Must be used before SparkInterpreter (%spark2) initialized //Hint: put this paragraph before any Spark code and restart Zeppelin/Interpreter
Step 4 - Run Solr query and return results into Spark DataFrame. Note: Zookeeper host might need to use full names:
"zkhost" -> "host-1.domain.com:2181,host-2.domain.com:2181,host-3.domain.com:2181/solr",
%spark2 val options = Map( "collection" -> "Tweets", "zkhost" -> "localhost:2181/solr", // "query" -> "Keyword, 'More Keywords'" ) val df = spark.read.format("solr").options(options).load df.cache()
Step 5 - Review results of the Solr query
%spark2 df.count() df.printSchema() df.take(1)
Kafka 0.9 Configuration Best Practices
Automate deployment of HDP3.1/HDF3.3 or HDF3.3 standalone using Ambari blueprints and AWS AMI
Azure Sandbox prep for Twitter/HDP/HDF demo
Sample HDF/NiFi flow to Push Tweets into Solr/Banana, HDFS/Hive
Apache Storm Topology Tuning Approach
NiFi Ranger based policy descriptions
Apache Storm Resource Contention Resolution Strategies
How can I configure pyspark on livy to use anaconda3 python instead of the default one
This website uses cookies for analytics, personalisation and advertising. To learn more or change your cookie settings, please read our Cookie Policy. By continuing to browse, you agree to our use of cookies.
HCC Guidelines | HCC FAQs | HCC Privacy Policy | Privacy Policy | Terms of Service
© 2011-2019 Hortonworks Inc. All Rights Reserved.
Hadoop, Falcon, Atlas, Sqoop, Flume, Kafka, Pig, Hive, HBase, Accumulo, Storm, Solr, Spark, Ranger, Knox, Ambari, ZooKeeper, Oozie and the Hadoop elephant logo are trademarks of the Apache Software Foundation.