I want to import the data available in this link "http://www.cricbuzz.com/cricket-series/2489/england-tour-of-india-2016-17/matches" into spark using pyspark. Is there a way that we can download it directly into spark?
Answer by tsharma · Jul 28, 2017 at 05:47 AM
If you wanted static data, you could use native python requests or urllib2 modules to fetch it, then parse and convert it to a Spark rdd.
But if you want to make a Streaming application then as per official documentation:
Spark Streaming provides two categories of built-in streaming sources.
If you're using Kafka:
You need to have a Producer that reads the data from a url and writes to a topic.
http://saurzcode.in/2015/02/kafka-producer-using-twitter-stream/ for reference.
You can then use KafkaInputDStream in Spark as an abstraction over Kafka Consumer to create a Spark DStream.
You can read the link below, should give you some idea:
http://www.michael-noll.com/blog/2014/10/01/kafka-spark-streaming-integration-example-tutorial/
HCC Guidelines | HCC FAQs | HCC Privacy Policy
© 2011-2017 Hortonworks Inc. All Rights Reserved.
Hadoop, Falcon, Atlas, Sqoop, Flume, Kafka, Pig, Hive, HBase, Accumulo, Storm, Solr, Spark, Ranger, Knox, Ambari, ZooKeeper, Oozie and the Hadoop elephant logo are trademarks of the Apache Software Foundation.
Privacy Policy |
Terms of Service
HCC Guidelines | HCC FAQs | HCC Privacy Policy | Privacy Policy | Terms of Service
© 2011-2018 Hortonworks Inc. All Rights Reserved.
Hadoop, Falcon, Atlas, Sqoop, Flume, Kafka, Pig, Hive, HBase, Accumulo, Storm, Solr, Spark, Ranger, Knox, Ambari, ZooKeeper, Oozie and the Hadoop elephant logo are trademarks of the Apache Software Foundation.