Hortonworks.com
  • Explore
    • All Tags
    • All Questions
    • All Repos
    • All Repos
    • All SKB
    • All SKB
    • All Articles
    • All Ideas
    • All Articles
    • All Ideas
    • All Users
    • All Badges
    • Leaderboard
  • Create
    • Ask a question
    • Add Repo
    • Add Repo
    • Create Article
    • Post Idea
    • Create Article
    • Post Idea
  • Tracks
    • All Tracks
    • Community Help
    • Cloud & Operations
    • CyberSecurity
    • Data Ingestion & Streaming
    • Data Processing
    • Data Science & Advanced Analytics
    • Design & Architecture
    • Governance & Lifecycle
    • Hadoop Core
    • Sandbox & Learning
    • Security
    • Solutions
  • Login
HCC Hortonworks Community Connection
  • Home /
  • Hadoop Core /
avatar image

How to import a data from URL through pyspark?

Question by Bala Vignesh N V Jul 17, 2017 at 05:12 PM Sparkhadooppysparkdata-ingestiondata

I want to import the data available in this link "http://www.cricbuzz.com/cricket-series/2489/england-tour-of-india-2016-17/matches" into spark using pyspark. Is there a way that we can download it directly into spark?

Comment

People who voted for this

0 Show 0
10 |6000 characters needed characters left characters exceeded
▼
  • Viewable by all users
  • Viewable by moderators
  • Viewable by moderators and the original poster
  • Advanced visibility
Viewable by all users

1 Reply

· Add your reply
  • Sort: 
  • Votes
  • Created
  • Oldest
avatar image

Answer by tsharma · Jul 28, 2017 at 05:47 AM

@Bala Vignesh N V

If you wanted static data, you could use native python requests or urllib2 modules to fetch it, then parse and convert it to a Spark rdd.

But if you want to make a Streaming application then as per official documentation:

Spark Streaming provides two categories of built-in streaming sources.

  • Basic sources: Sources directly available in the StreamingContext API. Examples: file systems, and socket connections.
  • Advanced sources: Sources like Kafka, Flume, Kinesis, etc. are available through extra utility classes. These require linking against extra dependencies as discussed in the linking section.

If you're using Kafka:

You need to have a Producer that reads the data from a url and writes to a topic.

http://saurzcode.in/2015/02/kafka-producer-using-twitter-stream/ for reference.

You can then use KafkaInputDStream in Spark as an abstraction over Kafka Consumer to create a Spark DStream.

You can read the link below, should give you some idea:

http://www.michael-noll.com/blog/2014/10/01/kafka-spark-streaming-integration-example-tutorial/

Comment

People who voted for this

0 Show 0 · Share
10 |6000 characters needed characters left characters exceeded
▼
  • Viewable by all users
  • Viewable by moderators
  • Viewable by moderators and the original poster
  • Advanced visibility
Viewable by all users

Your answer

Hint: You can notify a user about this post by typing @username

Up to 5 attachments (including images) can be used with a maximum of 524.3 kB each and 1.0 MB total.

63
Followers
follow question

Answers Answer & comments

HCC Guidelines | HCC FAQs | HCC Privacy Policy

Hortonworks - Develops, Distributes and Supports Open Enterprise Hadoop.

© 2011-2017 Hortonworks Inc. All Rights Reserved.
Hadoop, Falcon, Atlas, Sqoop, Flume, Kafka, Pig, Hive, HBase, Accumulo, Storm, Solr, Spark, Ranger, Knox, Ambari, ZooKeeper, Oozie and the Hadoop elephant logo are trademarks of the Apache Software Foundation.
Privacy Policy | Terms of Service

HCC Guidelines | HCC FAQs | HCC Privacy Policy | Privacy Policy | Terms of Service

© 2011-2018 Hortonworks Inc. All Rights Reserved.

Hadoop, Falcon, Atlas, Sqoop, Flume, Kafka, Pig, Hive, HBase, Accumulo, Storm, Solr, Spark, Ranger, Knox, Ambari, ZooKeeper, Oozie and the Hadoop elephant logo are trademarks of the Apache Software Foundation.

  • Anonymous
  • Login
  • Create
  • Ask a question
  • Add Repo
  • Add Repo
  • Create SupportKB
  • Create SupportKB
  • Create Article
  • Post Idea
  • Create Article
  • Post Idea
  • Tracks
  • Community Help
  • Cloud & Operations
  • CyberSecurity
  • Data Ingestion & Streaming
  • Data Processing
  • Data Science & Advanced Analytics
  • Design & Architecture
  • Governance & Lifecycle
  • Hadoop Core
  • Sandbox & Learning
  • Security
  • Solutions
  • Explore
  • All Tags
  • All Questions
  • All Repos
  • All Repos
  • All SKB
  • All SKB
  • All Articles
  • All Ideas
  • All Articles
  • All Ideas
  • All Users
  • Leaderboard
  • All Badges