Hortonworks.com
  • Explore
    • All Tags
    • All Questions
    • All Repos
    • All SKB
    • All Articles
    • All Ideas
    • All Users
    • All Badges
    • Leaderboard
  • Create
    • Ask a question
    • Add Repo
    • Create Article
    • Post Idea
  • Tracks
    • All Tracks
    • Community Help
    • Cloud & Operations
    • CyberSecurity
    • Data Ingestion & Streaming
    • Data Processing
    • Data Science & Advanced Analytics
    • Design & Architecture
    • Governance & Lifecycle
    • Hadoop Core
    • Sandbox & Learning
    • Security
    • Solutions
  • Login
HCC Hortonworks Community Connection
  • Home /
  • Data Ingestion & Streaming /
  • Home /
  • Data Ingestion & Streaming /
avatar image

Guidelines for building Streaming Applications

  • Export to PDF
Ambud Sharma created · Nov 19, 2016 at 09:24 PM · edited · Nov 28, 2016 at 05:23 PM
6

Short Description:

Set of critical points to consider when developing a Streaming Application

Article

The objective of this article is to have a discussion and capture critical points for consideration when developing a Streaming Application using Storm, Spark or any other streaming technology. Intent is to add additional points from discussion and comments to evolve this article into a comprehensive guideline for Stream Engineering.

Here are a few points synthesized from my experience developing streaming applications using Storm and Flume:

1. Select Streaming Application’s SLAs requirements

Most Streaming Applications can be classified based on their SLA requirements into:

  • At-most once (data drops are acceptable)
  • At-least once (data drops are NOT acceptable)
  • Exactly once (idempotent computations)

These requirements tend to heavily drive design requirements like whether or not to ack tuples (in Storm), or if downsampling is acceptable (sensor data processing)

2. Minimal and careful data replays

In Guaranteed Processing SLA use cases, data replay must be minimized by application logic to avoid situations like replay loops and heavy back pressure. This situation is seen in poorly designed topologies where incorrect exception handling leads to a single datapoint being infinitely replayed.

3. Minimize processing latencies

Latencies for processing individual events/tuples adds up to the cumulative processing latencies therefore streaming application must be engineered for low latency using appropriate technologies for local in-memory caching and micro-batching where possible to minimize network latencies and amortize the cost over several calls.

4. Tradeoffs between Throughput and Latency

Performance tradeoffs usually are between Throughput Vs. Latency and can be tuned using Micro-batching (Storm Trident or Spark Streaming). Micro-batching approach may use time-based or size-based batches both of which have some caveats.

Size-based micro-batching may add latencies since the buffer must fill up before processing applied or transfers are executed and is therefore subject to event velocity. This model can't be used if there are hard latencies limits for the application however, if a sustained minimum event rate (velocity) is guaranteed then micro-batching can be applied while preserving acceptable latencies.

Time-based micro-batching is subject to stability and performance issues if spikes (volume and velocity) are not accounted. Time-based micro-batching can satisfy hard-latency constraints of a use case.

A hybrid model of micro-batching can also be deployed which uses both size based and time based batching and has hard limits on both to guarantee low latencies and high throughput while providing stability.

5. Aggregation Bottlenecks

Use cases where Streaming Aggregations are performed must

  • account for event volume at the pivot point of aggregation (aggregation key) to avoid bottlenecks and pipeline back pressures
  • account for the distribution of event volume across aggregation keys

6. Polling and Event Sourcing

Polling and Event Sourcing are two prominent design patterns for updating configuration and logic in a Streaming pipeline. Logic may include but is not limited to: Dynamic Filtering, Caching for data enrichment and Machine Learning Models.

In Polling based design these updates are polled for at a pre-determined frequency in an out-of-band fashion (separate update thread).

In Event Sourcing based design these updates are delivered to the processing component as Event and the processing component has a separate code branch (if statement) to handle these kinds of events and execute logic update.

Using Event Souring approach allows for a lock-free design whereas Polling based design requires locking (at some point) and concurrent data structures to be used to guarantee consistency.

7. Error Handling

Error Handling must be given thorough design thought as incorrect error handling will lead to application downtime and performance issues. Errors that are recoverable may allow for replays like network unavailability or failover, however unrecoverable errors must be written to a separate stream without blocking the primary execution pipeline. NiFi handles this very seamlessly using Success and Failure streams.

thub.nodes.view.add-new-comment
KafkaNifiSparkStormfaqfaqfaqfaqfaqfaqstreaming
Add comment
10 |6000 characters needed characters left characters exceeded
▼
  • Viewable by all users
  • Viewable by moderators
  • Viewable by moderators and the original poster
  • Advanced visibility
Viewable by all users

Article

Contributors

avatar image

Follow

Follow

avatar image avatar image avatar image avatar image avatar image
avatar image avatar image avatar image avatar image avatar image

Navigation

Guidelines for building Streaming Applications

Related Articles

Kafka 0.9 Configuration Best Practices

Automate deployment of HDP3.1/HDF3.3 or HDF3.3 standalone using Ambari blueprints and AWS AMI

Apache Storm Topology Tuning Approach

NiFi Ranger based policy descriptions

Azure Sandbox prep for Twitter/HDP/HDF demo

​Apache Storm Resource Contention Resolution Strategies

Sample HDF/NiFi flow to Push Tweets into Solr/Banana, HDFS/Hive

How can I configure pyspark on livy to use anaconda3 python instead of the default one

Running NiFi on Raspberry Pi. Best Practices.

Lessons learnt from nifi streaming data to hive transactional tables

This website uses cookies for analytics, personalisation and advertising. To learn more or change your cookie settings, please read our Cookie Policy. By continuing to browse, you agree to our use of cookies.

HCC Guidelines | HCC FAQs | HCC Privacy Policy | Privacy Policy | Terms of Service

© 2011-2019 Hortonworks Inc. All Rights Reserved.

Hadoop, Falcon, Atlas, Sqoop, Flume, Kafka, Pig, Hive, HBase, Accumulo, Storm, Solr, Spark, Ranger, Knox, Ambari, ZooKeeper, Oozie and the Hadoop elephant logo are trademarks of the Apache Software Foundation.

  • Anonymous
  • Login
  • Create
  • Ask a question
  • Add Repo
  • Create SupportKB
  • Create Article
  • Post Idea
  • Tracks
  • Community Help
  • Cloud & Operations
  • CyberSecurity
  • Data Ingestion & Streaming
  • Data Processing
  • Data Science & Advanced Analytics
  • Design & Architecture
  • Governance & Lifecycle
  • Hadoop Core
  • Sandbox & Learning
  • Security
  • Solutions
  • Explore
  • All Tags
  • All Questions
  • All Repos
  • All SKB
  • All Articles
  • All Ideas
  • All Users
  • Leaderboard
  • All Badges