Hortonworks.com
  • Explore
    • All Tags
    • All Questions
    • All Repos
    • All SKB
    • All Articles
    • All Ideas
    • All Users
    • All Badges
    • Leaderboard
  • Create
    • Ask a question
    • Add Repo
    • Create Article
    • Post Idea
  • Tracks
    • All Tracks
    • Community Help
    • Cloud & Operations
    • CyberSecurity
    • Data Ingestion & Streaming
    • Data Processing
    • Data Science & Advanced Analytics
    • Design & Architecture
    • Governance & Lifecycle
    • Hadoop Core
    • Sandbox & Learning
    • Security
    • Solutions
  • Login
HCC Hortonworks Community Connection
  • Home /
  • Data Processing /
avatar image

Creating a Spark program with aws POM dependencies to load to s3 bucket

Question by Eric Hanson Jan 10, 2018 at 03:28 PM Sparkawsspark2s3

I'm using the HDP version 2.6.3 with the 2.2 version of Spark (not HDP cloud) and I'm trying to write to s3 from an IntelliJ project. I have no problems writing to the s3 bucket from the shell, but when I try to test my app on my local machine in IntelliJ I get weird errors after adding the Hadoop-aws and aws-java-sdk dependency jars. It seems like depending on where I place them in the ordering of the dependencies in my POM file I get different errors. When I put the spark dependencies at the top I get ERROR MetricsSystem: Sink class org.apache.spark.metrics.sink.MetricsServlet cannot be instantiated. If I take the Hadoop-aws dependency out and then invalidate the cache, everything runs fine except saving to s3, where I get a class not found error for org/apache/http/message/TokenParser. My POM file is pasted below. I have been playing around with putting the Hadoop-aws dependency in different orders in my pom, but haven't been successful in getting it to work. If I put it above my Spark dependencies, I get class not found errors for Spark classes. I set configurations for accessing s3a by setting the fs.s3a.iml, fs.s3a.access.key, and fs.s3a.secret.key properties through sc.hadoopConfiguration.set. Again, I have no problems with saving to s3 from the shell by setting these properties the same way. Any help would be greatly appreciated with this. Is there dependencies that it must be before or after? I wasn't aware that the ordering mattered, but apparently it does. I'm guessing that there might be some conflicting classes between the Hadoop-aws jar and one of the other Hadoop or spark jars.

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">  <modelVersion>4.0.0</modelVersion>  <groupId>com.lendingtree.data_lake</groupId>  <artifactId>Spark2_DL_ETL</artifactId>  <version>1.0-SNAPSHOT</version>  <name>${project.artifactId}</name>  <description>My wonderfull scala app</description>  <inceptionYear>2015</inceptionYear>  <licenses>    <license>      <name>My License</name>      <url>http://....</url>      <distribution>repo</distribution>    </license>  </licenses>  <properties>    <maven.compiler.source>1.6</maven.compiler.source>    <maven.compiler.target>1.6</maven.compiler.target>    <encoding>UTF-8</encoding>    <scala.version>2.11.8</scala.version>    <scala.compat.version>2.11</scala.compat.version>    <spark.version>2.2.0.2.6.3.0-235</spark.version>    <kafka.version>0.10.1</kafka.version>    <hbase.version>1.1.2.2.6.3.0-235</hbase.version>    <hadoop.version>2.7.3.2.6.3.0-235</hadoop.version>    <zookeeper.version>3.4.6</zookeeper.version>    <shc.version>1.1.0.2.6.3.0-235</shc.version>  </properties>  <repositories>    <repository>      <id>hortonworks</id>      <name>hortonworks repo</name>      <url>http://repo.hortonworks.com/content/repositories/releases/</url>    </repository>  </repositories>  <dependencies>    <dependency>      <groupId>org.scala-lang</groupId>      <artifactId>scala-library</artifactId>      <version>${scala.version}</version>    </dependency>    <dependency>      <groupId>org.apache.spark</groupId>      <artifactId>spark-core_${scala.compat.version}</artifactId>      <version>${spark.version}</version>    </dependency>    <dependency>      <groupId>org.apache.spark</groupId>      <artifactId>spark-sql_${scala.compat.version}</artifactId>      <version>${spark.version}</version>    </dependency>    <dependency>      <groupId>org.apache.spark</groupId>      <artifactId>spark-sql-kafka-0-10_${scala.compat.version}</artifactId>      <version>${spark.version}</version>    </dependency>    <dependency>      <groupId>org.apache.spark</groupId>      <artifactId>spark-hive_${scala.compat.version}</artifactId>      <version>${spark.version}</version>    </dependency>    <dependency>      <groupId>org.apache.hadoop</groupId>      <artifactId>hadoop-common</artifactId>      <version>${hadoop.version}</version><!--<scope>provided</scope>--></dependency>    <dependency>      <groupId>org.apache.hadoop</groupId>      <artifactId>hadoop-hdfs</artifactId>      <version>${hadoop.version}</version><!--<scope>provided</scope>--></dependency>    <dependency>      <groupId>org.apache.hadoop</groupId>      <artifactId>hadoop-nfs</artifactId>      <version>${hadoop.version}</version><!--<scope>provided</scope>--></dependency>    <dependency>      <groupId>org.apache.hadoop</groupId>      <artifactId>hadoop-auth</artifactId>      <version>${hadoop.version}</version><!--<scope>provided</scope>--></dependency>    <dependency>      <groupId>org.apache.hadoop</groupId>      <artifactId>hadoop-mapreduce-client-core</artifactId>      <version>${hadoop.version}</version><!--<scope>provided</scope>--></dependency>    <dependency>      <groupId>org.apache.hive</groupId>      <artifactId>hive-hbase-handler</artifactId>      <version>1.2.1</version>    </dependency>    <dependency>      <groupId>org.apache.hbase</groupId>      <artifactId>hbase-spark</artifactId>      <version>${hbase.version}</version>    </dependency>    <dependency>      <groupId>org.apache.hbase</groupId>      <artifactId>hbase-common</artifactId>      <version>${hbase.version}</version>    </dependency>    <dependency>      <groupId>org.apache.hbase</groupId>      <artifactId>hbase-server</artifactId>      <version>${hbase.version}</version>    </dependency>    <dependency>      <groupId>com.amazonaws</groupId>      <artifactId>aws-java-sdk</artifactId>      <version>1.10.6</version>    </dependency><!--    <dependency>      <groupId>org.apache.hadoop</groupId>      <artifactId>hadoop-aws</artifactId>      <version>${hadoop.version}</version>    </dependency>    -->    <!-- Test --><dependency>      <groupId>junit</groupId>      <artifactId>junit</artifactId>      <version>4.11</version>      <scope>test</scope>    </dependency>    <dependency>      <groupId>org.specs2</groupId>      <artifactId>specs2-core_${scala.compat.version}</artifactId>      <version>2.4.16</version>      <scope>test</scope>    </dependency>    <dependency>      <groupId>org.specs2</groupId>      <artifactId>specs2-junit_${scala.compat.version}</artifactId>      <version>2.4.16</version>      <scope>test</scope>    </dependency>    <dependency>      <groupId>org.scalatest</groupId>      <artifactId>scalatest_${scala.compat.version}</artifactId>      <version>2.2.4</version>      <scope>test</scope>    </dependency>  </dependencies>  <build>    <sourceDirectory>src/main/scala</sourceDirectory>    <testSourceDirectory>src/test/scala</testSourceDirectory>    <plugins>      <plugin><!-- see http://davidb.github.com/scala-maven-plugin --><groupId>net.alchim31.maven</groupId>        <artifactId>scala-maven-plugin</artifactId>        <version>3.2.0</version>        <executions>          <execution>            <goals>              <goal>compile</goal>              <goal>testCompile</goal>            </goals>            <configuration>              <args><!--<arg>-make:transitive</arg>--><arg>-dependencyfile</arg>                <arg>${project.build.directory}/.scala_dependencies</arg>              </args>            </configuration>          </execution>        </executions>      </plugin>      <plugin>        <groupId>org.apache.maven.plugins</groupId>        <artifactId>maven-surefire-plugin</artifactId>        <version>2.18.1</version>        <configuration>          <useFile>false</useFile>          <disableXmlReport>true</disableXmlReport><!-- If you have classpath issue like NoDefClassError,... -->          <!-- useManifestOnlyJar>false</useManifestOnlyJar --><includes>            <include>**/*Test.*</include>            <include>**/*Suite.*</include>          </includes>        </configuration>      </plugin>    </plugins>  </build></project>
Comment

People who voted for this

0 Show 0
10 |6000 characters needed characters left characters exceeded
▼
  • Viewable by all users
  • Viewable by moderators
  • Viewable by moderators and the original poster
  • Advanced visibility
Viewable by all users

1 Reply

· Add your reply
  • Sort: 
  • Votes
  • Created
  • Oldest
avatar image

Answer by stevel · Jan 12, 2018 at 09:34 AM

There's a risk here that you are being burned by Jackson versions. the AWS SDK needs one set of Jackson jars, Spark uses another. On a normal spark-submit, everything works because Spark has shaded theirs, The IDE doesn't do that (lovely as IntelliJ is), so it refuses to play. FWIW, I hit the same problem.

The workaround I use is: start the job as an executable but have the spark-submit pause for a while, and then attach the IDE to it via "attach to a local process". How to get it to wait? Simplest: put a sleep() in. Most flexible, have it poll for a file existing and then do a sleep(1000) if it isn't and repeat. That way, all you have to do is create that file and it will set off.

Comment

People who voted for this

0 Show 0 · Share
10 |6000 characters needed characters left characters exceeded
▼
  • Viewable by all users
  • Viewable by moderators
  • Viewable by moderators and the original poster
  • Advanced visibility
Viewable by all users

Your answer

Hint: You can notify a user about this post by typing @username

Up to 5 attachments (including images) can be used with a maximum of 524.3 kB each and 1.0 MB total.

77
Followers
follow question

Answers Answer & comments

This website uses cookies for analytics, personalisation and advertising. To learn more or change your cookie settings, please read our Cookie Policy. By continuing to browse, you agree to our use of cookies.

HCC Guidelines | HCC FAQs | HCC Privacy Policy | Privacy Policy | Terms of Service

© 2011-2019 Hortonworks Inc. All Rights Reserved.

Hadoop, Falcon, Atlas, Sqoop, Flume, Kafka, Pig, Hive, HBase, Accumulo, Storm, Solr, Spark, Ranger, Knox, Ambari, ZooKeeper, Oozie and the Hadoop elephant logo are trademarks of the Apache Software Foundation.

  • Anonymous
  • Login
  • Create
  • Ask a question
  • Add Repo
  • Create SupportKB
  • Create Article
  • Post Idea
  • Tracks
  • Community Help
  • Cloud & Operations
  • CyberSecurity
  • Data Ingestion & Streaming
  • Data Processing
  • Data Science & Advanced Analytics
  • Design & Architecture
  • Governance & Lifecycle
  • Hadoop Core
  • Sandbox & Learning
  • Security
  • Solutions
  • Explore
  • All Tags
  • All Questions
  • All Repos
  • All SKB
  • All Articles
  • All Ideas
  • All Users
  • Leaderboard
  • All Badges