This is a summary of the reasons WHY I have done the shading in this project the way it is now.
If someone has suggestions/hint on how this can be done better I’m really curious what the ‘right’ way of doing this is.
The base structure of this project is we have a module with the functionality and a set of ‘UDFs’ that wrap this functionality so that it can be used in external processing frameworks (like Pig, Flink, Hive, etc.)
This library and the UDFs should be easy to use for all downstream users that want to use this in their projects.
Some of the dependencies (Antlr4, Spring and SnakeYaml) have proven to be problematic for downstream users who need different versions of these in the same application.
So for only these we include and relocate the used classes into the main jar.
In the pom.xml
<plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-shade-plugin</artifactId> <configuration> <minimizeJar>true</minimizeJar> <createDependencyReducedPom>true</createDependencyReducedPom> <relocations> <relocation> <pattern>org.springframework</pattern> <shadedPattern>nl.basjes.shaded.org.springframework</shadedPattern> </relocation> <relocation> <pattern>org.antlr</pattern> <shadedPattern>nl.basjes.shaded.org.antlr</shadedPattern> </relocation> <relocation> <pattern>org.yaml.snakeyaml</pattern> <shadedPattern>nl.basjes.shaded.org.yaml.snakeyaml</shadedPattern> </relocation> </relocations> </configuration> <executions> <execution> <id>inject-problematic-dependencies</id> <phase>package</phase> <goals> <goal>shade</goal> </goals> <configuration> <artifactSet> <includes> <include>org.antlr:antlr4-runtime</include> <include>org.springframework:spring-core</include> <include>org.yaml:snakeyaml</include> </includes> </artifactSet> </configuration> </execution> </executions> </plugin>
Turns out that a shaded jar still contains the original pom.xml that references the shaded dependencies. As a consequence the downstream users (like the udfs in this project) still include the entire set of dependencies (not used by the code) in addition to the shaded versions (used by the code).
This is a known problem in the Maven shade plugin: https://issues.apache.org/jira/browse/MSHADE-36
For which I’ve put up a pull request: https://github.com/apache/maven-shade-plugin/pull/25
This way building an external project no longer includes things like Antlr a second time.
With the maven-shade-plugin version 3.3.0 (2022-03-24) this is now a built-in feature:
So after solution 2 it is all fine for external projects using the created jar file because they look at the pom.xml in that jar file.
The remaining problem is that any other module in this (multi module) project will look at the original pom.xml instead of the cleaned one in the jar file.
As a consequence the Hive UDF jar file contains
$ unzip -l udfs/hive/target/yauaa-hive-5.12-SNAPSHOT-udf.jar | fgrep org/springframework/core/io/ResourceLoader.class 494 2019-08-23 12:26 nl/basjes/shaded/org/springframework/core/io/ResourceLoader.class 487 2019-02-13 05:32 org/springframework/core/io/ResourceLoader.class
I filed a bug report/ missing feature for this in the Maven shade plugin: https://issues.apache.org/jira/browse/MSHADE-326
For which I’ve put up a pull request: https://github.com/apache/maven-shade-plugin/pull/26
So we exclude these 4 shaded dependencies in all modules in this project so they are no longer included double in the final jars.
Which gives rise to a new problem: When building/developing these modules the code will complain about missing dependencies. The dependencies have been shaded, relocated and excluded … which means that any code looking for the ‘original’ class name will find it to be missing.
The final step I had to take was to include these 4 dependencies again as ‘provided’ in all modules in this project.