Basic descriptive statistics in Apache Spark

Update: this blog is migrated to Medium https://medium.com/spark-experts. To continue access good content, please subscribe it.

Spark core module provides basic descriptive statistics operations for RDD of numeric data. More complex statistics operations are available in MLlib module which is beyond the scope of this post.

The descriptive statistics operations are only available under a specialised version of Spark RDD called DoubleRDD. Following example application calculates mean, max, min, …etc of a sample dataset.

First we’ll generate a sample test dataset. Following code generates numbers from 1 to 99999 using Java 8 stream API.

Next create DoubleRDD from sample data. Please note the use of “parallelizeDoubles” method instead of usual “parallelize” method. This method accepts list of Doubles and generates JavaDoubleRDD.

Now we’ll calculate the mean of our dataset.

There are similar methods for other statistics operation such as max, standard deviation, …etc.

Every time one of this method is invoked , Spark performs the operation on the entire RDD data. If more than one operations performed, it will repeat again and again which is very inefficient. To solve this, Spark provides “StatCounter” class which executes once and provides results of all basic statistics operations in the same time.

Now results can be accessed as follows,

When this application runs, logs output will look like below.

Complete application is available at my GitHub repo https://github.com/sujee81/SparkApps.

1 comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 194 other subscribers