Architecting a Scalable Stock Price Prediction Service with Hadoop, Kafka, Spark & FBProphet

5 min readSep 6, 2021

We are in the era where real-time analytics is becoming a necessity of many businesses. It enables businesses to react without delay. They can seize opportunities or prevent problems before they happen.

Foreword

Nanyang Poly’s specialist diploma in Big Data Management (ITD35X) is my last certificate to my 1-and-half-year big data and machine learning journey with NYP. This article is about the 5-week final big data architecture project that I worked with my teammates.

Startup Background & Project Objective

I and my 3 course mates (JunQiang Shen, TaiSan Lee and S. Vignesh) formed a pseudo start-up company — DataVar, which provides stock market research and analytics service to clients. Technical stock analysis is a major service offer of DataVar as we anticipate how the stock prices are likely to fluctuate in the near future by utilizing machine learning techniques and using historical data as benchmark. Our technical stock analysis enables our clients in taking right investment decisions based on validated data rather than assumption.

The huge volumes of data, coming from variety of sources, generated in stock market floods us on a daily basis. Manually performing a naïve stock forecast by using time series forecasting in Excel/CSV files and python will not scale once it is required to compute the stock data for hundreds of companies and data sources. The manual approach would not be acceptable in the stock market research and analysis industry where tracking, processing and analyzing real-time data is becoming a necessity. Our project objective is to programmatically perform stock price forecasts for our clients at scale.

In this project, we developed a stock data processing and prediction platform on Microsoft Azure cloud VMs using Apache Hadoop, Kafka, FBprophet and PySpark.

Architecture of Hadoop Ecosystem

We built a multi node cluster in Hadoop/YARN which contains Active and Standby NameNode, 3 DataNodes and 1 EdgeNode in a distributed Hadoop environment. This was used to store and process the data in HDFS cluster in a parallel distributed manner. At the same time, more than one copy of data is stored actually to support failover to another.

We setup Apache Hive as Data Warehousing built on top of Hadoop. Hive provides SQL type querying language for data analysis purpose. Apache Kafka and Spark streaming were setup as well for real-time data API streaming and historical data CSV processing.

Our data scientist will utilize Apache Zeppelin that enables interactive data analytics and machine learning.

ZooKeeper is an open source Apache project that provides a centralized service for providing configuration information, naming, synchronization and group services over large cluster in a distributed system.

Here is the architecture of Hadoop/YARN Ecosystem

Streaming and Data Pipeline

As I was playing “ETL architect” role in the project, I will focus on the data pipeline design in the rest of article, which includes data streaming, data warehousing, data transformation and batch prediction.

Here is the architecture of end-to-end data pipeline and batch prediction integration:

Kafka and Spark Structured Streaming

To build data streams that can feed data into analytics tools as soon as it is generated and get near-instant analytics results, I decided to setup Kafka as streaming platform that can accept huge volumes of data records from various sources, such as flat files and REST API. In the project, I got every stocks’ one minute frequency data and historical data by calling APIs provided by Alpha Vantage. I also downloaded stocks’ historical data CSV files from Kaggle into the dedicated drop folders monitored by Kafka producer applications.

The streams of data (Kafka calls it Events) was organized and stored in form of Topics, which supported multi producers who publish events and multi consumers who subscribe to the events.

Then I used Spark Structured Streaming, which was a stream process engine built on top of Spark SQL, to build the pipeline to ingest and transform the data from Kafka message queues, then wrote the dataset to the output sinks, in the project, which is Hive warehouse.

Data ETL with PySpark

I utilized PySpark with Hive to perform a series of processes to transform the staging data from various sources into single final dataset which included filter invalid data, for examples, test data and abnormal closing pricing; eliminate non-value added data fields; substring values to remove special characters and non-value added string; aggregate stock data by stock date and return max high/min low/first open/last close/sum volume; converse data type from string to date, double and long; partition and load the stock data by the ticker symbols and years of stock date.

Batch Prediction with FBProphet

To overcame the scalability limits and provide forecasts to the clients in a timely manner, besides input data streaming and distributed data processing, we also leveraged Facebook Prophet aka FBProphet, a forecasting procedure implemented in R and Python which is fast and provides completely automated forecasts that can be tuned by hand by data scientists.

Prophet models can only be fit once, and a new model must be re-fit when new data become available. In most settings, model fitting is fast enough that there isn’t any issue with re-fitting from scratch. So I decided to setup Batch Prediction. I setup the cron jobs to refresh the final dataset in Hive, then run predictions to make multi-step forecast by fitting the best FBProphet model to the data and output the results to the console and emails.

Here are the examples of auto email notification of the next n days stock price prediction to our clients after the scheduled ETL jobs are completed.

Closing thoughts

Building a scalable and distributed Hadoop-based stock data processing and prediction platform within 5 weeks is nontrivial. I hope that this article provides those interested in such architecture some food for thought, and a baseline against which to iterate and improve. A special thanks to DataVar teammates “Operation Manager” JunQiang Shen, “Project Manager” TaiSan Lee and “Data Scientist” S. Vignesh for their technical expertise, time and efforts and NYP lecturers Mr. Foo and Dr Ooi for the guidance.