Building Your First Clickstream Analytics Dashboard: An End-to-End Guide
Venturing into the world of data pipelines may feel daunting, especially when juggling several advanced technologies. Work with us as we create an analytics dashboard tailored for clickstream data, leveraging Python's Streamline, Apache Kafka, and Apache Pinot. Even if you're new to data pipelines, this comprehensive guide is here to help.
Overview - Build your first clickstream dashboard
By the end of this project, we will have built a simple dashboard that shows a funnel and Sankey chart. There are four components that we need to work on. First, we need to set up Apache Kafka and Pinot. Then, we will create some data to stream to our Kafka. Finally, we need to create a dashboard that visualizes the data we get from Pinot.
Tools I Used
Apache Kafka. Apache Kafka is a distributed streaming platform that is used for building real-time data pipelines and streaming apps.
Apache Pinot. Apache Pinot is an open-source, distributed data system designed for real-time analytics and suited for time series databases, OLAP queries, and real-time monitoring systems.
Streamlit. Streamlit is an open-source Python library that allows developers to create dashboards featuring visualizations, widgets, and more.
Plotly. Plotly is an open-source graphing library. We will use this to create the funnel and Sankey chart.
Clickstream data is the information collected when a user uses a web browser. Clickstream analytics is the process of tracking, analyzing, and reporting data on the pages a user visits and user behavior while on a webpage. This data can reveal much about user behavior, including how they found the product or service and their actions while using it.
The data for the project looks like this
Page They Visited
When They Visited
To make it easy to process, we turn the timestamp (When They Visited column) into order. Essentially, we sort the timestamp and assign an integer from smallest to largest.
In the real world, the data is sent as the user navigates through the website. But in this project, we will read the data from a CSV file and send its content row by row. The data looks like the following.
You can also use Python random library or any mock tools (for example, Mockaroo) to generate this kind of data.
Setting Up Kafka and Pinot
Next, we need to set up Kafka and Pinot. However, manually setting up such complex systems can be daunting, even for experienced developers. You have to deal with complex dependencies, for example, using the Zookeeper for managing distributed systems. Installing and configuring these dependencies can be time-consuming. You have to pay attention to many configurations. You need to make sure all components and their dependencies are correct. And so on.
Since the focus of this article is to give an end-to-end review of the whole system, let’s not get bogged down with a manual setup. Let’s use Docker-Compose. A side note: Docker-Compose is not meant for a production environment. It is primarily used in development because it is easy to manage multi-container applications like this project.
Docker-Compose offers a simplified approach. It will help streamline the setup process. I will provide two docker-compose files—one for Apple Silicon and the other for Intel-based machines.
Docker-Compose File for Apple Silicon
Docker-Compose File for x86
Spin Up Kafka and Pinot
Using the above docker-compose file, we can now spin both Kafka and Pinot up using the following command:
Before sending a bunch of data to Kafka, we must create a “topic.” A topic in Kafka is a category or name to which records are published. Think of it like a channel on a TV. If you want to watch a particular type of show or get specific news updates, you tune into a specific channel. Similarly, in Kafka, producers send records to topics, and consumers read records from topics.
The following command will create a topic named “clickstream_events” in Kafka.
Add Pinot Table
Before Pinot can work, it must know how to save the data and connect to Kafka. To allow that, we must first define a schema.json and a table.json. Then, we need to tell Pinot to create the table following our schema and definition.
The schema will follow the data definition from the previous section. There will be user_id, web_page, order, location, and visit. We also add a timestamp column named ts. This timestamp column is vital in Pinot because it determines the time boundary and manages the life cycle of data stored in Pinot segments.
To connect Pinot with Kafka, we need to create a table and define the connection configuration. This is so Pinot knows where to connect, what topic to read from, and other settings.
The complete table.json content is as follows. Notice the streamConfigs section where we define the stream type (Kafka), the broker (kafka-clickstream:9093), and the topic name (clickstream-events). This is how Pinot gets the data From Kafka.
Now that we have defined our schema and table configuration, we only need to run the following command.
Once we have set up Kafka and Pinot, it is time to send some data. As mentioned before, we will read the funnel data from a CSV and send every row to Kafka. Giving some random sleep between each transmission.
Our code that sends this data to Kafka is as follows.
Building the Dashboard
Now that the data is sent to Kafka and Pinot, we can display our dashboard. I use this Streamlit code to show a funnel and user flow data.
Once everything is set up correctly, we can see the result.
After building your first clickstream dashboard you might think embarking on the data pipeline journey can be intimidating. With some tools that are available right now, such as Docker, Kafka, Pinot, and Streamlit, we can make this complex task manageable. Whether you're analyzing user behaviors, optimizing sales funnels, or just experimenting, these tools offer a scalable and efficient way to handle and visualize data.
Now, you might be wondering why we should go through all this trouble. Why not just batch process the data every day or every half a day? Why use streaming when you can use batching?
Modern applications need to act fast. Often up to the millisecond fast. With the sheer volume of data being produced at every moment, it is critical to act fast on the data before it becomes stale or obsolete. This continuous data processing offers advantages that can transform the way businesses run.
As businesses and technologies grow, the paradigms of data handling and processing are shifting. Being equipped with the right tools and understanding the significance of real-time processing could be the determining factor in staying ahead in the competitive landscape.