Apache Druid Integration with Apache Superset

Introduction

In our previous blog, Exploring Apache Druid: A High-Performance Real-Time Analytics Database, we discussed Apache Druid in more detail. In case you have missed it, it is recommended that you read it before continuing the Apache Druid blog series. In this blog, we talk about integrating Apache Superset and Apache Druid.

Apache Superset and Apache Druid are both open-source tools that are often used together in data analytics and business intelligence workflows. Here’s an overview of each and their relationship.

Apache Superset

Apache Superset is an open-source data exploration and visualization platform developed by Airbnb, it was later donated to the Apache Software Foundation. It is now a top-level Apache project, widely adopted across industries for data analytics and visualization. 

  1. Type: Data visualization and exploration platform.
  2. Purpose: Enables users to create dashboards, charts, and reports for data analysis.
  3. Features:
    1. User-friendly interface for querying and visualizing data.
    2. Wide variety of visualization types (charts, maps, graphs, etc.).
    3. SQL Lab: A rich SQL IDE for writing and running queries.
    4. Extensible: Custom plugins and charts can be developed.
    5. Role-based access control and authentication options.
    6. Integrates with multiple data sources, including databases, warehouses, and big data engines.
  4. Use Cases:
    1. Business intelligence (BI) dashboards.
    2. Interactive data exploration.
    3. Monitoring and operational analytics.

Apache Druid

Apache Druid is a high-performance, distributed real-time analytics database designed to handle fast, interactive queries on large-scale datasets, especially those with a time component. It is widely used for powering business intelligence (BI) dashboards, real-time monitoring systems, and exploratory data analytics.

  1. Type: Real-time analytics database.
  2. Purpose: Designed for fast aggregation, ingestion, and querying of high-volume event streams.
  3. Features:
    1. Columnar storage format: Optimized for analytic workloads.
    2. Real-time ingestion: Can process and make data available immediately.
    3. Fast queries: High-performance query engine for OLAP (Online Analytical Processing).
    4. Scalability: Handles massive data volumes efficiently.
    5. Support for time-series data: Designed for datasets with a time component.
    6. Flexible data ingestion: Supports batch and streaming ingestion from sources like Kafka, Hadoop, and object stores.
  4. Use Cases:
    1. Real-time dashboards.
    2. Clickstream analytics.
    3. Operational intelligence and monitoring.
    4. Interactive drill-down analytics.

How They Work Together

  1. Integration: Superset can connect to Druid as a data source, enabling users to visualize and interact with the data stored in Druid.
  2. Workflow:
    1. Data Ingestion: Druid ingests large-scale data in real-time or batches from various sources.
    2. Data Querying: Superset uses Druid’s fast query capabilities to retrieve data for visualizations.
    3. Visualization: Superset presents the data in user-friendly dashboards, allowing users to explore and analyze it.

Key Advantages of Using Them Together

  1. Speed: Druid’s real-time and high-speed querying capabilities complement Superset’s visualization features, enabling near real-time analytics.
  2. Scalability: The combination is suitable for high-throughput environments where large-scale data analysis is needed.
  3. Flexibility: Supports a wide range of analytics use cases, from real-time monitoring to historical data analysis.

Use Cases of Apache Druid and Apache Superset

  1. Fraud Detection: Analyzing transactional data for anomalies.
  2. Real-Time Analytics: Monitoring website traffic, application performance, or IoT devices in real-time.
  3. Operational Intelligence: Tracking business metrics like revenue, sales, or customer behaviour.
  4. Clickstream Analysis: Analyzing user interactions and behaviour on websites and apps.
  5. Log and Event Analytics: Parsing and querying log files or event streams.

Importance of Integrating Apache Druid with Apache Superset

Integrating Apache Druid with Apache Superset opens quite a few doors for the organization to really address high-scale data analysis and interactive visualization needs. Here’s why this combination is so powerful:

  1. High-Performance Analytics:

Druid’s Speed: Apache Druid optimizes low-latency queries and fast aggregation and is designed with large datasets in mind. That can be huge because quick response means Superset can deliver good, reusable, production-quality representations.

Real-Time Data: Druid supports real-time data ingestion, allowing Superset dashboards to visualise near-live data streams, making it suitable for monitoring and operational analytics.

  1. Advanced Query Capabilities:

Complex Aggregations: Druid’s ability to handle complex OLAP-style aggregations and filtering ensures that Superset can provide rich and detailed visualizations.

Native Time-Series Support: Druid’s in-built time-series functionalities allow seamless integration for Superset’s time-series charts and analysis.

  1. Scalability:

Designed for Big Data: Druid is highly scalable and can manage billions of rows in an efficient manner. Together with Superset, it helps dashboards scale easily for high volumes of data.

Distributed Architecture: Being by nature distributed, Druid ensures reliable performance of Superset even when it’s under high loads or many concurrent user sessions.

  1. Simplified Data Exploration:

Columnar Storage: This columnar storage format on Druid greatly accelerates the exploratory data analysis of Superset since it supports scans of selected columns efficiently.

Pre-Aggregated Data: Superset can avail use from the roll-up and pre-aggregation mechanisms in Druid for reducing query cost as well as improving the visualization performance.

  1. Flawless Integration:

Incorporated Connector: Superset is natively integrated with Druid. Therefore, there would not be any hassle regarding implementing it. It lets users query Druid directly and curate dashboards using minimal configuration. Moreover, non-technical users can also perform SQL-like queries (Druid SQL) within the interface.

  1. Advanced Features:

Granular Access Control: When using Druid with Superset, it is possible to enforce row-level security and user-based data access, ensuring that sensitive data is only accessible to authorized users.

Segment-Level Optimization: Druid’s segment-based storage enables Superset to optimize queries for specific data slices.

  1. Cost Efficiency:

Efficient Resource Utilization: Druid’s architecture minimizes storage and compute costs while providing high-speed analytics, ensuring that Superset dashboards run efficiently.

  1. Real-Time Monitoring and BI:

Operational Dashboards: In the case of monitoring systems, website traffic, or application logs, Druid’s real-time capabilities make the dashboards built in Superset very effective for real-time operational intelligence.

Streaming Data Support: Druid supports ingesting stream-processing-enabled data sources; for example, Kafka’s ingestion is supported. This enables Superset to visualize live data with minimal delay.

Prerequisites

  1. The Superset Docker Container or instance should be up and running.
  2. Installation of pydruid in Superset Container or instance.

Installing PyDruid

1. Enable Druid-Specific SQL Queries

PyDruid is the Python library that serves as a client for Apache Druid. This allows Superset to: 

  1. Run Druid SQL queries as efficiently as possible. 
  2. Access Druid’s powerful querying capabilities directly from Superset’s SQL Lab and dashboards.
  3. Without this client, Superset will not be able to communicate with the Druid database, which consequently results in partial functionality or functionality at large

2. Native Connection Support

Superset is largely dependent upon PyDruid for:

  1. Seamless connections to Apache Druid.
  2. Translate Superset queries into Druid-native SQL or JSON that the Druid engine understands.

3. Efficient Data Fetching

PyDruid optimizes data fetching from Druid by

  1. Managing the query API of Druid in regards to aggregation, filtering, and time-series analysis
  2. Reducing latency due to better management of connections and responses

    4. Advanced Visualization Features

Many features of advanced visualization available in Superset (like time-series charts and heat maps) depend on Druid’s capabilities, which PyDruid enables.  It also exposes to Superset Druid’s granularity, time aggregation and access to complex metrics pre-aggregated in Druid.

5. Integration with Superset’s Druid Connector

The Druid connector in Superset relies on PyDruid as a dependency to:

  1. Register Druid as a data source.
  2. Allow Superset users the capability to query and visually explore Druid datasets directly.
  3. Without PyDruid the Druid connector in Superset will not work, limiting your ability to integrate Druid as a backend for analytics.

6. Data Discovery PyDruid Enables Superset to:

Fetch metadata from Druid, including schema, tables, and columns. Enable ad hoc exploration of Druid data in SQL Lab.

Apache Druid as a Data Source

  1. Make sure your Druid database application is set up and working fine in the respective ports.

2. Install pydruid by logging into the container.

$ docker exec -it <superset_container_name> user=root /bin/bash 

$ pip install pydruid

3. Once pydruid is installed, navigate to settings and click on database connections

4. Click on Add Database and select Apache Druid as shown in the screenshot below:

5. Here we have to enter the SQLALCHEMY URI. To learn more about the SQLALCHEMY rules please follow the below link. 

https://superset.apache.org/docs/configuration/databases/

6. For Druid the SQLALCHEMY URI will be in the below format:

druid://<User>:<password>@<Host>:<Port-default-9088>/druid/v2/sql

Note: 

druid://<User>:<password>@<Host>:<Port-default-9088>/druid/v2/sql

According to the Alchemy URI “<Port-default-9088>” the default ports will be “8888” or “8081”. 

If the druid you have installed is running on http without any secure connection the Alchemy URI with the above rule will work without any issue.

Ex: druid://localhost:8888/druid/v2/sql/

If you have a secure connection normally druid://localhost:8888/druid/v2/sql/ will not work and provide an error. 

Here we will understand that we have to provide a domain name instead of localhost if we have enabled a secured connection for the druid UI. So we will make the changes and enter the URI as below

druid://domain.name.com:8888/druid/v2/sql/. This will also not work. 

According to other sources we have to make the changes like 

druid+https://localhost:8888/druid/v2/sql/

OR

druid+https://localhost:8888/druid/v2/sql/?ssl_verify=false

Neither of them will work

Normally once we enable a secured connection port HTTPS will be enabled to the site proxying the default ports of any of the other applications.

So if we enter the below SQL ALCHEMY URI as below

druid+https://domain.name.com:443/druid/v2/sql/?ssl=true

Superset will understand that the druid is enabled with a secured connection and running in the HTTPS port.

7. Since we have enabled the SSL to our druid application we have to enter the below SQLALCHEMY URI

druid+https://domain.name.com:443/druid/v2/sql/?ssl=true 

and click on test connection.

8. Once we click on test connection the Apache superset will be integrated with the Apache druid and will be ready to be worked on.

In the upcoming articles, we will talk about the data ingestion and visualization of Apache Druid data in the Apache Superset.

Apache Druid FAQ

1. What is Apache Druid used for?

Apache Druid is used for real-time analytics and fast querying of large-scale datasets. It’s particularly suitable for time-series data, clickstream analytics, operational intelligence, and real-time monitoring.

2. What are Druid’s ingestion methods?

Druid supports:

  1. Streaming ingestion: Data from Kafka, Kinesis, etc., is ingested in real-time.
  2. Batch ingestion: Data from Hadoop, S3, or other static sources is ingested in batches.

3. How does Druid achieve fast queries?

Druid uses:

  1. Columnar storage: Optimizes storage for analytic queries.
  2. Advanced indexing: Bitmap and inverted indexes speed up filtering and aggregation.
  3. Distributed architecture: Spreads query load across multiple nodes.

4. Can Druid handle transactional data (OLTP)?

No. Druid is optimized for OLAP (Online Analytical Processing) and is not suitable for transactional use cases.

5. How does Druid manage data storage?

  1. Data is segmented into immutable chunks and stored in Deep Storage (e.g., S3, HDFS).
  2. Historical nodes cache frequently accessed segments for faster querying.

6. What query languages does Druid support?

  1. SQL: Provides a familiar interface for querying.
  2. Native JSON-based query language: Offers more flexibility and advanced features.

7. What are the core components of Apache Druid?

  1. Broker: Routes queries to appropriate nodes.
  2. Historical nodes: Serve immutable data.
  3. MiddleManager: Handles ingestion tasks.
  4. Overlord: Coordinates ingestion processes.
  5. Coordinator: Manages data availability and balancing.

8. What integrations does Druid support?

Druid integrates with tools like:

  1. Apache Superset
  2. Tableau
  3. Grafana
  4. Stream ingestion tools (Kafka, Kinesis)
  5. Batch processing frameworks (Hadoop, Spark)

9. Is Apache Druid free to use?

Yes, it’s open-source and distributed under the Apache License 2.0.

10. What are some alternatives to Apache Druid?

Alternatives include:

  1. ClickHouse
  2. Elasticsearch
  3. BigQuery
  4. Snowflake

Apache Superset FAQ

1. What is Apache Superset?

Apache Superset is an open-source data visualization and exploration platform. It allows users to create dashboards, charts, and interactive visualizations from various data sources.

2. What types of visualizations does Superset support?

Superset offers:

  1. Line charts, bar charts, pie charts.
  2. Geographic visualizations like maps.
  3. Pivot tables, histograms, and heatmaps.
  4. Custom visualizations via plugins.

3. What data sources can Superset connect to?

Superset supports:

  1. Relational databases (PostgreSQL, MySQL, SQLite).
  2. Big data engines (Hive, Presto, Trino, BigQuery).
  3. Analytics databases (Apache Druid, ClickHouse).

4. Is Apache Superset a replacement for Tableau or Power BI?

Superset is a robust, open-source alternative to commercial BI tools. While it’s feature-rich, it may require more setup and development effort compared to proprietary solutions like Tableau.

5. What query languages does Superset support?

Superset primarily supports SQL for querying data. Its SQL Lab provides an environment for writing, testing, and running queries.

6. How does Superset handle user authentication?

Superset supports:

  1. Database-based login.
  2. OAuth, LDAP, OpenID Connect, and other SSO mechanisms.
  3. Role-based access control (RBAC) for managing permissions.

7. How do I install Superset?

Superset can be installed using:

  1. Docker: For containerized deployments.
  2. Python pip: For a native installation in a Python environment.
  3. Helm: For Kubernetes-based setups.

8. Can I use Superset with real-time data?

Yes. By integrating Superset with real-time analytics platforms like Apache Druid or streaming data sources, you can build dashboards that reflect live data.

9. What are Superset’s limitations?

  1. Requires knowledge of SQL for advanced queries.
  2. May not have the polish or advanced features of proprietary BI tools like Tableau.
  3. Customization and extensibility require some development effort.

10. How does Superset handle performance?

Superset performance depends on:

  1. The efficiency of the connected database or analytics engine.
  2. The complexity of queries and visualizations.
  3. Hardware resources and configuration.

Would you like further details on specific features or help setting up either tool? Reach out to us

Watch the Apache Blog Series

Table of Contents
Table of Contents
Related Posts
Apache Druid Integration with Apache Superset

Introduction In our previous blog, Exploring Apache Druid: A High-Performance Real-Time Analytics Database, we discussed Apache Druid in more detail. In case you have missed

Unlocking Data Insights with Apache Superset

Introduction In today’s data-driven world, having the right tools to analyze and visualize data is crucial for making informed decisions. Organizations rely heavily on actionable

Shopping Basket

Fill Your Requirements


MicroFocus Vertica Analytics Platform delivers speed, scalability, and built-in machine learning that today’s most analytically intensive workloads demand, whether in the Public Clouds, On-Premises, on Hadoop, or any Hybrid combination. Vertica’s SQL Data Warehouse is trusted by the world’s leading data-driven companies, including Cerner, Etsy, Intuit, Uber and more to deliver speed, scale and reliability on mission-critical analytics. Vertica combines the power of a high-performance, massively parallel processing SQL query engine with advanced analytics and machine learning so you can unlock the true potential of your data with no limits and no compromises. We are a certified System Integration and reseller partner of Vertica and have a strategic alliance to develop industry-specific solutions using this Award-winning Columnar Database in the APAC region.

We have extensive experience with the entire product suite having successfully completed over 50 implementations in the USA/Europe/Asia Pacific across different industries and still continue to support a few key customers Globally.

As a Future-ready and complete, enterprise-grade analytics platform, Pyramid is a compelling option for organizations. Pyramid offers an integrated suite for modern Analytics and Business Intelligence requirements. It has a broad range of analytical capabilities, including data wrangling, ad hoc analysis, interactive visualization, analytic dashboards, mobile capabilities and collaboration in a governed infrastructure. It also features an integrated workflow for system-of-record reporting. Its Augmented features such as Smart Discovery, Smart Reporting, Ask Pyramid (NLQ), AI-driven modelling, automatic visualizations and dynamic content offer powerful insights to all users, regardless of skill level and the adaptive augmented analytics platform covers the entire data life cycle out-of-the-box, from ML-based data preparation to automated insights and automated ML model building. Pyramid is especially useful for the customer who is in urgent need to get more value out of their existing SAP BW and SAP HANA investments. Without any data extraction or duplication, Pyramid offers best-in-class functionality and performance that preserves the security and governance inherent in the SAP platform. We are a Strategic System Integration and Reseller partner of Pyramid Analytics.