Exploring Apache Druid_ A High-Performance Real-Time Analytics Database

Exploring Apache Druid: A High-Performance Real-Time Analytics Database

Introduction

Apache Druid is a distributed, column-oriented, real-time analytics database designed for fast, scalable, and interactive analytics on large datasets. It excels in use cases requiring real-time data ingestion, high-performance queries, and low-latency analytics. 

Druid was originally developed to power interactive data applications at Metamarkets and has since become a widely adopted open-source solution for real-time analytics, particularly in industries such as ad tech, fintech, and IoT.

It supports batch and real-time data ingestion, enabling users to perform fast ad-hoc queries, power dashboards, and interactive data exploration.

In big data and real-time analytics, having the right tools to process and analyze large volumes of data swiftly is essential. Apache Druid, an open-source, high-performance, column-oriented distributed data store, has emerged as a leading solution for real-time analytics and OLAP (online analytical processing) workloads. In this blog post, we’ll delve into what Apache Druid is, its key features, and how it can revolutionize your data analytics capabilities. Refer to the official documentation for more information.

Apache Druid

Apache Druid is a high-performance, real-time analytics database designed for fast slice-and-dice analytics on large datasets. It was created by Metamarkets (now part of Snap Inc.) and is now an Apache Software Foundation project. Druid is built to handle both batch and streaming data, making it ideal for use cases that require real-time insights and low-latency queries.

Key Features of Apache Druid:

Real-Time Data Ingestion

Druid excels at real-time data ingestion, allowing data to be ingested from various sources such as Kafka, Kinesis, and traditional batch files. It supports real-time indexing, enabling immediate query capabilities on incoming data with low latency.

High-Performance Query Engine

Druid’s query engine is optimized for fast, interactive querying. It supports a wide range of query types, including Time-series, TopN, GroupBy, and search queries. Druid’s columnar storage format and advanced indexing techniques, such as bitmap indexes and compressed column stores, ensure that queries are executed efficiently.

Scalable and Distributed Architecture

Druid’s architecture is designed to scale horizontally. It can be deployed on a cluster of commodity hardware, with data distributed across multiple nodes to ensure high availability and fault tolerance. This scalability makes Druid suitable for handling large datasets and high query loads.

Flexible Data Model

Druid’s flexible data model allows for the ingestion of semi-structured and structured data. It supports schema-on-read, enabling dynamic column discovery and flexibility in handling varying data formats. This flexibility simplifies the integration of new data sources and evolving data schemas.

Built-In Data Management

Druid includes built-in features for data management, such as automatic data partitioning, data retention policies, and compaction tasks. These features help maintain optimal query performance and storage efficiency as data volumes grow.

Extensive Integration Capabilities

Druid integrates seamlessly with various data ingestion and processing frameworks, including Apache Kafka, Apache Storm, and Apache Flink. It also supports integration with visualization tools like Apache Superset, Tableau, and Grafana, enabling users to build comprehensive analytics solutions.

Use Cases of Apache Druid

Real-Time Analytics

Druid is used in real-time analytics applications where the ability to ingest and query data in near real-time is critical. This includes monitoring applications, fraud detection, and customer behavior tracking.

Ad-Tech and Marketing Analytics

Druid’s ability to handle high-throughput data ingestion and fast queries makes it a popular choice in the ad tech and marketing industries. It can track user events, clicks, impressions, and conversion rates in real time to optimize campaigns.

IoT Data and Sensor Analytics

IoT applications produce time-series data at high volume. Druid’s architecture is optimized for time-series data analysis, making it ideal for analyzing IoT sensor data, device telemetry, and real-time event tracking.

Operational Dashboards

Druid is often used to power operational dashboards that provide insights into infrastructure, systems, or applications. The low-latency query capabilities ensure that dashboards reflect real-time data without delay.

Clickstream Analysis

Organizations leverage Druid to analyze user clickstream data on websites and applications, allowing for in-depth analysis of user interactions, preferences, and behaviors in real time.

The Architecture of Apache Druid

Apache Druid follows a distributed, microservice-based architecture. The architecture allows for scaling different components based on the system’s needs.

The main components are:

Coordinator and Overlord Nodes

  1. Coordinator Node: Manages data availability, balancing the distribution of data across the cluster, and overseeing segment management (segments are the basic units of storage in Druid).
  2. Overlord Node: Responsible for managing ingestion tasks. It works with the middle managers to schedule and execute data ingestion tasks, ensuring that data is ingested properly into the system.

Historical Nodes

Historical nodes store immutable segments of historical data. When queries are executed, historical nodes serve data from the disk, which allows for low-latency and high-throughput queries.

MiddleManager Nodes

MiddleManager nodes handle real-time ingestion tasks. They manage tasks such as ingesting data from real-time streams (like Kafka), transforming it, and pushing the processed data to historical nodes after it has persisted.

Broker Nodes

The broker nodes route incoming queries to the appropriate historical or real-time nodes and aggregate the results. They act as the query routers and perform query federation across the Druid cluster.

Query Nodes

Query nodes are responsible for receiving, routing, and processing queries. They can handle a variety of query types, including SQL, and route these queries to other nodes for execution.

Deep Storage

Druid relies on an external deep storage system (such as Amazon S3, Google Cloud Storage, or HDFS) to store segments of data permanently. The historical nodes pull these segments from deep storage when they need to serve data.

Metadata Storage

Druid uses an external relational database (typically PostgreSQL or MySQL) to store metadata about the data, including segment information, task states, and configuration settings.

Advantages of Apache Druid

  1. Sub-Second Query Latency: Optimized for high-speed data queries, making it perfect for real-time dashboards.
  2. Scalability: Easily scales to handle petabytes of data.
  3. Flexible Data Ingestion: Supports both batch and real-time data ingestion from multiple sources like Kafka, HDFS, and Amazon S3.
  4. Column-Oriented Storage: Efficient data storage with high compression ratios and fast retrieval of specific columns.
  5. SQL Support: Familiar SQL-like querying capabilities for easy data analysis.
  6. High Availability: Fault-tolerant and highly available due to data replication across nodes.

Getting Started with Apache Druid

Installation and Setup

Setting up Apache Druid involves configuring a cluster with different node types, each responsible for specific tasks:

  1. Master Nodes: Oversee coordination, metadata management, and data distribution.
  2. Data Nodes: Handle data storage, ingestion, and querying.
  3. Query Nodes: Manage query routing and processing.

You can install Druid using a package manager, Docker, or by downloading and extracting the binary distribution. Here’s a brief overview of setting up Druid using Docker:

  1. Download the Docker Compose File:
    $curl -O https://raw.githubusercontent.com/apache/druid/master/examples/docker-compose/docker-compose.yml
  2. Start the Druid Cluster: $ docker-compose up
  3. Access the Druid Console: Open your web browser and navigate to http://localhost:8888 to access the Druid console.

Ingesting Data

To ingest data into Druid, you need to define an ingestion spec that outlines the data source, input format, and parsing rules. Here’s an example of a simple ingestion spec for a CSV file:

JSON Code

{ “type”: “index_parallel”, “spec”: { “ioConfig”: { “type”: “index_parallel”, “inputSource”: { “type”: “local”, “baseDir”: “/path/to/csv”, “filter”: “*.csv” }, “inputFormat”: { “type”: “csv”, “findColumnsFromHeader”: true } }, “dataSchema”: { “dataSource”: “example_data”, “timestampSpec”: { “column”: “timestamp”, “format”: “iso” }, “dimensionsSpec”: { “dimensions”: [“column1”, “column2”, “column3”] } }, “tuningConfig”: { “type”: “index_parallel” } }}

Submit the ingestion spec through the Druid console or via the Druid API to start the data ingestion process.

Querying Data

Once your data is ingested, you can query it using Druid’s native query language or SQL. Here’s an example of a simple SQL query to retrieve data from the example_data data source:

SELECT  __time, column1, column2, columnFROM example_dataWHERE __time BETWEEN ‘2023-01-01’ AND ‘2023-01-31’

Use the Druid console or connect to Druid from your preferred BI tool to execute queries and visualize data.

Conclusion

Apache Druid is a powerful, high-performance real-time analytics database that excels at handling large-scale data ingestion and querying. Its robust architecture, flexible data model, and extensive integration capabilities make it a versatile solution for a wide range of analytics use cases. Whether you need real-time insights, interactive queries, or scalable OLAP capabilities, Apache Druid provides the tools to unlock the full potential of your data. Explore Apache Druid today and transform your data analytics landscape. Apache Druid has firmly established itself as a leading database for real-time, high-performance analytics. Its unique combination of real-time data ingestion, sub-second query speeds, and scalability makes it a perfect choice for businesses that need to analyze vast amounts of time-series and event-driven data. With growing adoption across industries.

Need help transforming your real-time analytics with high-performance querying? Contact our experts today!

Watch the Apache Druid Blog Series

Stay tuned for the upcoming Apache Druid Blog Series:

  1. Why choose Apache Druid over Vertica
  2. Why choose Apache Druid over Snowflake
  3. Why choose Apache Druid over Google Big Query
  4. Integrating Apache Druid with Apache Superset for Realtime Analytics
Streamlining Apache HOP Workflow Management with Apache Airflow

Streamlining Apache HOP Workflow Management with Apache Airflow

Introduction

In our previous blog, we talked about the Apache HOP in more detail. In case you have missed it, refer to it here https://analytics.axxonet.com/comparison-of-and-migrating-from-pdi-kettle-to-apache-hop/” page. As a continuation of the Apache HOP article series, here we touch upon how to integrate Apache Airflow and Apache HOP. In the fast-paced world of data engineering and data science, efficiently managing complex workflows is crucial. Apache Airflow, an open-source platform for programmatically authoring, scheduling, and monitoring workflows, has become a cornerstone in many data teams’ toolkits. This blog post explores what Apache Airflow is, its key features, and how you can leverage it to streamline and manage your Apache HOP workflows and pipelines.

Apache HOP

Apache HOP is an open-source data integration and orchestration platform. For more details refer to our previous blog here.

Apache Airflow

Apache Airflow is an open-source workflow orchestration tool originally developed by Airbnb. It allows you to define workflows as code, providing a dynamic, extensible platform to manage your data pipelines. Airflow’s rich features enable you to automate and monitor workflows efficiently, ensuring that data moves seamlessly through various processes and systems.

Use Cases:

  1. Data Pipelines: Orchestrating ETL jobs to extract data from sources, transform it, and load it into a data warehouse.
  2. Machine Learning Pipelines: Scheduling ML model training, batch processing, and deployment workflows.
  3. Task Automation: Running repetitive tasks, like backups or sending reports.

DAG (Directed Acyclic Graph):

  1. A DAG represents the workflow in Airflow. It defines a collection of tasks and their dependencies, ensuring that tasks are executed in the correct order.
  2. DAGs are written in Python and allow you to define the tasks and how they depend on each other.

Operators:

  • Operators define a single task in a DAG. There are several built-in operators, such as:
    1. BashOperator: Runs a bash command.
    2. PythonOperator: Runs Python code.
    3. SqlOperator: Executes SQL commands.
    4. HttpOperator: Makes HTTP requests.

Custom operators can also be created to meet specific needs.

Tasks:

  1. Tasks are the building blocks of a DAG. Each node in a DAG is a task that does a specific unit of work, such as executing a script or calling an API.
  2. Tasks are defined by operators and their dependencies are controlled by the DAG.

Schedulers:

  1. The scheduler is responsible for triggering tasks at the appropriate time, based on the schedule_interval defined in the DAG.
  2. It continuously monitors all DAGs and determines when to run the next task.

Executors:

  • The executor is the mechanism that runs the tasks. Airflow supports different types of executors:
    1. SequentialExecutor: Executes tasks one by one.
    2. LocalExecutor: Runs tasks in parallel on the local machine.
    3. CeleryExecutor: Distributes tasks across multiple worker machines.
    4. KubernetesExecutor: Runs tasks in a Kubernetes cluster.

Web UI:

  1. Airflow has a web-based UI that lets you monitor the status of DAGs, view logs, and check the status of each task in a DAG.
  2. It also provides tools to trigger, pause, or retry DAGs.

Key Features of Apache Airflow

Workflow as Code

Airflow uses Directed Acyclic Graphs (DAGs) to represent workflows. These DAGs are written in Python, allowing you to leverage the full power of a programming language to define complex workflows. This approach, known as “workflow as code,” promotes reusability, version control, and collaboration.

Dynamic Task Scheduling

Airflow’s scheduling capabilities are highly flexible. You can schedule tasks to run at specific intervals, handle dependencies, and manage task retries in case of failures. The scheduler executes tasks in a defined order, ensuring that dependencies are respected and workflows run smoothly.

Extensible Architecture

Airflow’s architecture is modular and extensible. It supports a wide range of operators (pre-defined tasks), sensors (waiting for external conditions), and hooks (interfacing with external systems). This extensibility allows you to integrate with virtually any system, including databases, cloud services, and APIs.

Robust Monitoring and Logging

Airflow provides comprehensive monitoring and logging capabilities. The web-based user interface (UI) offers real-time visibility into the status of your workflows, enabling you to monitor task progress, view logs, and troubleshoot issues. Additionally, Airflow can send alerts and notifications based on task outcomes.

Scalability and Reliability

Designed to scale, Airflow can handle workflows of any size. It supports distributed execution, allowing you to run tasks on multiple workers across different nodes. This scalability ensures that Airflow can grow with your organization’s needs, maintaining reliability even as workflows become more complex.

Getting Started with Apache Airflow

Installation using PIP

Setting up Apache Airflow is straightforward. You can install it using pip, Docker, or by deploying it on a cloud service. Here’s a brief overview of the installation process using pip:

1. Create a Virtual Environment (optional but recommended):

           python3 -m venv airflow_env

            source airflow_env/bin/activate

2. Install Apache Airflow: 

           pip install apache-airflow

3. Initialize the Database:

           airflow db init

4. Create a User:

           airflow users create –username admin –password admin –firstname Adminlastname User –role Admin –email [email protected]

5. Start the Web Server and Scheduler:

           airflow webserver –port 8080

           airflow scheduler

6. Access the Airflow UI: Open your web browser and go to http://localhost:8080.

Installation using Docker

Pull the docker image and run the container to access the Airflow web UI. Refer to the link for more details.

Creating Your First DAG

Airflow DAG Structure:

A DAG in Airflow is composed of three main parts:

  1. Imports: Necessary packages and operators.
  2. Default Arguments: Arguments that apply to all tasks within the DAG (such as retries, owner, start date).
  3. Task Definition: Define tasks using operators, and specify dependencies between them.

Scheduling:

Airflow allows you to define the schedule of a DAG using schedule_interval:

  1. @daily: Run once a day at midnight.
  2. @hourly: Run once every hour.
  3. @weekly: Run once a week at midnight on Sunday.
  4. Cron expressions, like “0 12 * * *”, are also supported for more specific scheduling needs.
  5. Define the DAG: Create a Python file (e.g., run_lms_transaction.py) in the dags folder of your Airflow installation directory.

    Example

from airflow import DAG

from airflow.operators.dummy import DummyOperator

from datetime import datetime

default_args = {

 ‘owner’: ‘airflow’,

 ‘start_date’: datetime(2023, 1, 1),

 ‘retries’: 1,

}

dag = DAG(‘example_dag’, default_args=default_args, schedule_interval=‘@daily’)

start = DummyOperator(task_id=‘start’, dag=dag)

end = DummyOperator(task_id=‘end’, dag=dag)

start >> end

2. Deploy the DAG: Save the file in the Dags folder. Place the DAG Python script in the DAGs folder (~/airflow/dags by default). Airflow will automatically detect and load the DAG.

3. Monitor the DAG: Access the Airflow UI, where you can view and manage the newly created DAG. Trigger the DAG manually or wait for it to run according to the defined schedule.

Calling the Apache HOP Pipelines/Workflows from Apache Airflow

In this example, we walk through how to integrate the Apache HOP with Apache Airflow. Here both the Apache Airflow and Apache HOP are running on two different independent docker containers. Apache HOP ETL Pipelines / Workflows are configured with a persistent volume storage strategy so that the DAG code can request execution from Airflow.  

Steps

  1. Define the DAG: Create a Python file (e.g., Stg_User_Details.py) in the dags folder of your Airflow installation directory.

from datetime import datetime, timedelta

from airflow import DAG

from airflow.operators.bash_operator import BashOperator

from airflow.operators.docker_operator import DockerOperator

from airflow.operators.python_operator import BranchPythonOperator

from airflow.operators.dummy_operator import DummyOperator

from docker.types import Mount

default_args = {

‘owner’                 : ‘airflow’,

‘description’           : ‘Stg_User_details’,

‘depend_on_past’        : False,

‘start_date’            : datetime(2022, 1, 1),

’email_on_failure’      : False,

’email_on_retry’        : False,

‘retries’               : 1,

‘retry_delay’           : timedelta(minutes=5)

}

with DAG(‘Stg_User_details’, default_args=default_args, schedule_interval=‘0 10 * * *’, catchup=False, is_paused_upon_creation=False) as dag:

    start_dag = DummyOperator(

        task_id=‘start_dag’

        )

    end_dag = DummyOperator(

        task_id=‘end_dag’

        )

    hop = DockerOperator(

        task_id=‘Stg_User_details’,

        # use the Apache Hop Docker image. Add your tags here in the default apache/hop: syntax

        image=‘test’,

        api_version=‘auto’,

        auto_remove=True,

        environment= {

            ‘HOP_RUN_PARAMETERS’: ‘INPUT_DIR=’,

            ‘HOP_LOG_LEVEL’: ‘TRACE’,

            ‘HOP_FILE_PATH’: ‘/opt/hop/config/projects/default/stg_user_details_test.hpl’,

            ‘HOP_PROJECT_DIRECTORY’: ‘/opt/hop/config/projects/’,

            ‘HOP_PROJECT_NAME’: ‘ISON_Project’,

            ‘HOP_ENVIRONMENT_NAME’: ‘ISON_Env’,

            ‘HOP_ENVIRONMENT_CONFIG_FILE_NAME_PATHS’: ‘/opt/hop/config/projects/default/project-config.json’,

            ‘HOP_RUN_CONFIG’: ‘local’,

        },

        docker_url=“unix://var/run/docker.sock”,

        network_mode=“bridge”,

        force_pull=False,

        mount_tmp_dir=False

        )

    start_dag >> hop >> end_dag

Note: For reference purposes only.

2. Deploy the DAG: Save the file in the dags folder. Airflow will automatically detect and load the DAG.

After successful deployment, we should see the new “Stg_User_Details” DAG listed in the Active Tab and All Tab from the Airflow Portal. As shown in the screenshot above.

3. Run the DAG: We can trigger pipelines or workflows using Airflow by clicking on the Trigger DAG option as shown below from the Airflow application.

4. Monitor the DAG: Access the Airflow UI, where you can view and manage the newly created DAG. Trigger the DAG manually or wait for it to run according to the defined schedule.

After successful execution, we should see the status message as shown an execution history along with log details. new “Stg_User_Details” DAG listed in the Active Tab and All Tab from the Airflow Portal. As shown in the screenshot above.

Managing and Scaling Workflows

  1. Use Operators and Sensors: Leverage Airflow’s extensive library of operators and sensors to create tasks that interact with various systems and handle complex logic.
  2. Implement Task Dependencies: Define task dependencies using the >> and << operators to ensure tasks run in the correct order.
  3. Optimize Performance: Monitor task performance through the Airflow UI and logs. Adjust task configurations and parallelism settings to optimize workflow execution.
  4. Scale Out: Configure Airflow to run in a distributed mode by adding more worker nodes, ensuring that the system can handle increasing workload efficiently.

Conclusion

Apache Airflow is a powerful and versatile tool for managing workflows and automating complex data pipelines. Its “workflow as code” approach, coupled with robust scheduling, monitoring, and scalability features, makes it an essential tool for data engineers and data scientists. By adopting Airflow, you can streamline your workflow management, improve collaboration, and ensure that your data processes are efficient and reliable. Explore Apache Airflow today and discover how it can transform your data engineering workflows.

Streamline your Apache HOP Workflow Management With Apache Airflow through our team of experts.

Upcoming Apache HOP Blog Series

Stay tuned for the upcoming Apache HOP Blog Series:

  1. Migrating from Pentaho ETL to Apache Hop
  2. Integrating Apache Hop with an Apache Superset
  3. Comparison of Pentaho ETL and Apache Hop