Apache Hop Meets GitLab: CICD Automation with GitLab

April 14, 2025

by Paras Analytics

Introduction

In our previous blog, we discussed Apache HOP in more detail. In case you have missed it, refer to Comparison of and migrating from Pentaho Data Integration PDI/ Kettle to Apache HOP. As a continuation of the Apache HOP article series, here we touch upon how to integrate Apache HOP with GitLab for version management and CI/CD.

In the fast-paced world of data engineering and data science, organizations deal with massive amounts of data that need to be processed, transformed, and analyzed in real-time. Extract, Transform, and Load (ETL) workflows are at the heart of this process, ensuring that raw data is ingested, cleaned, and structured for meaningful insights. Apache HOP (Hop Orchestration Platform) has emerged as one of the most powerful open-source tools for designing, orchestrating, and executing ETL pipelines, offering a modular, scalable, and metadata-driven approach to data integration.

However, as ETL workflows become more complex and business requirements evolve quickly, managing multiple workflows can be difficult. This is where Continuous Integration and Continuous Deployment (CI/CD) come into play. By automating the deployment, testing, and version control of ETL pipelines, CI/CD ensures consistency, reduces human intervention, and accelerates the development lifecycle.

This blog post explores Apache HOP integration with Gitlab, its key features, and how to leverage it to streamline and manage your Apache HOP workflows and pipelines.

Apache Hop:

Apache Hop (Hop Orchestration Platform) is a robust, open-source data integration and orchestration tool that empowers developers and data engineers to build, test, and deploy workflows and pipelines efficiently. One of Apache Hop’s standout features is its seamless integration with version control systems like Git, enabling collaborative development and streamlined management of project assets directly from the GUI.

GitLab:

GitLab is a widely adopted DevSecOps platform that provides built-in CI/CD capabilities, version control, and infrastructure automation. Integrating GitLab with Apache HOP for ETL development and deployment offers several benefits:

Version Control for ETL Workflows
- GitLab allows teams to track changes in Apache HOP pipelines, making it easier to collaborate, review, and revert to previous versions when needed.
- Each change to an ETL workflow is documented, ensuring transparency and traceability in development.
Automated Testing of ETL Pipelines
- Data pipelines can break due to schema changes, logic errors, or unexpected data patterns.
- GitLab CI/CD enables automated testing of HOP pipelines before deployment, reducing the risk of failures in production.
Seamless Deployment to Multiple Environments
- Using GitLab CI/CD pipelines, teams can deploy ETL workflows across different environments (development, staging, and production) without manual intervention.
- Environment-specific configurations can be managed efficiently using GitLab variables.
Efficient Collaboration & Code Reviews
- Multiple data engineers can work on different aspects of ETL development simultaneously using GitLab’s branching and merge request features.
- Code reviews ensure best practices are followed, improving the quality of ETL pipelines.
Rollback and Disaster Recovery
- If an ETL workflow fails in production, previous stable versions can be quickly restored using GitLab’s versioning and rollback capabilities.
Security and Compliance
- GitLab provides access control, audit logging, and security scanning features to ensure that sensitive ETL workflows adhere to compliance standards.

Jenkins:

Jenkins, one of the most widely used automation servers, plays a key role in enabling CI/CD by automating build, test, and deployment processes. Stay tuned for our upcoming articles on How to Integrate Jenkins with GitLab for managing, and deploying the Apache HOP artifacts.

In this blog post, we’ll explore how Git actions can be utilized in Apache Hop GUI to manage and track changes to workflows and pipelines effectively. We’ll cover the setup process, common Git operations, and best practices for using Git within Apache Hop.

Manual Git Integration for CI/CD Process (Problem Statement)

Earlier CI/CD for ETLs was a manual and tedious process. In ETL tools like Pentaho PDI or the older versions, we had to manually manage the CICD for the ETL artefacts (Apache Hop or Pentaho pipelines/transformation and workflows/jobs) with Git by following these summary steps:

Create an Empty Repository
- Log in to your Git account.
- Create a new repository and leave it empty (do not add a README, .gitignore, or license file).
Clone the Repository
- Clone the empty repository to your local system.
- This will create a local folder corresponding to your repository.
Use the Cloned Folder as Project Home
- Set the cloned folder as your Apache Hop project home folder.
- Save all your pipelines (.hpl files), workflows (.hwf files), and configuration files in this folder.

Common Challenges in Pentaho Data Integration (PDI)

Pentaho Data Integration (PDI), also known as Kettle, has been widely used for ETL (Extract, Transform, Load) processes. However, as data workflows became more complex and teams required better collaboration, automation, and version control, Pentaho’s limitations in CI/CD (Continuous Integration/Continuous Deployment) and Git integration became apparent.

1. Lack of Native Git Support

PDI lacked built-in Git integration, making version control and collaboration difficult for teams working on large-scale data projects.

2. Manual Deployment Processes

Without automated CI/CD pipelines, teams had to manually deploy and migrate transformations, leading to inefficiencies and errors.

3. Limited Workflow Orchestration

Handling complex workflows required custom scripting and external tools, increasing development overhead.

4. Scalability Issues:

PDI struggled with modern cloud-native architectures and containerized deployments, requiring additional customization.

Official Link: https://pentaho-public.atlassian.net/jira/software/c/projects/PDI

Birth of Apache Hop (Solution)

To automate the CICD process with the Apache HOP, we can now configure the Hop project to Git and manage the CICD process for the ETLs. This results in efficient ETL code integration and code management. The need for CI/CD and Git Integration in Pentaho Data Integration and other ETL tools led to Apache Hop. To address these limitations, the Apache Hop project was created, evolving from PDI while introducing modern development practices such as:

Built-in Git Integration: Enables seamless version control, collaboration, and tracking of changes within the Hop GUI.
CI/CD Compatibility: Supports automated testing, validation, and deployment of workflows using tools like Jenkins, GitHub Actions, and GitLab CI/CD.
Improved Workflow Orchestration: Provides metadata-driven workflow design with enhanced debugging and visualization.
Containerization & Cloud Support: Fully supports Kubernetes, Docker, and cloud-native architectures for scalable deployments.

Official Link: https://hop.apache.org/manual/latest/hop-gui/hop-gui-git.html

Impact of Git and CI/CD in Apache Hop

The integration of Git and Continuous Integration/Continuous Deployment (CI/CD) practices into Apache Hop has significantly transformed the way data engineering teams manage and deploy their ETL workflows.

1. Enhanced Collaboration with Git

Apache Hop’s support for Git allows multiple team members to work on different parts of a data pipeline simultaneously. Each developer can clone the repository, make changes in isolated branches, and submit pull requests for review. Git’s version control enables teams to:

Track changes to workflows and metadata over time
Review historical modifications and troubleshoot regressions
Merge contributions efficiently while minimizing conflicts

This collaborative environment leads to better code quality, transparency, and accountability within the team.

2. Reliable Deployments through CI/CD Pipelines

By integrating Apache Hop with CI/CD tools like Jenkins, GitLab CI, or GitHub Actions, organizations can automate the process of testing, packaging, and deploying ETL pipelines. Benefits include:

Automated testing of workflows to ensure stability before production releases
Consistent deployment across development, staging, and production environments
Rapid iteration cycles, reducing the time from development to delivery

These pipelines reduce human error and enhance the repeatability of deployment processes.

3. Improved Agility and Scalability

The combination of Git and CI/CD fosters a modern DevOps culture within data engineering. Teams can:

React quickly to changing business requirements
Scale solutions across projects and environments with minimal overhead
Maintain a centralized repository for configuration and infrastructure-as-code artifacts

This level of agility makes Apache Hop a powerful and future-ready tool for enterprises aiming to modernize their data integration and transformation processes.

Why Use Git with Apache Hop?

Integrating Git with Apache Hop offers several benefits:

Version Control
- Track changes in pipelines and workflows with Git’s version history.
- Revert to previous versions when needed.
Collaboration
- Multiple users can work on the same repository, ensuring smooth collaboration.
- Resolve conflicts using Git’s merge and conflict resolution features.
Centralized Management
- Store pipelines, workflows, and associated metadata in a Git repository for centralized access.
Branch Management
- Experiment with new features or workflows in isolated branches.
Rollback
- Revert to earlier versions of workflows in case of issues.

By incorporating Git into your Apache Hop workflow, you ensure a smooth and organized development process.

Git Actions in Apache Hop GUI

Apache Hop GUI provides a range of Git-related actions to simplify version control tasks.

These actions can be accessed from the toolbar or context menus within the application.

Committing Changes

After modifying a workflow or pipeline, save the changes.
Use the Commit option in the GUI to add a descriptive message for the changes.

Pulling Updates

Fetch the latest changes from the remote repository using the Pull option.
Resolve any conflicts directly in the GUI or using an external merge tool.

Pushing Changes

Once you commit changes locally, use the Push option to sync them with the remote repository.

Branching and Merging

Create new branches for feature development or experimentation.
Merge branches into the main branch to integrate completed features.

Viewing History

View the commit history to understand changes made to workflows or pipelines over time.
Use the diff viewer to compare changes between commits.

Reverting Changes

If a workflow is not functioning as expected, revert to a previous commit directly from the GUI.

In addition to adding and committing files, Apache Hop’s File Explorer perspective allows you to manage other Git operations:

Pull: To retrieve the latest changes from your remote repository, click the Git Pull button in the toolbar. This ensures you’re always working with the most up-to-date version of the project.
Revert: If you need to discard changes to a file or folder, select the file and click Git Revert in the Git toolbar.
Visual Diff: Apache Hop allows you to visually compare different versions of a file. Click the Git Info button, select a specific revision, and use the Visual Diff option to see the changes between two versions of a pipeline or workflow. This opens two tabs, showing the before and after states of your project.

Setting Up Git in Apache Hop GUI (CI/CD)

Apache Hop (Hop Orchestration Platform) provides Git integration to help users manage their workflows, pipelines, and metadata effectively. This integration allows version control of Hop projects, making it easier to track changes, collaborate, and revert to previous versions.

Apache Hop supports Git integration to track metadata changes such as:

Pipelines (ETL workflows)
Workflows (job orchestration)
Project metadata (variables, environment settings)
Database connections (stored securely)

With Git, users can:

Commit and push changes to a repository
Revert changes
Collaborate with other team members
Maintain a version history of Hop projects

Prerequisites

Install Apache Hop on your system.
Set up Git on your machine and configure it with your credentials (username and email).
Ensure you have access to a Git repository (local or remote).
Create or clone a repository to store Apache Hop files.

Create a GitLab access token to authenticate and push the code artifacts from Apache HOP.

Configure Git in Apache Hop

Here are the steps for configuring the CICD process using Gitlab in Apache HOP

Step 1: Launch Apache Hop GUI

Step 2: Navigate to the Preferences Menu

Step 3: Locate the Version Control Settings and Configure the Path to your Git Executable

Step 4: Optionally, Specify Default Repositories and Branch Names for your Projects

Step 5: Initialize a Git Repository

Create a new project in Apache Hop.

Open and verify the project’s folder in your file system.

Use Git to initialize the repository: Run the Init command in the Hop Project folder.

$ git init

Add a .gitignore file to exclude temporary files generated by Hop:

1. 1. *.log
  2. *.bak

Step 6: Git Info

After successfully initializing the Git on your local Hop project folder.

Step 7: Adding Files to Git

Select the file(s) you want to add to Git from the File Explorer.
In the toolbar at the top, you’ll see the Git Add button. Clicking this will stage the selected files, meaning they’re ready to be committed to your Git repository.
Alternatively, right-click the file in the File Explorer and select Git Add.

Once staged, the file will change from red to blue, indicating that it’s ready to be committed.

Step 8: Committing Changes

Click the Git Commit button from the toolbar.
Select the files you’ve staged from the File Explorer that you want to include in the commit.
A dialog will prompt you to enter a commit message, this message should summarize the changes you’ve made. Confirm the commit.

Once committed, the blue files will return to a neutral color.

After a successful Commit, comment details are shown in the Revision Tab.

Step 9: Connect to a Remote Repository

Add a remote repository URL:
git remote add origin <repository_url>

Push the local repository to the remote repository:
git push -u origin main

Step 10: Pushing Changes to a Remote Repository

In the Git toolbar, click the Git Push button.
Apache Hop will prompt you for your Git username and password. Enter the correct authentication details.

A confirmation message will appear, indicating that the push was successful.

GitLab / GitHub Operation

In the previous section of this blog, we have seen how to practically initialize a new Git project, commit the ETL files, and push the ETLs and config files from the Apache HOP GUI tool. This section will show how to manage merges and approve merge requests in GitLab after sending the push request from Apache HOP.

Go to the GitLab project.

Once the code is pushed from Hop, you should able to see the merge request notification as shown below:

2. Create a Merge request in the GitLab as shown below:

3. After the merge request is sent, the corresponding user should review the code artifacts to Approve and Merge the request.

4. After the merge request is approved we should be able to see the merged artifacts in the GitLab project as shown in the screenshot below:

Note: Save Pipelines and Workflows:

Store your Hop .hpl (pipelines) and .hwf (workflows) files in the Git repository.
Use this folder as your main project path.

In the upcoming blogs, we will see how to streamline the Git operations (approval and merging) and deploy the merged development files to new environments (Dev->QA->Production) using Jenkins to implement more robust CICD operations. Stay tuned for the upcoming blog releases.

Best Practices for Git in Apache Hop

Apache Hop (Hop Orchestration Platform) provides seamless integration with Git, enabling version control for pipelines and workflows. This integration allows teams to collaborate effectively, track changes, and manage multiple versions of ETL processes.

Use Descriptive Commit Messages
- Ensure commit messages clearly describe the changes made.
- Example: “Added error handling to data ingestion pipeline.”
Commit Frequently
- Break changes into small, logical units and commit regularly.
Leverage Branching
- Use branches for new features, bug fixes, or experimentation.
- Merge branches only after thorough testing.
Collaborate Effectively
- Use pull requests to review and discuss changes with your team before merging.
Keep the Repository Clean
- Use a .gitignore file to exclude temporary files and logs generated by Hop.

Conclusion

Using Git with Apache Hop GUI combines the power of modern version control with an intuitive data integration platform. By integrating Git into your ETL workflows, you’ll enhance collaboration and organization and improve your projects’ reliability and maintainability. Integrating GitLab CI with Apache HOP revolutionizes ETL workflow management by automating testing, deployment, and monitoring. This Continuous Integration (CI) ensures that data pipelines remain reliable, scalable, and maintainable in the ever-evolving landscape of data engineering. By embracing CI/CD best practices, organizations can enhance efficiency, reduce downtime, and accelerate the delivery of high-quality data insights.

Start leveraging Git actions in Apache Hop today to streamline your data orchestration projects.

Up Next

In the part of the CICD blog, we will talk more about Jenkins for the Continuous Deployment of the CI/CD process. Stay tuned.

Would you like us to officially set up Hop with GitLab & Jenkins CI/CD for your new project?

Official Links for Apache Hop and GitLab Integration

When writing this blog about Apache Hop and its integration with GitLab for CI/CD, the following were the official documentation and resources referred to. Below is a list of key official links:

1. Apache Hop Official Resources

Apache Hop Official Website: https://hop.apache.org
Apache Hop Documentation: https://hop.apache.org/docs/latest
Apache Hop GitHub Repository: https://github.com/apache/hop
Apache Hop Community & Discussions: https://hop.apache.org/community

2. GitLab CI/CD Official Resources

GitLab Official Website: https://about.gitlab.com
GitLab CI/CD Documentation: https://docs.gitlab.com/ee/ci/
GitLab Runners (for automation): https://docs.gitlab.com/runner/
GitLab API Reference: https://docs.gitlab.com/ee/api/

3. DevOps & CI/CD Best Practices

Continuous Integration Overview: https://about.gitlab.com/stages-devops-lifecycle/continuous-integration/
Continuous Deployment Best Practices: https://about.gitlab.com/stages-devops-lifecycle/continuous-deployment/
Apache Hop with DevOps Pipelines: https://hop.apache.org/manual/latest/plugins/

These links will help you explore Apache Hop, GitLab, and CI/CD automation in more depth.

Introduction

In our previous blog, Apache Druid Integration with Apache Superset we talked about Apache Druid’s integration with Apache Superset. In case you have missed it, it is recommended you read it before continuing the Apache Druid blog series. In the Exploring Apache Druid: A High-Performance Real-Time Analytics Database blog post, we have talked about Apache Druid in more detail.

In this blog, we talk about the advantages of Druid over Snowflake. Apache Druid and Snowflake are both high-performance databases, but they serve different use cases, with Druid excelling in real-time analytics and Snowflake in traditional data warehousing.

Snowflake

Snowflake is a cloud-based data platform and data warehouse service that has gained significant popularity due to its performance, scalability, and ease of use. It is built from the ground up for the cloud and offers a unified platform for data warehousing, data lakes, data engineering, data science, and data application development

Apache Druid

Apache Druid is a high-performance, distributed real-time analytics database designed to handle fast, interactive queries on large-scale datasets, especially those with a time component. It is widely used for powering business intelligence (BI) dashboards, real-time monitoring systems, and exploratory data analytics.

Here are the advantages of Apache Druid over Snowflake, particularly in real-time, time-series, and low-latency analytics:

(a) Real-Time Data Ingestion

- Apache Druid: Druid is designed for real-time data ingestion, making it ideal for applications where data needs to be available for querying as soon as it’s ingested. It can ingest data streams from sources like Apache Kafka or Amazon Kinesis and make the data immediately queryable.
- Snowflake: Snowflake is primarily a batch-processing data warehouse. While Snowflake can load near real-time data using external tools or connectors, it is not built for real-time streaming data ingestion like Druid.

Advantage: Druid is superior for real-time analytics, especially for streaming data from sources like Kafka and Kinesis.

(b) Sub-Second Query Latency for Interactive Analytics

- Apache Druid: Druid is optimized for sub-second query latency, which makes it highly suitable for interactive dashboards and ad-hoc queries. Its columnar storage format and advanced indexing techniques (such as bitmap and inverted indexes) enable very fast query performance on large datasets, even in real time.
- Snowflake: Snowflake is highly performant for large-scale analytical queries, but it is designed for complex OLAP (Online Analytical Processing) queries over historical data, and query latency may be higher for low-latency, real-time analytics, particularly for streaming or fast-changing datasets.

Advantage: Druid offers better performance for low-latency, real-time queries in use cases like interactive dashboards or real-time monitoring.

(c) Time-Series Data Optimization

- Apache Druid: Druid is natively optimized for time-series data, with its architecture built around time-based partitioning and segmenting. This allows for efficient querying and storage of event-driven data (e.g., log data, clickstreams, IoT sensor data).
- Snowflake: While Snowflake can store and query time-series data, it does not have the same level of optimization for time-based queries as Druid. Snowflake is better suited for complex, multi-dimensional queries rather than the high-frequency, time-stamped event queries where Druid excels.

Advantage: Druid is more optimized for time-series analytics and use cases where data is primarily indexed and queried based on time.

(d) Streaming Data Support

- Apache Druid: Druid has native support for streaming data platforms like Kafka and Kinesis, enabling direct ingestion from these sources and offering real-time visibility into data as it streams in.
- Snowflake: Snowflake supports streaming data through tools like Snowpipe, but it works better with batch data loading or micro-batch processing. Its streaming capabilities are generally less mature and lower-performing compared to Druid’s real-time streaming ingestion.

Advantage: Druid has stronger native streaming data support and is a better fit for real-time analytics from event streams.

(e) Low-Latency Aggregations

- Apache Druid: Druid excels in performing low-latency aggregations on event data, making it ideal for use cases that require real-time metrics, summaries, and rollups. For instance, it’s widely used in monitoring, fraud detection, ad tech, and IoT, where data must be aggregated and queried in near real-time.
- Snowflake: While Snowflake can aggregate data, it is more optimized for batch-mode processing, where queries are run over large datasets. Performing continuous, real-time aggregations would require external tools and more complex architectures.

Advantage: Druid is better suited for real-time aggregations and rollups on streaming or event-driven data.

(f) High Cardinality Data Handling

- Apache Druid: Druid’s advanced indexing techniques, like bitmap indexes and sketches (e.g., HyperLogLog), allow it to efficiently handle high-cardinality data (i.e., data with many unique values). This is important for applications like ad tech where unique user IDs, URLs, or click events are frequent.
- Snowflake: Snowflake performs well with high-cardinality data in large-scale analytical queries, but its query execution model is generally more suited to aggregated batch processing rather than fast, high-cardinality lookups and filtering in real time.

Advantage: Druid has better indexing for real-time queries on high-cardinality datasets.

(g) Real-Time Analytics for Operational Use Cases

- Apache Druid: Druid was built to serve operational analytics use cases where real-time visibility into systems, customers, or events is critical. Its ability to handle fast-changing data and return instant insights makes it ideal for powering monitoring dashboards, business intelligence tools, or real-time decision-making systems.
- Snowflake: Snowflake is an excellent data warehouse for historical and batch-oriented analytics but is not optimized for operational or real-time analytics where immediate insights from freshly ingested data are needed.

Advantage: Druid is better suited for operational analytics and real-time, event-driven systems.

(h) Cost-Effectiveness for Real-Time Workloads

- Apache Druid: Druid’s open-source nature means there are no licensing fees, and you only pay for the infrastructure it runs on (if you use a managed service or deploy it on the cloud). For organizations with significant real-time analytics workloads, Druid can be more cost-effective than cloud-based data warehouses, which charge based on storage and query execution.
- Snowflake: Snowflake’s pricing is based on compute and storage usage, with charges for data loading, querying, and storage. For continuous, high-frequency querying (such as real-time dashboards), these costs can add up quickly.

Advantage: Druid can be more cost-effective for real-time analytics, especially in high-query environments with constant data ingestion.

(i) Schema Flexibility and Semi-Structured Data Handling

- Apache Druid: Druid supports schema-on-read and is highly flexible in terms of handling semi-structured data such as JSON or log formats. This flexibility is particularly useful for use cases where the schema may evolve over time or when working with less structured data types.
- Snowflake: Snowflake also handles semi-structured data like JSON, but it requires more structured schema management compared to Druid’s flexible schema handling, which makes Druid more adaptable to changes in data format or structure.

Advantage: Druid offers greater schema flexibility for semi-structured data and evolving datasets.

(j) Open-Source and Vendor Independence

- Apache Druid: Druid is open-source, which gives users full control over deployment, management, and scaling without being locked into a specific vendor. This makes it a good choice for organizations that want to avoid vendor lock-in and have the flexibility to self-manage or choose different cloud providers.
- Snowflake: Snowflake is a proprietary, cloud-based data warehouse. While Snowflake offers excellent cloud capabilities, users are tied to Snowflake’s platform and pricing model, which may not be ideal for organizations preferring more control or customization in their infrastructure.

Advantage: Druid provides more freedom and control as an open-source platform, allowing for vendor independence.

When to Choose Apache Druid over Snowflake

Real-Time Streaming Analytics: If your use case involves high-frequency event data or real-time streaming analytics (e.g., user behavior tracking, IoT sensor data, or monitoring dashboards), Druid is a better fit.
Interactive, Low-Latency Queries: For interactive dashboards requiring fast response times, especially with frequently updated data, Druid’s sub-second query performance is a significant advantage.
Time-Series and Event-Driven Data: Druid’s architecture is designed for time-series data, making it superior for analyzing log data, time-stamped events, and similar data.
Operational Analytics: Druid excels in operational analytics where real-time data ingestion and low-latency insights are needed for decision-making.
Cost-Effective Real-Time Workloads: For continuous real-time querying and analysis, Druid’s cost structure may be more affordable compared to Snowflake’s compute-based pricing.

Data Ingestion, Data Processing, and Data Querying

Apache Druid

Data Ingestion and Data Processing

Apache Druid is a high-performance real-time analytics database designed for fast querying of large datasets. One common use case is ingesting a CSV file into Druid to enable interactive analysis. In this guide, we will walk through each step to upload a CSV file to Apache Druid, covering both the Druid Console and the ingestion configuration details.

Note: We have used the default configuration settings that come with the trial edition of Apache Druid and Snowflake for this exercise. Results may vary depending on the configuration of the applications.

Prerequisites:

(a) Ensure your CSV file is:

- Properly formatted, with headers in the first row.
- Clean of inconsistencies (e.g., missing data or malformed values).
- Stored locally or accessible via a URL if uploading from a remote location.

(b) Launch the Apache Druid Console.

(d) Open the Druid Console by navigating to its URL (default: http://localhost:8888).

(e) Load Data: Select the Datasources Grid.

Create a New Data Ingestion Task

(a) Navigate to the Ingestion Section:

- In the Druid Console, click on Data in the top navigation bar.
- Select Load data to start a new ingestion task.

(b) Choose the Data Source:

- Select Local disk if the CSV file is on the same server as the Druid cluster.
- Select HTTP(s) if the file is accessible via a URL.
- Choose Amazon S3, Google Cloud Storage, or other options if the file is stored in a cloud storage service.

(c) Upload or Specify the File Path:

- For local ingestion: Provide the absolute path to the CSV file.
- For HTTP ingestion: Enter the file’s URL.

(d) Start a New Batch Spec: We are using a new file for processing.

(e) Connect and parse raw data: Select Local Disk option.

(f) Specify the Data: Select the base directory of the file and choose the file type you are using to process.

File Format:
- Select CSV as the file format.
- If your CSV uses a custom delimiter (e.g., ;), specify it here.
Parse Timestamp Column:
- Druid requires a timestamp column to index the data.
- Select the column containing the timestamp (e.g., timestamp).
- Specify the timestamp format if it differs from ISO 8601 (e.g., yyyy-MM-dd HH:mm:ss).
Preview Data:
- The console will show a preview of the parsed data.
- Ensure that all columns are correctly identified.

(g) Connect: Connect details.

(h) Parse Data: At this stage, you should be able to see the data in Druid to parse it.

(i) Parse Time: At this stage, you should be able to see the data in Druid to parse the time details.

(j) Transform: At this stage, Druid allows you to perform some fundamental Transformations.

(k) Filter: At this stage, Druid allows you to apply the filter conditions to your data.

(l) Configure Schema: At this stage, we are configuring the Schema details about the file that we had uploaded to the Druid.

Define Dimensions and Metrics:
- Dimensions are fields you want to filter or group by (e.g., category).
- Metrics are fields you want to aggregate (e.g., value).
Primary Timestamp:
- Confirm the primary timestamp field.
Partitioning and Indexing:
- Select a time-based partitioning scheme, such as day or hour.
- Choose indexing options like bitmap or auto-compaction if needed.

(m) Partition: At this stage, Druid allows you to configure the Date specific granularity details.

(n) Tune:

- Max Rows in Memory:
  - Specify the maximum number of rows to store in memory during ingestion.
- Segment Granularity:
  - Define how data is divided into segments (e.g., daily, hourly).
- Partitioning and Indexing:
  - Configure the number of parallel tasks for ingestion if working with large datasets.

(o) Publish:

(p) Edit Spec:

(q) Submit Task:

1. Generate the Ingestion Spec:
  - Review the auto-generated ingestion spec JSON in the console.
  - Edit the JSON manually if you need to add custom configurations.
2. Submit the Task:
  - Click Submit to start the ingestion process.

(r) Monitor the Ingestion Process:

1. Navigate to the Tasks section in the Druid Console.
2. Monitor the progress of the ingestion task.
3. If the task fails, review the logs for errors, such as incorrect schema definitions or file path issues.

Processing a CSV data load of 1 Million data in 33 sec.

Data Querying

Query the Ingested Data

1. Navigate to the Query Tab:
  - In the Druid Console, go to Query and select your new data source.
2. Write and Execute Queries:
  - Use Druid SQL or the native JSON query language to interact with your data.

Example SQL Query:

Query Performance for querying and rendering of 1 million rows is 15 ms.

1. Visualize Results:

- - View the query results directly in the console or connect Druid to a visualization tool like Apache Superset. In our upcoming blog we will see how to visualize this ingested data using Apache Superset. Meanwhile In our previous blog, Exploring Apache Druid: A High-Performance Real-Time Analytics Database, and Apache Druid Integration with Apache Superset we discussed Apache Druid in more detail. In case you have missed it, it is recommended that you read it before continuing the Apache Druid blog series. In this blog, we talk about integrating Apache Superset and Apache Druid.

Uploading data sets to Apache Druid is a straightforward process, thanks to the intuitive Druid Console. By following these steps, you can ingest your data, configure schemas, and start analyzing it in no time. Whether you’re exploring business metrics, performing real-time analytics, or handling time-series data, Apache Druid provides the tools and performance needed to get insights quickly.

Snowflake

2. Upload the CSV file to the Server.

3. The uploaded “customer.csv” file has 1 Million data rows and the size of the file is around 166 MB.

4. Specify the Snowflake Database.

5. Specify the Snowflake Table.

6. The CSV data is being processed into the specified Snowflake Database and table.

7. Snowflake allows us to format or edit the metadata information before loading the data into Snowflake DB. Update the details as per the data and click on the Load button.

8. Snowflake successfully loaded customer data. Snowflake has taken 45 sec to process the CSV data. Snowflake has taken 12+ seconds compared to Apache Druid to load, transform, and process 1 Million records.

9. Query Performance for rendering 1 million rows is 20 ms. Snowflake took 5+ seconds compared to Apache Druid for querying 1 Million records.

Conclusion

While Snowflake is a powerful, cloud-native data warehouse with strengths in batch processing, historical data analysis, and complex OLAP queries, Apache Druid stands out in scenarios where real-time, low-latency analytics are needed. Druid’s open-source nature, time-series optimizations, streaming data support, and operational focus make it the better choice for real-time analytics, event-driven applications, and fast, interactive querying on large datasets.

Apache Druid FAQ

What is Apache Druid used for?

Apache Druid is used for real-time analytics and fast querying of large-scale datasets. It's particularly suitable for time-series data, clickstream analytics, operational intelligence, and real-time monitoring.

What are Druid’s ingestion methods?

Druid supports:

Streaming ingestion: Data from Kafka, Kinesis, etc., is ingested in real time.
Batch ingestion: Data from Hadoop, S3, or other static sources is ingested in batches.

How does Druid achieve fast queries?

Druid uses:

Columnar storage: Optimizes storage for analytic queries.
Advanced indexing: Bitmap and inverted indexes speed up filtering and aggregation.
Distributed architecture: Spreads query load across multiple nodes.

Can Druid handle transactional data (OLTP)?

No. Druid is optimized for OLAP (Online Analytical Processing) and is not suitable for transactional use cases.

How does Druid manage data storage?

Data is segmented into immutable chunks and stored in Deep Storage (e.g., S3, HDFS).
Historical nodes cache frequently accessed segments for faster querying.

What query languages does Druid support?

SQL: Provides a familiar interface for querying.
Native JSON-based query language: Offers more flexibility and advanced features.

What are the core components of Apache Druid?

Broker: Routes queries to appropriate nodes.
Historical nodes: Serve immutable data.
MiddleManager: Handles ingestion tasks.
Overlord: Coordinates ingestion processes.
Coordinator: Manages data availability and balancing.

What integrations does Druid support?

Druid integrates with tools like:

Apache Superset
Tableau
Grafana
Stream ingestion tools (Kafka, Kinesis)
Batch processing frameworks (Hadoop, Spark)

Is Apache Druid free to use?

Yes, it’s open-source and distributed under the Apache License 2.0.

What are some alternatives to Apache Druid?

Alternatives include:

ClickHouse
Elasticsearch
BigQuery
Snowflake

Whether you're exploring real-time analytics or need help getting started, feel free to reach out!

Watch the Apache Blog Series

Stay tuned for the upcoming Apache Blog Series:

Exploring Apache Druid: A High-Performance Real-Time Analytics Database
Unlocking Data Insights with Apache Superset
Streamlining Apache HOP Workflow Management with Apache Airflow
Comparison of and migrating from Pentaho Data Integration PDI/ Kettle to Apache HOP
Apache Druid Integration with Apache Superset
Why choose Apache Druid over Vertica
Why choose Apache Druid over Snowflake
Why choose Apache Druid over Google Big Query
Integrating Apache Druid with Apache Superset for Realtime Analytics

Apache Druid Integration with Apache Superset

January 16, 2025

by Axxonet Analytics

Introduction

In our previous blog, Exploring Apache Druid: A High-Performance Real-Time Analytics Database, we discussed Apache Druid in more detail. In case you have missed it, it is recommended that you read it before continuing the Apache Druid blog series. In this blog, we talk about integrating Apache Superset and Apache Druid.

Apache Superset and Apache Druid are both open-source tools that are often used together in data analytics and business intelligence workflows. Here’s an overview of each and their relationship.

Apache Superset

Apache Superset is an open-source data exploration and visualization platform developed by Airbnb, it was later donated to the Apache Software Foundation. It is now a top-level Apache project, widely adopted across industries for data analytics and visualization.

Type: Data visualization and exploration platform.
Purpose: Enables users to create dashboards, charts, and reports for data analysis.
Features:
1. User-friendly interface for querying and visualizing data.
2. Wide variety of visualization types (charts, maps, graphs, etc.).
3. SQL Lab: A rich SQL IDE for writing and running queries.
4. Extensible: Custom plugins and charts can be developed.
5. Role-based access control and authentication options.
6. Integrates with multiple data sources, including databases, warehouses, and big data engines.
Use Cases:
1. Business intelligence (BI) dashboards.
2. Interactive data exploration.
3. Monitoring and operational analytics.

Apache Druid

Type: Real-time analytics database.
Purpose: Designed for fast aggregation, ingestion, and querying of high-volume event streams.
Features:
1. Columnar storage format: Optimized for analytic workloads.
2. Real-time ingestion: Can process and make data available immediately.
3. Fast queries: High-performance query engine for OLAP (Online Analytical Processing).
4. Scalability: Handles massive data volumes efficiently.
5. Support for time-series data: Designed for datasets with a time component.
6. Flexible data ingestion: Supports batch and streaming ingestion from sources like Kafka, Hadoop, and object stores.
Use Cases:
1. Real-time dashboards.
2. Clickstream analytics.
3. Operational intelligence and monitoring.
4. Interactive drill-down analytics.

How They Work Together

Integration: Superset can connect to Druid as a data source, enabling users to visualize and interact with the data stored in Druid.
Workflow:
1. Data Ingestion: Druid ingests large-scale data in real-time or batches from various sources.
2. Data Querying: Superset uses Druid’s fast query capabilities to retrieve data for visualizations.
3. Visualization: Superset presents the data in user-friendly dashboards, allowing users to explore and analyze it.

Key Advantages of Using Them Together

Speed: Druid’s real-time and high-speed querying capabilities complement Superset’s visualization features, enabling near real-time analytics.
Scalability: The combination is suitable for high-throughput environments where large-scale data analysis is needed.
Flexibility: Supports a wide range of analytics use cases, from real-time monitoring to historical data analysis.

Use Cases of Apache Druid and Apache Superset

Fraud Detection: Analyzing transactional data for anomalies.
Real-Time Analytics: Monitoring website traffic, application performance, or IoT devices in real-time.
Operational Intelligence: Tracking business metrics like revenue, sales, or customer behaviour.
Clickstream Analysis: Analyzing user interactions and behaviour on websites and apps.
Log and Event Analytics: Parsing and querying log files or event streams.

Importance of Integrating Apache Druid with Apache Superset

Integrating Apache Druid with Apache Superset opens quite a few doors for the organization to really address high-scale data analysis and interactive visualization needs. Here’s why this combination is so powerful:

High-Performance Analytics:

Druid’s Speed: Apache Druid optimizes low-latency queries and fast aggregation and is designed with large datasets in mind. That can be huge because quick response means Superset can deliver good, reusable, production-quality representations.

Real-Time Data: Druid supports real-time data ingestion, allowing Superset dashboards to visualise near-live data streams, making it suitable for monitoring and operational analytics.

Advanced Query Capabilities:

Complex Aggregations: Druid’s ability to handle complex OLAP-style aggregations and filtering ensures that Superset can provide rich and detailed visualizations.

Native Time-Series Support: Druid’s in-built time-series functionalities allow seamless integration for Superset’s time-series charts and analysis.

Scalability:

Designed for Big Data: Druid is highly scalable and can manage billions of rows in an efficient manner. Together with Superset, it helps dashboards scale easily for high volumes of data.

Distributed Architecture: Being by nature distributed, Druid ensures reliable performance of Superset even when it’s under high loads or many concurrent user sessions.

Simplified Data Exploration:

Columnar Storage: This columnar storage format on Druid greatly accelerates the exploratory data analysis of Superset since it supports scans of selected columns efficiently.

Pre-Aggregated Data: Superset can avail use from the roll-up and pre-aggregation mechanisms in Druid for reducing query cost as well as improving the visualization performance.

Flawless Integration:

Incorporated Connector: Superset is natively integrated with Druid. Therefore, there would not be any hassle regarding implementing it. It lets users query Druid directly and curate dashboards using minimal configuration. Moreover, non-technical users can also perform SQL-like queries (Druid SQL) within the interface.

Advanced Features:

Granular Access Control: When using Druid with Superset, it is possible to enforce row-level security and user-based data access, ensuring that sensitive data is only accessible to authorized users.

Segment-Level Optimization: Druid’s segment-based storage enables Superset to optimize queries for specific data slices.

Cost Efficiency:

Efficient Resource Utilization: Druid’s architecture minimizes storage and compute costs while providing high-speed analytics, ensuring that Superset dashboards run efficiently.

Real-Time Monitoring and BI:

Operational Dashboards: In the case of monitoring systems, website traffic, or application logs, Druid’s real-time capabilities make the dashboards built in Superset very effective for real-time operational intelligence.

Streaming Data Support: Druid supports ingesting stream-processing-enabled data sources; for example, Kafka’s ingestion is supported. This enables Superset to visualize live data with minimal delay.

Prerequisites

The Superset Docker Container or instance should be up and running.
Installation of pydruid in Superset Container or instance.

Installing PyDruid

1. Enable Druid-Specific SQL Queries

PyDruid is the Python library that serves as a client for Apache Druid. This allows Superset to:

Run Druid SQL queries as efficiently as possible.
Access Druid’s powerful querying capabilities directly from Superset’s SQL Lab and dashboards.
Without this client, Superset will not be able to communicate with the Druid database, which consequently results in partial functionality or functionality at large

2. Native Connection Support

Superset is largely dependent upon PyDruid for:

Seamless connections to Apache Druid.
Translate Superset queries into Druid-native SQL or JSON that the Druid engine understands.

3. Efficient Data Fetching

PyDruid optimizes data fetching from Druid by

Managing the query API of Druid in regards to aggregation, filtering, and time-series analysis
Reducing latency due to better management of connections and responses

4. Advanced Visualization Features

Many features of advanced visualization available in Superset (like time-series charts and heat maps) depend on Druid’s capabilities, which PyDruid enables. It also exposes to Superset Druid’s granularity, time aggregation and access to complex metrics pre-aggregated in Druid.

5. Integration with Superset’s Druid Connector

The Druid connector in Superset relies on PyDruid as a dependency to:

Register Druid as a data source.
Allow Superset users the capability to query and visually explore Druid datasets directly.
Without PyDruid the Druid connector in Superset will not work, limiting your ability to integrate Druid as a backend for analytics.

6. Data Discovery PyDruid Enables Superset to:

Fetch metadata from Druid, including schema, tables, and columns. Enable ad hoc exploration of Druid data in SQL Lab.

Apache Druid as a Data Source

Make sure your Druid database application is set up and working fine in the respective ports.

2. Install pydruid by logging into the container.

$ docker exec -it <superset_container_name> user=root /bin/bash

$ pip install pydruid

3. Once pydruid is installed, navigate to settings and click on database connections

4. Click on Add Database and select Apache Druid as shown in the screenshot below:

5. Here we have to enter the SQLALCHEMY URI. To learn more about the SQLALCHEMY rules please follow the below link.

https://superset.apache.org/docs/configuration/databases/

6. For Druid the SQLALCHEMY URI will be in the below format:

druid://<User>:<password>@<Host>:<Port-default-9088>/druid/v2/sql

Note:

druid://<User>:<password>@<Host>:<Port-default-9088>/druid/v2/sql

According to the Alchemy URI “<Port-default-9088>” the default ports will be “8888” or “8081”.

If the druid you have installed is running on http without any secure connection the Alchemy URI with the above rule will work without any issue.

Ex: druid://localhost:8888/druid/v2/sql/

If you have a secure connection normally druid://localhost:8888/druid/v2/sql/ will not work and provide an error.

Here we will understand that we have to provide a domain name instead of localhost if we have enabled a secured connection for the druid UI. So we will make the changes and enter the URI as below

druid://domain.name.com:8888/druid/v2/sql/. This will also not work.

According to other sources we have to make the changes like

druid+https://localhost:8888/druid/v2/sql/

druid+https://localhost:8888/druid/v2/sql/?ssl_verify=false

Neither of them will work

Normally once we enable a secured connection port HTTPS will be enabled to the site proxying the default ports of any of the other applications.

So if we enter the below SQL ALCHEMY URI as below

druid+https://domain.name.com:443/druid/v2/sql/?ssl=true

Superset will understand that the druid is enabled with a secured connection and running in the HTTPS port.

7. Since we have enabled the SSL to our druid application we have to enter the below SQLALCHEMY URI

druid+https://domain.name.com:443/druid/v2/sql/?ssl=true

and click on test connection.

8. Once we click on test connection the Apache superset will be integrated with the Apache druid and will be ready to be worked on.

In the upcoming articles, we will talk about the data ingestion and visualization of Apache Druid data in the Apache Superset.

Apache Druid FAQ

What is Apache Druid used for?

What are Druid’s ingestion methods?

Druid supports:

Streaming ingestion: Data from Kafka, Kinesis, etc., is ingested in real-time.
Batch ingestion: Data from Hadoop, S3, or other static sources is ingested in batches.

How does Druid achieve fast queries?

Druid uses:

Columnar storage: Optimizes storage for analytic queries.
Advanced indexing: Bitmap and inverted indexes speed up filtering and aggregation.
Distributed architecture: Spreads query load across multiple nodes.

Can Druid handle transactional data (OLTP)?

No. Druid is optimized for OLAP (Online Analytical Processing) and is not suitable for transactional use cases.

How does Druid manage data storage?

Data is segmented into immutable chunks and stored in Deep Storage (e.g., S3, HDFS).
Historical nodes cache frequently accessed segments for faster querying.

What query languages does Druid support?

SQL: Provides a familiar interface for querying.
Native JSON-based query language: Offers more flexibility and advanced features.

What are the core components of Apache Druid?

Broker: Routes queries to appropriate nodes.
Historical nodes: Serve immutable data.
MiddleManager: Handles ingestion tasks.
Overlord: Coordinates ingestion processes.
Coordinator: Manages data availability and balancing.

What integrations does Druid support?

Druid integrates with tools like:

Apache Superset
Tableau
Grafana
Stream ingestion tools (Kafka, Kinesis)
Batch processing frameworks (Hadoop, Spark)

Is Apache Druid free to use?

Yes, it’s open-source and distributed under the Apache License 2.0.

What are some alternatives to Apache Druid?

Alternatives include:

ClickHouse
Elasticsearch
BigQuery
Snowflake

Apache Superset FAQ

What is Apache Superset?

Apache Superset is an open-source data visualization and exploration platform. It allows users to create dashboards, charts, and interactive visualizations from various data sources.

What types of visualizations does Superset support?

Superset offers:

Line charts, bar charts, pie charts.
Geographic visualizations like maps.
Pivot tables, histograms, and heatmaps.
Custom visualizations via plugins.

What data sources can Superset connect to?

Superset supports:

Relational databases (PostgreSQL, MySQL, SQLite).
Big data engines (Hive, Presto, Trino, BigQuery).
Analytics databases (Apache Druid, ClickHouse).

Is Apache Superset a replacement for Tableau or Power BI?

Superset is a robust, open-source alternative to commercial BI tools. While it’s feature-rich, it may require more setup and development effort compared to proprietary solutions like Tableau.

What query languages does Superset support?

Superset primarily supports SQL for querying data. Its SQL Lab provides an environment for writing, testing, and running queries.

How does Superset handle user authentication?

Superset supports:

Database-based login.
OAuth, LDAP, OpenID Connect, and other SSO mechanisms.
Role-based access control (RBAC) for managing permissions.

How do I install Superset?

Superset can be installed using:

Docker: For containerized deployments.
Python pip: For a native installation in a Python environment.
Helm: For Kubernetes-based setups.

Can I use Superset with real-time data?

Yes. By integrating Superset with real-time analytics platforms like Apache Druid or streaming data sources, you can build dashboards that reflect live data.

What are Superset’s limitations?

Requires knowledge of SQL for advanced queries.
May not have the polish or advanced features of proprietary BI tools like Tableau.
Customization and extensibility require some development effort.

How does Superset handle performance?

Superset performance depends on:

The efficiency of the connected database or analytics engine.
The complexity of queries and visualizations.
Hardware resources and configuration.

Would you like further details on specific features or help setting up either tool? Reach out to us

Watch the Apache Blog Series

Stay tuned for the upcoming Apache Blog Series:

Exploring Apache Druid: A High-Performance Real-Time Analytics Database
Unlocking Data Insights with Apache Superset
Streamlining Apache HOP Workflow Management with Apache Airflow
Comparison of and Migrating from Pentaho Data Integration PDI/ Kettle to Apache HOP
Why choose Apache Druid over Vertica
Why choose Apache Druid over Snowflake
Why choose Apache Druid over Google Big Query

Exploring Apache Druid: A High-Performance Real-Time Analytics Database

September 27, 2024

by Axxonet Analytics

Introduction

Apache Druid is a distributed, column-oriented, real-time analytics database designed for fast, scalable, and interactive analytics on large datasets. It excels in use cases requiring real-time data ingestion, high-performance queries, and low-latency analytics.

Druid was originally developed to power interactive data applications at Metamarkets and has since become a widely adopted open-source solution for real-time analytics, particularly in industries such as ad tech, fintech, and IoT.

It supports batch and real-time data ingestion, enabling users to perform fast ad-hoc queries, power dashboards, and interactive data exploration.

In big data and real-time analytics, having the right tools to process and analyze large volumes of data swiftly is essential. Apache Druid, an open-source, high-performance, column-oriented distributed data store, has emerged as a leading solution for real-time analytics and OLAP (online analytical processing) workloads. In this blog post, we’ll delve into what Apache Druid is, its key features, and how it can revolutionize your data analytics capabilities. Refer to the official documentation for more information.

Apache Druid

Apache Druid is a high-performance, real-time analytics database designed for fast slice-and-dice analytics on large datasets. It was created by Metamarkets (now part of Snap Inc.) and is now an Apache Software Foundation project. Druid is built to handle both batch and streaming data, making it ideal for use cases that require real-time insights and low-latency queries.

Key Features of Apache Druid:

Real-Time Data Ingestion

Druid excels at real-time data ingestion, allowing data to be ingested from various sources such as Kafka, Kinesis, and traditional batch files. It supports real-time indexing, enabling immediate query capabilities on incoming data with low latency.

High-Performance Query Engine

Druid’s query engine is optimized for fast, interactive querying. It supports a wide range of query types, including Time-series, TopN, GroupBy, and search queries. Druid’s columnar storage format and advanced indexing techniques, such as bitmap indexes and compressed column stores, ensure that queries are executed efficiently.

Scalable and Distributed Architecture

Druid’s architecture is designed to scale horizontally. It can be deployed on a cluster of commodity hardware, with data distributed across multiple nodes to ensure high availability and fault tolerance. This scalability makes Druid suitable for handling large datasets and high query loads.

Flexible Data Model

Druid’s flexible data model allows for the ingestion of semi-structured and structured data. It supports schema-on-read, enabling dynamic column discovery and flexibility in handling varying data formats. This flexibility simplifies the integration of new data sources and evolving data schemas.

Built-In Data Management

Druid includes built-in features for data management, such as automatic data partitioning, data retention policies, and compaction tasks. These features help maintain optimal query performance and storage efficiency as data volumes grow.

Extensive Integration Capabilities

Druid integrates seamlessly with various data ingestion and processing frameworks, including Apache Kafka, Apache Storm, and Apache Flink. It also supports integration with visualization tools like Apache Superset, Tableau, and Grafana, enabling users to build comprehensive analytics solutions.

Use Cases of Apache Druid

Real-Time Analytics

Druid is used in real-time analytics applications where the ability to ingest and query data in near real-time is critical. This includes monitoring applications, fraud detection, and customer behavior tracking.

Ad-Tech and Marketing Analytics

Druid’s ability to handle high-throughput data ingestion and fast queries makes it a popular choice in the ad tech and marketing industries. It can track user events, clicks, impressions, and conversion rates in real time to optimize campaigns.

IoT Data and Sensor Analytics

IoT applications produce time-series data at high volume. Druid’s architecture is optimized for time-series data analysis, making it ideal for analyzing IoT sensor data, device telemetry, and real-time event tracking.

Operational Dashboards

Druid is often used to power operational dashboards that provide insights into infrastructure, systems, or applications. The low-latency query capabilities ensure that dashboards reflect real-time data without delay.

Clickstream Analysis

Organizations leverage Druid to analyze user clickstream data on websites and applications, allowing for in-depth analysis of user interactions, preferences, and behaviors in real time.

The Architecture of Apache Druid

Apache Druid follows a distributed, microservice-based architecture. The architecture allows for scaling different components based on the system’s needs.

The main components are:

Coordinator and Overlord Nodes

Coordinator Node: Manages data availability, balancing the distribution of data across the cluster, and overseeing segment management (segments are the basic units of storage in Druid).
Overlord Node: Responsible for managing ingestion tasks. It works with the middle managers to schedule and execute data ingestion tasks, ensuring that data is ingested properly into the system.

Historical Nodes

Historical nodes store immutable segments of historical data. When queries are executed, historical nodes serve data from the disk, which allows for low-latency and high-throughput queries.

MiddleManager Nodes

MiddleManager nodes handle real-time ingestion tasks. They manage tasks such as ingesting data from real-time streams (like Kafka), transforming it, and pushing the processed data to historical nodes after it has persisted.

Broker Nodes

The broker nodes route incoming queries to the appropriate historical or real-time nodes and aggregate the results. They act as the query routers and perform query federation across the Druid cluster.

Query Nodes

Query nodes are responsible for receiving, routing, and processing queries. They can handle a variety of query types, including SQL, and route these queries to other nodes for execution.

Deep Storage

Druid relies on an external deep storage system (such as Amazon S3, Google Cloud Storage, or HDFS) to store segments of data permanently. The historical nodes pull these segments from deep storage when they need to serve data.

Metadata Storage

Druid uses an external relational database (typically PostgreSQL or MySQL) to store metadata about the data, including segment information, task states, and configuration settings.

Advantages of Apache Druid

Sub-Second Query Latency: Optimized for high-speed data queries, making it perfect for real-time dashboards.
Scalability: Easily scales to handle petabytes of data.
Flexible Data Ingestion: Supports both batch and real-time data ingestion from multiple sources like Kafka, HDFS, and Amazon S3.
Column-Oriented Storage: Efficient data storage with high compression ratios and fast retrieval of specific columns.
SQL Support: Familiar SQL-like querying capabilities for easy data analysis.
High Availability: Fault-tolerant and highly available due to data replication across nodes.

Getting Started with Apache Druid

Installation and Setup

Setting up Apache Druid involves configuring a cluster with different node types, each responsible for specific tasks:

Master Nodes: Oversee coordination, metadata management, and data distribution.
Data Nodes: Handle data storage, ingestion, and querying.
Query Nodes: Manage query routing and processing.

You can install Druid using a package manager, Docker, or by downloading and extracting the binary distribution. Here’s a brief overview of setting up Druid using Docker:

Download the Docker Compose File:
$curl -O https://raw.githubusercontent.com/apache/druid/master/examples/docker-compose/docker-compose.yml
Start the Druid Cluster: $ docker-compose up
Access the Druid Console: Open your web browser and navigate to http://localhost:8888 to access the Druid console.

Ingesting Data

To ingest data into Druid, you need to define an ingestion spec that outlines the data source, input format, and parsing rules. Here’s an example of a simple ingestion spec for a CSV file:

JSON Code

{ “type”: “index_parallel”, “spec”: { “ioConfig”: { “type”: “index_parallel”, “inputSource”: { “type”: “local”, “baseDir”: “/path/to/csv”, “filter”: “*.csv” }, “inputFormat”: { “type”: “csv”, “findColumnsFromHeader”: true } }, “dataSchema”: { “dataSource”: “example_data”, “timestampSpec”: { “column”: “timestamp”, “format”: “iso” }, “dimensionsSpec”: { “dimensions”: [“column1”, “column2”, “column3”] } }, “tuningConfig”: { “type”: “index_parallel” } }}

Submit the ingestion spec through the Druid console or via the Druid API to start the data ingestion process.

Querying Data

Once your data is ingested, you can query it using Druid’s native query language or SQL. Here’s an example of a simple SQL query to retrieve data from the example_data data source:

SELECT __time, column1, column2, columnFROM example_dataWHERE __time BETWEEN ‘2023-01-01’ AND ‘2023-01-31’

Use the Druid console or connect to Druid from your preferred BI tool to execute queries and visualize data.

Conclusion

Apache Druid is a powerful, high-performance real-time analytics database that excels at handling large-scale data ingestion and querying. Its robust architecture, flexible data model, and extensive integration capabilities make it a versatile solution for a wide range of analytics use cases. Whether you need real-time insights, interactive queries, or scalable OLAP capabilities, Apache Druid provides the tools to unlock the full potential of your data. Explore Apache Druid today and transform your data analytics landscape. Apache Druid has firmly established itself as a leading database for real-time, high-performance analytics. Its unique combination of real-time data ingestion, sub-second query speeds, and scalability makes it a perfect choice for businesses that need to analyze vast amounts of time-series and event-driven data. With growing adoption across industries.

Need help transforming your real-time analytics with high-performance querying? Contact our experts today!

Watch the Apache Druid Blog Series

Stay tuned for the upcoming Apache Druid Blog Series:

Why choose Apache Druid over Vertica
Why choose Apache Druid over Snowflake
Why choose Apache Druid over Google Big Query
Integrating Apache Druid with Apache Superset for Realtime Analytics

Streamlining Apache HOP Workflow Management with Apache Airflow

September 23, 2024

by Axxonet Analytics

Introduction

In our previous blog, we talked about the Apache HOP in more detail. In case you have missed it, refer to it here “https://analytics.axxonet.com/comparison-of-and-migrating-from-pdi-kettle-to-apache-hop/” page. As a continuation of the Apache HOP article series, here we touch upon how to integrate Apache Airflow and Apache HOP. In the fast-paced world of data engineering and data science, efficiently managing complex workflows is crucial. Apache Airflow, an open-source platform for programmatically authoring, scheduling, and monitoring workflows, has become a cornerstone in many data teams’ toolkits. This blog post explores what Apache Airflow is, its key features, and how you can leverage it to streamline and manage your Apache HOP workflows and pipelines.

Apache HOP

Apache HOP is an open-source data integration and orchestration platform. For more details refer to our previous blog here.

Apache Airflow

Apache Airflow is an open-source workflow orchestration tool originally developed by Airbnb. It allows you to define workflows as code, providing a dynamic, extensible platform to manage your data pipelines. Airflow’s rich features enable you to automate and monitor workflows efficiently, ensuring that data moves seamlessly through various processes and systems.

Use Cases:

Data Pipelines: Orchestrating ETL jobs to extract data from sources, transform it, and load it into a data warehouse.
Machine Learning Pipelines: Scheduling ML model training, batch processing, and deployment workflows.
Task Automation: Running repetitive tasks, like backups or sending reports.

DAG (Directed Acyclic Graph):

A DAG represents the workflow in Airflow. It defines a collection of tasks and their dependencies, ensuring that tasks are executed in the correct order.
DAGs are written in Python and allow you to define the tasks and how they depend on each other.

Operators:

Operators define a single task in a DAG. There are several built-in operators, such as:
1. BashOperator: Runs a bash command.
2. PythonOperator: Runs Python code.
3. SqlOperator: Executes SQL commands.
4. HttpOperator: Makes HTTP requests.

Custom operators can also be created to meet specific needs.

Tasks:

Tasks are the building blocks of a DAG. Each node in a DAG is a task that does a specific unit of work, such as executing a script or calling an API.
Tasks are defined by operators and their dependencies are controlled by the DAG.

Schedulers:

The scheduler is responsible for triggering tasks at the appropriate time, based on the schedule_interval defined in the DAG.
It continuously monitors all DAGs and determines when to run the next task.

Executors:

The executor is the mechanism that runs the tasks. Airflow supports different types of executors:
1. SequentialExecutor: Executes tasks one by one.
2. LocalExecutor: Runs tasks in parallel on the local machine.
3. CeleryExecutor: Distributes tasks across multiple worker machines.
4. KubernetesExecutor: Runs tasks in a Kubernetes cluster.

Web UI:

Airflow has a web-based UI that lets you monitor the status of DAGs, view logs, and check the status of each task in a DAG.
It also provides tools to trigger, pause, or retry DAGs.

Key Features of Apache Airflow

Workflow as Code

Airflow uses Directed Acyclic Graphs (DAGs) to represent workflows. These DAGs are written in Python, allowing you to leverage the full power of a programming language to define complex workflows. This approach, known as “workflow as code,” promotes reusability, version control, and collaboration.

Dynamic Task Scheduling

Airflow’s scheduling capabilities are highly flexible. You can schedule tasks to run at specific intervals, handle dependencies, and manage task retries in case of failures. The scheduler executes tasks in a defined order, ensuring that dependencies are respected and workflows run smoothly.

Extensible Architecture

Airflow’s architecture is modular and extensible. It supports a wide range of operators (pre-defined tasks), sensors (waiting for external conditions), and hooks (interfacing with external systems). This extensibility allows you to integrate with virtually any system, including databases, cloud services, and APIs.

Robust Monitoring and Logging

Airflow provides comprehensive monitoring and logging capabilities. The web-based user interface (UI) offers real-time visibility into the status of your workflows, enabling you to monitor task progress, view logs, and troubleshoot issues. Additionally, Airflow can send alerts and notifications based on task outcomes.

Scalability and Reliability

Designed to scale, Airflow can handle workflows of any size. It supports distributed execution, allowing you to run tasks on multiple workers across different nodes. This scalability ensures that Airflow can grow with your organization’s needs, maintaining reliability even as workflows become more complex.

Getting Started with Apache Airflow

Installation using PIP

Setting up Apache Airflow is straightforward. You can install it using pip, Docker, or by deploying it on a cloud service. Here’s a brief overview of the installation process using pip:

1. Create a Virtual Environment (optional but recommended):

python3 -m venv airflow_env

source airflow_env/bin/activate

2. Install Apache Airflow:

pip install apache-airflow

3. Initialize the Database:

airflow db init

4. Create a User:

airflow users create –username admin –password admin –firstname Admin – lastname User –role Admin –email [email protected]

5. Start the Web Server and Scheduler:

airflow webserver –port 8080

airflow scheduler

6. Access the Airflow UI: Open your web browser and go to http://localhost:8080.

Installation using Docker

Pull the docker image and run the container to access the Airflow web UI. Refer to the link for more details.

Creating Your First DAG

Airflow DAG Structure:

A DAG in Airflow is composed of three main parts:

Imports: Necessary packages and operators.
Default Arguments: Arguments that apply to all tasks within the DAG (such as retries, owner, start date).
Task Definition: Define tasks using operators, and specify dependencies between them.

Scheduling:

Airflow allows you to define the schedule of a DAG using schedule_interval:

@daily: Run once a day at midnight.
@hourly: Run once every hour.
@weekly: Run once a week at midnight on Sunday.
Cron expressions, like “0 12 * * *”, are also supported for more specific scheduling needs.
Define the DAG: Create a Python file (e.g., run_lms_transaction.py) in the dags folder of your Airflow installation directory.

Example:

from airflow import DAG

from airflow.operators.dummy import DummyOperator

from datetime import datetime

default_args = {

‘owner’: ‘airflow’,

‘start_date’: datetime(2023, 1, 1),

‘retries’: 1,

}

dag = DAG(‘example_dag’, default_args=default_args, schedule_interval=‘@daily’)

start = DummyOperator(task_id=‘start’, dag=dag)

end = DummyOperator(task_id=‘end’, dag=dag)

start >> end

2. Deploy the DAG: Save the file in the Dags folder. Place the DAG Python script in the DAGs folder (~/airflow/dags by default). Airflow will automatically detect and load the DAG.

3. Monitor the DAG: Access the Airflow UI, where you can view and manage the newly created DAG. Trigger the DAG manually or wait for it to run according to the defined schedule.

Calling the Apache HOP Pipelines/Workflows from Apache Airflow

In this example, we walk through how to integrate the Apache HOP with Apache Airflow. Here both the Apache Airflow and Apache HOP are running on two different independent docker containers. Apache HOP ETL Pipelines / Workflows are configured with a persistent volume storage strategy so that the DAG code can request execution from Airflow.

Steps

Define the DAG: Create a Python file (e.g., Stg_User_Details.py) in the dags folder of your Airflow installation directory.

from datetime import datetime, timedelta

from airflow import DAG

from airflow.operators.bash_operator import BashOperator

from airflow.operators.docker_operator import DockerOperator

from airflow.operators.python_operator import BranchPythonOperator

from airflow.operators.dummy_operator import DummyOperator

from docker.types import Mount

default_args = {

‘owner’ : ‘airflow’,

‘description’ : ‘Stg_User_details’,

‘depend_on_past’ : False,

‘start_date’ : datetime(2022, 1, 1),

’email_on_failure’ : False,

’email_on_retry’ : False,

‘retries’ : 1,

‘retry_delay’ : timedelta(minutes=5)

}

with DAG(‘Stg_User_details’, default_args=default_args, schedule_interval=‘0 10 * * *’, catchup=False, is_paused_upon_creation=False) as dag:

start_dag = DummyOperator(

task_id=‘start_dag’

)

end_dag = DummyOperator(

task_id=‘end_dag’

)

hop = DockerOperator(

task_id=‘Stg_User_details’,

# use the Apache Hop Docker image. Add your tags here in the default apache/hop: syntax

image=‘test’,

api_version=‘auto’,

auto_remove=True,

environment= {

‘HOP_RUN_PARAMETERS’: ‘INPUT_DIR=’,

‘HOP_LOG_LEVEL’: ‘TRACE’,

‘HOP_FILE_PATH’: ‘/opt/hop/config/projects/default/stg_user_details_test.hpl’,

‘HOP_PROJECT_DIRECTORY’: ‘/opt/hop/config/projects/’,

‘HOP_PROJECT_NAME’: ‘ISON_Project’,

‘HOP_ENVIRONMENT_NAME’: ‘ISON_Env’,

‘HOP_ENVIRONMENT_CONFIG_FILE_NAME_PATHS’: ‘/opt/hop/config/projects/default/project-config.json’,

‘HOP_RUN_CONFIG’: ‘local’,

docker_url=“unix://var/run/docker.sock”,

network_mode=“bridge”,

force_pull=False,

mount_tmp_dir=False

)

start_dag >> hop >> end_dag

Note: For reference purposes only.

2. Deploy the DAG: Save the file in the dags folder. Airflow will automatically detect and load the DAG.

After successful deployment, we should see the new “Stg_User_Details” DAG listed in the Active Tab and All Tab from the Airflow Portal. As shown in the screenshot above.

3. Run the DAG: We can trigger pipelines or workflows using Airflow by clicking on the Trigger DAG option as shown below from the Airflow application.

4. Monitor the DAG: Access the Airflow UI, where you can view and manage the newly created DAG. Trigger the DAG manually or wait for it to run according to the defined schedule.

After successful execution, we should see the status message as shown an execution history along with log details. new “Stg_User_Details” DAG listed in the Active Tab and All Tab from the Airflow Portal. As shown in the screenshot above.

Managing and Scaling Workflows

Use Operators and Sensors: Leverage Airflow’s extensive library of operators and sensors to create tasks that interact with various systems and handle complex logic.
Implement Task Dependencies: Define task dependencies using the >> and << operators to ensure tasks run in the correct order.
Optimize Performance: Monitor task performance through the Airflow UI and logs. Adjust task configurations and parallelism settings to optimize workflow execution.
Scale Out: Configure Airflow to run in a distributed mode by adding more worker nodes, ensuring that the system can handle increasing workload efficiently.

Conclusion

Apache Airflow is a powerful and versatile tool for managing workflows and automating complex data pipelines. Its “workflow as code” approach, coupled with robust scheduling, monitoring, and scalability features, makes it an essential tool for data engineers and data scientists. By adopting Airflow, you can streamline your workflow management, improve collaboration, and ensure that your data processes are efficient and reliable. Explore Apache Airflow today and discover how it can transform your data engineering workflows.

Streamline your Apache HOP Workflow Management With Apache Airflow through our team of experts.

Upcoming Apache HOP Blog Series

Stay tuned for the upcoming Apache HOP Blog Series:

Migrating from Pentaho ETL to Apache Hop
Integrating Apache Hop with an Apache Superset
Comparison of Pentaho ETL and Apache Hop

Unlocking Data Insights with Apache Superset

September 17, 2024

by Axxonet Analytics

Introduction

In today’s data-driven world, having the right tools to analyze and visualize data is crucial for making informed decisions. Organizations rely heavily on actionable insights to make informed decisions. With vast amounts of data generated daily, visualizing it becomes crucial for deriving patterns, trends, and insights. One of the standout solutions in the open-source landscape is Apache Superset. Apache Superset, an open-source data exploration and visualization platform, has emerged as a powerful tool for modern data analytics. This powerful, user-friendly platform enables users to create, explore, and share interactive data visualizations and dashboards. Whether you’re a data scientist, analyst, or business intelligence professional, Apache Superset can significantly enhance your data analysis capabilities. In this blog post, we’ll dive deep into what Apache Superset is, its key features, architecture, installation process, use cases, and how you can leverage it to unlock valuable insights from your data.

Apache Superset

Apache Superset is an open-source data exploration and visualization platform developed by Airbnb, it was later donated to the Apache Software Foundation. It is now a top-level Apache project, widely adopted across industries for data analytics and visualization. Apache Superset is designed to be a modern, enterprise-ready business intelligence web application that allows users to explore, analyze, and visualize large datasets. Superset’s intuitive interface allows users to quickly and easily create beautiful and interactive visualizations and dashboards from various data sources without needing extensive programming knowledge.

Superset is designed to be lightweight yet feature-rich, offering powerful SQL-based querying, interactive dashboards, and a wide variety of data visualization options—all through an intuitive web-based interface.

Key Features

Rich Data Visualizations

Superset offers a clean and intuitive interface that makes it easy for users to navigate and create visualizations. The drag-and-drop functionality simplifies the process of building charts and dashboards, making it accessible even to non-technical users. Superset provides a wide range of customizable visualizations. Whether it’s simple charts like bar charts, line charts, pie charts, scatter plots, geographical maps, or complex visuals like geospatial maps and heatmaps, Superset offers an extensive library to cover various data visualization needs. This flexibility allows users to choose the best way to represent their data, facilitating better analysis and understanding.

Bar Charts: Perfect for comparing different categories of data.
Line Charts: Excellent for time-series analysis.
Heatmaps: Useful for showing data density or intensity.
Geospatial Maps: Visualize location-based data on geographical maps.
Pie Charts, Treemaps, Sankey Diagrams, and More: Additional options for exploring relationships and proportions in the data.

SQL-Based Querying

One of Superset’s most powerful features is its support for SQL-based querying. It provides an SQL editor where users can write and execute SQL queries directly against connected databases. For users who prefer working with SQL, Superset includes a powerful SQL editor called SQL Lab. This feature allows users to run queries, explore databases, and preview data before creating visualizations. SQL Lab supports syntax highlighting, autocompletion, and query history, enhancing the SQL writing experience.

Interactive Dashboards

Superset allows users to create interactive dashboards with multiple charts, filters, and data points. These dashboards can be customized and shared across teams to deliver insights interactively. Real-time data updates ensure that the latest metrics are always displayed.

Extensible and Scalable

Apache Superset is highly extensible and can connect to a variety of data sources such as:

SQL-based databases (PostgreSQL, MySQL, Oracle, etc.)
Big Data platforms (Presto, Druid, Hive, and more)
Cloud-native databases (Google BigQuery, Snowflake, Amazon Redshift)

This versatility ensures that users can easily access and analyze their data, regardless of where it is stored. Its architecture supports horizontal scaling, making it suitable for enterprises handling large-scale datasets.

Security and Authentication

As an enterprise-ready platform, Superset offers robust security features, including role-based access control (RBAC), authentication, and authorization mechanisms. Additionally, Superset is designed to scale with your organization, capable of handling large volumes of data and concurrent users. Superset integrates with common authentication protocols (OAuth, OpenID, LDAP) to ensure secure access. It also provides fine-grained access control through role-based security, enabling administrators to control access to specific dashboards, charts, and databases.

Low-Code and No-Code Data Exploration

Superset is ideal for both technical and non-technical users. While advanced users can write SQL queries to explore data, non-technical users can use the point-and-click interface to create visualizations without requiring code. This makes it accessible to everyone, from data scientists to business analysts.

Customizable Visualizations

Superset’s visualization framework allows users to modify the look and feel of their charts using custom JavaScript, CSS, and the powerful ECharts and D3.js libraries. This gives users the flexibility to create branded and unique visual representations.

Advanced Analytics

Superset includes features for advanced analytics, such as time-series analysis, trend lines, and complex aggregations. These capabilities enable users to perform in-depth analysis and uncover deeper insights from their data.

Architecture of Apache Superset

Superset’s architecture is modular and designed to be scalable, making it suitable for both small teams and large enterprises

Here’s a breakdown of its core components:

Frontend (React-based):

Superset’s frontend is built using React, offering a smooth and responsive user interface for creating visualizations and interacting with data. The UI also leverages Bootstrap and other modern JavaScript libraries to enhance the user experience.

Backend (Python/Flask-based):

The backend is powered by Python and Flask, a lightweight web framework. Superset uses SQLAlchemy as the SQL toolkit and Alembic for database migrations.
Superset communicates with databases using SQLAlchemy to execute queries and fetch results.
Celery and Redis can be used for background tasks and asynchronous queries, allowing for scalable query processing.

Metadata Database:

Superset stores information about visualizations, dashboards, and user access in a metadata database. Common choices include PostgreSQL or MySQL.
This database does not store the actual data being analyzed but rather metadata about the analysis (queries, charts, filters, and dashboards).

Caching Layer:

Superset supports caching using Redis or Memcached. Caching improves the performance of frequently queried datasets and dashboards, ensuring faster load times.

Asynchronous Query Execution:

For large datasets, Superset can run queries asynchronously using Celery workers. This prevents the UI from being blocked during long-running queries.

Worker and Beat

This is one or more workers who execute tasks like run async queries or take snapshots of reports and send emails, and a “beat” that acts as the scheduler and tells workers when to perform their tasks. Most installations use Celery for these components.

Getting Started with Apache Superset

Installation and Setup

Setting up Apache Superset is straightforward. It can be installed using Docker, pip, or by deploying it on a cloud platform. Here’s a brief overview of the installation process using Docker:

1. Install Docker: Ensure Docker is installed on your machine.

2. Clone the Superset Repository:

git clone https://github.com/apache/superset.git

cd superset

3. Run the Docker Compose Command:

docker-compose -f docker-compose-non-dev.yml up

4. Initialize the Database:

docker exec -it superset_superset-worker_1 superset db upgrade

docker exec -it superset_superset-worker_1 superset init

5. Access Superset: Open your web browser and go to http://localhost:8088 to access the Superset login page.

Configuring the Metadata Storage

The metadata database is where chart and dashboard definitions, user information, logs, etc. are stored. Superset is tested to work with PostgreSQL and MySQL databases. In a Docker Compose installation, the data would be stored in a PostgreSQL container volume. The PyPI installation methods use a SQLite on-disk database. However, neither of these cases is recommended for production instances of Superset. For production, a properly configured, managed, standalone database is recommended. No matter what database you use, you should plan to back it up regularly. In the upcoming Superset blogs, we will go through how to configure the Apache Superset with Metadata storage.

Creating Your First Dashboard

1. Connect to a Data Source: Navigate to the Sources tab and add a new database or table.

2. Explore Data: Use SQL Lab to run queries and explore your data.

3. Create Charts: Go to the Charts tab, choose a dataset, and select a visualization type. Customize your chart using the various configuration options.

4. Build a Dashboard: Combine multiple charts into a cohesive dashboard. Drag and drop charts, add filters, and arrange them to create an interactive dashboard.

More Dashboards:

Use Cases of Apache Superset

Business Intelligence & Reporting Superset is widely used in organizations for creating BI dashboards that track KPIs, sales, revenue, and other critical metrics. It’s a great alternative to commercial BI tools like Tableau or Power BI, particularly for organizations that prefer open-source solutions.
Data Exploration for Data Science Data scientists can leverage Superset to explore datasets, run queries, and visualize complex relationships in the data before moving to more complex machine learning tasks.
Operational Dashboards Superset can be used to create operational dashboards that track system health, service uptimes, or transaction statuses in real-time. Its ability to connect to various databases and run SQL queries in real time makes it a suitable choice for this use case.
Geospatial Analytics With built-in support for geospatial visualizations, Superset is ideal for businesses that need to analyze location-based data. For example, a retail business can use it to analyze customer distribution or store performance across regions.
E-commerce Data Analysis Superset is frequently used by e-commerce companies to analyze sales data, customer behavior, product performance, and marketing campaign effectiveness.

Advantages of Apache Superset

Open-source and Cost-effective: Being an open-source tool, Superset is free to use and can be customized to meet specific needs, making it a cost-effective alternative to proprietary BI tools.
Rich Customizations: Superset supports extensive visual customizations and can integrate with JavaScript libraries for more advanced use cases.
Easy to Deploy: It’s relatively straightforward to set up on both local and cloud environments.
SQL-based and Powerful: Ideal for organizations with a strong SQL-based querying culture.
Extensible: Can be integrated with other data processing or visualization tools as needed.

Sharing and Collaboration

Superset makes it easy to share your visualizations and dashboards with others. You can export and import dashboards, share links, and embed visualizations in other applications. Additionally, Superset’s role-based access control ensures that users only have access to the data and visualizations they are authorized to view.

Conclusion

Apache Superset is a versatile and powerful tool for data exploration and visualization. Its user-friendly interface, a wide range of visualizations, and robust integration capabilities make it an excellent choice for businesses and data professionals looking to unlock insights from their data. Whether you’re just getting started with data visualization or you’re an experienced analyst, Superset provides the tools you need to create compelling and informative visualizations. Give it a try and see how it can transform your data analysis workflow.

You can also get in touch with us and we will be Happy to help with your custom implementations.

Comparison of and migrating from Pentaho Data Integration PDI/ Kettle to Apache HOP

August 21, 2024

by Axxonet Analytics

Introduction

As Data Engineering evolves, so do the tools we use to manage and streamline our data workflows. Commercial Open-Source Pentaho Data Integration (PDI), commonly known as Kettle or Spoon, has been a popular choice for over a decade for many Data professionals. Hitachi Vantara acquired and continued to support Pentaho Community Edition along with the Commercial offering not just for the PDI / Data Integration platform but also the complete Business intelligence Suite which included a comprehensive set of tools with great flexibility, extensibility and hence used to be featured highly in the Analysts reports including Gartner BI Magic Quadrant, Forrester and Dresner’s Wisdom of Crowds.

Over the last few years, however, there has been a shift in industry and several niche Pentaho alternatives have appeared. Also, an alternative is needed for the Pentaho Community Edition users since Hitachi Vantara / Pentaho has stopped releasing or supporting the Community Edition (CE) of Pentaho Business Intelligence and Data Integration platforms since November of 2022. With the emergence of Apache Hop (Hop Orchestration Platform), a top-level Apache Open-Source Project, organizations now have a modern, flexible alternative that builds on the foundations laid by PDI and it is one of the top Pentaho Data Integration alternatives.

This is the first part of a series of articles where we try to highlight why Apache Hop can be considered as a replacement for the Pentaho Data Integration platform as we explore its benefits and also list a few of its limitations currently. In the next part, we provide a step-by-step guide to make the transition as smooth as possible.

Current Pentaho Enterprise and Community edition Releases:

A summary of the Release dates for the recent versions of Pentaho Platform along with their support commitment is captured in this table. You will notice that the last CE version was released in Nov 2022 while 3 newer EE versions have been released since.

Enterprise Version	Release Date	Community Version	Release Date	Support
Pentaho 10.2	Expected in Q3 2024	NA	NA	Long Term
Pentaho 10.1 GA	March 5, 2024	NA	NA	Normal
Pentaho 10.0	December 01, 2023	NA	NA	Limited
Pentaho 9.5	May 31, 2023	NA	NA	Limited
Pentaho 9.4	November 01, 2022	9.4CE	Same as EE	Limited
Pentaho 9.3	May 04, 2022	9.3CE	Same as EE	Long Term
Pentaho 9.2	August 03, 2021	9.2CE	Same as EE	Unsupported
Pentaho 9.1	October 06, 2020	NA		Unsupported
Pentaho 9.0	February 04, 2020	NA		Unsupported
Pentaho 8.3	July 01, 2019	8.3CE	Same as EE	Unsupported

Additionally, Pentaho EE 8.2, 8.1, 8.0 and Pentaho 7.X are all unsupported versions on date.

Apache HOP - An Overview

Apache HOP is an open-source data integration and orchestration platform.

It allows users to design, manage, and execute data workflows (pipelines) and integration tasks (workflows) with ease. HOP’s visual interface, combined with its powerful backend, simplifies complex data processes, making it accessible for both technical and non-technical users.

Evolution from Kettle to HOP

As the visionary behind both Pentaho Data Integration (Kettle) and Apache HOP (Hop Orchestration Platform), Matt Casters has played a pivotal role in shaping the tools that power modern data workflows.

The Early Days: Creating Kettle

Matt Casters began his journey into the world of data integration in the early 2000s. Frustrated by the lack of flexible and user-friendly ETL (Extract, Transform, Load) tools available at the time, he set out to create a solution that would simplify the complex processes of data integration. This led to the birth of Kettle, an acronym for “Kettle ETTL Environment” (where ETTL stands for Extraction, Transformation, Transportation, and Loading).

Key Features of Kettle:

Visual Interface: Kettle introduced a visual drag-and-drop interface, making it accessible to users without extensive programming knowledge.
Extensibility: It was designed to be highly extensible, allowing users to create custom plugins and transformations.
Open Source: Recognizing the power of community collaboration, Matt released Kettle as an open-source project, inviting developers worldwide to contribute and improve the tool.

Kettle quickly gained popularity for its ease of use, flexibility, and robust capabilities. It became a cornerstone for data integration tasks, helping organizations manage and transform their data with unprecedented ease.

The Pentaho Era

In 2006, Matt Casters joined Pentaho, a company dedicated to providing open-source business intelligence (BI) solutions. Kettle was rebranded as Pentaho Data Integration (PDI) and integrated into the broader Pentaho suite. This move brought several advantages:

Resource Support: Being part of Pentaho provided Kettle with added resources, including development support, marketing, and a broader user base.
Enhanced Features: Under Pentaho, PDI saw many enhancements, including improved scalability, performance, and integration with other BI tools.
Community Growth: The backing of Pentaho helped grow the community of users and contributors, driving further innovation and adoption.

Despite these advancements, Matt Casters never lost sight of his commitment to open-source principles and community-driven development, ensuring that PDI stayed a flexible and powerful tool for users worldwide.

The Birth of Apache HOP

While PDI continued to evolve, Matt Casters recognized the need for a modern, flexible, and cloud-ready data orchestration platform. The landscape of data integration had changed significantly, with new challenges and opportunities emerging in the era of big data and cloud computing. This realization led to the creation of Apache HOP (Hop Orchestration Platform).

In 2020, Apache HOP was accepted as an incubator project by the Apache Software Foundation, marking a new chapter in its development and community support. This move underscored the project’s commitment to open-source principles and ensured that HOP would receive help from the robust governance and community-driven innovation that the Apache Foundation is known for.

Advantage of Apache HOP compared to Pentaho Data Integration

Apache HOP (Hop Orchestration Platform) and Pentaho Data Integration (PDI)/Kettle are both powerful data integration and orchestration tools. However, Apache HOP has several advantages over PDI, because of its evolution from PDI and adaptation to modern data needs. Below, we explore the key advantages of Apache HOP over Pentaho Data Integration Kettle:

Modern Architecture and Design

Feature	Apache HOP	PDI (Kettle)
Modular and Extensible Framework	Being more modern it is built as a modular and extensible architecture, allowing for easier customization and addition of new features. Users can add or remove plugins without affecting the core functionality.	While PDI is also extensible, its older architecture can make customization and plugin integration more cumbersome compared to HOP’s more streamlined approach.
Lightweight and Performance Optimized	Designed to be lightweight and efficient, improving performance, particularly for large-scale and complex workflows	Older codebase may not be as optimized for performance in modern, resource-intensive data environments.

Hop’s metadata-driven design and extensive plugin library offer greater flexibility for building complex data workflows. Users can also develop custom plugins to extend Hop’s capabilities to meet specific needs.

Enhanced User Interface and Usability

Feature	Apache HOP	PDI (Kettle)
Modern UI	Features a modern and intuitive user interface, making it easier for users to design, manage, and monitor data workflows.	Although functional, the user interface is dated and may not offer the same level of user experience and ease of use as HOP.
Improved Workflow Visualization	Provides better visualization tools for workflows and pipelines, helping users understand and debug complex data processes more effectively.	Visualization capabilities are good but can be less intuitive and harder to navigate compared to HOP.

The drag-and-drop functionality, combined with a cleaner and more organized layout, helps users create and manage workflows and pipelines more efficiently.

Apache HOP Web

Apache Hop also supports a Web interface for the development and maintenance of the HOP files unlike Pentaho Data Integration where this feature is still in Beta that too only for the Enterprise Edition. The web interface can be accessed through http://localhost:8080/hop/ui

Accessing HOP Status Page: http://localhost:8080/hop/status/

https://hop.apache.org/dev-manual/latest/hopweb/index.html

Advanced Development and Collaboration Features

Feature	Apache HOP	PDI (Kettle)
Project-Based Approach	Uses a project-based approach, allowing users to organize workflows, configurations, and resources into cohesive projects. This facilitates better version control, collaboration, and project management.	Lacks a project-based organization, which can make managing complex data integration tasks more challenging.
Integration with Modern DevOps Practices	Designed to integrate seamlessly with modern DevOps tools and practices, including CI/CD pipelines and containerization.	Integration with DevOps tools is possible but not as seamless or integrated as with HOP, especially with the Community edition.

Apache HOP for CI/CD Integration with GitHub / Gitlab

Apache HOP (Hop Orchestration Platform) is a powerful and flexible data integration and orchestration tool. One of its standout features is its compatibility with modern development practices, including Continuous Integration and Continuous Deployment (CI/CD) pipelines. By integrating Apache HOP with GitHub, development teams can streamline their workflows, automate testing and deployment, and ensure consistent quality and performance. In this blog, we’ll explore the advanced features of Apache HOP that support CI/CD integration and provide a guide on setting it up with GitHub.

Why Integrate Apache HOP with CI/CD?

Automation: Automate repetitive tasks such as testing, building, and deploying HOP projects. 2. Consistency: Ensure that all environments (development, testing, production) are consistent by using automated pipelines. 3. Faster Delivery: Speed up the delivery of updates and new features by automating the deployment process. 4. Quality Assurance: Integrate testing into the pipeline to catch errors and bugs early in the development cycle. 5. Collaboration: Improve team collaboration by using version control and automated workflows.

Advanced Features of Apache HOP for CI/CD

Project-Based Approach

Apache HOP’s project-based architecture allows for easy organization and management of workflows, making it ideal for CI/CD pipelines.

Command-Line Interface (CLI)

HOP provides a robust CLI that enables automation of workflows and pipelines, easing integration into CI/CD pipelines.

Integration with Version Control Systems

Apache HOP supports integration with Git, allowing users to version control their workflows and configurations directly in GitHub.

Parameterization and Environment Configurations

HOP allows parameterization of workflows and environment-specific configurations, enabling seamless transitions between development, testing, and production environments.

Test Framework Integration

Apache HOP supports integration with various testing frameworks, allowing for automated testing of data workflows as part of the CI/CD pipeline.

Cloud-Native Capabilities

As the world moves towards cloud-first strategies, understanding how Apache HOP integrates with cloud environments is crucial for maximizing its potential. The cloud support for Apache HOP, exploring its benefits, features, and practical applications opens a world of possibilities for organizations looking to perfect their data workflows in the cloud. As cloud adoption continues to grow, using Apache HOP can help organizations stay ahead in the data-driven world

Feature	Apache HOP	PDI (Kettle)
Cloud Integration	Built with cloud integration in mind, providing robust support for deploying on various cloud platforms and integrating with cloud storage, databases, and services.control, collaboration, and project management.	While PDI can be used in cloud environments, it lacks the inherent cloud-native design and seamless integration capabilities of HOP especially for the Community edition.

Integration with Cloud Storage

Data workflows often involve large data sets stored in cloud storage solutions. Apache HOP provides out-of-the-box connectors for major cloud storage services:

Amazon S3: Seamlessly read from and write to Amazon S3 buckets.
Google Cloud Storage: Integrate with GCS for scalable and secure data storage.
Azure Blob Storage: Use Azure Blob Storage for efficient data handling.

Cloud-native Databases and Data Warehouses:

Modern data architectures often leverage cloud-native databases and data warehouses. Apache HOP supports integration with:

Amazon RDS and Redshift: Connect to relational databases and data warehouses on AWS.
Google Big Query: Integrate with Big Query for fast, SQL-based analytics.
Azure SQL Database and Synapse Analytics: Use Microsoft’s cloud databases for scalable data solutions.

Cloud-native Data Processing

Apache HOP’s integration capabilities extend to cloud-native data processing services, allowing for powerful and scalable data transformations:

AWS Glue: Use AWS Glue for serverless ETL jobs.
Google Dataflow: Integrate with Dataflow for stream and batch data processing.
Azure Data Factory: Leverage ADF for hybrid data integration.

Security and Compliance

Security is paramount in cloud environments. Apache HOP supports various security protocols and practices to ensure data integrity and compliance:

Encryption: Support for encrypted data transfers and storage.
Authentication and Authorization: Integrate with cloud identity services for secure access control.

Compliance: Ensure workflows comply with industry standards and regulations

Features Summary and Comparison

Feature	Kettle	Hop
Projects and Lifecycle Configuration	No	Yes
Search Information in projects and configurations	No	Yes
Configuration management through UI and command line	No	Yes
Standardized shared metadata	No	Yes
Pluggable runtime engines	No	Yes
Advanced GUI features: memory, native zoom	No	Yes
Metadata Injection	Yes	Yes (most transforms)
Mapping (sub-transformation/pipeline	Yes	Yes(simplified)
Web Interface	Web Spoon	Hop Web
APL 2.0 license compliance	LGPL doubts regarding pentaho-metastore library	Yes
Pluggable metadata objects	No	Yes
GUI plugin architecture	XUL based (XML)	Java annotations
External Link:	https://hop.apache.org/tech-manual/latest/hop-vs-kettle/hop-vs-kettle.html

Community and Ecosystem

Open-Source Advantages

Apache HOP: Fully open-source under the Apache License, offering transparency, flexibility, and community-driven enhancements.
PDI (Kettle): While also open-source and having a large user base with extensive documentation, PDI’s development has slowed, and it has not received as many updates or new features as HOP. PDI’s development was and is more tightly controlled in the recent past by Hitachi Vantara, potentially limiting community contributions and innovation compared to HOP.

Active Development and Community Support

Apache Hop is actively developed and maintained under the Apache Software Foundation, ensuring regular updates, bug fixes, and new features. The community support for Apache HOP is a cornerstone of its success. The Apache Software Foundation (ASF) has always championed the concept of community over code, and Apache HOP is a shining example of this ethos in action.

Why Community Support Matters

Accelerated Development and Innovation: The community continuously contributes to the development and enhancement of Apache HOP. From submitting bug reports to developing new features, the community’s input is invaluable. This collaborative effort accelerates the innovation cycle, ensuring that Apache HOP stays innovative and highly functional.
Resource Sharing: The Apache HOP community is a treasure trove of resources. From comprehensive documentation and how-to guides to video tutorials and webinars, community members create and share a wealth of knowledge. This collective pool of information helps both beginners and experienced users navigate the platform with ease.
Peer Support and Troubleshooting: One of the standout benefits of community support is the peer-to-peer assistance available through forums, mailing lists, and chat channels. Users can seek help, share solutions, and discuss best practices. This collaborative troubleshooting often leads to quicker resolutions and deeper understanding of the platform.
Networking and Collaboration: Being part of the Apache HOP community opens doors to networking opportunities. Engaging with other professionals in the field can lead to collaborative projects, job opportunities, and professional growth. It’s a platform for like-minded individuals to connect and create meaningful professional relationships.

All this can be seen from the frequent, consistent releases with key features released in each release captured in the table below.

Version	Release Date	Description
Apache Hop 3.0	Q4 2024	Future Release Items
Apache Hop 2.10	August 31, 2024	Upcoming… The Apache Hop 2.10 release introduced several new features and improvements. Key updates include Enhanced Plugin Management, Bug Fixes and Performance Enhancements, New Tools and Utilities.
Apache Hop 2.9	May 24, 2024	This version includes various new features like static schema metadata type, Crate DB database dialect and bulk loader, and several improvements in transforms. Check out What’s changed. Check here for more details.
Apache Hop 2.8	March 13, 2024	This update brought new AWS transforms (SNS Notify and SQS Reader), many bug fixes, and performance improvements. Check here for more details.
Apache Hop 2.7	December 1, 2023	This release featured the Redshift bulk loader, JDBC driver refactoring, and other enhancements. Check here for more details.
Apache Hop 2.6	September 19, 2023	This version included new Google transforms (Google Analytics 4 and Google Sheets Input/Output), an Apache Beam upgrade, and various bug fixes. Check here for more details.
Apache Hop 2.5	July 18, 2023	This version focused on various bug fixes and new features, including an upgrade to Apache Beam 2.48.0 with support for Apache Spark 3.4, Apache Flink 1.16, and Google Cloud Dataflow. Additional updates included a new Intersystem IRIS database type, JSON input and output improvements, Salesforce input enhancements, an upgrade to Duck DB 0.8, and the addition of Polish language support.Check here for more details.
Apache Hop 2.4	March 31, 2023	This update introduced new features like Duck DB support, a new script transform, and various improvements in existing transforms and documentation
Apache Hop 2.3	February 1, 2023	This release focused mainly on bug fixes and included a few new features. One significant update was the integration of Weblate, a new translation tool that simplifies the contribution of translations. Another key addition was the integration of the Vertica Bulk Loader into the main code base, enhancing data loading speeds to the Vertica analytical database. Check here for more details.
Apache Hop 2.2	December 6, 2022	This release involved significant improvements and fixes, addressing over 160 tickets. Key updates included enhancements to the Hop GUI, such as a new welcome dialog, navigation viewport, data grid toolbars, and a configuration perspective. Additionally, there were upgrades to various components, including Apache Beam and Google Dataflow (Apache Hop) (Apache Issues). For more detailed information, you can visit the Apache Hop 2.2 release page. Check here for more details.
Apache Hop 2.1	October 14, 2022	This release included various new features such as MongoDB integration, Apache Beam execution, and new plugins for data profiling and documentation improvements. Check here for more details.
Apache Hop 2.0	June 17, 2022	Introduced various bug fixes and improvements, including enhancements to the metadata injection functionality and documentation updates. The update also included various new transform plugins such as Apache Avro File Output, Apache Doris Bulk Loader, Drools Rules Accumulator, and Drools Rules Executor, as well as a new Formula transform. Additionally, the user interface for the Dimension Lookup/Update transform was cleaned up and improved. Check here for more details.
Apache Hop 1.2	March 7, 2022	This release included several improvements to Hop GUI, Docker support, Neo4j integration, and Kafka and Avro transforms. It also introduced the Hop Translator tool for easier localization efforts, starting with Chinese translations. Check here for more details.
Apache Hop 1.1	January 24, 2022	Some of the key updates in Apache Hop 1.1 included improvements in metadata injection, enhancements to the graphical user interface, support for more data formats, and various performance optimizations. Check here for more details.
Apache Hop 1.0	January 17, 2022	This version marked Hop’s transition from incubation, featuring clean architecture, support for over 20 plugin types, and a revamped Hop GUI for designing workflows and pipelines. Check here for more details.

Additional Links:

https://hop.apache.org/categories/Release/ , https://hop.apache.org/docs/roadmap/

Few Limitations with HOP

While Apache HOP has several advantages compared to Pentaho ETL, by the nature of it being a comparatively newer platform, there are a few limitations we have encountered when using it. Some of these are already recorded as issues in the HOP Github and are scheduled to be fixed in upcoming releases.

Type	Details
HOP GUI	The HOP GUI application does not allowto the change “Project Home path” to a valid path after setting an invalid Project Path.
HOP GUI	Repeatedly Prompting to enter GitHub Credentials
HOP GUI	While Saving a New Pipeline, HOP GUI Appends the Previously Opened Pipeline Names
HOP Server	Multiple Hop Server Object IDs for a single HOP Pipeline on the HOP Server
HOP Server	Hop Server Objects (Pipeline/Workflow) Status is Null and the Metrics Information is not Shown
HOP Web	Unable to Copy the Transform from one Pipeline to Another Pipeline
HOP GUI	Log table options in Workflow Properties Tab
HOP GUI	Showing Folder Icon for HPL Files
HOP GUI	Dimension Lookup & Update Transform SQL Button Nullpointer Exception

There are very few issues which can act as an impediment to using Apache HOP depending on the specific use cases. We will talk more about it in the next blog article of this series.

Conclusion

Apache HOP brings a host of advantages over Pentaho Data Integration Kettle, driven by its modern architecture, enhanced usability, advanced development features, cloud-native capabilities, and active community support. These advantages make Apache HOP a compelling choice for organizations looking to streamline their data integration and orchestration processes, especially in today’s cloud-centric and agile development environments. By using Apache HOP, businesses can achieve more efficient, scalable, and manageable data workflows, positioning themselves for success in the data-driven future.

Most importantly, Hitachi Vantara / Pentaho has stopped releasing Community versions of PDI or security patches for nearly 2 years now and also removed the links to download older versions of the software from Source forge too. This makes it risky for users to continue using Pentaho Community Edition in Production due to any non-resolved vulnerabilities.

Need help to migrate your Pentaho Artifacts to Apache HOP? Our experts can help.

Pentaho vs Pyramid: A comprehensive comparison and Roadmap for Migration

January 12, 2022

by Axxonet Analytics

Hitachi Vantara Pentaho has been a very popular Commercial Open-Source Business intelligence platform used extensively over the last 12+ years providing a comprehensive set of tools with great flexibility and extensibility and hence used to be featured in the Analysts reports including Gartner BI Magic Quadrant, Forrester and Dresner’s Wisdom of Crowds.

Over the last 3-5 years, however, there has been a shift in the industry to the new Augmented era of Business Intelligence and Analytics and several niche Pentaho alternatives have emerged. Pyramid Analytics is one of these Pentaho replacement platforms and is consistently recognized as a leader in this space by several Analysts including Gartner, BARC, Forrester & Dresner having featured in their reports now for over 7 years in a row.
Hitachi Vantara Pentaho hasn’t been able to keep pace and has since been dropped from these analyst reports. This series of articles are aimed to help current users of Pentaho and other similar old generation BI platforms like Jasper who are evaluating Pentaho replacements or alternatives. We try to map the most commonly used modules and features of Pentaho BI Platform to their equivalent in Pyramid Analytics, comparing and highlighting the improvements and also presenting a RoadMap for migration.

Architecture Overview and Comparison

About Pentaho

Pentaho BI Platform covers the entire spectrum of Analytics. It includes both web-based components and design tools. The design tools include Pentaho Data Integration for ETL, Metadata Editor, Schema workbench, Aggregate & Report Designers to build Reports and Weka for Data Science / Machine Learning.

Pentaho BI Server includes a set of Web Components including the User Console, Analyzer, Interactive Reports, Dashboard Designer, CTools and Data Source Model Editor / Wizard. Specific Design tools and Web Components can be used to generate different Analytical content depending on the specific use cases and requirements. The flexible and open architecture was one of the key reasons for its popularity for so long as there were not many Pentaho alternatives with similar capabilities, and hence, it enjoyed its days with little competition.

Please refer to this link for a detailed explanation of each of the above components and Design tools.

About Pyramid Analytics

Pyramid Analytics is a Modern, Scalable, Enterprise Grade, End to End Cloud-Centric Unified Decision Intelligence platform for tomorrow’s Analytical needs. Being an adaptive analytics platform it provides different capabilities and experiences based on user needs and skills, all while managing content as a shared resource. It provides organizations with one analytics solution for everyone, across all user types and skill levels. Hence, proving itself as a worthy and capable Pentaho replacement platform.

Unlike Pentaho and other Pentaho replacement platforms, there are no different Client or Design tools that need to be installed on local systems by developers; instead, all Modules are hosted in a Server and can be accessed using just the browser.

Please refer to these Platform Overview & Pyramid Modules for a more detailed explanation of each of the above components and modules.

Mapping of Modules & Design Tools between Pyramid & Pentaho

Here is the mapping between the Modules of Pentaho with the corresponding ones of Pyramid Analytics.

Key Platform Capabilities & Differentiators

We have listed some of the Key Capabilities of both Pentaho and Pyramid Analytics Platforms and highlighted differences in terms of how they are built

Decision Intelligence and Augmented Analytics

As per Gartner, Decision Intelligence & Augmented analytics is the use of enabling technologies such as machine learning and AI to assist with data preparation, insight generation, and insight explanation to augment how people explore and analyze data in analytics and BI platforms. It also augments the expert and citizen data scientists by automating many aspects of data science, machine learning, and AI model development, management, and deployment.

Pentaho doesn’t offer any Augmented Analytics or Decision Intelligence capability as part of its offerings. This feature makes Pyramid Analytics an even more solid Pentaho replacement option.

Pyramid offers augmented analytics capabilities in a couple of ways like Explain(NLQ & Chatbot). Smart Insights, Smart Model, Smart Discover, Smart Publish and Present, Data Structure Analyzer, Auto Discover, Auto recommendations. Among the most used are Auto-Discovery and Smart Discovery. It offers users the simplest method for building data visualizations in Pyramid through a simple point-and-click wizard. The wizard presents the user with an ultra-streamlined interface, consisting of the report canvas, the visualization menu, and a single unified drop zone.

Collaboration & Conversation

If there’s some discussion or real-time collaboration required between business users around Report or Dashboard, Pentaho users usually need to use mail or similar to create discussion about any issue or pointers related to reports.

However, Pyramid Analytics not only has inbuilt collaboration and conversation features where any user can write a comment and share it with a single user or group of users it also offers a very powerful Custom Workflow API to support integration with other applications. Other users also get notifications about new comments and accordingly respond or continue the conversation.

Dashboard & Data Visualization

Pentaho’s Dashboard Designer helps create Ad Hoc interactive visualizations and Dashboards with a 360-degree view of data through dynamic filter controls and content linking. Drag-and-drop, attribute highlighting, and zoom-in capabilities make it easy to isolate key trends, details, and patterns. We can also use the Open Source CTools component of Pentaho to build custom Dashboards but this requires highly technical Javascript skills. We can also integrate business analytics with other applications through portal and mashup integrations.

Pyramid offers a wide range of Visualization capabilities in the Discover, Present, and Illustrate Modules with a wide range of charts and Graphs. It has features like Time Intelligence, a wide range of formulae capabilities, better representation of GeoSpatial Data using inbuilt Maps capabilities. It also has the capability of Explain where we can ask questions to get the information needed and it provides the results using NLP. Users can also set alerts based on dynamic conditions without any coding, unlike Pentaho and other Pentaho alternatives. Using the powerful Publish module, you can create data-driven graphics, text, and visual elements which can be scheduled and delivered to users via email as PowerPoint, Word, Excel, PDF and other formats.

Data Sources Support

Pentaho supports more than 45 Databases using JDBC & JNDI including the ability to retrieve information from Google Analytics and Salesforce. Pentaho Server also provides a Database connection wizard to create custom data sources. Details can be found here.

Pyramid Analytics also offers a wide range of data source connectivity options using JDBC, ODBC, OLAP(Microsoft Analysis Services, SAP HANA, SAP BW) and External applications like Facebook, Twitter & Salesforce. It provides an easy wizard to retrieve and Mash the Data by creating a logical Sematic layer.

It should be highlighted that out-of-box connectivity to SAP HANA and BW makes it easy for SAP users to modernize their Analytical solution using Pyramid Analytics. You can find more details here.

Metadata Layer Support

Pentaho has two Design tools which help end-users to create the Meta Data required for Ad Hoc Reports creation – Schema Workbench and Metadata Editor. Schema Workbench helps in creating a Mondrian Schema which needs an underlying Database in Star Schema. This is OLAP technology and needs MDX language to query data. Metadata Editor is used to create a Metadata model which primarily transforms DB physical structure to a business Logical Model.

The Pyramid Metadata layer is managed using the Model component. Everything in the Pyramid revolves around the Model. The model is highly sophisticated which facilitates all visualization capabilities. Pyramid Models are easy to create and can be created with no or little database changes. The model creation process comes with lots of Data preparation, Calculation features. The model also mimics OLAP concepts and more.

Predictive Analytics

Pentaho product suite has module Weka that enables predictive analytics features like data preprocessing, classification, association, time series analysis, and clustering. However, there’s some effort to bring the data to nice visualization and then to consume it in the context of other analytical artefacts. The process is not easy to achieve with other Pentaho alternatives, but Pyramid Analytics solves this with an out-of-the-box solution.

Pyramid has out of the box predictive modelling capabilities as part of the whole analytical process which can be executed seamlessly. To facilitate the AI framework, Pyramid comes with tools to deliver machine learning in R, Python, Java, JavaScript, and Ruby (with more to be added in the future)

Natural Language Processing

Pentaho can be integrated with external tools like Spark MLlib, Weka, Tensorflow, and Keras but these are not suitable for NLP use cases. The same is the case with many other Pentaho replacement solutions.

Pyramid’s Explain and Ask Question using Natural Language Query (NLQ) however supports easy text-based searches, allowing users to type a question in conversational language and get answers instantly in the form of automatic data visualizations. Users can enhance the output by customizing the underlying semantic model according to their business needs.

Native Mobile Applications

Considering today’s need by Business users to have instant access to Information and Data to make quick decisions and the fact that Mobiles are the de facto mediums, it is very important to deliver Analytical content including the KPI on the go and when offline. This can be achieved by the support of access to data on mobile devices. This is achieved by responsive web interfaces and mobile apps.

Pentaho doesn’t have a native mobile app but we can deliver Mobile-friendly content. using a mobile browser.

Pyramid, on the other hand, offers a native mobile app and one of the best Pentaho alternatives that empower Business Users on the Go. The app can be downloaded from App stores.

Admin, Security & Management

User. Roles and Folder/file management is done by PUC(Pentaho User Console) when logged in as Administrator. Your predefined users and roles can be used for the Pentaho User Console (PUC) if you are already using a security provider such as LDAP, Microsoft Active Directory (MSAD), or Single Sign-On. Pentaho Data Integration (PDI) can also be configured to use your implementation of these providers or Kerberos to authenticate users and authorize data access.

Clustering and Load balancing need to be configured separately for PDI and BI servers. The server is a Tomcat application so the clustering and load balancing generally follows accordingly. Upgrading the server version needs to follow an up-gradation path which involves possible changes to content artefacts.

Pyramid has multiple layers of security which makes it very robust and offers secured content delivery. It also facilitates third-party security integration like Active Directory, Azure LDAPS, LDAP, OpenID and SAML. Pyramid has Advanced Administration with simplified Security handling, Monitoring, Fine Tuning, and Multi-Tenancy management without the need to edit and manage multiple configuration server files as in Pentaho.

All can be done using the Browser by an Administrative user. Pulse Module which helps Pyramid Server hosted in the Cloud securely connect into Data repositories on-prem. Distributed Architecture inbuilt which offers easy dynamic scalability across Multiple Servers with Load Balancer built-in.

Conclusion

With Pyramid ranking ahead of Pentaho in most of the features and capabilities, it is not surprising that it is rated so highly by all the analysts and it is a no-brainer to select Pyramid as your next-generation Pentaho replacement Enterprise Analytics and Decision Intelligence platform. More details on Why to Choose a Pyramid is provided here.

We only covered the high-level aspects and differences with Pentaho as part of this article. In the next article, we delve deeper into the individual components and walk through how each of them from existing Pentaho-based solutions or Pentaho alternatives can be migrated into Pyramid by giving specific examples.

Please get in touch with us here. if you are currently using Pentaho and want assistance with migrating to the Pyramid Analytics Platform.

Simple steps to use Pyramid JavaScript API to retrieve Realtime Data, customize and display in external Web application

May 22, 2021

by Axxonet Analytics

Pyramid Analytics is a Robust Analytics and Business Intelligence Platform which is used to process data from multiple sources and also create rich interactive reports and dashboards to gain insights.

How To Enable Conversations & Actions in Pyramid Analytics Discoveries

April 8, 2021

by Axxonet Analytics

We live in a data-driven world that demands accurate analysis and reports for all domains. Pyramid Analytics is a platform that offers data analysis support and provides many functionalities in Discovery.

Introduction

Manual Git Integration for CI/CD Process (Problem Statement)

Common Challenges in Pentaho Data Integration (PDI)

Birth of Apache Hop (Solution)

Impact of Git and CI/CD in Apache Hop

Why Use Git with Apache Hop?

Git Actions in Apache Hop GUI

Setting Up Git in Apache Hop GUI (CI/CD)

Prerequisites

Configure Git in Apache Hop

GitLab / GitHub Operation

Best Practices for Git in Apache Hop

Conclusion

Up Next

Official Links for Apache Hop and GitLab Integration

Other posts in the Apache HOP Blog Series

Introduction

Snowflake

Apache Druid

When to Choose Apache Druid over Snowflake

Data Ingestion, Data Processing, and Data Querying

Apache Druid

Data Ingestion and Data Processing

Data Querying

Snowflake

Conclusion

Apache Druid FAQ

Watch the Apache Blog Series

Introduction

Apache Superset

Apache Druid

How They Work Together

Key Advantages of Using Them Together

Use Cases of Apache Druid and Apache Superset

Importance of Integrating Apache Druid with Apache Superset

Prerequisites

Installing PyDruid

Apache Druid as a Data Source

Apache Druid FAQ

Apache Superset FAQ

Watch the Apache Blog Series

Introduction

Apache Druid

Key Features of Apache Druid:

Real-Time Data Ingestion

High-Performance Query Engine

Scalable and Distributed Architecture

Flexible Data Model

Built-In Data Management

Extensive Integration Capabilities

Use Cases of Apache Druid

Real-Time Analytics

Ad-Tech and Marketing Analytics

IoT Data and Sensor Analytics

Operational Dashboards

Clickstream Analysis

The Architecture of Apache Druid

Coordinator and Overlord Nodes

Historical Nodes

MiddleManager Nodes

Broker Nodes

Query Nodes

Deep Storage

Metadata Storage

Advantages of Apache Druid

Getting Started with Apache Druid

Installation and Setup

Ingesting Data

Querying Data

Conclusion

Watch the Apache Druid Blog Series

Introduction

Apache HOP

Apache Airflow

Key Features of Apache Airflow

Workflow as Code

Dynamic Task Scheduling

Extensible Architecture

Robust Monitoring and Logging

Scalability and Reliability

Worker and Beat