Asset 40@2x

Proactive Alerting for Cloud Applications using Grafana

When to Use Grafana and How to Set Up Alerting in Grafana

Alerting now has become critical. As monitoring gives one an overview of the system, alerting is a near-real-time alert and notification system that immediately notifies the team regarding the occurrence of an issue in time to take some quick action before things go bad. For example, suppose a server uses more than its expected CPU usage. In that case, an alert will alert the team to address the matter before it leads to downtime or performance degradation. In short, alerting allows you to preclude problems that have a big impact on your system or business.

In this article, we will discuss the basic role of alerting in a monitoring system and exactly how alerting works inside Grafana, one of the powerful open-source tools for monitoring and visualization. After briefly discussing the importance of monitoring and alerting, we’ll guide you through the steps to set up alerting in Grafana.

Importance of Alerting in Monitoring Systems

Monitoring is the process of continuously collecting data from various parts of the system and understanding it over a while to trace patterns or anomalies. It helps in capacity planning, exhibits performance bottlenecks, and guides optimization efforts by showing a whole picture of health without initiating action. Instead of this, alerting is an active response mechanism that informs the teams when certain conditions or thresholds have been met; the objective being keeping the teams informed of problems as they occur.

Main Differences

Objectives: Monitoring is concerned with long-term data collection and analysis while alerting is directed at the immediate need for issue detection and response.

Timing: Monitoring is always on, capturing data at all times, while alerts are event-driven, which means they become effective only when certain conditions are met.

Key Benefits of Alerts

Continuous Monitoring Without Human Intervention: The alerts automate the process, ensuring that issues are flagged without constant human oversight.

Real-Time Update-Alerts: It is based on predefined conditions to send instant notifications and thus, ensure rapid responses to critical changes. The right people get notified and thus ensure proper escalations are managed.

Types of Alerts

Threshold-Based Alerts: Threshold-based alerts are identified based on definite thresholds, such as which could raise an alert when the CPU exceeds 90%.

Anomaly Detection Alerts: Intended to track and look for unusual patterns or behaviours that might not be detected using typical thresholds.

Event-Based Alerts: These alerts react to critical events, such as the failure of an application process or missing critical data; thus, teams are alerted to important occurrences.

Setting Up Alerting in Grafana (Step-by-Step Guide)

Prerequisites to Setup Alerts

Before you can have alerts working in Grafana, you need to have the environment set up just as outlined below:

Data Source Integration: You will need a data source integrated with Grafana; some examples of sources are Prometheus. Alerts work based on the time-series data retrieved from such sources.

Understanding Alert Rules: An alert rule is a query that checks the state of a defined metric and determines whether an alert should be triggered given certain predefined conditions.

Step1: Login to Grafana with the required credentials

Step2: Create a new dashboard or open an existing dashboard where the notification alert needs to be setup

Steps to Create Alerts

Step 1: Create a Panel for Visualization

Add New Panel: First, add a new panel to your Grafana dashboard where you will visualize the metric that you are going to monitor.

Select Visualization Type: From the list, pick a visualization type that best fits either a Graph or Singlestat based on what sort of data you wish to monitor.

Step 2: Configure Alert

Alerting Menu Access: Navigate to the Alerting section from the menu.

New Alert Rule: From the subsection under Alerting, you click New Alert Rule to start the process of setting up an alert.

Data Source: Under the list of choices for a data source select such as Prometheus.

Write the Query: Type the query that fetches the metric you need to monitor. Be sure the query accurately reflects the condition you need to monitor.

Set the Threshold: How to check the input, i.e. whether the value is above a certain value, or similar. You could choose this condition as “is above” with a threshold value (for example, 80 for CPU usage).

Enter Values for Alerting Rule Options

Name: Give the rule a descriptive name for the alert, like “High CPU Usage Alert”.

Alert Conditions: Define a query that specifies the conditions under which the alert should be triggered.

Alert Evaluation Behavior: Select how frequently to check the alert (in this case, every 5 minutes).

Labels and Notifications: Add relevant tags to help categorize your alerts, such as environment or service. Describe the action instructions for the alert message that will go out once the alert is triggered. Include some background information regarding the issue so it can be easily recognized.

Include Contact Information: Determine the contact information where the alert notifications are to be delivered, such as email, Slack, or Google Chat/Hangout, PagerDuty & Webhooks. Remember, you’ll have to set up the notification channels in Grafana beforehand. In the URL section attach the Web hook of the above channels where you want to get notified.

Step 3: Testing your Alerts

Test the Alert: Use the testing feature in Grafana to test if your alert configuration is properly set. Thus, you will be reassured that under well-defined conditions, alerting works.

Step 4: Finalize the Alert

Save Alert: When all the settings for configuring are made, you can save the alert rule created by clicking Save.

Enable Alert: Finally, ensure to enable the alert so it can start monitoring for the defined conditions.

Conclusion

Alerting is one of the most important features of a modern monitoring system, that can enable teams to be able to respond to issues at their earliest sign rather than allowing them to spin out of control. With proper alert definitions integrated with monitoring, organizations can avoid more downtime, increase reliability, and make all these complex systems work flawlessly.

Alerts in Grafana must be actional and should not be vague. Avoid the over-complication of rules on alerts. Regularly update the alerts since the infrastructure and environments are always in the update, it has to be properly grouped and prioritized, and advance notification options like webhooks or third-party tools.

In this post, we focused on how Grafana excels at detailed alert settings and is suitable for monitoring metrics of the system, complementing tools like Uptime Kuma, which is good for simple service uptime tracking. In the following release, we dig deeper into Uptime Kuma, examining it in much more depth, then, of course, showing its setup from the ground up. Stay tuned to find out how these two tools can work together to create a seamless, holistic monitoring and alerting strategy.

Have questions about Grafana, alerting, or optimizing your monitoring setup? Our team is here to assist

Frequently Asked Questions (FAQs)

The purpose of configuring notification alerts is to ensure timely awareness of issues in your systems by monitoring specific metrics. Alerts allow you to proactively respond to potential problems, reducing downtime and enhancing system performance.

You can access Grafana by logging in with the required credentials. If you don't have an account, you'll need to create one or request access from your administrator.

You can set up alerts on both existing dashboards and new ones. Simply open the dashboard where you want to configure the alert or create a new dashboard if needed.

You can use various visualization types, such as Graph or Singlestat, depending on how you want to display the metric you're monitoring.

In the alerting section under "Rules," select "New Alert Rule" and choose your data source (e.g., Prometheus, InfluxDB) when writing the query to retrieve the metric you want to monitor.

You can define alert conditions by specifying when the alert should trigger based on your chosen metric. This could be when the metric crosses a certain threshold or remains above or below a specific value for a defined duration.

Setting a threshold value determines the specific point at which an alert will be triggered, allowing you to control when you are notified of potential issues based on the behaviour of the monitored metric.

Yes, you can customize the alert messages by setting annotations in the alerting rule. This allows you to tailor the content of the notification that will be sent when the alert is triggered.

You can set contact points for notifications, such as Email, Hangouts, Slack, PagerDuty, or Webhooks. Attach the webhook URL for the channel where you want to receive alerts.

Testing the alert with the "Test Rule" button allows you to simulate the alert and see how it would behave under current conditions, ensuring the configuration works as expected before saving.

Server monitoring involves tracking the performance and health of servers to ensure they are running efficiently and to quickly identify and resolve any issues. It is important because it helps prevent downtime, ensures optimal performance, and maintains the reliability of services.

Grafana provides real-time insights into server metrics such as CPU usage, memory utilization, network traffic, and disk activity. It offers customizable dashboards and visualization options to help interpret data and spot anomalies quickly.

 

Alerts are configured in Grafana with custom rules and thresholds. Integrating with Google Chat, the system sends immediate notifications to the relevant team members when any anomalies or performance issues arise.

Node-exporter and Prometheus are used for data collection. Node-exporter gathers system-level metrics, while Prometheus stores these metrics and provides querying capabilities.

Grafana can monitor a wide range of metrics, including CPU usage, memory utilization, disk I/O, network traffic, application response times, and custom application metrics defined through various data sources.

Yes, Grafana supports integration with numerous third-party applications and services, including notification channels like Slack, Microsoft Teams, PagerDuty, and more, enhancing its alerting capabilities.

The data collection frequency can vary based on the configuration of the data source (like Prometheus) and the specific queries you set up. You can typically configure scrape intervals in your Prometheus setup.

Yes, Grafana allows you to share dashboards with team members via direct links, snapshots, or by exporting them. You can also set permissions to control who can view or edit the dashboards.

If you encounter issues, check the Grafana logs for error messages, review your alert configurations, and ensure that your data sources are properly connected. The Grafana community and documentation are also valuable resources for troubleshooting.

Yes, Grafana allows you to create complex alert conditions based on multiple metrics using advanced queries. You can combine metrics in a single alert rule to monitor related conditions.

If a data source goes down, Grafana will typically show an error or a warning on the dashboard. Alerts configured with that data source may also fail to trigger until the connection is restored.

Yes, Grafana allows you to visualize historical data by querying data sources that store time-series data, such as Prometheus. You can create dashboards that analyze trends over time.

Annotations are markers added to graphs in Grafana to indicate significant events or changes. They can provide context for data trends and help identify when specific incidents occurred.

Alerts are conditions set to monitor specific metrics and trigger under certain circumstances, while notifications are the messages sent out when those alerts are triggered, informing users of the situation.

Yes, Grafana offers some customization options for its UI, including themes and layout adjustments. You can also configure dashboard variables to create dynamic and user-friendly interfaces.

Yes, you can use Grafana's API to programmatically create and manage dashboards, allowing for automation in scenarios such as CI/CD pipelines or large-scale deployments.

Grafana offers extensive documentation, tutorials, and community forums. Additionally, there are many online courses and video tutorials available to help users learn the platform.

Asset 37@2x

Apache Druid Integration with Apache Superset

Introduction

In our previous blog, Exploring Apache Druid: A High-Performance Real-Time Analytics Database, we discussed Apache Druid in more detail. In case you have missed it, it is recommended that you read it before continuing the Apache Druid blog series. In this blog, we talk about integrating Apache Superset and Apache Druid.

Apache Superset and Apache Druid are both open-source tools that are often used together in data analytics and business intelligence workflows. Here’s an overview of each and their relationship.

Apache Superset

Apache Superset is an open-source data exploration and visualization platform developed by Airbnb, it was later donated to the Apache Software Foundation. It is now a top-level Apache project, widely adopted across industries for data analytics and visualization. 

  1. Type: Data visualization and exploration platform.
  2. Purpose: Enables users to create dashboards, charts, and reports for data analysis.
  3. Features:
    1. User-friendly interface for querying and visualizing data.
    2. Wide variety of visualization types (charts, maps, graphs, etc.).
    3. SQL Lab: A rich SQL IDE for writing and running queries.
    4. Extensible: Custom plugins and charts can be developed.
    5. Role-based access control and authentication options.
    6. Integrates with multiple data sources, including databases, warehouses, and big data engines.
  4. Use Cases:
    1. Business intelligence (BI) dashboards.
    2. Interactive data exploration.
    3. Monitoring and operational analytics.

Apache Druid

Apache Druid is a high-performance, distributed real-time analytics database designed to handle fast, interactive queries on large-scale datasets, especially those with a time component. It is widely used for powering business intelligence (BI) dashboards, real-time monitoring systems, and exploratory data analytics.

  1. Type: Real-time analytics database.
  2. Purpose: Designed for fast aggregation, ingestion, and querying of high-volume event streams.
  3. Features:
    1. Columnar storage format: Optimized for analytic workloads.
    2. Real-time ingestion: Can process and make data available immediately.
    3. Fast queries: High-performance query engine for OLAP (Online Analytical Processing).
    4. Scalability: Handles massive data volumes efficiently.
    5. Support for time-series data: Designed for datasets with a time component.
    6. Flexible data ingestion: Supports batch and streaming ingestion from sources like Kafka, Hadoop, and object stores.
  4. Use Cases:
    1. Real-time dashboards.
    2. Clickstream analytics.
    3. Operational intelligence and monitoring.
    4. Interactive drill-down analytics.

How They Work Together

  1. Integration: Superset can connect to Druid as a data source, enabling users to visualize and interact with the data stored in Druid.
  2. Workflow:
    1. Data Ingestion: Druid ingests large-scale data in real-time or batches from various sources.
    2. Data Querying: Superset uses Druid’s fast query capabilities to retrieve data for visualizations.
    3. Visualization: Superset presents the data in user-friendly dashboards, allowing users to explore and analyze it.

Key Advantages of Using Them Together

  1. Speed: Druid’s real-time and high-speed querying capabilities complement Superset’s visualization features, enabling near real-time analytics.
  2. Scalability: The combination is suitable for high-throughput environments where large-scale data analysis is needed.
  3. Flexibility: Supports a wide range of analytics use cases, from real-time monitoring to historical data analysis.

Use Cases of Apache Druid and Apache Superset

  1. Fraud Detection: Analyzing transactional data for anomalies.
  2. Real-Time Analytics: Monitoring website traffic, application performance, or IoT devices in real-time.
  3. Operational Intelligence: Tracking business metrics like revenue, sales, or customer behaviour.
  4. Clickstream Analysis: Analyzing user interactions and behaviour on websites and apps.
  5. Log and Event Analytics: Parsing and querying log files or event streams.

Importance of Integrating Apache Druid with Apache Superset

Integrating Apache Druid with Apache Superset opens quite a few doors for the organization to really address high-scale data analysis and interactive visualization needs. Here’s why this combination is so powerful:

  1. High-Performance Analytics:

Druid’s Speed: Apache Druid optimizes low-latency queries and fast aggregation and is designed with large datasets in mind. That can be huge because quick response means Superset can deliver good, reusable, production-quality representations.

Real-Time Data: Druid supports real-time data ingestion, allowing Superset dashboards to visualise near-live data streams, making it suitable for monitoring and operational analytics.

  1. Advanced Query Capabilities:

Complex Aggregations: Druid’s ability to handle complex OLAP-style aggregations and filtering ensures that Superset can provide rich and detailed visualizations.

Native Time-Series Support: Druid’s in-built time-series functionalities allow seamless integration for Superset’s time-series charts and analysis.

  1. Scalability:

Designed for Big Data: Druid is highly scalable and can manage billions of rows in an efficient manner. Together with Superset, it helps dashboards scale easily for high volumes of data.

Distributed Architecture: Being by nature distributed, Druid ensures reliable performance of Superset even when it’s under high loads or many concurrent user sessions.

  1. Simplified Data Exploration:

Columnar Storage: This columnar storage format on Druid greatly accelerates the exploratory data analysis of Superset since it supports scans of selected columns efficiently.

Pre-Aggregated Data: Superset can avail use from the roll-up and pre-aggregation mechanisms in Druid for reducing query cost as well as improving the visualization performance.

  1. Flawless Integration:

Incorporated Connector: Superset is natively integrated with Druid. Therefore, there would not be any hassle regarding implementing it. It lets users query Druid directly and curate dashboards using minimal configuration. Moreover, non-technical users can also perform SQL-like queries (Druid SQL) within the interface.

  1. Advanced Features:

Granular Access Control: When using Druid with Superset, it is possible to enforce row-level security and user-based data access, ensuring that sensitive data is only accessible to authorized users.

Segment-Level Optimization: Druid’s segment-based storage enables Superset to optimize queries for specific data slices.

  1. Cost Efficiency:

Efficient Resource Utilization: Druid’s architecture minimizes storage and compute costs while providing high-speed analytics, ensuring that Superset dashboards run efficiently.

  1. Real-Time Monitoring and BI:

Operational Dashboards: In the case of monitoring systems, website traffic, or application logs, Druid’s real-time capabilities make the dashboards built in Superset very effective for real-time operational intelligence.

Streaming Data Support: Druid supports ingesting stream-processing-enabled data sources; for example, Kafka’s ingestion is supported. This enables Superset to visualize live data with minimal delay.

Prerequisites

  1. The Superset Docker Container or instance should be up and running.
  2. Installation of pydruid in Superset Container or instance.

Installing PyDruid

1. Enable Druid-Specific SQL Queries

PyDruid is the Python library that serves as a client for Apache Druid. This allows Superset to: 

  1. Run Druid SQL queries as efficiently as possible. 
  2. Access Druid’s powerful querying capabilities directly from Superset’s SQL Lab and dashboards.
  3. Without this client, Superset will not be able to communicate with the Druid database, which consequently results in partial functionality or functionality at large

2. Native Connection Support

Superset is largely dependent upon PyDruid for:

  1. Seamless connections to Apache Druid.
  2. Translate Superset queries into Druid-native SQL or JSON that the Druid engine understands.

3. Efficient Data Fetching

PyDruid optimizes data fetching from Druid by

  1. Managing the query API of Druid in regards to aggregation, filtering, and time-series analysis
  2. Reducing latency due to better management of connections and responses

    4. Advanced Visualization Features

Many features of advanced visualization available in Superset (like time-series charts and heat maps) depend on Druid’s capabilities, which PyDruid enables.  It also exposes to Superset Druid’s granularity, time aggregation and access to complex metrics pre-aggregated in Druid.

5. Integration with Superset’s Druid Connector

The Druid connector in Superset relies on PyDruid as a dependency to:

  1. Register Druid as a data source.
  2. Allow Superset users the capability to query and visually explore Druid datasets directly.
  3. Without PyDruid the Druid connector in Superset will not work, limiting your ability to integrate Druid as a backend for analytics.

6. Data Discovery PyDruid Enables Superset to:

Fetch metadata from Druid, including schema, tables, and columns. Enable ad hoc exploration of Druid data in SQL Lab.

Apache Druid as a Data Source

  1. Make sure your Druid database application is set up and working fine in the respective ports.

2. Install pydruid by logging into the container.

$ docker exec -it <superset_container_name> user=root /bin/bash 

$ pip install pydruid

3. Once pydruid is installed, navigate to settings and click on database connections

4. Click on Add Database and select Apache Druid as shown in the screenshot below:

5. Here we have to enter the SQLALCHEMY URI. To learn more about the SQLALCHEMY rules please follow the below link. 

https://superset.apache.org/docs/configuration/databases/

6. For Druid the SQLALCHEMY URI will be in the below format:

druid://<User>:<password>@<Host>:<Port-default-9088>/druid/v2/sql

Note: 

druid://<User>:<password>@<Host>:<Port-default-9088>/druid/v2/sql

According to the Alchemy URI “<Port-default-9088>” the default ports will be “8888” or “8081”. 

If the druid you have installed is running on http without any secure connection the Alchemy URI with the above rule will work without any issue.

Ex: druid://localhost:8888/druid/v2/sql/

If you have a secure connection normally druid://localhost:8888/druid/v2/sql/ will not work and provide an error. 

Here we will understand that we have to provide a domain name instead of localhost if we have enabled a secured connection for the druid UI. So we will make the changes and enter the URI as below

druid://domain.name.com:8888/druid/v2/sql/. This will also not work. 

According to other sources we have to make the changes like 

druid+https://localhost:8888/druid/v2/sql/

OR

druid+https://localhost:8888/druid/v2/sql/?ssl_verify=false

Neither of them will work

Normally once we enable a secured connection port HTTPS will be enabled to the site proxying the default ports of any of the other applications.

So if we enter the below SQL ALCHEMY URI as below

druid+https://domain.name.com:443/druid/v2/sql/?ssl=true

Superset will understand that the druid is enabled with a secured connection and running in the HTTPS port.

7. Since we have enabled the SSL to our druid application we have to enter the below SQLALCHEMY URI

druid+https://domain.name.com:443/druid/v2/sql/?ssl=true 

and click on test connection.

8. Once we click on test connection the Apache superset will be integrated with the Apache druid and will be ready to be worked on.

In the upcoming articles, we will talk about the data ingestion and visualization of Apache Druid data in the Apache Superset.

Apache Druid FAQ

Apache Druid is used for real-time analytics and fast querying of large-scale datasets. It's particularly suitable for time-series data, clickstream analytics, operational intelligence, and real-time monitoring.

Druid supports:

  1. Streaming ingestion: Data from Kafka, Kinesis, etc., is ingested in real-time.
  2. Batch ingestion: Data from Hadoop, S3, or other static sources is ingested in batches.

Druid uses:

  1. Columnar storage: Optimizes storage for analytic queries.
  2. Advanced indexing: Bitmap and inverted indexes speed up filtering and aggregation.
  3. Distributed architecture: Spreads query load across multiple nodes.

No. Druid is optimized for OLAP (Online Analytical Processing) and is not suitable for transactional use cases.

  1. Data is segmented into immutable chunks and stored in Deep Storage (e.g., S3, HDFS).
  2. Historical nodes cache frequently accessed segments for faster querying.
  1. SQL: Provides a familiar interface for querying.
  2. Native JSON-based query language: Offers more flexibility and advanced features.
  1. Broker: Routes queries to appropriate nodes.
  2. Historical nodes: Serve immutable data.
  3. MiddleManager: Handles ingestion tasks.
  4. Overlord: Coordinates ingestion processes.
  5. Coordinator: Manages data availability and balancing.

Druid integrates with tools like:

  1. Apache Superset
  2. Tableau
  3. Grafana
  4. Stream ingestion tools (Kafka, Kinesis)
  5. Batch processing frameworks (Hadoop, Spark)

Yes, it’s open-source and distributed under the Apache License 2.0.

Alternatives include:

  1. ClickHouse
  2. Elasticsearch
  3. BigQuery
  4. Snowflake

Apache Superset FAQ

Apache Superset is an open-source data visualization and exploration platform. It allows users to create dashboards, charts, and interactive visualizations from various data sources.

Superset offers:

  1. Line charts, bar charts, pie charts.
  2. Geographic visualizations like maps.
  3. Pivot tables, histograms, and heatmaps.
  4. Custom visualizations via plugins.

Superset supports:

  1. Relational databases (PostgreSQL, MySQL, SQLite).
  2. Big data engines (Hive, Presto, Trino, BigQuery).
  3. Analytics databases (Apache Druid, ClickHouse).

Superset is a robust, open-source alternative to commercial BI tools. While it’s feature-rich, it may require more setup and development effort compared to proprietary solutions like Tableau.

Superset primarily supports SQL for querying data. Its SQL Lab provides an environment for writing, testing, and running queries.

Superset supports:

  1. Database-based login.
  2. OAuth, LDAP, OpenID Connect, and other SSO mechanisms.
  3. Role-based access control (RBAC) for managing permissions.

Superset can be installed using:

  1. Docker: For containerized deployments.
  2. Python pip: For a native installation in a Python environment.
  3. Helm: For Kubernetes-based setups.

Yes. By integrating Superset with real-time analytics platforms like Apache Druid or streaming data sources, you can build dashboards that reflect live data.

  1. Requires knowledge of SQL for advanced queries.
  2. May not have the polish or advanced features of proprietary BI tools like Tableau.
  3. Customization and extensibility require some development effort.

Superset performance depends on:

  1. The efficiency of the connected database or analytics engine.
  2. The complexity of queries and visualizations.
  3. Hardware resources and configuration.

Would you like further details on specific features or help setting up either tool? Reach out to us

Watch the Apache Blog Series

Exploring Apache Druid_ A High-Performance Real-Time Analytics Database

Exploring Apache Druid: A High-Performance Real-Time Analytics Database

Introduction

Apache Druid is a distributed, column-oriented, real-time analytics database designed for fast, scalable, and interactive analytics on large datasets. It excels in use cases requiring real-time data ingestion, high-performance queries, and low-latency analytics. 

Druid was originally developed to power interactive data applications at Metamarkets and has since become a widely adopted open-source solution for real-time analytics, particularly in industries such as ad tech, fintech, and IoT.

It supports batch and real-time data ingestion, enabling users to perform fast ad-hoc queries, power dashboards, and interactive data exploration.

In big data and real-time analytics, having the right tools to process and analyze large volumes of data swiftly is essential. Apache Druid, an open-source, high-performance, column-oriented distributed data store, has emerged as a leading solution for real-time analytics and OLAP (online analytical processing) workloads. In this blog post, we’ll delve into what Apache Druid is, its key features, and how it can revolutionize your data analytics capabilities. Refer to the official documentation for more information.

Apache Druid

Apache Druid is a high-performance, real-time analytics database designed for fast slice-and-dice analytics on large datasets. It was created by Metamarkets (now part of Snap Inc.) and is now an Apache Software Foundation project. Druid is built to handle both batch and streaming data, making it ideal for use cases that require real-time insights and low-latency queries.

Key Features of Apache Druid:

Real-Time Data Ingestion

Druid excels at real-time data ingestion, allowing data to be ingested from various sources such as Kafka, Kinesis, and traditional batch files. It supports real-time indexing, enabling immediate query capabilities on incoming data with low latency.

High-Performance Query Engine

Druid’s query engine is optimized for fast, interactive querying. It supports a wide range of query types, including Time-series, TopN, GroupBy, and search queries. Druid’s columnar storage format and advanced indexing techniques, such as bitmap indexes and compressed column stores, ensure that queries are executed efficiently.

Scalable and Distributed Architecture

Druid’s architecture is designed to scale horizontally. It can be deployed on a cluster of commodity hardware, with data distributed across multiple nodes to ensure high availability and fault tolerance. This scalability makes Druid suitable for handling large datasets and high query loads.

Flexible Data Model

Druid’s flexible data model allows for the ingestion of semi-structured and structured data. It supports schema-on-read, enabling dynamic column discovery and flexibility in handling varying data formats. This flexibility simplifies the integration of new data sources and evolving data schemas.

Built-In Data Management

Druid includes built-in features for data management, such as automatic data partitioning, data retention policies, and compaction tasks. These features help maintain optimal query performance and storage efficiency as data volumes grow.

Extensive Integration Capabilities

Druid integrates seamlessly with various data ingestion and processing frameworks, including Apache Kafka, Apache Storm, and Apache Flink. It also supports integration with visualization tools like Apache Superset, Tableau, and Grafana, enabling users to build comprehensive analytics solutions.

Use Cases of Apache Druid

Real-Time Analytics

Druid is used in real-time analytics applications where the ability to ingest and query data in near real-time is critical. This includes monitoring applications, fraud detection, and customer behavior tracking.

Ad-Tech and Marketing Analytics

Druid’s ability to handle high-throughput data ingestion and fast queries makes it a popular choice in the ad tech and marketing industries. It can track user events, clicks, impressions, and conversion rates in real time to optimize campaigns.

IoT Data and Sensor Analytics

IoT applications produce time-series data at high volume. Druid’s architecture is optimized for time-series data analysis, making it ideal for analyzing IoT sensor data, device telemetry, and real-time event tracking.

Operational Dashboards

Druid is often used to power operational dashboards that provide insights into infrastructure, systems, or applications. The low-latency query capabilities ensure that dashboards reflect real-time data without delay.

Clickstream Analysis

Organizations leverage Druid to analyze user clickstream data on websites and applications, allowing for in-depth analysis of user interactions, preferences, and behaviors in real time.

The Architecture of Apache Druid

Apache Druid follows a distributed, microservice-based architecture. The architecture allows for scaling different components based on the system’s needs.

The main components are:

Coordinator and Overlord Nodes

  1. Coordinator Node: Manages data availability, balancing the distribution of data across the cluster, and overseeing segment management (segments are the basic units of storage in Druid).
  2. Overlord Node: Responsible for managing ingestion tasks. It works with the middle managers to schedule and execute data ingestion tasks, ensuring that data is ingested properly into the system.

Historical Nodes

Historical nodes store immutable segments of historical data. When queries are executed, historical nodes serve data from the disk, which allows for low-latency and high-throughput queries.

MiddleManager Nodes

MiddleManager nodes handle real-time ingestion tasks. They manage tasks such as ingesting data from real-time streams (like Kafka), transforming it, and pushing the processed data to historical nodes after it has persisted.

Broker Nodes

The broker nodes route incoming queries to the appropriate historical or real-time nodes and aggregate the results. They act as the query routers and perform query federation across the Druid cluster.

Query Nodes

Query nodes are responsible for receiving, routing, and processing queries. They can handle a variety of query types, including SQL, and route these queries to other nodes for execution.

Deep Storage

Druid relies on an external deep storage system (such as Amazon S3, Google Cloud Storage, or HDFS) to store segments of data permanently. The historical nodes pull these segments from deep storage when they need to serve data.

Metadata Storage

Druid uses an external relational database (typically PostgreSQL or MySQL) to store metadata about the data, including segment information, task states, and configuration settings.

Advantages of Apache Druid

  1. Sub-Second Query Latency: Optimized for high-speed data queries, making it perfect for real-time dashboards.
  2. Scalability: Easily scales to handle petabytes of data.
  3. Flexible Data Ingestion: Supports both batch and real-time data ingestion from multiple sources like Kafka, HDFS, and Amazon S3.
  4. Column-Oriented Storage: Efficient data storage with high compression ratios and fast retrieval of specific columns.
  5. SQL Support: Familiar SQL-like querying capabilities for easy data analysis.
  6. High Availability: Fault-tolerant and highly available due to data replication across nodes.

Getting Started with Apache Druid

Installation and Setup

Setting up Apache Druid involves configuring a cluster with different node types, each responsible for specific tasks:

  1. Master Nodes: Oversee coordination, metadata management, and data distribution.
  2. Data Nodes: Handle data storage, ingestion, and querying.
  3. Query Nodes: Manage query routing and processing.

You can install Druid using a package manager, Docker, or by downloading and extracting the binary distribution. Here’s a brief overview of setting up Druid using Docker:

  1. Download the Docker Compose File:
    $curl -O https://raw.githubusercontent.com/apache/druid/master/examples/docker-compose/docker-compose.yml
  2. Start the Druid Cluster: $ docker-compose up
  3. Access the Druid Console: Open your web browser and navigate to http://localhost:8888 to access the Druid console.

Ingesting Data

To ingest data into Druid, you need to define an ingestion spec that outlines the data source, input format, and parsing rules. Here’s an example of a simple ingestion spec for a CSV file:

JSON Code

{ “type”: “index_parallel”, “spec”: { “ioConfig”: { “type”: “index_parallel”, “inputSource”: { “type”: “local”, “baseDir”: “/path/to/csv”, “filter”: “*.csv” }, “inputFormat”: { “type”: “csv”, “findColumnsFromHeader”: true } }, “dataSchema”: { “dataSource”: “example_data”, “timestampSpec”: { “column”: “timestamp”, “format”: “iso” }, “dimensionsSpec”: { “dimensions”: [“column1”, “column2”, “column3”] } }, “tuningConfig”: { “type”: “index_parallel” } }}

Submit the ingestion spec through the Druid console or via the Druid API to start the data ingestion process.

Querying Data

Once your data is ingested, you can query it using Druid’s native query language or SQL. Here’s an example of a simple SQL query to retrieve data from the example_data data source:

SELECT  __time, column1, column2, columnFROM example_dataWHERE __time BETWEEN ‘2023-01-01’ AND ‘2023-01-31’

Use the Druid console or connect to Druid from your preferred BI tool to execute queries and visualize data.

Conclusion

Apache Druid is a powerful, high-performance real-time analytics database that excels at handling large-scale data ingestion and querying. Its robust architecture, flexible data model, and extensive integration capabilities make it a versatile solution for a wide range of analytics use cases. Whether you need real-time insights, interactive queries, or scalable OLAP capabilities, Apache Druid provides the tools to unlock the full potential of your data. Explore Apache Druid today and transform your data analytics landscape. Apache Druid has firmly established itself as a leading database for real-time, high-performance analytics. Its unique combination of real-time data ingestion, sub-second query speeds, and scalability makes it a perfect choice for businesses that need to analyze vast amounts of time-series and event-driven data. With growing adoption across industries.

Need help transforming your real-time analytics with high-performance querying? Contact our experts today!

Watch the Apache Druid Blog Series

Stay tuned for the upcoming Apache Druid Blog Series:

  1. Why choose Apache Druid over Vertica
  2. Why choose Apache Druid over Snowflake
  3. Why choose Apache Druid over Google Big Query
  4. Integrating Apache Druid with Apache Superset for Realtime Analytics
Streamlining Apache HOP Workflow Management with Apache Airflow

Streamlining Apache HOP Workflow Management with Apache Airflow

Introduction

In our previous blog, we talked about the Apache HOP in more detail. In case you have missed it, refer to it here https://analytics.axxonet.com/comparison-of-and-migrating-from-pdi-kettle-to-apache-hop/” page. As a continuation of the Apache HOP article series, here we touch upon how to integrate Apache Airflow and Apache HOP. In the fast-paced world of data engineering and data science, efficiently managing complex workflows is crucial. Apache Airflow, an open-source platform for programmatically authoring, scheduling, and monitoring workflows, has become a cornerstone in many data teams’ toolkits. This blog post explores what Apache Airflow is, its key features, and how you can leverage it to streamline and manage your Apache HOP workflows and pipelines.

Apache HOP

Apache HOP is an open-source data integration and orchestration platform. For more details refer to our previous blog here.

Apache Airflow

Apache Airflow is an open-source workflow orchestration tool originally developed by Airbnb. It allows you to define workflows as code, providing a dynamic, extensible platform to manage your data pipelines. Airflow’s rich features enable you to automate and monitor workflows efficiently, ensuring that data moves seamlessly through various processes and systems.

Use Cases:

  1. Data Pipelines: Orchestrating ETL jobs to extract data from sources, transform it, and load it into a data warehouse.
  2. Machine Learning Pipelines: Scheduling ML model training, batch processing, and deployment workflows.
  3. Task Automation: Running repetitive tasks, like backups or sending reports.

DAG (Directed Acyclic Graph):

  1. A DAG represents the workflow in Airflow. It defines a collection of tasks and their dependencies, ensuring that tasks are executed in the correct order.
  2. DAGs are written in Python and allow you to define the tasks and how they depend on each other.

Operators:

  • Operators define a single task in a DAG. There are several built-in operators, such as:
    1. BashOperator: Runs a bash command.
    2. PythonOperator: Runs Python code.
    3. SqlOperator: Executes SQL commands.
    4. HttpOperator: Makes HTTP requests.

Custom operators can also be created to meet specific needs.

Tasks:

  1. Tasks are the building blocks of a DAG. Each node in a DAG is a task that does a specific unit of work, such as executing a script or calling an API.
  2. Tasks are defined by operators and their dependencies are controlled by the DAG.

Schedulers:

  1. The scheduler is responsible for triggering tasks at the appropriate time, based on the schedule_interval defined in the DAG.
  2. It continuously monitors all DAGs and determines when to run the next task.

Executors:

  • The executor is the mechanism that runs the tasks. Airflow supports different types of executors:
    1. SequentialExecutor: Executes tasks one by one.
    2. LocalExecutor: Runs tasks in parallel on the local machine.
    3. CeleryExecutor: Distributes tasks across multiple worker machines.
    4. KubernetesExecutor: Runs tasks in a Kubernetes cluster.

Web UI:

  1. Airflow has a web-based UI that lets you monitor the status of DAGs, view logs, and check the status of each task in a DAG.
  2. It also provides tools to trigger, pause, or retry DAGs.

Key Features of Apache Airflow

Workflow as Code

Airflow uses Directed Acyclic Graphs (DAGs) to represent workflows. These DAGs are written in Python, allowing you to leverage the full power of a programming language to define complex workflows. This approach, known as “workflow as code,” promotes reusability, version control, and collaboration.

Dynamic Task Scheduling

Airflow’s scheduling capabilities are highly flexible. You can schedule tasks to run at specific intervals, handle dependencies, and manage task retries in case of failures. The scheduler executes tasks in a defined order, ensuring that dependencies are respected and workflows run smoothly.

Extensible Architecture

Airflow’s architecture is modular and extensible. It supports a wide range of operators (pre-defined tasks), sensors (waiting for external conditions), and hooks (interfacing with external systems). This extensibility allows you to integrate with virtually any system, including databases, cloud services, and APIs.

Robust Monitoring and Logging

Airflow provides comprehensive monitoring and logging capabilities. The web-based user interface (UI) offers real-time visibility into the status of your workflows, enabling you to monitor task progress, view logs, and troubleshoot issues. Additionally, Airflow can send alerts and notifications based on task outcomes.

Scalability and Reliability

Designed to scale, Airflow can handle workflows of any size. It supports distributed execution, allowing you to run tasks on multiple workers across different nodes. This scalability ensures that Airflow can grow with your organization’s needs, maintaining reliability even as workflows become more complex.

Getting Started with Apache Airflow

Installation using PIP

Setting up Apache Airflow is straightforward. You can install it using pip, Docker, or by deploying it on a cloud service. Here’s a brief overview of the installation process using pip:

1. Create a Virtual Environment (optional but recommended):

           python3 -m venv airflow_env

            source airflow_env/bin/activate

2. Install Apache Airflow: 

           pip install apache-airflow

3. Initialize the Database:

           airflow db init

4. Create a User:

           airflow users create –username admin –password admin –firstname Adminlastname User –role Admin –email [email protected]

5. Start the Web Server and Scheduler:

           airflow webserver –port 8080

           airflow scheduler

6. Access the Airflow UI: Open your web browser and go to http://localhost:8080.

Installation using Docker

Pull the docker image and run the container to access the Airflow web UI. Refer to the link for more details.

Creating Your First DAG

Airflow DAG Structure:

A DAG in Airflow is composed of three main parts:

  1. Imports: Necessary packages and operators.
  2. Default Arguments: Arguments that apply to all tasks within the DAG (such as retries, owner, start date).
  3. Task Definition: Define tasks using operators, and specify dependencies between them.

Scheduling:

Airflow allows you to define the schedule of a DAG using schedule_interval:

  1. @daily: Run once a day at midnight.
  2. @hourly: Run once every hour.
  3. @weekly: Run once a week at midnight on Sunday.
  4. Cron expressions, like “0 12 * * *”, are also supported for more specific scheduling needs.
  5. Define the DAG: Create a Python file (e.g., run_lms_transaction.py) in the dags folder of your Airflow installation directory.

    Example

from airflow import DAG

from airflow.operators.dummy import DummyOperator

from datetime import datetime

default_args = {

 ‘owner’: ‘airflow’,

 ‘start_date’: datetime(2023, 1, 1),

 ‘retries’: 1,

}

dag = DAG(‘example_dag’, default_args=default_args, schedule_interval=‘@daily’)

start = DummyOperator(task_id=‘start’, dag=dag)

end = DummyOperator(task_id=‘end’, dag=dag)

start >> end

2. Deploy the DAG: Save the file in the Dags folder. Place the DAG Python script in the DAGs folder (~/airflow/dags by default). Airflow will automatically detect and load the DAG.

3. Monitor the DAG: Access the Airflow UI, where you can view and manage the newly created DAG. Trigger the DAG manually or wait for it to run according to the defined schedule.

Calling the Apache HOP Pipelines/Workflows from Apache Airflow

In this example, we walk through how to integrate the Apache HOP with Apache Airflow. Here both the Apache Airflow and Apache HOP are running on two different independent docker containers. Apache HOP ETL Pipelines / Workflows are configured with a persistent volume storage strategy so that the DAG code can request execution from Airflow.  

Steps

  1. Define the DAG: Create a Python file (e.g., Stg_User_Details.py) in the dags folder of your Airflow installation directory.

from datetime import datetime, timedelta

from airflow import DAG

from airflow.operators.bash_operator import BashOperator

from airflow.operators.docker_operator import DockerOperator

from airflow.operators.python_operator import BranchPythonOperator

from airflow.operators.dummy_operator import DummyOperator

from docker.types import Mount

default_args = {

‘owner’                 : ‘airflow’,

‘description’           : ‘Stg_User_details’,

‘depend_on_past’        : False,

‘start_date’            : datetime(2022, 1, 1),

’email_on_failure’      : False,

’email_on_retry’        : False,

‘retries’               : 1,

‘retry_delay’           : timedelta(minutes=5)

}

with DAG(‘Stg_User_details’, default_args=default_args, schedule_interval=‘0 10 * * *’, catchup=False, is_paused_upon_creation=False) as dag:

    start_dag = DummyOperator(

        task_id=‘start_dag’

        )

    end_dag = DummyOperator(

        task_id=‘end_dag’

        )

    hop = DockerOperator(

        task_id=‘Stg_User_details’,

        # use the Apache Hop Docker image. Add your tags here in the default apache/hop: syntax

        image=‘test’,

        api_version=‘auto’,

        auto_remove=True,

        environment= {

            ‘HOP_RUN_PARAMETERS’: ‘INPUT_DIR=’,

            ‘HOP_LOG_LEVEL’: ‘TRACE’,

            ‘HOP_FILE_PATH’: ‘/opt/hop/config/projects/default/stg_user_details_test.hpl’,

            ‘HOP_PROJECT_DIRECTORY’: ‘/opt/hop/config/projects/’,

            ‘HOP_PROJECT_NAME’: ‘ISON_Project’,

            ‘HOP_ENVIRONMENT_NAME’: ‘ISON_Env’,

            ‘HOP_ENVIRONMENT_CONFIG_FILE_NAME_PATHS’: ‘/opt/hop/config/projects/default/project-config.json’,

            ‘HOP_RUN_CONFIG’: ‘local’,

        },

        docker_url=“unix://var/run/docker.sock”,

        network_mode=“bridge”,

        force_pull=False,

        mount_tmp_dir=False

        )

    start_dag >> hop >> end_dag

Note: For reference purposes only.

2. Deploy the DAG: Save the file in the dags folder. Airflow will automatically detect and load the DAG.

After successful deployment, we should see the new “Stg_User_Details” DAG listed in the Active Tab and All Tab from the Airflow Portal. As shown in the screenshot above.

3. Run the DAG: We can trigger pipelines or workflows using Airflow by clicking on the Trigger DAG option as shown below from the Airflow application.

4. Monitor the DAG: Access the Airflow UI, where you can view and manage the newly created DAG. Trigger the DAG manually or wait for it to run according to the defined schedule.

After successful execution, we should see the status message as shown an execution history along with log details. new “Stg_User_Details” DAG listed in the Active Tab and All Tab from the Airflow Portal. As shown in the screenshot above.

Managing and Scaling Workflows

  1. Use Operators and Sensors: Leverage Airflow’s extensive library of operators and sensors to create tasks that interact with various systems and handle complex logic.
  2. Implement Task Dependencies: Define task dependencies using the >> and << operators to ensure tasks run in the correct order.
  3. Optimize Performance: Monitor task performance through the Airflow UI and logs. Adjust task configurations and parallelism settings to optimize workflow execution.
  4. Scale Out: Configure Airflow to run in a distributed mode by adding more worker nodes, ensuring that the system can handle increasing workload efficiently.

Conclusion

Apache Airflow is a powerful and versatile tool for managing workflows and automating complex data pipelines. Its “workflow as code” approach, coupled with robust scheduling, monitoring, and scalability features, makes it an essential tool for data engineers and data scientists. By adopting Airflow, you can streamline your workflow management, improve collaboration, and ensure that your data processes are efficient and reliable. Explore Apache Airflow today and discover how it can transform your data engineering workflows.

Streamline your Apache HOP Workflow Management With Apache Airflow through our team of experts.

Upcoming Apache HOP Blog Series

Stay tuned for the upcoming Apache HOP Blog Series:

  1. Migrating from Pentaho ETL to Apache Hop
  2. Integrating Apache Hop with an Apache Superset
  3. Comparison of Pentaho ETL and Apache Hop
Optimizing DevSecOps with SonarQube and DefectDojo Integration

Optimizing DevSecOps with SonarQube and DefectDojo Integration

In the evolving landscape of software development, the importance of integrating security practices into every phase of the software development lifecycle (SDLC) cannot be overstated.

This is where DevSecOps comes into play a philosophy that emphasizes the need to incorporate security as a shared responsibility throughout the entire development process, rather than treating it as an afterthought.

By shifting security left, organizations can detect vulnerabilities earlier, reduce risks, and improve compliance with industry standards. DevSecOps isn’t just about tools; it’s about creating a culture where developers, security teams, and operations work together seamlessly. However, having the right tools is essential to implementing an effective DevSecOps strategy.

In this article, we’ll delve into the setup of DefectDojo, a powerful open-source vulnerability management tool, and demonstrate how to integrate it with SonarQube to automate Code Analysis as part of the CICD Process and also capture the vulnerability identified in the testing so it can be tracked to closure. 

Why DevSecOps and What Tools to Use?

DevSecOps aims to automate security processes, ensuring that security is built into the CI/CD pipeline without slowing down the development process. Here are some key tools that can be used in a DevSecOps environment:

SonarQube: Static Code Analysis

SonarQube inspects code for bugs, code smells, and vulnerabilities, providing developers with real-time feedback on code quality. It helps identify and fix issues early in the development process, ensuring cleaner and more maintainable code.

Dependency Check: Dependency Vulnerability Scanning

Dependency Check scans project dependencies for known vulnerabilities, alerting teams to potential risks from third-party libraries. This tool is crucial for managing and mitigating security risks associated with external code components.

Trivy: Container Image Scanning

Trivy examines Docker images for vulnerabilities in both OS packages and application dependencies. By integrating Trivy into the CI/CD pipeline, teams ensure that the container images deployed are secure, reducing the risk of introducing vulnerabilities into production.

ZAP: Dynamic Application Security Testing

ZAP performs dynamic security testing on running applications to uncover vulnerabilities that may not be visible through static analysis alone. It helps identify security issues in a live environment, ensuring that the application is secure from real-world attacks.

Defect Dojo: Centralized Security Management

Defect Dojo collects and centralizes security findings from SonarQube, Dependency Check, Trivy, and ZAP, providing a unified platform for managing and tracking vulnerabilities. It streamlines the process of addressing security issues, offering insights and oversight to help prioritize and resolve them efficiently.

DefectDojo ties all these tools together by acting as a central hub for vulnerability management. It allows security teams to track, manage, and remediate vulnerabilities efficiently, making it an essential tool in the DevSecOps toolkit.

Installation and Setup of DefectDojo

Before integrating DefectDojo with your security tools, you need to install and configure it. Here’s how you can do that:

Prerequisites: 

1. Docker
2. Jenkins

Step1: Download the official documentation from DefectDojo github repository using the link

Step2: Once downloaded, navigate to the path where you have downloaded the repository and run the below command.

docker-compose up -d –-build

Step3: DefectDojo will start running and you will be able to access it with http://localhost:8080/ . Enter the default credentials that you have set and login to the application.

Step4: Create a product by navigating as shown below

Step5: Click on the created product name and create an engagement by navigating as shown below

Step6: Once an engagement is created click on the created engagement name and create a test from it.

Step7: Copy the test number and enter the test number in the post stage of the pipeline.

Step8: Building a pipeline in Jenkins to integrate it with DefectDojo inorder to generate vulnerability report

pipeline {

    agent {

        label ‘Agent-1’

    }

    environment {

        SONAR_TOKEN = credentials(‘sonarqube’)

    }

    stages {

        stage(‘Checkout’) {

            steps {

                checkout([$class: ‘GitSCM’, branches: [[name: ‘dev’]],

                    userRemoteConfigs: [[credentialsId: ‘parameterized credentials’, url: ‘repositoryURL’]]

                ])

            }

        }

        stage(‘SonarQube Analysis’) {

            steps {

                script {

                    withSonarQubeEnv(‘Sonarqube’) {

                        def scannerHome = tool name: ‘sonar-scanner’, type: ‘hudson.plugins.sonar.SonarRunnerInstallation’

                        def scannerCmd = “${scannerHome}/bin/sonar-scanner”

                        sh “${scannerCmd} -Dsonar.projectKey=ProjectName -Dsonar.sources=./ -Dsonar.host.url=Sonarqubelink -Dsonar.login=${SONAR_TOKEN} -Dsonar.java.binaries=./”

                    }

                }

            }

        }

        // Other stages…

        stage(‘Generate SonarQube JSON Report’) {

            steps {

                script {

                    def SONAR_REPORT_FILE = “/path/for/.jsonfile”

                    sh “””

                    curl -u ${SONAR_TOKEN}: \

                        “Sonarqube URL” \

                        -o ${SONAR_REPORT_FILE}

                    “””

                }

            }

        }

    }

    post {

        always {

            withCredentials([string(credentialsId: ‘Defect_Dojo_Api_Key’, variable: ‘API_KEY’)]) {

                script {

                    def defectDojoUrl = ‘Enter the defect dojo URL’  // Replace with your DefectDojo URL

                    def testId = ’14’  // Replace with the correct test ID

                    def scanType = ‘SonarQube Scan’

                    def tags = ‘Enter the tag name’

                    def SONAR_REPORT_FILE = “/path/where/.json/file is present”

                    sh “””

                    curl -X POST \\

                      ‘${defectDojoUrl}’ \\

                      -H ‘accept: application/json’ \\

                      -H ‘Authorization: Token ${API_KEY}’ \\

                      -H ‘Content-Type: multipart/form-data’ \\

                      -F ‘test=${testId}’ \\

                      -F ‘file=@${SONAR_REPORT_FILE};type=application/json’ \\

                      -F ‘scan_type=${scanType}’ \\

                      -F ‘tags=${tags}’

                    “””

                }

            }

        }

    }

}

The entire script is wrapped in a pipeline block, defining the CI/CD pipeline in Jenkins.

agent: Specifies the Jenkins agent (node) where the pipeline will execute. The label Agent-1 indicates that the pipeline should run on a node with that label.

environment: Defines environment variables for the pipeline.

SONAR_TOKEN: Retrieves the SonarQube authentication token from Jenkins credentials using the ID ‘sonarqube’. This token is needed to authenticate with the SonarQube server during analysis.

The pipeline includes several stages, each performing a specific task.

stage(‘Checkout’): The first stage checks out the code from the Git repository.

checkout: This is a Jenkins step that uses the GitSCM plugin to pull code from the specified branch.

branches: Indicates which branch (dev) to checkout.

userRemoteConfigs: Specifies the Git repository’s URL and the credentials ID (parameterized credentials) used to access the repository.

stage(‘SonarQube Analysis’): This stage runs a SonarQube analysis on the checked-out code.

withSonarQubeEnv(‘Sonarqube’): Sets up the SonarQube environment using the SonarQube server named Sonarqube, which is configured in Jenkins.

tool name: Locates the SonarQube scanner tool installed on the Jenkins agent.

sh: Executes the SonarQube scanner command with the following parameters:

-Dsonar.projectKey=ProjectName: Specifies the project key in SonarQube.

-Dsonar.sources=./: Specifies the source directory for the analysis (in this case, the root directory).

-Dsonar.host.url=Sonarqubelink: Specifies the SonarQube server URL.

-Dsonar.login=${SONAR_TOKEN}: Uses the SonarQube token for authentication.

-Dsonar.java.binaries=./: Points to the location of Java binaries (if applicable) for analysis.

stage(‘Generate SonarQube JSON Report’): This stage generates a JSON report of the SonarQube analysis.

SONAR_REPORT_FILE: Defines the path where the JSON report will be saved.

sh: Runs a curl command to retrieve the SonarQube issues data in JSON format:

  1. -u ${SONAR_TOKEN}:: Uses the SonarQube token for authentication.
  2. “Sonarqube URL”: Specifies the API endpoint to fetch the issues from SonarQube.
  3. -o ${SONAR_REPORT_FILE}: Saves the JSON response to the specified file path.

post: Defines actions to perform after the pipeline stages are complete. The always block ensures that the following steps are executed regardless of the pipeline’s success or failure.

withCredentials: Securely retrieves the DefectDojo API key from Jenkins credentials using the ID ‘Defect_Dojo_Api_Key’.

script: Contains the script block where the DefectDojo integration happens.

defectDojoUrl: The URL of the DefectDojo API endpoint for reimporting scans.

testId: The specific test ID in DefectDojo where the SonarQube report will be uploaded.

scanType: Indicates the type of scan (in this case, SonarQube Scan).

tags: Tags the scan in DefectDojo.

SONAR_REPORT_FILE: Points to the JSON file generated earlier.

sh: Runs a curl command to POST the SonarQube JSON report to DefectDojo:

-X POST: Specifies that this is a POST request.

-H ‘accept: application/json’: Indicates that the response should be in JSON format.

-H ‘Authorization: Token ${API_KEY}’: Authenticates with DefectDojo using the API key.

-F ‘test=${testId}’: Specifies the test ID in DefectDojo.

-F ‘file=@${SONAR_REPORT_FILE};type=application/json’: Uploads the JSON file as part of the request.

-F ‘scan_type=${scanType}’: Indicates the type of scan being uploaded.

-F ‘tags=${tags}’: Adds any specified tags to the scan in DefectDojo.

Step 9: Verify the Integration

After the pipeline executes, the vulnerability report generated by SonarQube will be available in DefectDojo under the corresponding engagement. You can now track, manage, and remediate these vulnerabilities using DefectDojo’s robust features.

Conclusion

Integrating security tools like SonarQube with DefectDojo is a critical step in building a secure DevSecOps pipeline. By automating vulnerability management and integrating it directly into your CI/CD pipeline, you can ensure that security remains a top priority throughout the development process.

In this article, we focused on setting up DefectDojo and integrating it with SonarQube. In future articles, we will cover the integration of additional tools like OWASP ZAP, Trivy, and Dependency-Check. Stay tuned to further enhance your DevSecOps practices.

Frequently Asked Questions (FAQs)

DefectDojo is an open-source application security vulnerability management tool. It helps organizations in managing, tracking, and remediating security vulnerabilities efficiently. It integrates with a variety of security tools for automating workflows and provides continuous security monitoring throughout the SDLC.

The prerequisites for setting up DefectDojo are Docker, to run DefectDojo inside a container, and Jenkins, for integrating pipelines in CI/CD.

To install DefectDojo:

  1. Download the official DefectDojo documentation from its GitHub repository.
  2. Navigate to the path where you downloaded the repository.
  3. Run the docker-compose up -d --build command to build and start the DefectDojo container and its dependencies.

Once DefectDojo is running, you can access it via a web browser at http://localhost:8080/. Log in with the credentials set during installation, and you can start managing security vulnerabilities by creating products, engagements, and tests.

Yes, DefectDojo integrates seamlessly with other security tools like SonarQube, OWASP ZAP, Trivy, and Dependency-Check. It allows the centralization of vulnerability management across multiple tools, making it an indispensable part of the DevSecOps pipeline.

Get in touch with us and elevate your software security and performance

Unlocking Data Insights with Apache Superset

Unlocking Data Insights with Apache Superset

Introduction

In today’s data-driven world, having the right tools to analyze and visualize data is crucial for making informed decisions. Organizations rely heavily on actionable insights to make informed decisions. With vast amounts of data generated daily, visualizing it becomes crucial for deriving patterns, trends, and insights. One of the standout solutions in the open-source landscape is Apache Superset. Apache Superset, an open-source data exploration and visualization platform, has emerged as a powerful tool for modern data analytics. This powerful, user-friendly platform enables users to create, explore, and share interactive data visualizations and dashboards. Whether you’re a data scientist, analyst, or business intelligence professional, Apache Superset can significantly enhance your data analysis capabilities. In this blog post, we’ll dive deep into what Apache Superset is, its key features, architecture, installation process, use cases, and how you can leverage it to unlock valuable insights from your data. 

Apache Superset

Apache Superset is an open-source data exploration and visualization platform developed by Airbnb, it was later donated to the Apache Software Foundation. It is now a top-level Apache project, widely adopted across industries for data analytics and visualization. Apache Superset is designed to be a modern, enterprise-ready business intelligence web application that allows users to explore, analyze, and visualize large datasets. Superset’s intuitive interface allows users to quickly and easily create beautiful and interactive visualizations and dashboards from various data sources without needing extensive programming knowledge.

Superset is designed to be lightweight yet feature-rich, offering powerful SQL-based querying, interactive dashboards, and a wide variety of data visualization options—all through an intuitive web-based interface.

Key Features

Rich Data Visualizations

Superset offers a clean and intuitive interface that makes it easy for users to navigate and create visualizations. The drag-and-drop functionality simplifies the process of building charts and dashboards, making it accessible even to non-technical users. Superset provides a wide range of customizable visualizations. Whether it’s simple charts like bar charts, line charts, pie charts, scatter plots, geographical maps, or complex visuals like geospatial maps and heatmaps, Superset offers an extensive library to cover various data visualization needs. This flexibility allows users to choose the best way to represent their data, facilitating better analysis and understanding.

  1. Bar Charts: Perfect for comparing different categories of data.
  2. Line Charts: Excellent for time-series analysis.
  3. Heatmaps: Useful for showing data density or intensity.
  4. Geospatial Maps: Visualize location-based data on geographical maps.
  5. Pie Charts, Treemaps, Sankey Diagrams, and More: Additional options for exploring relationships and proportions in the data.

SQL-Based Querying

One of Superset’s most powerful features is its support for SQL-based querying. It provides an SQL editor where users can write and execute SQL queries directly against connected databases. For users who prefer working with SQL, Superset includes a powerful SQL editor called SQL Lab. This feature allows users to run queries, explore databases, and preview data before creating visualizations. SQL Lab supports syntax highlighting, autocompletion, and query history, enhancing the SQL writing experience.

Interactive Dashboards

Superset allows users to create interactive dashboards with multiple charts, filters, and data points. These dashboards can be customized and shared across teams to deliver insights interactively. Real-time data updates ensure that the latest metrics are always displayed.

Extensible and Scalable

Apache Superset is highly extensible and can connect to a variety of data sources such as:

  1. SQL-based databases (PostgreSQL, MySQL, Oracle, etc.)
  2. Big Data platforms (Presto, Druid, Hive, and more)
  3. Cloud-native databases (Google BigQuery, Snowflake, Amazon Redshift)

This versatility ensures that users can easily access and analyze their data, regardless of where it is stored. Its architecture supports horizontal scaling, making it suitable for enterprises handling large-scale datasets.

Security and Authentication

As an enterprise-ready platform, Superset offers robust security features, including role-based access control (RBAC), authentication, and authorization mechanisms. Additionally, Superset is designed to scale with your organization, capable of handling large volumes of data and concurrent users. Superset integrates with common authentication protocols (OAuth, OpenID, LDAP) to ensure secure access. It also provides fine-grained access control through role-based security, enabling administrators to control access to specific dashboards, charts, and databases.

Low-Code and No-Code Data Exploration

Superset is ideal for both technical and non-technical users. While advanced users can write SQL queries to explore data, non-technical users can use the point-and-click interface to create visualizations without requiring code. This makes it accessible to everyone, from data scientists to business analysts.

Customizable Visualizations

Superset’s visualization framework allows users to modify the look and feel of their charts using custom JavaScript, CSS, and the powerful ECharts and D3.js libraries. This gives users the flexibility to create branded and unique visual representations.

Advanced Analytics

Superset includes features for advanced analytics, such as time-series analysis, trend lines, and complex aggregations. These capabilities enable users to perform in-depth analysis and uncover deeper insights from their data.

Architecture of Apache Superset

Superset’s architecture is modular and designed to be scalable, making it suitable for both small teams and large enterprises

Here’s a breakdown of its core components:

Frontend (React-based):

Superset’s frontend is built using React, offering a smooth and responsive user interface for creating visualizations and interacting with data. The UI also leverages Bootstrap and other modern JavaScript libraries to enhance the user experience.

Backend (Python/Flask-based):

  1. The backend is powered by Python and Flask, a lightweight web framework. Superset uses SQLAlchemy as the SQL toolkit and Alembic for database migrations.
  2. Superset communicates with databases using SQLAlchemy to execute queries and fetch results.
  3. Celery and Redis can be used for background tasks and asynchronous queries, allowing for scalable query processing.

Metadata Database:

  1. Superset stores information about visualizations, dashboards, and user access in a metadata database. Common choices include PostgreSQL or MySQL.
  2. This database does not store the actual data being analyzed but rather metadata about the analysis (queries, charts, filters, and dashboards).

Caching Layer:

  1. Superset supports caching using Redis or Memcached. Caching improves the performance of frequently queried datasets and dashboards, ensuring faster load times.

Asynchronous Query Execution:

  1. For large datasets, Superset can run queries asynchronously using Celery workers. This prevents the UI from being blocked during long-running queries.

Worker and Beat​

This is one or more workers who execute tasks like run async queries or take snapshots of reports and send emails, and a “beat” that acts as the scheduler and tells workers when to perform their tasks. Most installations use Celery for these components.

Getting Started with Apache Superset

Installation and Setup

Setting up Apache Superset is straightforward. It can be installed using Docker, pip, or by deploying it on a cloud platform. Here’s a brief overview of the installation process using Docker:

1. Install Docker: Ensure Docker is installed on your machine.

2. Clone the Superset Repository:

git clone https://github.com/apache/superset.git

cd superset

3. Run the Docker Compose Command:

docker-compose -f docker-compose-non-dev.yml up

4. Initialize the Database:

docker exec -it superset_superset-worker_1 superset db upgrade

docker exec -it superset_superset-worker_1 superset init

5. Access Superset: Open your web browser and go to http://localhost:8088 to access the Superset login page.

Configuring the Metadata Storage

The metadata database is where chart and dashboard definitions, user information, logs, etc. are stored. Superset is tested to work with PostgreSQL and MySQL databases. In a Docker Compose installation, the data would be stored in a PostgreSQL container volume. The PyPI installation methods use a SQLite on-disk database. However, neither of these cases is recommended for production instances of Superset. For production, a properly configured, managed, standalone database is recommended. No matter what database you use, you should plan to back it up regularly. In the upcoming Superset blogs, we will go through how to configure the Apache Superset with Metadata storage. 

Creating Your First Dashboard

1. Connect to a Data Source: Navigate to the Sources tab and add a new database or table.

2. Explore Data: Use SQL Lab to run queries and explore your data.

3. Create Charts: Go to the Charts tab, choose a dataset, and select a visualization type. Customize your chart using the various configuration options.

4. Build a Dashboard: Combine multiple charts into a cohesive dashboard. Drag and drop charts, add filters, and arrange them to create an interactive dashboard.

More Dashboards:

Use Cases of Apache Superset

  1. Business Intelligence & Reporting Superset is widely used in organizations for creating BI dashboards that track KPIs, sales, revenue, and other critical metrics. It’s a great alternative to commercial BI tools like Tableau or Power BI, particularly for organizations that prefer open-source solutions.
  2. Data Exploration for Data Science Data scientists can leverage Superset to explore datasets, run queries, and visualize complex relationships in the data before moving to more complex machine learning tasks.
  3. Operational Dashboards Superset can be used to create operational dashboards that track system health, service uptimes, or transaction statuses in real-time. Its ability to connect to various databases and run SQL queries in real time makes it a suitable choice for this use case.
  4. Geospatial Analytics With built-in support for geospatial visualizations, Superset is ideal for businesses that need to analyze location-based data. For example, a retail business can use it to analyze customer distribution or store performance across regions.
  5. E-commerce Data Analysis Superset is frequently used by e-commerce companies to analyze sales data, customer behavior, product performance, and marketing campaign effectiveness.

Advantages of Apache Superset

  1. Open-source and Cost-effective: Being an open-source tool, Superset is free to use and can be customized to meet specific needs, making it a cost-effective alternative to proprietary BI tools.
  2. Rich Customizations: Superset supports extensive visual customizations and can integrate with JavaScript libraries for more advanced use cases.
  3. Easy to Deploy: It’s relatively straightforward to set up on both local and cloud environments.
  4. SQL-based and Powerful: Ideal for organizations with a strong SQL-based querying culture.
  5. Extensible: Can be integrated with other data processing or visualization tools as needed.

Sharing and Collaboration

Superset makes it easy to share your visualizations and dashboards with others. You can export and import dashboards, share links, and embed visualizations in other applications. Additionally, Superset’s role-based access control ensures that users only have access to the data and visualizations they are authorized to view.

Conclusion

Apache Superset is a versatile and powerful tool for data exploration and visualization. Its user-friendly interface, a wide range of visualizations, and robust integration capabilities make it an excellent choice for businesses and data professionals looking to unlock insights from their data. Whether you’re just getting started with data visualization or you’re an experienced analyst, Superset provides the tools you need to create compelling and informative visualizations. Give it a try and see how it can transform your data analysis workflow.

You can also get in touch with us and we will be Happy to help with your custom implementations.

Asset 31

Comparison of and migrating from Pentaho Data Integration PDI/ Kettle to Apache HOP

Introduction

As Data Engineering evolves, so do the tools we use to manage and streamline our data workflows. Commercial Open-Source Pentaho Data Integration (PDI), commonly known as Kettle or Spoon, has been a popular choice for over a decade for many Data professionals. Hitachi Vantara acquired and continued to support Pentaho Community Edition along with the Commercial offering not just for the PDI / Data Integration platform but also the complete Business intelligence Suite which included a comprehensive set of tools with great flexibility, extensibility and hence used to be featured highly in the Analysts reports including Gartner BI Magic Quadrant, Forrester and Dresner’s Wisdom of Crowds. 

Over the last few years, however, there has been a shift in industry and several niche Pentaho alternatives have appeared. Also, an alternative is needed for the Pentaho Community Edition users since Hitachi Vantara / Pentaho has stopped releasing or supporting the Community Edition (CE) of Pentaho Business Intelligence and Data Integration platforms since November of 2022. With the emergence of Apache Hop (Hop Orchestration Platform), a top-level Apache Open-Source Project, organizations now have a modern, flexible alternative that builds on the foundations laid by PDI and it is one of the top Pentaho Data Integration alternatives.

This is the first part of a series of articles where we try to highlight why Apache Hop can be considered as a replacement for the Pentaho Data Integration platform as we explore its benefits and also list a few of its limitations currently. In the next part, we provide a step-by-step guide to make the transition as smooth as possible.

Current Pentaho Enterprise and Community edition Releases:

A summary of the Release dates for the recent versions of Pentaho Platform along with their support commitment is captured in this table. You will notice that the last CE version was released in Nov 2022 while 3 newer EE versions have been released since.

Enterprise Version

Release Date

Community Version

Release Date

Support

Pentaho 10.2

Expected in Q3 2024

NA

NA

Long Term

Pentaho 10.1 GA

March 5, 2024

NA

NA

Normal

Pentaho 10.0

December 01, 2023

NA

NA

Limited

Pentaho 9.5

May 31, 2023

NA

NA

Limited

Pentaho 9.4

November 01, 2022

9.4CE

Same as EE

Limited

Pentaho 9.3

May 04, 2022

9.3CE

Same as EE

Long Term

Pentaho 9.2

August 03, 2021

9.2CE

Same as EE

Unsupported

Pentaho 9.1

October 06, 2020

NA

 

Unsupported

Pentaho 9.0

February 04, 2020

NA

 

Unsupported

Pentaho 8.3

July 01, 2019

8.3CE

Same as EE

Unsupported

Additionally, Pentaho EE 8.2, 8.1, 8.0 and Pentaho 7.X are all unsupported versions on date.

Apache HOP - An Overview

Apache HOP is an open-source data integration and orchestration platform.

It allows users to design, manage, and execute data workflows (pipelines) and integration tasks (workflows) with ease. HOP’s visual interface, combined with its powerful backend, simplifies complex data processes, making it accessible for both technical and non-technical users.

Evolution from Kettle to HOP

As the visionary behind both Pentaho Data Integration (Kettle) and Apache HOP (Hop Orchestration Platform), Matt Casters has played a pivotal role in shaping the tools that power modern data workflows.

The Early Days: Creating Kettle

Matt Casters began his journey into the world of data integration in the early 2000s. Frustrated by the lack of flexible and user-friendly ETL (Extract, Transform, Load) tools available at the time, he set out to create a solution that would simplify the complex processes of data integration. This led to the birth of Kettle, an acronym for “Kettle ETTL Environment” (where ETTL stands for Extraction, Transformation, Transportation, and Loading).

Key Features of Kettle:

  1. Visual Interface: Kettle introduced a visual drag-and-drop interface, making it accessible to users without extensive programming knowledge.
  2. Extensibility: It was designed to be highly extensible, allowing users to create custom plugins and transformations.
  3. Open Source: Recognizing the power of community collaboration, Matt released Kettle as an open-source project, inviting developers worldwide to contribute and improve the tool.

Kettle quickly gained popularity for its ease of use, flexibility, and robust capabilities. It became a cornerstone for data integration tasks, helping organizations manage and transform their data with unprecedented ease.

The Pentaho Era

In 2006, Matt Casters joined Pentaho, a company dedicated to providing open-source business intelligence (BI) solutions. Kettle was rebranded as Pentaho Data Integration (PDI) and integrated into the broader Pentaho suite. This move brought several advantages:

  1. Resource Support: Being part of Pentaho provided Kettle with added resources, including development support, marketing, and a broader user base.
  2. Enhanced Features: Under Pentaho, PDI saw many enhancements, including improved scalability, performance, and integration with other BI tools.
  3. Community Growth: The backing of Pentaho helped grow the community of users and contributors, driving further innovation and adoption.

Despite these advancements, Matt Casters never lost sight of his commitment to open-source principles and community-driven development, ensuring that PDI stayed a flexible and powerful tool for users worldwide.

The Birth of Apache HOP

While PDI continued to evolve, Matt Casters recognized the need for a modern, flexible, and cloud-ready data orchestration platform. The landscape of data integration had changed significantly, with new challenges and opportunities emerging in the era of big data and cloud computing. This realization led to the creation of Apache HOP (Hop Orchestration Platform).

In 2020, Apache HOP was accepted as an incubator project by the Apache Software Foundation, marking a new chapter in its development and community support. This move underscored the project’s commitment to open-source principles and ensured that HOP would receive help from the robust governance and community-driven innovation that the Apache Foundation is known for.

Advantage of Apache HOP compared to Pentaho Data Integration

Apache HOP (Hop Orchestration Platform) and Pentaho Data Integration (PDI)/Kettle are both powerful data integration and orchestration tools. However, Apache HOP has several advantages over PDI, because of its evolution from PDI and adaptation to modern data needs. Below, we explore the key advantages of Apache HOP over Pentaho Data Integration Kettle:

Modern Architecture and Design

Feature

Apache HOP

PDI (Kettle)

Modular and Extensible Framework

Being more modern it is built as a modular and extensible architecture, allowing for easier customization and addition of new features. Users can add or remove plugins without affecting the core functionality.

While PDI is also extensible, its older architecture can make customization and plugin integration more cumbersome compared to HOP’s more streamlined approach.

Lightweight and Performance Optimized

Designed to be lightweight and efficient, improving performance, particularly for large-scale and complex workflows

Older codebase may not be as optimized for performance in modern, resource-intensive data environments.

Hop’s metadata-driven design and extensive plugin library offer greater flexibility for building complex data workflows. Users can also develop custom plugins to extend Hop’s capabilities to meet specific needs.

Enhanced User Interface and Usability

Feature

Apache HOP

PDI (Kettle)

Modern UI

Features a modern and intuitive user interface, making it easier for users to design, manage, and monitor data workflows.

Although functional, the user interface is dated and may not offer the same level of user experience and ease of use as HOP.

Improved Workflow Visualization

Provides better visualization tools for workflows and pipelines, helping users understand and debug complex data processes more effectively.

Visualization capabilities are good but can be less intuitive and harder to navigate compared to HOP.

 

The drag-and-drop functionality, combined with a cleaner and more organized layout, helps users create and manage workflows and pipelines more efficiently.

Apache HOP Web

Apache Hop also supports a Web interface for the development and maintenance of the HOP files unlike Pentaho Data Integration where this feature is still in Beta that too only for the Enterprise Edition. The web interface can be accessed through http://localhost:8080/hop/ui 

Accessing HOP Status Page: http://localhost:8080/hop/status/

https://hop.apache.org/dev-manual/latest/hopweb/index.html

Advanced Development and Collaboration Features

Feature

Apache HOP

PDI (Kettle)

Project-Based Approach

Uses a project-based approach, allowing users to organize workflows, configurations, and resources into cohesive projects. This facilitates better version control, collaboration, and project management.

Lacks a project-based organization, which can make managing complex data integration tasks more challenging.

Integration with Modern DevOps Practices

Designed to integrate seamlessly with modern DevOps tools and practices, including CI/CD pipelines and containerization.

Integration with DevOps tools is possible but not as seamless or integrated as with HOP, especially with the Community edition.

Apache HOP for CI/CD Integration with GitHub / Gitlab

Apache HOP (Hop Orchestration Platform) is a powerful and flexible data integration and orchestration tool. One of its standout features is its compatibility with modern development practices, including Continuous Integration and Continuous Deployment (CI/CD) pipelines. By integrating Apache HOP with GitHub, development teams can streamline their workflows, automate testing and deployment, and ensure consistent quality and performance. In this blog, we’ll explore the advanced features of Apache HOP that support CI/CD integration and provide a guide on setting it up with GitHub.

Why Integrate Apache HOP with CI/CD?

  1. Automation: Automate repetitive tasks such as testing, building, and deploying HOP projects. 2. Consistency: Ensure that all environments (development, testing, production) are consistent by using automated pipelines. 3. Faster Delivery: Speed up the delivery of updates and new features by automating the deployment process. 4. Quality Assurance: Integrate testing into the pipeline to catch errors and bugs early in the development cycle. 5. Collaboration: Improve team collaboration by using version control and automated workflows.

Advanced Features of Apache HOP for CI/CD

  1. Project-Based Approach
  • Apache HOP’s project-based architecture allows for easy organization and management of workflows, making it ideal for CI/CD pipelines.
  1. Command-Line Interface (CLI)
  • HOP provides a robust CLI that enables automation of workflows and pipelines, easing integration into CI/CD pipelines.
  1. Integration with Version Control Systems
  • Apache HOP supports integration with Git, allowing users to version control their workflows and configurations directly in GitHub.
  1. Parameterization and Environment Configurations
  • HOP allows parameterization of workflows and environment-specific configurations, enabling seamless transitions between development, testing, and production environments.
  1. Test Framework Integration
  • Apache HOP supports integration with various testing frameworks, allowing for automated testing of data workflows as part of the CI/CD pipeline.

Cloud-Native Capabilities

As the world moves towards cloud-first strategies, understanding how Apache HOP integrates with cloud environments is crucial for maximizing its potential. The cloud support for Apache HOP, exploring its benefits, features, and practical applications opens a world of possibilities for organizations looking to perfect their data workflows in the cloud. As cloud adoption continues to grow, using Apache HOP can help organizations stay ahead in the data-driven world

Feature

Apache HOP

PDI (Kettle)

Cloud Integration

Built with cloud integration in mind, providing robust support for deploying on various cloud platforms and integrating with cloud storage, databases, and services.control, collaboration, and project management.

While PDI can be used in cloud environments, it lacks the inherent cloud-native design and seamless integration capabilities of HOP especially for the Community edition.

Integration with Cloud Storage

Data workflows often involve large data sets stored in cloud storage solutions. Apache HOP provides out-of-the-box connectors for major cloud storage services:

  • Amazon S3: Seamlessly read from and write to Amazon S3 buckets.
  • Google Cloud Storage: Integrate with GCS for scalable and secure data storage.
  • Azure Blob Storage: Use Azure Blob Storage for efficient data handling.

Cloud-native Databases and Data Warehouses: 

Modern data architectures often leverage cloud-native databases and data warehouses. Apache HOP supports integration with:

  • Amazon RDS and Redshift: Connect to relational databases and data warehouses on AWS.
  • Google Big Query: Integrate with Big Query for fast, SQL-based analytics.
  • Azure SQL Database and Synapse Analytics: Use Microsoft’s cloud databases for scalable data solutions.

Cloud-native Data Processing

Apache HOP’s integration capabilities extend to cloud-native data processing services, allowing for powerful and scalable data transformations:

  • AWS Glue: Use AWS Glue for serverless ETL jobs.
  • Google Dataflow: Integrate with Dataflow for stream and batch data processing.
  • Azure Data Factory: Leverage ADF for hybrid data integration.

Security and Compliance

Security is paramount in cloud environments. Apache HOP supports various security protocols and practices to ensure data integrity and compliance:

  • Encryption: Support for encrypted data transfers and storage.
  • Authentication and Authorization: Integrate with cloud identity services for secure access control.

Compliance: Ensure workflows comply with industry standards and regulations

Features Summary and Comparison

Feature

Kettle

Hop

Projects and Lifecycle Configuration

No

Yes

Search Information in projects and configurations

No

Yes

Configuration management through UI and command line

No

Yes

Standardized shared metadata

No

Yes

Pluggable runtime engines

No

Yes

Advanced GUI features: memory, native zoom

No

Yes

Metadata Injection

Yes

Yes (most transforms)

Mapping (sub-transformation/pipeline

Yes

Yes(simplified)

Web Interface

Web Spoon

Hop Web

APL 2.0 license compliance

LGPL doubts regarding pentaho-metastore library

Yes

Pluggable metadata objects

No

Yes

GUI plugin architecture

XUL based (XML)

Java annotations

External Link:

https://hop.apache.org/tech-manual/latest/hop-vs-kettle/hop-vs-kettle.html

 

Community and Ecosystem

Open-Source Advantages

  • Apache HOP: Fully open-source under the Apache License, offering transparency, flexibility, and community-driven enhancements.
  • PDI (Kettle): While also open-source and having a large user base with extensive documentation, PDI’s development has slowed, and it has not received as many updates or new features as HOP. PDI’s development was and is more tightly controlled in the recent past by Hitachi Vantara, potentially limiting community contributions and innovation compared to HOP. 

Active Development and Community Support

Apache Hop is actively developed and maintained under the Apache Software Foundation, ensuring regular updates, bug fixes, and new features. The community support for Apache HOP is a cornerstone of its success. The Apache Software Foundation (ASF) has always championed the concept of community over code, and Apache HOP is a shining example of this ethos in action.

Why Community Support Matters

  1. Accelerated Development and Innovation: The community continuously contributes to the development and enhancement of Apache HOP. From submitting bug reports to developing new features, the community’s input is invaluable. This collaborative effort accelerates the innovation cycle, ensuring that Apache HOP stays innovative and highly functional.
  2. Resource Sharing: The Apache HOP community is a treasure trove of resources. From comprehensive documentation and how-to guides to video tutorials and webinars, community members create and share a wealth of knowledge. This collective pool of information helps both beginners and experienced users navigate the platform with ease.
  3. Peer Support and Troubleshooting: One of the standout benefits of community support is the peer-to-peer assistance available through forums, mailing lists, and chat channels. Users can seek help, share solutions, and discuss best practices. This collaborative troubleshooting often leads to quicker resolutions and deeper understanding of the platform.
  4. Networking and Collaboration: Being part of the Apache HOP community opens doors to networking opportunities. Engaging with other professionals in the field can lead to collaborative projects, job opportunities, and professional growth. It’s a platform for like-minded individuals to connect and create meaningful professional relationships.

All this can be seen from the frequent, consistent releases with key features released in each release captured in the table below.

Version

Release Date

Description

Apache Hop 3.0

Q4 2024

Future Release Items

Apache Hop 2.10

August 31, 2024

Upcoming… The Apache Hop 2.10 release introduced several new features and improvements. Key updates include Enhanced Plugin Management, Bug Fixes and Performance Enhancements, New Tools and Utilities.

Apache Hop 2.9

May 24, 2024

This version includes various new features like static schema metadata type, Crate DB database dialect and bulk loader, and several improvements in transforms. Check out What’s changed. Check here for more details.

Apache Hop 2.8

March 13, 2024

This update brought new AWS transforms (SNS Notify and SQS Reader), many bug fixes, and performance improvements​. Check here for more details.

Apache Hop 2.7

December 1, 2023

This release featured the Redshift bulk loader, JDBC driver refactoring, and other enhancements​. Check here for more details.

Apache Hop 2.6

September 19, 2023

This version included new Google transforms (Google Analytics 4 and Google Sheets Input/Output), an Apache Beam upgrade, and various bug fixes​. Check here for more details.

Apache Hop 2.5

July 18, 2023

This version focused on various bug fixes and new features, including an upgrade to Apache Beam 2.48.0 with support for Apache Spark 3.4, Apache Flink 1.16, and Google Cloud Dataflow. Additional updates included a new Intersystem IRIS database type, JSON input and output improvements, Salesforce input enhancements, an upgrade to Duck DB 0.8, and the addition of Polish language support​.Check here for more details.

Apache Hop 2.4

March 31, 2023

This update introduced new features like Duck DB support, a new script transform, and various improvements in existing transforms and documentation

Apache Hop 2.3

February 1, 2023

This release focused mainly on bug fixes and included a few new features. One significant update was the integration of Weblate, a new translation tool that simplifies the contribution of translations. Another key addition was the integration of the Vertica Bulk Loader into the main code base, enhancing data loading speeds to the Vertica analytical database. Check here for more details.

Apache Hop 2.2

December 6, 2022

This release involved significant improvements and fixes, addressing over 160 tickets. Key updates included enhancements to the Hop GUI, such as a new welcome dialog, navigation viewport, data grid toolbars, and a configuration perspective. Additionally, there were upgrades to various components, including Apache Beam and Google Dataflow​ (Apache Hop)​​ (Apache Issues)​. For more detailed information, you can visit the Apache Hop 2.2 release page. Check here for more details.

Apache Hop 2.1

October 14, 2022

This release included various new features such as MongoDB integration, Apache Beam execution, and new plugins for data profiling and documentation improvements​. Check here for more details.

Apache Hop 2.0

June 17, 2022

Introduced various bug fixes and improvements, including enhancements to the metadata injection functionality and documentation updates​. The update also included various new transform plugins such as Apache Avro File Output, Apache Doris Bulk Loader, Drools Rules Accumulator, and Drools Rules Executor, as well as a new Formula transform. Additionally, the user interface for the Dimension Lookup/Update transform was cleaned up and improved​. Check here for more details.

Apache Hop 1.2

March 7, 2022

This release included several improvements to Hop GUI, Docker support, Neo4j integration, and Kafka and Avro transforms. It also introduced the Hop Translator tool for easier localization efforts, starting with Chinese translations. Check here for more details.

Apache Hop 1.1

January 24, 2022

Some of the key updates in Apache Hop 1.1 included improvements in metadata injection, enhancements to the graphical user interface, support for more data formats, and various performance optimizations. Check here for more details.

Apache Hop 1.0

January 17, 2022

This version marked Hop’s transition from incubation, featuring clean architecture, support for over 20 plugin types, and a revamped Hop GUI for designing workflows and pipelines​. Check here for more details.

Additional Links:

https://hop.apache.org/categories/Release/ , https://hop.apache.org/docs/roadmap/ 

Few Limitations with HOP

While Apache HOP has several advantages compared to Pentaho ETL, by the nature of it being a comparatively newer platform, there are a few limitations we have encountered when using it. Some of these are already recorded as issues in the HOP Github and are scheduled to be fixed in upcoming releases.

Type

Details

HOP GUI

The HOP GUI application does not allowto the change “Project Home path” to a valid path after setting an invalid Project Path.

HOP GUI

Repeatedly Prompting to enter GitHub Credentials

HOP GUI

While Saving a New Pipeline, HOP GUI Appends the Previously Opened Pipeline Names

HOP Server

Multiple Hop Server Object IDs for a single HOP Pipeline on the HOP Server

HOP Server

Hop Server Objects (Pipeline/Workflow) Status is Null and the Metrics Information is not Shown

HOP Web

Unable to Copy the Transform from one Pipeline to Another Pipeline

HOP GUI

Log table options in Workflow Properties Tab

HOP GUI

Showing Folder Icon for HPL Files

HOP GUI

Dimension Lookup & Update Transform SQL Button Nullpointer Exception

There are very few issues which can act as an impediment to using Apache HOP depending on the specific use cases. We will talk more about it in the next blog article of this series.

Conclusion

Apache HOP brings a host of advantages over Pentaho Data Integration Kettle, driven by its modern architecture, enhanced usability, advanced development features, cloud-native capabilities, and active community support. These advantages make Apache HOP a compelling choice for organizations looking to streamline their data integration and orchestration processes, especially in today’s cloud-centric and agile development environments. By using Apache HOP, businesses can achieve more efficient, scalable, and manageable data workflows, positioning themselves for success in the data-driven future.

Most importantly, Hitachi Vantara / Pentaho has stopped releasing Community versions of PDI or security patches for nearly 2 years now and also removed the links to download older versions of the software from Source forge too. This makes it risky for users to continue using Pentaho Community Edition in Production due to any non-resolved vulnerabilities.

Need help to migrate your Pentaho Artifacts to Apache HOP? Our experts can help.

Asset-26

Simpler alternative to Kubernetes – Docker Swarm with Swarmpit

Introduction:

As more are more applications are moving to private or public cloud environments as part of modernization by adopting new microservices architecture, they are being containerized using Docker and are orchestrated using Kubernetes as popular platform choices. Kubernetes does offer many advanced features and great flexibility, making it suitable for complex and large-scale applications, however, it has a steeper learning curve requiring more time, effort and resources to set up, monitor and manage for simpler applications where these advanced features may not be needed.

Docker Swarm is another open-source orchestration platform with a much simpler architecture and can be used to do most of the activities that one does with Kubernetes including the deployment and management of containerized applications across a cluster of Docker hosts with built-in clustering capabilities and load balancing enabling you to manage multiple Docker hosts as a single virtual entity.

Swarmpit is a little-known open-source web-based interface which offers a simple and intuitive interface for easy monitoring and management of the Docker Swarm cluster.

In this article, we walk through the process of setting up a Docker Swarm cluster with one master node and two worker nodes and configure Swarmpit to easily manage and monitor this Swarm Cluster.

Problem Statement

Managing containerized applications at scale can be challenging due to the complexity of handling multiple nodes, ensuring high availability, and supporting scalability. Single-node deployments limit redundancy and scalability, while manually managing multiple nodes is time-consuming and error-prone.

Docker Swarm addresses these challenges with its clustering and orchestration capabilities, but monitoring and managing the cluster can still be complex. This is addressed using Swarmpit.

Use Cases

  1. High Availability: Ensure your application remains available even if individual nodes fail.
  2. Scalability: Easily scale services up or down to meet demand.
  3. Simplified Management: Use a single interface to manage multiple Docker hosts.
  4. Efficient Resource Utilization: Distribute container workloads across multiple nodes.

Why Choose Docker Swarm Over Kubernetes?

Docker Swarm is an excellent choice for smaller or less complex deployments because:

  1. Simplicity and Ease of Use: Docker Swarm is easier to set up and manage compared to Kubernetes. It integrates seamlessly with Docker’s command line tools, making it simpler for those already familiar with Docker.
  2. Faster Deployment: Swarm allows you to get your cluster up and running quickly without the intricate setup required by Kubernetes.
  3. High Availability and Scaling: Docker Swarm effectively manages high availability by redistributing tasks if nodes fail and supports multiple manager nodes. Scaling is straightforward with the docker service scale command and resource constraints. Kubernetes offers similar capabilities but with more advanced configurations, like horizontal pod autoscaling based on custom metrics for finer control.

Limitations of Docker Swarm

  1. Advanced Features and Extensibility: Docker Swarm lacks some of the advanced features and customization options found in Kubernetes, such as detailed resource management and extensive extensions.
  2. Ecosystem: Kubernetes has a larger community and more integrations, offering a broader range of tools and support.

While Kubernetes might be better for complex needs and extensive customization, Docker Swarm offers effective high availability and scaling in a more straightforward and manageable way for simpler use cases.

Comparison:

Prerequisites

Before you start, ensure you have the following:

  1. Three Ubuntu machines with Docker installed.
  2. Access to the machines via SSH.
  3. A basic understanding of Docker and Docker Compose.

Setting Up the Docker Swarm Cluster

To start, you need to prepare your machines. Begin by updating the system packages on all three Ubuntu machines. This ensures that you are working with the latest software and security updates. Use the following commands:

sudo apt update

sudo apt upgrade -y

Next, install Docker on each machine using:

sudo apt install -y docker.io

Enable and start Docker to ensure it runs on system boot:

sudo systemctl enable docker

sudo systemctl start docker 

It is essential that all nodes are running the same version of Docker to avoid compatibility issues. Check the Docker version on each node with:

docker –version

With the machines prepared, go to the instance where you want the manager node to be present and run the command below.

docker swarm init

This will initialize the swarm process and provide a token using which you can add the worker node.

Copy the above-given command by the system and paste it into the instance or machine where you need the worker node to be present. Now both the manager and worker’s nodes are ready.

Once all nodes are joined, verify that the Swarm is properly configured. On the master node, list all nodes using: 

docker node ls

This command should display the master node and the two worker nodes, confirming that they are part of the Swarm. Additionally, you can inspect the Swarm’s status with:

docker info

Check the Swarm section to ensure it is active and reflects the correct number of nodes.

To ensure that your cluster is functioning as expected, deploy a sample service.

Create a Docker service named my_service with three replicas of the nginx image: 

docker service create –name my_service –replicas 3 nginx

Verify the deployment by listing the services and checking their status:

docker service ls

docker service ps my_service

Managing and scaling your Docker Swarm cluster is straightforward. To scale the number of replicas for your service, use:

docker service scale my_service=5 

If you need to update the service to a new image, you can do so with: 

docker service update –image nginx:latest my_service

Troubleshooting Common Issues

Node Not Joining the Swarm

  1. Check Docker Version: Ensure all nodes are running compatible Docker versions.
  2. Firewall Settings: Make sure port 2377 is open on the master node.
  3. Network Connectivity: Verify network connectivity between nodes.

Deploying Swarmpit for Monitoring and Management

Swarmpit is a user-friendly web interface for monitoring and managing Docker Swarm clusters. It simplifies cluster administration, providing an intuitive dashboard to monitor and control your Swarm services and nodes. Here’s how you can set up Swarmpit and use it to manage your Docker Swarm cluster.

Deploy Swarmpit Container

On the manager node, deploy Swarmpit using the following Docker command:

docker run -d -p 8888:8888 –name swarmpit -v /var/run/docker.sock:/var/run/docker.sock swarmpit/swarmpit:latest

Access the Swarmpit Interface

Open your web browser and navigate to http://<manager-node-ip>:8888. You will see the Swarmpit login page. Use the default credentials (admin/admin) to log in. It is recommended to change the default password after the first login.

Using Swarmpit for Monitoring and Management

Once you have logged in, you can perform a variety of tasks through Swarmpit’s web interface:


Dashboard Overview

○ Cluster Health: Monitor the overall health and status of your Docker Swarm cluster.
○ Node Information: View detailed information about each node, including its status, resources, and running services.


Service Management

  1. Deploy Services: Easily deploy new services by filling out a form with the necessary parameters (image, replicas, ports, etc.).
  2. Scale Services: Adjust the number of replicas for your services directly from the interface.
  3. Update Services: Change service configurations, such as updating the Docker image or environment variables.
  4. Service Logs: Access logs for each service to troubleshoot and watch their behavior.

Container Management

  1. View Containers: List all running containers across the cluster, including their status and resource usage.
  2. Start/Stop Containers: Manually start or stop containers as needed.
  3. Container Logs: Access logs for individual containers for troubleshooting purposes.

Network and Volume Management

  • Create Networks: Define and manage custom networks for your services.
  • Create Volumes: Create and manage Docker volumes for persistent storage.

User Management

  1. Add Users: Create additional user accounts with varying levels of access and permissions.
  2. Manage Roles: Assign roles to users to control what actions they can perform within the Swarmpit interface.

Benefits of Using Swarmpit

  1. User-Friendly Interface: Simplifies the complex task of managing a Docker Swarm cluster with a graphical user interface.
  2. Centralized Management: Provides a single point of control over all aspects of your Swarm cluster, from node management to service deployment.
  3. Real-Time Monitoring: Offers real-time insights into the health and performance of your cluster and its services.
  4. Enhanced Troubleshooting: Facilitates easy access to logs and service status for quick issue resolution.

Conclusion

By integrating Swarmpit into the Docker Swarm setup, we get a powerful tool that streamlines cluster management and monitoring. Its comprehensive features and intuitive interface make it easier to maintain a healthy and efficient Docker Swarm environment, enhancing the ability to deploy and manage containerized applications effectively.

Frequently Asked Questions (FAQs)

Docker Swarm is a clustering tool that manages multiple Docker hosts as a single entity, providing high availability, load balancing, and easy scaling of containerized applications.

Install Docker on all machines, initialize the Swarm on the master with docker swarm init, and join worker nodes using the provided token. Verify the setup with docker node ls.

Docker Swarm offers high availability, scalability, simplified management, load balancing, and automated failover for services across a cluster.

Swarmpit is a web-based interface for managing and monitoring Docker Swarm clusters, providing a visual dashboard for overseeing services, nodes, and logs.

Access Swarmpit at http://<manager-node-ip>:8888, log in, and use the dashboard to monitor the health of nodes, view service status, and manage configurations.

Blogbanner_2

Keycloak deployment on Kubernetes with Helm charts using an external PostgreSQL database

Prerequisites:

  1. Kubernetes cluster set up and configured.
  2. Helm installed on your Kubernetes cluster.
  3. Basic understanding of Kubernetes concepts like Pods, Deployments, and Services.
  4. Familiarity with Helm charts and templating.

Introduction:

Deploying Keycloak on Kubernetes with an external PostgreSQL database can be challenging, especially when using Helm charts. One common issue is that keycloak deploys with  a default database service when we use Helm chart, making it difficult to integrate with an external database. 

In this article, we’ll explore the problem we encountered while deploying Keycloak on Kubernetes using Helm charts and describe the solution we implemented to seamlessly use an external PostgreSQL database.

Problem:

The primary issue we faced during the deployment of Keycloak on Kubernetes using Helm was the automatic deployment of a default database service. This default service conflicted with our requirement to use an external PostgreSQL database for Keycloak. The Helm chart, by default, would deploy an internal database, making it challenging to configure Keycloak to connect to an external database.

Problem Analysis

  1. Default Database Deployment: The Helm chart for Keycloak automatically deploys an internal PostgreSQL database. This default setup is convenient for simple deployments but problematic when an external database is required.
  2. Configuration Complexity: Customizing the Helm chart to disable the internal database and correctly configure Keycloak to use an external PostgreSQL database requires careful adjustments to the values.yaml file.
  3. Integration Challenges: Ensuring seamless integration with an external PostgreSQL database involves specifying the correct database connection parameters and making sure that these settings are correctly propagated to the Keycloak deployment.
  4. Persistence and Storage: The internal database deployed by default may not meet the persistence and storage requirements for production environments, where an external managed PostgreSQL service is preferred for reliability and scalability.

To address these issues, the following step-by-step guide provides detailed instructions on customizing the Keycloak Helm chart to disable the default database and configure it to use an external PostgreSQL database.

Overview Diagram:

Step-by-Step Guide

Step 1: Setting Up Helm Repository

If you haven’t already added the official Helm charts repository for Keycloak, you can add it using the following command:

helm repo add codecentric https://codecentric.github.io/helm-charts

helm repo update

By adding the official Helm charts repository for Keycloak, you ensure that you have access to the latest charts maintained by the developers. Updating the repository ensures you have the most recent versions of the charts.

Step 2: Customizing Helm Values

Objective: Customize the Keycloak Helm chart to avoid deploying the default database and configure it to use an external PostgreSQL database.

Configure Keycloak for development mode

Create a values.yaml File

  1. Create a new file named values.yaml.
  2. Add the following content to the file 

image:

  # The Keycloak image repository

  repository: quay.io/keycloak/keycloak

  # Overrides the Keycloak image tag whose default is the chart appVersion

  tag: 24.0.3

  # The Keycloak image pull policy

  pullPolicy: IfNotPresent


resources:

  requests:

    cpu: “500m”

    memory: “1024Mi”

  limits:

    cpu: “500m”

    memory: “1024Mi”


args:

  – start-dev

  – –hostname=<url>

  – –hostname-url=<url>

  – –verbose


autoscaling:

  # If `true`, a autoscaling/v2beta2 HorizontalPodAutoscaler resource is created (requires Kubernetes 1.18 or above)

  # Autoscaling seems to be most reliable when using KUBE_PING service discovery (see README for details)

  # This disables the `replicas` field in the StatefulSet

  enabled: false

  # The minimum and maximum number of replicas for the Keycloak StatefulSet

  minReplicas: 1

  maxReplicas: 2


ingress:

  enabled: true

  #hosts:

  #  – <url>

  ssl:

    letsencrypt: true

    cert_secret: <url>

  annotations:

    kubernetes.io/ingress.class: “nginx”

    kubernetes.io/tls-acme: “true”

    cert-manager.io/cluster-issuer: letsencrypt

    cert-manager.io/acme-challenge-type: dns01

    cert-manager.io/acme-dns01-provider: route53

    nginx.ingress.kubernetes.io/proxy-buffer-size: “128k”

    nginx.ingress.kubernetes.io/affinity: “cookie”

    nginx.ingress.kubernetes.io/session-cookie-name: “sticky-cookie”

    nginx.ingress.kubernetes.io/session-cookie-expires: “172800”

    nginx.ingress.kubernetes.io/session-cookie-max-age: “172800”

    nginx.ingress.kubernetes.io/affinity-mode: persistent

    nginx.ingress.kubernetes.io/session-cookie-hash: sha1

  labels: {}

  rules:

    –

      # Ingress host

      host: ‘<url>’

      # Paths for the host

      paths:

        – path: /

          pathType: Prefix

  tls:

    – hosts:

        – <url>

      secretName: “<url>”

      

extraEnv: |

 – name: PROXY_ADDRESS_FORWARDING

   value: “true”

 – name: QUARKUS_THREAD_POOL_MAX_THREADS

   value: “500”

 – name: QUARKUS_THREAD_POOL_QUEUE_SIZE

   value: “500”

 

This configuration file customizes the Keycloak Helm chart to set specific resource requests and limits, ingress settings, and additional environment variables. By setting the args to start Keycloak in development mode, you allow for easier initial setup and testing.

Configuring for Production Mode

  1. Add or replace the following content in values.yaml for production mode:

args:

  – start

  – –hostname=<url>

  – –hostname-url=<url>

  – –verbose

  – –optimized

  – -Dquarkus.http.host=0.0.0.0

  – -Dquarkus.http.port=8080

Note: The production configuration includes optimizations and ensures that Keycloak runs in a stable environment suitable for production workloads. The –optimized flag is added for performance improvements.

Configuring for External Database

  1. Add the following content to values.yaml to use an external PostgreSQL database:

args:

  – start

  – –hostname-url=<url>

  – –verbose

  – –db=postgres

  – –db-url=<jdbc-url>

  – –db-password=${DB_PASSWORD}

  – –db-username=${DB_USER}

 

postgresql:

  enabled: false

This configuration disables the default PostgreSQL deployment by setting postgresql.enabled to false. The database connection arguments are provided to connect Keycloak to an external PostgreSQL database.

Step 3: Deploying Keycloak with PostgreSQL and Custom Themes Using Helm

Objective: Add custom themes to Keycloak and deploy it using Helm.

  1. Add the following to values.yaml to include custom themes:

extraInitContainers: |

  – name: theme-provider

    image: <docker-hub-registry-url>

    imagePullPolicy: IfNotPresent

    command:

      – sh

    args:

      – -c

      – |

        echo “Copying custom theme…”

        cp -R /custom-themes/* /eha-clinic

    volumeMounts:

      – name: custom-theme

        mountPath: /eha-clinic

extraVolumeMounts: |

  – name: custom-theme

    mountPath: /opt/jboss/keycloak/themes/

extraVolumes: |

  – name: custom-theme

    emptyDir: {}

This configuration uses an init container to copy custom themes into the Keycloak container. The themes are mounted at the appropriate location within the Keycloak container, ensuring they are available when Keycloak starts.

Step 4: Configuring Keycloak

Objective: Log in to the Keycloak admin console and configure realms, users, roles, client applications, and other settings.

Access the Keycloak admin console using the URL provided by your ingress configuration.

  1. Log in with the default credentials (admin/admin).
  2. Configure the following according to your application requirements:
  • Realms
  • Users
  • Roles
  • Client applications

The Keycloak admin console allows for comprehensive configuration of all aspects of authentication and authorization, tailored to the needs of your applications.

Step 5: Configuring Custom Themes

Objective: Apply and configure custom themes within the Keycloak admin console.

  1. Log in to the Keycloak admin console using the default credentials (admin/admin).
  2. Navigate to the realm settings and select the “Themes” tab.
  3. Select and configure your custom themes for:
  • Login pages
  • Account pages
  • Email pages

Custom themes enhance the user experience by providing a personalized and branded interface. This step ensures that the authentication experience aligns with your organization’s branding and user interface guidelines.

Conclusion

By following the steps outlined in this article, you can deploy Keycloak with PostgreSQL on Kubernetes using Helm, while also incorporating custom themes to personalize the authentication experience. Leveraging Helm charts simplifies the deployment process, while Keycloak and PostgreSQL offer robust features for authentication and data storage. Integrating custom themes allows you to tailor the authentication pages according to your branding and user interface requirements, ultimately enhancing the user experience and security of your applications.

Frequently Asked Questions (FAQs)

Keycloak is a modern applications and services open-source identity and access management solution. It offers various functionalities such as single sign-on, identity brokering, and social login which help in making user identity management easy as well as application security

Deploying Keycloak on Kubernetes means that you can have an elastic application that can be extended according to the number of users, resistant to failure on case of external services unavailability or internal server crashes; it is also easy to manage, has numerous protocols for authentication mechanisms and connects with different types of external databases.

Helm charts are pre-configured Kubernetes resource packages that simplify the efficient management and deployment of applications on Kubernetes.

To disable the default PostgreSQL database, set postgresql.enabled to false in the values.yaml file.

Provide the necessary database connection parameters in the values.yaml file, including --db-url, --db-password, and --db-username.

You can add custom themes by configuring init containers in the values.yaml file to copy the themes into the Keycloak container and mounting them at the appropriate location.

Asset 2 (1)

Integrating Apache Jmeter with Jenkins

In the world of software development, ensuring the performance and reliability of applications is paramount. One of the most popular tools for performance testing is Apache JMeter, known for its flexibility and scalability. Meanwhile, Jenkins has become the go-to choice for continuous integration and continuous delivery (CI/CD). Combining the power of JMeter with the automation capabilities of Jenkins can significantly enhance the efficiency of performance testing within the development pipeline. In this article, we’ll explore the integration of JMeter with Jenkins and how it can streamline the performance testing process.

Apache Jmeter

Apache JMeter is a powerful open-source tool designed for load testing, performance testing, and functional testing of applications. It provides a user-friendly GUI that allows testers to create and execute several types of tests, including HTTP, FTP, JDBC, LDAP, and more. JMeter supports simulating heavy loads on servers, analyzing overall performance metrics, and finding performance bottlenecks.

With its scripting and parameterization capabilities, JMeter offers flexibility and scalability for testing web applications, APIs, databases, and other software systems. Its extensive reporting features help teams assess application performance under different conditions, making it an essential tool for ensuring the reliability and scalability of software applications. More information available here Apache JMeter.

Jenkins

Jenkins is one of the most popular open-source automation servers widely used for continuous integration (CI) and continuous delivery (CD) processes in software development for several years now. It allows developers to automate the building, testing, and deployment of applications, thereby streamlining the development lifecycle. Jenkins supports integration with various version control systems like Git, SVN, and Mercurial, enabling automatic triggers for builds whenever code changes are pushed.

Its extensive plugin ecosystem provides flexibility to integrate with a wide range of tools and technologies, making it a versatile solution for managing complex CI/CD pipelines. Jenkins’ intuitive web interface, extensive plugin library, and robust scalability make it a popular choice for teams aiming to achieve efficient and automated software delivery processes. The Jenkins docs has a page to help with the Jenkins installation process.

Why Integrate JMeter with Jenkins?

Traditionally, performance testing has been a manual and time-consuming process, often conducted as a separate part of the development lifecycle by test teams. The results had to then be shared with the rest of the team as there was no automated execution or capturing of the test results as part of the CICD pipeline. However, in today’s fast-paced software development environment, there’s a growing need to automate the complete testing processes to include execution of Performance tests as part of the CICD pipeline. By integrating JMeter with Jenkins, organizations can achieve the following benefits:

Automation: Jenkins allows you to automate the execution of JMeter tests as part of your CI/CD pipeline, enabling frequent and consistent performance testing with minimal manual intervention.

Continuous Feedback: Incorporating performance tests into Jenkins pipelines provides immediate feedback on the impact of code changes on application performance, allowing developers to find and address performance issues early in the development cycle.

Reporting: Jenkins provides robust reporting and visualization capabilities, allowing teams to analyze test results and track performance trends over time, helping data-driven decision-making.

Our Proposed Approach & its advantages

We’ve adopted a new approach in addition to using the existing JMeter plugin for Jenkins wherein we enhance the Jenkins Pipeline to include detailed notifications and better result organization.

The key steps of our approach are as follows.

  1. We can install JMeter directly on the agent base OS. This ensures we have access to the latest features and updates.
  2. We use the powerful Blaze Meter plugin to generate our JMeter scripts.
  3. We’ve written a dedicated Jenkins pipeline to automate the execution of these Jmeter scripts.
  4. We have also defined the steps as part of the Jenkins script to distribute the Execution Status and log by email to chosen users.
  5. We also store the results in a configurable Path for future reference.

All of this ensures better automation, flexibility or control of the execution, notification and efficient performance testing as part of the CICD Pipeline.

Setting Up Blaze Meter & Capturing the Test Scripts

To automate the process of script writing we should use the Blaze Meter tool. Navigate to chrome extensions and search for Blaze Meter. Click on add to chrome. Access Blaze Meter official website and create an account there for further process.

Open chrome and now you can find the Blaze Meter extension on the top right corner.

Click on the Blaze Meter chrome extension. A toolbox will be visible. Open the application where you want to record the scripts for Jmeter. Click on the record button to start.

Navigate through the application and perform necessary operations as end users of the application would. Click on the stop button to stop the recording.

Now Blaze Meter has recorded the scripts. To save it in. jmx format, click on save, check the Jmeter only box and click on save as shown in the screenshot below.

For more information on how to record a Jmeter script using Blaze Meter, follow the link

Modifying the Test Scripts in JMeter

The recorded script can then be opened in JMeter, and necessary changes can be made as per the different Load and Performance Scenarios to be assessed for the application.

Select the generated .jmx file and click on open

In addition to these you can add listeners to the thread groups, for better visibility of each sample results.

Setting up a Jenkins pipeline to execute JMeter tests

Install Jenkins: If you haven’t already, install Jenkins on your server following the official documentation.

Create a New Pipeline Job: A new pipeline job should be created to orchestrate the performance testing process. Click on new item to create a pipeline job and select pipeline option.

Post creating a new pipeline, navigate to configure and mention the scheduled time in the below format.

(‘H 0 */3 * *’)

Define Pipeline Script: Configure the pipeline script to execute the JMeter test at regular intervals using a cron expression.

 pipeline {

    agent {

        label ‘{agent name}’

    }

This part of the Jenkins pipeline script specifies the agent (or node) where the pipeline should run. The label ‘{agent name}’ should be replaced with the label of the specific agent you want to use. This ensures that the pipeline will execute on a machine that matches the provided label.

stages {

      stage(‘Running JMeter Scripts’) {

          steps {

              script {

 sh ”’

    output_directory=”{path}/$(date +’%Y-%m-%d’)”

    mkdir -p “$output_directory”

    cd {Jmeter Path}/bin

    sh jmeter -n -t {Jmeter Script Path} -l “$output_directory/{Result file name}” -e -o “$output_directory”

    cp “$output_directory”/{Result file name} $WORKSPACE

    cp -r $output_directory $WORKSPACE

    ”’

       }

            }      

}

This stage named ‘Running JMeter Scripts’ has steps to execute a shell script. The script does the following:
1. Creates an output directory with the current date.
2. Navigates to the JMeter binary directory.
3. Runs the JMeter test script specified by {Jmeter Script Path}, storing the results in the created directory
4. Copies the result file and output directory to the Jenkins workspace for archiving.

post {

        always {

            script {

                currentBuild.result = currentBuild.result

                def date = sh(script: ‘date +\’%Y-%m-%d\”, returnStdout: true).trim()

                def subject = “${currentBuild.result}: Job ‘${env.JOB_NAME}'”

                def buildLog = currentBuild.rawBuild.getLog(1000)

                emailext(

                    subject: subject,

                    body: “”” Hi Team , Jmeter build was successful , please contact team for the results “””,

                    mimeType: ‘text/html’,

                    to: ‘{Receiver Email}’,

                    from: ‘{Sender Email}’,

                    attachLog: true ,

                )

            }

     }

This post block runs after the pipeline completes. It retrieves the last 1000 lines of the build log and sends an email notification with the build result, a message, and the build log attached to specified recipients.

View the generated reports.

In Linux instance navigate to the path where the .html files are stored i.e. the output report of the jmeter scripts

Before you open the HTML file, move the complete folder to your local device. Once the folder is moved open the .html file and you’ll be able to analyze the reports

Conclusion

By following the steps described and the approach suggested, we have shown how Integrating JMeter with Jenkins enables teams to automate performance testing and incorporate it seamlessly into the CI/CD pipeline. By scheduling periodic tests, storing results, and sending out email notifications, organizations can ensure the reliability and scalability of their applications with minimal manual intervention. Embrace the power of automation and elevate your performance testing efforts with Jenkins and JMeter integration. For any assistance in automation of your Performance tests please get in touch with us at [email protected] or leave us a note at form and we will get in touch with you.

Frequently Asked Questions (FAQs)

You can download the latest version from the Apache JMeter homepage, after download extract the downloaded files to a directory on the agent machine and Make sure to add the JMeter bin directory in the system's PATH variable so that all JMeter commands can be run from the command line.

Its a Chrome extension that helps users to record and create JMeter scripts easily. To use it, install the BlazeMeter extension from the Chrome Web Store and record the desired scenarios on your web application, and export the recorded script in .jmx format. This script can then be modified in JMeter and used in your Jenkins pipeline for automated performance testing.

Create a new pipeline job in Jenkins and define the pipeline script to include stages for running JMeter tests. The script should include steps to execute JMeter commands, store the results, and send notifications. Here's a example script for you 

pipeline {

    agent { label 'your-agent-label'

 }

    stages {

        stage('Run JMeter Tests') {

            steps {

                script {

                    sh '''

                    output_directory="/path/to/output/$(date +'%Y-%m-%d')"

                    mkdir -p "$output_directory"

                    cd /path/to/jmeter/bin

                    sh jmeter -n -t /path/to/test/script.jmx -l "$output_directory/results.jtl" -e -o "$output_directory"

                    cp "$output_directory/results.jtl" $WORKSPACE

                    cp -r "$output_directory" $WORKSPACE

                    '''

                }

            }

        }

    }

    post {

        always {

            emailext (

                subject: "JMeter Test Results",

                body: "The JMeter tests have completed. Please find the results attached.",

                recipientProviders: [[$class: 'DevelopersRecipientProvider']],

                attachLog: true

            )

        }

    }

}

You can schedule the JMeter test execution using the cron syntax in the Jenkins pipeline configuration. For example, to run the tests every three hours, you can use:<br><br>

H 0 */3 * * *

<br><br>

This will trigger the pipeline at the top of the hour every three hours.

After the JMeter tests are executed, the results are typically stored in a specified directory on the Jenkins agent. You can navigate to this directory, download the results to your local machine, and open the HTML reports generated by JMeter for detailed analysis.