Asset 49

Simplified Log Monitoring using Swarmpit for a Docker Swarm Cluster

Introduction

In our earlier blog post A Simpler Alternative to Kubernetes – Docker Swarm with Swarmpit we talked about how Docker Swarm, a popular open-source orchestration platform along with Swarmpit, another open-source simple and intuitive Swarm Monitoring and Management Web User interface for Swarm, we can do most of the activities that one does with Kubernetes, including the deployment and management of containerized applications across a cluster of Docker hosts with clustering and load balancing enabling for simple applications.

We know monitoring container logs efficiently is crucial for troubleshooting and maintaining application health in a Docker environment. In this article, we explore how this can be achieved easily using Swarmpit’s features for real-time monitoring of service logs with its strong filtering capabilities, and best practices for log management.

Challenges in Log Monitoring

Managing logs in a distributed environment like Docker Swarm presents several challenges:

    • Logs are distributed across multiple containers and nodes.
    • Accessing logs requires logging into individual nodes or using CLI tools.
    • Troubleshooting can be time-consuming without a centralized interface.

Swarmpit's Logging Features

Swarmpit simplifies log monitoring by providing a centralized UI for viewing container logs without accessing nodes manually.

Key Capabilities:

    • Real-Time Log Viewing: Monitor logs of running containers in real-time.
    • Filtering and Searching: Narrow down logs using service names, container IDs, or keywords.
    • Replica-Specific Logs: View logs for each replica in a multi-replica service.
    • Log Retention: Swarmpit retains logs for a configurable period based on system resources.

Accessing Logs in Swarmpit

To access logs in Swarmpit:

    1. Log in to the Swarmpit dashboard.
    2. Navigate to the Services or Containers tab.
    3. Select the service or container whose logs you need.
    4. Switch to the Logs tab to view real-time output.
    5. Use the search bar to filter logs by keywords or service names.

Filtering and Searching Logs

Swarmpit provides advanced filtering options to help you locate relevant log entries quickly.

    • Filter by Service or Container: Choose a specific service or container from the UI.
    • Filter by Replica: If a service has multiple replicas, select a specific replica’s logs.
    • Keyword Search: Enter keywords to find specific log messages.

Best Practices for Log Monitoring

To make the most of Swarmpit’s logging features:

    • Use Filters Effectively: Apply filters to isolate logs related to specific issues.
    • Monitor in Real-Time: Keep an eye on live logs to detect errors early.
    • Export Logs for Advanced Analysis: If deeper log analysis is required, consider integrating with external tools like Loki or ELK.

Conclusion

Swarmpit’s built-in logging functionality provides a simple and effective way to monitor logs in a Docker Swarm cluster. By leveraging its real-time log viewing and filtering capabilities, administrators can troubleshoot issues efficiently without relying on additional logging setups. While Swarmpit does not provide node-level logs, its service, and replica-specific log views offer a practical solution for centralized log monitoring.

This guide demonstrates how to utilize Swarmpit’s logging features for streamlined log management in a Swarm cluster.

Optimize your log management with the right tools. Get in touch to learn more!

Frequently Asked Questions (FAQs)

No, Swarmpit provides logs per service and per replica but does not offer node-level log aggregation.

Log retention depends on system resources and configuration settings within Swarmpit.

Yes, for advanced log analysis, you can integrate Swarmpit with tools like Loki or the ELK stack.

No, Swarmpit does not provide built-in log alerting. For alerts, consider integrating with third-party monitoring tools.

Yes, Swarmpit allows keyword-based filtering, enabling you to search for specific error messages.

Asset 44

Guide to enable Single Sign On(SSO) between Grafana, GitLab and Jenkins using Keycloak

Introduction

As businesses scale their operations, centralised identity management becomes crucial to ensure security and operational efficiency. Keycloak, a powerful open-source Identity and Access Management (IAM) solution, simplifies this by offering Single Sign-On (SSO) and centralized authentication. This guide will walk you through integrating Keycloak with Grafana, GitLab, and Jenkins, streamlining user management and authentication.

In this guide, we will:

    • Configure Keycloak as the Identity Provider (IdP).
    • Set up OAuth2/OpenID Connect (OIDC) for Grafana, GitLab, and Jenkins.
    • Provide detailed troubleshooting tips and insights to avoid common pitfalls.

By following this guide, you’ll have a unified authentication system, ensuring that users can access Grafana, GitLab, and Jenkins with one set of credentials.

Keycloak Overview

Keycloak is an open-source Identity and Access Management (IAM) solution that provides SSO with modern security standards such as OpenID Connect, OAuth2, and SAML. It supports various identity providers (e.g., LDAP, Active Directory) and provides a rich user interface for managing realms, users, roles and permissions.

Architecture Diagram

This diagram illustrates the integration of Keycloak as a central identity and access management system. It facilitates Single Sign-On (SSO) for various external identity providers (IDPs) and internal applications.

Components:

    1. Keycloak
      • Acts as the core Identity Provider (IdP) for the architecture.
      • Manages authentication, authorization, and user sessions across connected services.
      • Facilitates Single Sign-On (SSO), allowing users to log in once and access multiple services seamlessly.
    2. External Identity Providers (IDPs)
      • Keycloak integrates with external IDPs for user authentication.
      • OpenID Connect (OIDC) and SAML protocols for federated identity management.
      • Google and Microsoft for social and enterprise logins.
      • Keycloak redirects user login requests to these external providers as configured, enabling users to authenticate with their existing credentials.
    3. Internal Applications
      • Keycloak provides SSO support for internal services.
      • GitLab for source code management.
      • Jenkins for continuous integration/continuous delivery (CI/CD) pipelines.
      • SonarQube for static code analysis and quality monitoring.
      • These applications delegate authentication to Keycloak. When users access these services, they are redirected to Keycloak for login.

Workflow:

    1. Authentication with External IDPs:
      • A user trying to log in to any service is redirected to Keycloak.
      • If configured, Keycloak can redirect the user to an external provider (e.g., Google or Microsoft) to authenticate.
      • Once authenticated, the user is granted access to the requested service
    2. Single Sign-On (SSO):
      • Keycloak generates and manages an SSO session after successful authentication.
      • Users only need to log in once. All other integrated services (e.g., GitLab, Jenkins, SonarQube) can reuse the same session for access.
    3. Authorization:
      • Keycloak ensures proper access control by assigning roles and permissions.
      • Applications fetch user roles and permissions from Keycloak to enforce fine-grained authorization.

Keycloak's Key Features

    • Single Sign-On (SSO): Users log in once to access multiple services.
    • Centralized Authentication: Manage authentication for various services from a central location.
    • Federated Identity: Keycloak can authenticate users from LDAP, Active Directory or other identity providers.
    • Role-based Access Control (RBAC): Assign users specific roles and permissions across services.

Prerequisites

Before beginning, ensure you have the following:

    • Keycloak: Installed and running. You can deploy Keycloak using Docker, Kubernetes, or install it manually.
    • Grafana: Installed and running.
    • GitLab: Installed and running (either self-hosted or GitLab.com).
    • Jenkins: Installed and running.
    • SSL Certificates: It’s highly recommended to secure your services with SSL to ensure secure communication.

Ensure you have administrative access to each service for configuration purposes.

Part 1: Keycloak Setup

Step 1: Install and Access Keycloak

Keycloak can be deployed via Docker, Kubernetes, or manually on a server.

For Docker:

docker run -d \

  -p 8080:8080 \

  -e KEYCLOAK_USER=admin \

  -e KEYCLOAK_PASSWORD=admin \

  quay.io/keycloak/keycloak:latest

For Kubernetes, you can deploy Keycloak with Helm or a custom Kubernetes manifest. Make sure Keycloak is accessible to Grafana, GitLab, and Jenkins, either through DNS or IP.

Step 2: Create a New Realm

A Realm in Keycloak is a space where you manage users, roles, clients, and groups. Each realm is isolated, meaning users and configurations don’t overlap between realms.

    • From the Keycloak admin console, click on the top-left dropdown and select Add Realm.
    • Name the realm (e.g., DevOpsRealm) and click Create.
    • Ensure you select an appropriate Display Name and Internationalization settings if your user base spans different regions.

Step 3: Add Users and Assign Roles

You can now create users and assign them roles:

    • Navigate to Users -> Add User.
    • Fill in the user details and click Save.
    • After the user is created, go to the Credentials tab and set the password.
    • You can also manage roles under the Role Mappings tab to define custom permissions.

Step 4: Configure Clients (Grafana, GitLab, Jenkins)

Clients in Keycloak represent the applications that will use Keycloak for authentication. For each service (Grafana, GitLab, Jenkins), we’ll create a client that uses the OIDC protocol.

    • Grafana Client Setup:

      1. Go to Clients -> Create.
      2. Set the Client ID to grafana, and choose OpenID Connect as the protocol.
      3. Set the Root URL to http://<grafana-host>, and the Redirect URIs to http://<grafana-host>/login/generic_oauth.
      4. Enable the following flows: Standard Flow, Implicit Flow (if needed), and Direct Access Grants.
      5. Save the client and navigate to the Credentials tab to copy the generated Client Secret.
    • GitLab Client Setup:
      1. Go to Clients -> Create.
      2. Set the Client ID to gitlab and choose OpenID Connect.
      3. Set the Redirect URI to http://<gitlab-host>/users/auth/openid_connect/callback.
      4. Enable Standard Flow and Implicit Flow if GitLab requires it.
      5. Save the client and copy the Client Secret from the Credentials tab.
    • Jenkins Client Setup:
      1. Add a new client for Jenkins, following the same steps as above.
      2. Set the Redirect URI to http://<jenkins-host>/securityRealm/finishLogin.
      3. Export the client-json file from client actions to configure in the next steps.
      4. Export the json file as shown below.

Part 2: Integrating Keycloak with Grafana

Step 1: Configure Grafana for OAuth2

Grafana supports OAuth2 natively, and Keycloak uses OpenID Connect as the protocol for OAuth2.

    • Grafana’s Configuration: Locate the grafana.ini file (or environment variables for a Docker container) and configure OAuth2 settings: 

[auth.generic_oauth]
enabled = true
client_id = grafana
client_secret = <Grafana_Client_Secret>
auth_url = http://<keycloak-host>/realms/<Your_Realm>/protocol/openid-connect/auth token_url = http://<keycloak-host>/realms/<Your_Realm>/protocol/openid-connect/token
api_url = http://<keycloak-host>/realms/<Your_Realm>/protocol/openid-connect/userinfo scopes = openid profile email

    • Restart Grafana: After updating the configuration, restart Grafana:

For Manual:

systemctl restart grafana-server

For Docker:

docker restart <grafana-container-name>

Part 3: Integrating Keycloak with GitLab

GitLab supports OIDC natively for SSO, making it straightforward to integrate with Keycloak.

Step 1: Configure GitLab for OpenID Connect

    • GitLab Configuration: Open GitLab’s gitlab.rb configuration file and enable OIDC:

gitlab_rails[‘omniauth_enabled’] = true
gitlab_rails[‘omniauth_providers’] = [ 

{
name: “openid_connect”,
label: “Keycloak”,
args: {
  name: “openid_connect”,
  scope: [“openid”, “profile”, “email”],
  response_type: “code”,
  issuer: “http://<keycloak-host>/realms/<Your_Realm>”,

   discovery: true,

   uid_field: “preferred_username”,

   client_auth_method: “basic”,

   client_options: {

     identifier: “<GitLab_Client_ID>”,

     secret: “<GitLab_Client_Secret>”,

     redirect_uri: “http://<gitlab-host>/users/auth/openid_connect/callback”

   }

  }

 }

]

    • Reconfigure GitLab: After editing gitlab.rb, run:

For Manual:

sudo gitlab-ctl reconfigure

For Docker:

docker restart <gitlab-container-name>

Step 2: Test the Integration

Log out of GitLab, then visit the login page. You should see a “Sign in with Keycloak” option. Clicking it will redirect you to Keycloak for authentication, and once logged in, you’ll be redirected back to GitLab.

Troubleshooting GitLab Integration:

    • Invalid UID Field: If authentication succeeds but user data is incorrect, check that uid_field in the gitlab.rb configuration is set to preferred_username.
    • Client Secret Mismatch: Ensure the client_secret in GitLab matches the secret in Keycloak’s client settings.
    • Invalid Token Response: If you get errors related to invalid tokens, ensure your GitLab instance is reachable from Keycloak and that your OAuth2 flow is configured correctly in Keycloak.

Part 4: Integrating Keycloak with Jenkins

Step 1: Install and Configure Keycloak Plugin

    • Install the Plugin: In Jenkins, go to Manage Jenkins -> Manage Plugins, search for the Keycloak OpenID Connect Plugin, and install it.
    • Configure Keycloak in Jenkins:
      1. Go to Manage Jenkins -> Configure Security.
      2. Under Security Realm, choose Keycloak Authentication.
      3. Upload the client-json file which is exported from the client configuration step.
      4. Upload the json file in keycloak json field.
    • Save and Test the Configuration: Log out of Jenkins and test the login with Keycloak.

Troubleshooting Jenkins Integration:

    • Redirect URI Mismatch: Double-check that the redirect URI set in Keycloak matches Jenkins’ redirect path (/securityRealm/finishLogin).
    • Access Denied Errors: Ensure users have been assigned roles in Keycloak that map to appropriate Jenkins permissions.
    • Plugin Issues: If the plugin behaves unexpectedly, verify you have the latest version installed and that Jenkins’ logs provide more detailed errors.

Note: All the 3 apps are now integrated with keycloak. So if you login to any service with keycloak, automatically you’ll be logged into all the 3 apps.

Conclusion

Integrating Keycloak with Grafana, GitLab, and Jenkins provides a unified and secure authentication system. This setup reduces complexity, enhances security, and improves user experience by enabling SSO across your DevOps tools.

Simplify authentication and enhance security across your systems. Get started today!

Frequently Asked Questions (FAQs)

Keycloak is an open-source Identity and Access Management (IAM) solution that supports Single Sign-On (SSO) for applications. It provides centralized user management, secure authentication, and protocol support like OAuth2 and OpenID Connect. By using Keycloak, you can streamline access control and enhance security across multiple applications, reducing the need for managing credentials separately in each tool.

Yes, Keycloak supports integration with multiple applications using different authentication protocols like OpenID Connect, OAuth2, and SAML. You can create separate clients in Keycloak for each application (Grafana, GitLab, Jenkins), all managed under a single realm.

Yes, Grafana, GitLab, and Jenkins can all be integrated with Keycloak using the OpenID Connect (OIDC) protocol. This ensures consistent and secure authentication across all services using Keycloak as the Identity Provider.

Permissions and roles can be managed both within Keycloak and each application. In Keycloak, you can assign roles to users, which can be passed to the application during authentication. Each application (Grafana, GitLab, Jenkins) can then map these roles to their internal permission models. For example, in Grafana, you can map Keycloak roles to Grafana’s organization roles.

A "redirect URI mismatch" error occurs when the redirect URI specified in Keycloak doesn’t match the one configured in the client application. Ensure that the redirect URI in the client configuration in Keycloak matches exactly with what’s defined in Grafana, GitLab, or Jenkins, including protocols (http vs https), port numbers, and path names.

If you receive an "invalid client ID or secret" error, verify the following:

  • Ensure that the client ID and client secret entered in the application’s configuration (Grafana, GitLab, Jenkins) match exactly what is defined in Keycloak for that particular client.
  • Regenerate the client secret in Keycloak if necessary and update it in the application configuration.
  • Check that there are no typos in the configuration files, and ensure the Keycloak client is enabled.

Yes, Keycloak allows customization of its login page. You can modify themes, change the logo, colors, and layout, or create a completely custom login page by uploading a custom theme to Keycloak. This allows you to offer a consistent branding experience for your users across different applications.

If user information is not being retrieved correctly (e.g., email, profile data), ensure that:

  • The scopes (openid, profile, email) are configured correctly in the client settings for each application.
  • The User Info endpoint is accessible, and there are no network restrictions blocking it.
  • The appropriate claims are being released from Keycloak. You may need to check the Mappers in Keycloak’s client configuration to ensure that the correct user attributes are being sent.

Keycloak handles user sessions centrally. For Grafana, GitLab, and Jenkins, ensure that:

  • Session timeouts and token expiration settings in Keycloak are configured appropriately to match the expected behavior in your applications.
  • Token introspection is working, allowing the applications to verify tokens issued by Keycloak.
  • If users are logged out from Keycloak, they are also logged out from the integrated applications (SSO logout).
Asset 42@2x

Why Choose Grafana a Comprehensive Alerting System Compared to Nagios, Datadog, Elastic Stack and Splunk

Why Choose Grafana? A Comprehensive Alerting System Compared to Nagios, Datadog, Elastic Stack, and Splunk

With the emerging world of data, monitoring and alerting systems have become vital to maintaining the reliability of applications and infrastructure. With tons of data produced every single second, businesses require powerful tools to present this data and quickly react to any incidents. Grafana — the leading open-source technology for monitoring and observability — is already among the most popular tools teams use to convert raw data into actionable real-life insight.

In this article, we will discuss how Grafana is the complete alerting system, showing how it compares to other monitoring tools, including Nagios, Datadog, Elastic Stack & Splunk. By comparing the pros and cons of each, readers can formulate a better sense of what Grafana is capable of and if it’s the right choice for their organization’s monitoring needs.

What is Grafana?

Grafana is a free platform used for monitoring, visualization, and analytics. It allows its users to turn large and complex datasets into informative visualizations by giving them the option of creating interactive and customizable dashboards. Monitoring of applications, services, and infrastructure can be done in real time with the help of Grafana, which makes it a good tool for teams, having a user-friendly interface. Therefore, it works exceedingly well for DevOps as well as IT operations.

It can work with an extensive range of data sources, such as Prometheus, InfluxDB, MySQL, Elasticsearch, etc. This means that adding Grafana to your display will allow you to see data from many systems all on one dashboard.

Additionally, dashboards in Grafana are also customizable, allowing users to present their data according to their requirements and preferences. Bar graphs, Heatmaps, and tables are some of the forms that can be used for data presentation depending on the metrics.

Also, the development of Grafana is aided by its active user base that helps monitor and alert users by contributing towards developing plugins and writing extensive documentation. Now, users will never feel lost in any application since there will always be installation guides and help provided.

Overview of Competing Tools

Nagios

Nagios is a monitoring solution whose service is based on infrastructure that has been in operation for several years. One of its strongest points includes user alerting, which has made it possible for users to keep track of application health, network performance, and server uptime. While it is accurate to say that there exists a robust alerting framework, the Nagios GUI is very tedious to navigate, and there is often a large manual workload for configuring alerts.

Datadog

Datadog is relatively new on the market, offering an application performance management and monitoring solution, entirely cloud-based, that provides visibility across your applications and infrastructure in its entirety. It works well with several cloud-based systems allowing its users to gather metrics, logs, and traces within seconds. It’s, quite literally, a powerful alerting system that enables keeping an eye on an application’s performance and infrastructure at any given moment. Unfortunately, its expensive model could be a deterrent for small organizations as once the level of use of the service increases, the cost would too.

Elastic Stack (ELK)

Elastic Stack, colloquially known as ELK, encompasses three tools – ElasticSearch, Logstash, and Kibana. This enables both analysis and visualization of logs. However, its capabilities get particularly enhanced when being used to process large amounts of log data. Thus, for organizations with a log-centric focus, ELK is a preferred choice. You can also set alert-based patterns via either the ElastAlert tool or using native alerting features provided in Kibana. It is worth noting though that maintaining and keeping the ELK stack up and running would require significant expertise and resources.

Splunk

Turning our attention to Splunk, we see that it currently reigns in the field of operational intelligence. More specifically, they provide real-time analysis and insights into machine-generated data. What puts Splunk at the forefront in terms of service, is that it combines effective analytics with an advanced search feature. Thereby allowing the organization to reduce their incident response time significantly. Splunk’s alerting system is arguably intricate as it doesn’t entirely allow for great customization, setting up alerts based on complex queries is doable. The downside, however, is a rather expensive licensing model. Splunk is notoriously expensive with their pricing model making scaling a challenge when trying to push them to their limits.

 

Feature

Grafana

Nagios

Datadog

Elastic Stack (ELK)

Splunk

Primary Focus

Visualization, monitoring, and alerting across multiple data sources

Infrastructure monitoring and alerting

Cloud-native monitoring and observability

Log analysis, search, and visualization

Data analysis, real-time insights, and log management

Alerting System

Unified alerting, customizable alerts across multiple data sources

Static alerts, tied to host/service monitoring

Extensive, integrated alerting for metrics and logs

Basic alerting with the additional configuration required

Advanced alerting with machine learning (ML) models

Ease of Use

User-friendly, intuitive UI, low learning curve

Complex setup, higher learning curve

Easy-to-use, modern cloud-native interface

Moderate complexity requires an Elasticsearch setup

Complex requires a deep understanding of configuration

Customization

Highly customizable dashboards support multiple data sources

Less flexible, more manual configuration

Highly customizable, SaaS-based with many options

High level of customization through Elasticsearch and Kibana

Highly customizable, but complex for large deployments

Notification Channels

Supports multiple channels: Slack, PagerDuty, email, Teams

Limited notification options, typically email

Broad range: Slack, email, PagerDuty, webhooks

Limited out-of-the-box, requires integration

Advanced notification options, but complex to set up

Integration

Integrates with Prometheus, Loki, Elasticsearch, MySQL, Pand PostgreSQL

Integrates with third-party plugins and services

Extensive cloud integration (AWS, Azure, GCP)

Deep integration with Elasticsearch for log analysis

Supports numerous services, but the complex integration process

Visualization

Best-in-class, customizable, interactive dashboards

Limited visualization capabilities

Advanced visualizations for metrics, logs, traces

Good visualization through Kibana, customizable

Strong visualization but more for advanced users

Scalability

Scales well with time series databases like Prometheus

Limited scalability without external tools

Excellent scalability for cloud and large deployments

Scales with Elasticsearch clusters

Scales well, but resource-intensive and costly

Community and Support

Strong open-source community, rich plugin ecosystem

Large community, but more focused on plugins

Strong support via SaaS, extensive documentation

Active open-source community, with paid support options

Strong enterprise support, but costs are high

Cost

Open-source, with paid enterprise options

Open-source, but requires third-party integrations for advanced features

Paid SaaS, usage-based pricing

Open-source (free), paid enterprise options available

Expensive enterprise-level pricing

Grafana’s Alerting Features

The skills of monitoring teams are significantly enhanced by Grafana’s alerting system which is quite robust and flexible to use. Central to this functionality are the alert rules, which enable the user to impose limitations on metrics, logs, and traces. With Grafana’s Unified Alerting system that was embedded with the software version 8, the user can operate different alerts from a single point, whereby fetching data from sources such as Prometheus, Loki, and Elasticsearch. This single-point management system is more effective in alert management as opposed to the Nagios-based swift tools that are closely tied to alerting tools and thus result in a disintegrated monitoring setup.

Grafana has the ability to alert a wide array of users and the alerting setup is another defining factor that has set Grafana aside from the others. Users can make advanced queries through their integrations with Prometheus by utilizing PromQL allowing them to build risks inside the metrics which can then be used for alerts. Also, Grafana integrates lots of notification channels such as PagerDuty, Slack, Microsoft Teams, and webhooks, among others. Such diversity assists the teams in instilling alerts over their communication workflows as it suits them best.

With Grafana’s contextual alerting, usability is improved as it binds alerts with relevant graphs. There is also a setting where a graph is displayed whenever an alert is triggered, making the diagnosis of the issue easier. Moreover, in the process of solving the problem, it is also beneficial that Grafana automatically marks events on graphs so users know when and where the problem was.

The platform also avoids alert fatigue through alert grouping and deduplication. Several alerts can be sent as one instead of a barrage of messages which greatly enhances communication and cuts out unwanted sounds. With Prometheus, Grafana allows users to not be burdened with constant notifications of similar types due to alert deduplication use.

Overall, the simple interface of Grafana makes any command related to setting alerts extremely easy, which is beneficial for new users while not compromising the experience of professionals. The API of the platform is also very helpful in automation as it allows integration with CI/CD pipelines and up-to-date DevOps processes. Such simplicity in work, in combination with powerful alerting features, is what makes Grafana a good tool for any company in need of a deep monitoring system.

Comparison of Alerting Systems

User Interface and Experience

Grafana is a better experience in terms of its user interface and experience as its design is more friendly. Users design the platform in such a way that navigating through its features becomes easy, which makes setup and management straightforward for those who are not familiar with monitoring tools. This stands in stark contrast to systems such as Nagios and Splunk which are a little bit easier to use but come with a complexity that requires a lot of time and effort to understand. For example, in Nagios, setting up alerts frequently requires sifting through configuration files and using a command-line interface which may turn out to be a pretty intimidating task for users with a limited technical background. Splunk, on the other hand, is a great tool but it also requires teams to understand how to use its query language and how to configure the tool which slows down the initial deployment of the team.

Alert Configuration

Grafana enhances alert management via a single unified system which allows users to define alert notifications based on any visualization or metrics from multiple data sources. Thanks to Prometheus and PromQL, complex alert conditions can easily be met, allowing for changes in response to conditions occurring. Note that Nagios, on the other hand, makes it necessary to adjust manually every alert which might unify some systems, but make several errors. Splunk is more advanced, yes, but the way search queries need to be configured can be a bit more than any user would want to deal with.

Notification Flexibility

Grafana shines in its flexibility in notification channels, where it supports an enormous number of notification channels: email, Slack, PagerDuty, and Microsoft Teams, among others. This means alerts can be pushed to the correct teams in real-time, enabling prompt responses to issues. Datadog and Elastic Stack also offer several options for notifications; however, being open-source, Grafana offers far more customization and integration opportunities. For organizations looking to customize their alerting processes in order to fit with particular workflows, Grafana is a more flexible solution.

Scalability

Scalability is one important factor for an organization, particularly as the need to monitor scales up and becomes very large. In this respect, Grafana has a good architecture that can scale effectively with data sources, such as Prometheus, which are designed to have high availability and scalability, which makes it perfect for large environments. However, Datadog is also scalable but becomes expensive with more data. Built with robust scalability options, the Elastic Stack may be a bit complex in terms of management. What Grafana does well is it captures the powerful monitoring features without incurring the high costs of some cloud-native products.

Integration with Existing Systems

Its biggest advantage is its seamless integration with other systems that may already exist in an organization. It supports multiple data sources such as Prometheus, Loki, Elasticsearch, and many more. Therefore, the existing tools are easily utilized by organizations with less disruption. It allows teams to see and alert on data coming from different sources in one place, thus making it a smooth workflow. And though Nagios and Splunk, for example, tend to be siloed products that may only more readily work with other systems with more configurations or separate instances, the integration of APIs and webhooks into Grafana increases its flexibility in deployment in various workflows. Such a feature explanation makes it appealing to teams that want to enhance the monitoring and alerting capabilities without overhauling their existing processes.

Conclusion

Grafana’s alerting system provides a great balance of visual appeal, flexibility, and ease of use, which makes it a great contender for organizations looking to enhance their monitoring capabilities. The unified alerting framework, support for multiple data sources, and rich community support help teams manage alerts effectively while reducing noise and improving response times.

Grafana can be a perfect choice for those organizations that are searching for an all-around monitoring stack. It complements other tools that are very strong in areas like log analysis or deep metrics monitoring. This way, the organization can get an even more complete solution for varied operational needs.

Empower Your Business with Data-Driven Insights. Reach out to us today!

Frequently Asked Questions (FAQs)

Grafana focuses on data visualization and dashboarding, whereas Nagios is more traditional in the system and network monitoring. Grafana supports multi-source data integration and is highly customizable for visual representations, while Nagios excels in alerting and plugin-based monitoring.

You can customize Grafana dashboards by adding panels for different data sources, applying custom queries, and using built-in or third-party plugins to tailor the visualization. The layout and metrics displayed can be configured to suit your monitoring preferences.

Prometheus can be easily integrated with Grafana as a data source. Grafana pulls data from Prometheus using its API and allows you to create dynamic dashboards based on the time-series data collected by Prometheus.

Some popular plugins include the Zabbix plugin for Zabbix integration, Loki for log monitoring, and the Pie Chart plugin for specific visualizations. Plugins for cloud services like AWS and Google Cloud also extend Grafana's monitoring capabilities.

Splunk is a powerful tool for log analysis and management, with advanced search capabilities. Grafana, on the other hand, excels at real-time data visualization from multiple sources and is often used alongside Prometheus for metrics monitoring. Splunk is more enterprise-focused, while Grafana offers flexibility in open-source and cloud-native environments

Unified Alerting is Grafana's system for managing alerts across multiple data sources. You can configure it by setting up conditions for metrics or logs that trigger alerts, and by defining notification channels like email, Slack, or webhook for sending alerts.

Grafana allows you to connect and visualize data from multiple sources like Prometheus, Loki, Elasticsearch, InfluxDB, and more, all in one dashboard. You can configure queries for each data source and display the results in different panels on the same dashboard.

The ELK stack (Elasticsearch, Logstash, Kibana) is a comprehensive solution for log aggregation, search, and analysis, whereas Grafana is primarily focused on metrics and visualization. Grafana, when used with Loki, can handle logs but lacks the deep search and indexing capabilities of ELK.

Grafana is often preferred for open-source, highly customizable setups, while Datadog offers an all-in-one solution with built-in metrics, logs, and traces for cloud environments. Datadog provides a more managed experience but at a higher cost.

Asset 43@2x

Why choose Apache Druid over Snowflake

Introduction

In our previous blog, Apache Druid Integration with Apache Superset we talked about Apache Druid’s integration with Apache Superset. In case you have missed it, it is recommended you read it before continuing the Apache Druid blog series. In the Exploring Apache Druid: A High-Performance Real-Time Analytics Database blog post, we have talked about Apache Druid in more detail.

In this blog, we talk about the advantages of Druid over Snowflake. Apache Druid and Snowflake are both high-performance databases, but they serve different use cases, with Druid excelling in real-time analytics and Snowflake in traditional data warehousing.

Snowflake

Snowflake is a cloud-based data platform and data warehouse service that has gained significant popularity due to its performance, scalability, and ease of use. It is built from the ground up for the cloud and offers a unified platform for data warehousing, data lakes, data engineering, data science, and data application development

Apache Druid

Apache Druid is a high-performance, distributed real-time analytics database designed to handle fast, interactive queries on large-scale datasets, especially those with a time component. It is widely used for powering business intelligence (BI) dashboards, real-time monitoring systems, and exploratory data analytics.

Here are the advantages of Apache Druid over Snowflake, particularly in real-time, time-series, and low-latency analytics:

(a) Real-Time Data Ingestion

    • Apache Druid: Druid is designed for real-time data ingestion, making it ideal for applications where data needs to be available for querying as soon as it’s ingested. It can ingest data streams from sources like Apache Kafka or Amazon Kinesis and make the data immediately queryable.
    • Snowflake: Snowflake is primarily a batch-processing data warehouse. While Snowflake can load near real-time data using external tools or connectors, it is not built for real-time streaming data ingestion like Druid.

Advantage: Druid is superior for real-time analytics, especially for streaming data from sources like Kafka and Kinesis.

(b) Sub-Second Query Latency for Interactive Analytics

    • Apache Druid: Druid is optimized for sub-second query latency, which makes it highly suitable for interactive dashboards and ad-hoc queries. Its columnar storage format and advanced indexing techniques (such as bitmap and inverted indexes) enable very fast query performance on large datasets, even in real time.
    • Snowflake: Snowflake is highly performant for large-scale analytical queries, but it is designed for complex OLAP (Online Analytical Processing) queries over historical data, and query latency may be higher for low-latency, real-time analytics, particularly for streaming or fast-changing datasets.

Advantage: Druid offers better performance for low-latency, real-time queries in use cases like interactive dashboards or real-time monitoring.

(c) Time-Series Data Optimization

    • Apache Druid: Druid is natively optimized for time-series data, with its architecture built around time-based partitioning and segmenting. This allows for efficient querying and storage of event-driven data (e.g., log data, clickstreams, IoT sensor data).
    • Snowflake: While Snowflake can store and query time-series data, it does not have the same level of optimization for time-based queries as Druid. Snowflake is better suited for complex, multi-dimensional queries rather than the high-frequency, time-stamped event queries where Druid excels.

Advantage: Druid is more optimized for time-series analytics and use cases where data is primarily indexed and queried based on time.

(d) Streaming Data Support

    • Apache Druid: Druid has native support for streaming data platforms like Kafka and Kinesis, enabling direct ingestion from these sources and offering real-time visibility into data as it streams in.
    • Snowflake: Snowflake supports streaming data through tools like Snowpipe, but it works better with batch data loading or micro-batch processing. Its streaming capabilities are generally less mature and lower-performing compared to Druid’s real-time streaming ingestion.

Advantage: Druid has stronger native streaming data support and is a better fit for real-time analytics from event streams.

(e) Low-Latency Aggregations

    • Apache Druid: Druid excels in performing low-latency aggregations on event data, making it ideal for use cases that require real-time metrics, summaries, and rollups. For instance, it’s widely used in monitoring, fraud detection, ad tech, and IoT, where data must be aggregated and queried in near real-time.
    • Snowflake: While Snowflake can aggregate data, it is more optimized for batch-mode processing, where queries are run over large datasets. Performing continuous, real-time aggregations would require external tools and more complex architectures.

Advantage: Druid is better suited for real-time aggregations and rollups on streaming or event-driven data.

(f) High Cardinality Data Handling

    • Apache Druid: Druid’s advanced indexing techniques, like bitmap indexes and sketches (e.g., HyperLogLog), allow it to efficiently handle high-cardinality data (i.e., data with many unique values). This is important for applications like ad tech where unique user IDs, URLs, or click events are frequent.
    • Snowflake: Snowflake performs well with high-cardinality data in large-scale analytical queries, but its query execution model is generally more suited to aggregated batch processing rather than fast, high-cardinality lookups and filtering in real time.

Advantage: Druid has better indexing for real-time queries on high-cardinality datasets.

(g) Real-Time Analytics for Operational Use Cases

    • Apache Druid: Druid was built to serve operational analytics use cases where real-time visibility into systems, customers, or events is critical. Its ability to handle fast-changing data and return instant insights makes it ideal for powering monitoring dashboards, business intelligence tools, or real-time decision-making systems.
    • Snowflake: Snowflake is an excellent data warehouse for historical and batch-oriented analytics but is not optimized for operational or real-time analytics where immediate insights from freshly ingested data are needed.

Advantage: Druid is better suited for operational analytics and real-time, event-driven systems.

(h) Cost-Effectiveness for Real-Time Workloads

    • Apache Druid: Druid’s open-source nature means there are no licensing fees, and you only pay for the infrastructure it runs on (if you use a managed service or deploy it on the cloud). For organizations with significant real-time analytics workloads, Druid can be more cost-effective than cloud-based data warehouses, which charge based on storage and query execution.
    • Snowflake: Snowflake’s pricing is based on compute and storage usage, with charges for data loading, querying, and storage. For continuous, high-frequency querying (such as real-time dashboards), these costs can add up quickly.

Advantage: Druid can be more cost-effective for real-time analytics, especially in high-query environments with constant data ingestion.

(i) Schema Flexibility and Semi-Structured Data Handling

    • Apache Druid: Druid supports schema-on-read and is highly flexible in terms of handling semi-structured data such as JSON or log formats. This flexibility is particularly useful for use cases where the schema may evolve over time or when working with less structured data types.
    • Snowflake: Snowflake also handles semi-structured data like JSON, but it requires more structured schema management compared to Druid’s flexible schema handling, which makes Druid more adaptable to changes in data format or structure.

Advantage: Druid offers greater schema flexibility for semi-structured data and evolving datasets.

(j) Open-Source and Vendor Independence

    • Apache Druid: Druid is open-source, which gives users full control over deployment, management, and scaling without being locked into a specific vendor. This makes it a good choice for organizations that want to avoid vendor lock-in and have the flexibility to self-manage or choose different cloud providers.
    • Snowflake: Snowflake is a proprietary, cloud-based data warehouse. While Snowflake offers excellent cloud capabilities, users are tied to Snowflake’s platform and pricing model, which may not be ideal for organizations preferring more control or customization in their infrastructure.

Advantage: Druid provides more freedom and control as an open-source platform, allowing for vendor independence.

When to Choose Apache Druid over Snowflake

  • Real-Time Streaming Analytics: If your use case involves high-frequency event data or real-time streaming analytics (e.g., user behavior tracking, IoT sensor data, or monitoring dashboards), Druid is a better fit.
  • Interactive, Low-Latency Queries: For interactive dashboards requiring fast response times, especially with frequently updated data, Druid’s sub-second query performance is a significant advantage.
  • Time-Series and Event-Driven Data: Druid’s architecture is designed for time-series data, making it superior for analyzing log data, time-stamped events, and similar data.
  • Operational Analytics: Druid excels in operational analytics where real-time data ingestion and low-latency insights are needed for decision-making.
  • Cost-Effective Real-Time Workloads: For continuous real-time querying and analysis, Druid’s cost structure may be more affordable compared to Snowflake’s compute-based pricing.

Data Ingestion, Data Processing, and Data Querying

Apache Druid

Data Ingestion and Data Processing

Apache Druid is a high-performance real-time analytics database designed for fast querying of large datasets. One common use case is ingesting a CSV file into Druid to enable interactive analysis. In this guide, we will walk through each step to upload a CSV file to Apache Druid, covering both the Druid Console and the ingestion configuration details.

Note: We have used the default configuration settings that come with the trial edition of Apache Druid and Snowflake for this exercise. Results may vary depending on the configuration of the applications.

Prerequisites:

(a) Ensure your CSV file is:

    • Properly formatted, with headers in the first row.
    • Clean of inconsistencies (e.g., missing data or malformed values).
    • Stored locally or accessible via a URL if uploading from a remote location.

(b) Launch the Apache Druid Console.

(c) Start your Apache Druid cluster if it is not already running.

(d) Open the Druid Console by navigating to its URL (default: http://localhost:8888).

(e) Load Data: Select the Datasources Grid.

Create a New Data Ingestion Task

(a) Navigate to the Ingestion Section:

    • In the Druid Console, click on Data in the top navigation bar.
    • Select Load data to start a new ingestion task.

(b) Choose the Data Source:

    • Select Local disk if the CSV file is on the same server as the Druid cluster.
    • Select HTTP(s) if the file is accessible via a URL.
    • Choose Amazon S3, Google Cloud Storage, or other options if the file is stored in a cloud storage service.

(c) Upload or Specify the File Path:

    • For local ingestion: Provide the absolute path to the CSV file.
    • For HTTP ingestion: Enter the file’s URL.

(d) Start a New Batch Spec: We are using a new file for processing.

(e) Connect and parse raw data: Select Local Disk option.

(f) Specify the Data: Select the base directory of the file and choose the file type you are using to process. 

  •  File Format:
    • Select CSV as the file format.
    • If your CSV uses a custom delimiter (e.g., ;), specify it here.
  • Parse Timestamp Column:
    • Druid requires a timestamp column to index the data.
    • Select the column containing the timestamp (e.g., timestamp).
    • Specify the timestamp format if it differs from ISO 8601 (e.g., yyyy-MM-dd HH:mm:ss).
  •  Preview Data:
    • The console will show a preview of the parsed data.
    • Ensure that all columns are correctly identified.

(g) Connect: Connect details.

(h) Parse Data: At this stage, you should be able to see the data in Druid to parse it. 

(i) Parse Time: At this stage, you should be able to see the data in Druid to parse the time details. 

(j) Transform: At this stage, Druid allows you to perform some fundamental Transformations. 

(k) Filter: At this stage, Druid allows you to apply the filter conditions to your data.

(l) Configure Schema: At this stage, we are configuring the Schema details about the file that we had uploaded to the Druid.

  • Define Dimensions and Metrics:
    • Dimensions are fields you want to filter or group by (e.g., category).
    • Metrics are fields you want to aggregate (e.g., value).
  • Primary Timestamp:
    • Confirm the primary timestamp field.
  • Partitioning and Indexing:
    • Select a time-based partitioning scheme, such as day or hour.
    • Choose indexing options like bitmap or auto-compaction if needed.

(m) Partition: At this stage, Druid allows you to configure the Date specific granularity details. 

(n) Tune:

    • Max Rows in Memory:
      • Specify the maximum number of rows to store in memory during ingestion.
    • Segment Granularity:
      • Define how data is divided into segments (e.g., daily, hourly).
    • Partitioning and Indexing:
      • Configure the number of parallel tasks for ingestion if working with large datasets.

(o) Publish:

(p) Edit Spec:

(q) Submit Task:

    1. Generate the Ingestion Spec:
      • Review the auto-generated ingestion spec JSON in the console.
      • Edit the JSON manually if you need to add custom configurations.
    2. Submit the Task:
      • Click Submit to start the ingestion process.

(r) Monitor the Ingestion Process:

    1. Navigate to the Tasks section in the Druid Console.
    2. Monitor the progress of the ingestion task.
    3. If the task fails, review the logs for errors, such as incorrect schema definitions or file path issues.

Processing a CSV data load of 1 Million data in 33 sec.

Data Querying

Query the Ingested Data 

    1. Navigate to the Query Tab:
      • In the Druid Console, go to Query and select your new data source.
    2. Write and Execute Queries:
      • Use Druid SQL or the native JSON query language to interact with your data.

Example SQL Query:

Query Performance for querying and rendering of 1 million rows is 15 ms.

    1. Visualize Results:
      • View the query results directly in the console or connect Druid to a visualization tool like Apache Superset. In our upcoming blog we will see how to visualize this ingested data using Apache Superset. Meanwhile In our previous blog, Exploring Apache Druid: A High-Performance Real-Time Analytics Database, and Apache Druid Integration with Apache Superset  we discussed Apache Druid in more detail. In case you have missed it, it is recommended that you read it before continuing the Apache Druid blog series. In this blog, we talk about integrating Apache Superset and Apache Druid.

Uploading data sets to Apache Druid is a straightforward process, thanks to the intuitive Druid Console. By following these steps, you can ingest your data, configure schemas, and start analyzing it in no time. Whether you’re exploring business metrics, performing real-time analytics, or handling time-series data, Apache Druid provides the tools and performance needed to get insights quickly.

Snowflake

  1. Log in to the Snowflake application using the credentials.

2. Upload the CSV file to the Server.

3. The uploaded “customer.csv” file has 1 Million data rows and the size of the file is around 166 MB. 

4. Specify the Snowflake Database.

5. Specify the Snowflake Table. 

6. The CSV data is being processed into the specified Snowflake Database and table.

7. Snowflake allows us to format or edit the metadata information before loading the data into Snowflake DB. Update the details as per the data and click on the Load button.

8. Snowflake successfully loaded customer data. Snowflake has taken 45 sec to process the CSV data. Snowflake has taken 12+ seconds compared to Apache Druid to load, transform, and process 1 Million records.

9. Query Performance for rendering 1 million rows is 20 ms. Snowflake took 5+ seconds compared to Apache Druid for querying 1 Million records.

Conclusion

While Snowflake is a powerful, cloud-native data warehouse with strengths in batch processing, historical data analysis, and complex OLAP queries, Apache Druid stands out in scenarios where real-time, low-latency analytics are needed. Druid’s open-source nature, time-series optimizations, streaming data support, and operational focus make it the better choice for real-time analytics, event-driven applications, and fast, interactive querying on large datasets.

Apache Druid FAQ

Apache Druid is used for real-time analytics and fast querying of large-scale datasets. It's particularly suitable for time-series data, clickstream analytics, operational intelligence, and real-time monitoring.

Druid supports:

  • Streaming ingestion: Data from Kafka, Kinesis, etc., is ingested in real time.
  • Batch ingestion: Data from Hadoop, S3, or other static sources is ingested in batches.

Druid uses:

  • Columnar storage: Optimizes storage for analytic queries.
  • Advanced indexing: Bitmap and inverted indexes speed up filtering and aggregation.
  • Distributed architecture: Spreads query load across multiple nodes.

No. Druid is optimized for OLAP (Online Analytical Processing) and is not suitable for transactional use cases.

  • Data is segmented into immutable chunks and stored in Deep Storage (e.g., S3, HDFS).
  • Historical nodes cache frequently accessed segments for faster querying.
  • SQL: Provides a familiar interface for querying.
  • Native JSON-based query language: Offers more flexibility and advanced features.
  • Broker: Routes queries to appropriate nodes.
  • Historical nodes: Serve immutable data.
  • MiddleManager: Handles ingestion tasks.
  • Overlord: Coordinates ingestion processes.
  • Coordinator: Manages data availability and balancing.

Druid integrates with tools like:

  • Apache Superset
  • Tableau
  • Grafana
  • Stream ingestion tools (Kafka, Kinesis)
  • Batch processing frameworks (Hadoop, Spark)

Yes, it’s open-source and distributed under the Apache License 2.0.

Alternatives include:

  • ClickHouse
  • Elasticsearch
  • BigQuery
  • Snowflake

Whether you're exploring real-time analytics or need help getting started, feel free to reach out!

Watch the Apache Blog Series

Stay tuned for the upcoming Apache Blog Series:

  1. Exploring Apache Druid: A High-Performance Real-Time Analytics Database
  2. Unlocking Data Insights with Apache Superset
  3. Streamlining Apache HOP Workflow Management with Apache Airflow
  4. Comparison of and migrating from Pentaho Data Integration PDI/ Kettle to Apache HOP
  5. Apache Druid Integration with Apache Superset
  6. Why choose Apache Druid over Vertica
  7. Why choose Apache Druid over Snowflake
  8. Why choose Apache Druid over Google Big Query
  9. Integrating Apache Druid with Apache Superset for Realtime Analytics
Asset 40@2x

Proactive Alerting for Cloud Applications using Grafana

When to Use Grafana and How to Set Up Alerting in Grafana

Alerting now has become critical. As monitoring gives one an overview of the system, alerting is a near-real-time alert and notification system that immediately notifies the team regarding the occurrence of an issue in time to take some quick action before things go bad. For example, suppose a server uses more than its expected CPU usage. In that case, an alert will alert the team to address the matter before it leads to downtime or performance degradation. In short, alerting allows you to preclude problems that have a big impact on your system or business.

In this article, we will discuss the basic role of alerting in a monitoring system and exactly how alerting works inside Grafana, one of the powerful open-source tools for monitoring and visualization. After briefly discussing the importance of monitoring and alerting, we’ll guide you through the steps to set up alerting in Grafana.

Importance of Alerting in Monitoring Systems

Monitoring is the process of continuously collecting data from various parts of the system and understanding it over a while to trace patterns or anomalies. It helps in capacity planning, exhibits performance bottlenecks, and guides optimization efforts by showing a whole picture of health without initiating action. Instead of this, alerting is an active response mechanism that informs the teams when certain conditions or thresholds have been met; the objective being keeping the teams informed of problems as they occur.

Main Differences

Objectives: Monitoring is concerned with long-term data collection and analysis while alerting is directed at the immediate need for issue detection and response.

Timing: Monitoring is always on, capturing data at all times, while alerts are event-driven, which means they become effective only when certain conditions are met.

Key Benefits of Alerts

Continuous Monitoring Without Human Intervention: The alerts automate the process, ensuring that issues are flagged without constant human oversight.

Real-Time Update-Alerts: It is based on predefined conditions to send instant notifications and thus, ensure rapid responses to critical changes. The right people get notified and thus ensure proper escalations are managed.

Types of Alerts

Threshold-Based Alerts: Threshold-based alerts are identified based on definite thresholds, such as which could raise an alert when the CPU exceeds 90%.

Anomaly Detection Alerts: Intended to track and look for unusual patterns or behaviours that might not be detected using typical thresholds.

Event-Based Alerts: These alerts react to critical events, such as the failure of an application process or missing critical data; thus, teams are alerted to important occurrences.

Setting Up Alerting in Grafana (Step-by-Step Guide)

Prerequisites to Setup Alerts

Before you can have alerts working in Grafana, you need to have the environment set up just as outlined below:

Data Source Integration: You will need a data source integrated with Grafana; some examples of sources are Prometheus. Alerts work based on the time-series data retrieved from such sources.

Understanding Alert Rules: An alert rule is a query that checks the state of a defined metric and determines whether an alert should be triggered given certain predefined conditions.

Step1: Login to Grafana with the required credentials

Step2: Create a new dashboard or open an existing dashboard where the notification alert needs to be setup

Steps to Create Alerts

Step 1: Create a Panel for Visualization

Add New Panel: First, add a new panel to your Grafana dashboard where you will visualize the metric that you are going to monitor.

Select Visualization Type: From the list, pick a visualization type that best fits either a Graph or Singlestat based on what sort of data you wish to monitor.

Step 2: Configure Alert

Alerting Menu Access: Navigate to the Alerting section from the menu.

New Alert Rule: From the subsection under Alerting, you click New Alert Rule to start the process of setting up an alert.

Data Source: Under the list of choices for a data source select such as Prometheus.

Write the Query: Type the query that fetches the metric you need to monitor. Be sure the query accurately reflects the condition you need to monitor.

Set the Threshold: How to check the input, i.e. whether the value is above a certain value, or similar. You could choose this condition as “is above” with a threshold value (for example, 80 for CPU usage).

Enter Values for Alerting Rule Options

Name: Give the rule a descriptive name for the alert, like “High CPU Usage Alert”.

Alert Conditions: Define a query that specifies the conditions under which the alert should be triggered.

Alert Evaluation Behavior: Select how frequently to check the alert (in this case, every 5 minutes).

Labels and Notifications: Add relevant tags to help categorize your alerts, such as environment or service. Describe the action instructions for the alert message that will go out once the alert is triggered. Include some background information regarding the issue so it can be easily recognized.

Include Contact Information: Determine the contact information where the alert notifications are to be delivered, such as email, Slack, or Google Chat/Hangout, PagerDuty & Webhooks. Remember, you’ll have to set up the notification channels in Grafana beforehand. In the URL section attach the Web hook of the above channels where you want to get notified.

Step 3: Testing your Alerts

Test the Alert: Use the testing feature in Grafana to test if your alert configuration is properly set. Thus, you will be reassured that under well-defined conditions, alerting works.

Step 4: Finalize the Alert

Save Alert: When all the settings for configuring are made, you can save the alert rule created by clicking Save.

Enable Alert: Finally, ensure to enable the alert so it can start monitoring for the defined conditions.

Conclusion

Alerting is one of the most important features of a modern monitoring system, that can enable teams to be able to respond to issues at their earliest sign rather than allowing them to spin out of control. With proper alert definitions integrated with monitoring, organizations can avoid more downtime, increase reliability, and make all these complex systems work flawlessly.

Alerts in Grafana must be actional and should not be vague. Avoid the over-complication of rules on alerts. Regularly update the alerts since the infrastructure and environments are always in the update, it has to be properly grouped and prioritized, and advance notification options like webhooks or third-party tools.

In this post, we focused on how Grafana excels at detailed alert settings and is suitable for monitoring metrics of the system, complementing tools like Uptime Kuma, which is good for simple service uptime tracking. In the following release, we dig deeper into Uptime Kuma, examining it in much more depth, then, of course, showing its setup from the ground up. Stay tuned to find out how these two tools can work together to create a seamless, holistic monitoring and alerting strategy.

Have questions about Grafana, alerting, or optimizing your monitoring setup? Our team is here to assist

Frequently Asked Questions (FAQs)

The purpose of configuring notification alerts is to ensure timely awareness of issues in your systems by monitoring specific metrics. Alerts allow you to proactively respond to potential problems, reducing downtime and enhancing system performance.

You can access Grafana by logging in with the required credentials. If you don't have an account, you'll need to create one or request access from your administrator.

You can set up alerts on both existing dashboards and new ones. Simply open the dashboard where you want to configure the alert or create a new dashboard if needed.

You can use various visualization types, such as Graph or Singlestat, depending on how you want to display the metric you're monitoring.

In the alerting section under "Rules," select "New Alert Rule" and choose your data source (e.g., Prometheus, InfluxDB) when writing the query to retrieve the metric you want to monitor.

You can define alert conditions by specifying when the alert should trigger based on your chosen metric. This could be when the metric crosses a certain threshold or remains above or below a specific value for a defined duration.

Setting a threshold value determines the specific point at which an alert will be triggered, allowing you to control when you are notified of potential issues based on the behaviour of the monitored metric.

Yes, you can customize the alert messages by setting annotations in the alerting rule. This allows you to tailor the content of the notification that will be sent when the alert is triggered.

You can set contact points for notifications, such as Email, Hangouts, Slack, PagerDuty, or Webhooks. Attach the webhook URL for the channel where you want to receive alerts.

Testing the alert with the "Test Rule" button allows you to simulate the alert and see how it would behave under current conditions, ensuring the configuration works as expected before saving.

Server monitoring involves tracking the performance and health of servers to ensure they are running efficiently and to quickly identify and resolve any issues. It is important because it helps prevent downtime, ensures optimal performance, and maintains the reliability of services.

Grafana provides real-time insights into server metrics such as CPU usage, memory utilization, network traffic, and disk activity. It offers customizable dashboards and visualization options to help interpret data and spot anomalies quickly.

 

Alerts are configured in Grafana with custom rules and thresholds. Integrating with Google Chat, the system sends immediate notifications to the relevant team members when any anomalies or performance issues arise.

Node-exporter and Prometheus are used for data collection. Node-exporter gathers system-level metrics, while Prometheus stores these metrics and provides querying capabilities.

Grafana can monitor a wide range of metrics, including CPU usage, memory utilization, disk I/O, network traffic, application response times, and custom application metrics defined through various data sources.

Yes, Grafana supports integration with numerous third-party applications and services, including notification channels like Slack, Microsoft Teams, PagerDuty, and more, enhancing its alerting capabilities.

The data collection frequency can vary based on the configuration of the data source (like Prometheus) and the specific queries you set up. You can typically configure scrape intervals in your Prometheus setup.

Yes, Grafana allows you to share dashboards with team members via direct links, snapshots, or by exporting them. You can also set permissions to control who can view or edit the dashboards.

If you encounter issues, check the Grafana logs for error messages, review your alert configurations, and ensure that your data sources are properly connected. The Grafana community and documentation are also valuable resources for troubleshooting.

Yes, Grafana allows you to create complex alert conditions based on multiple metrics using advanced queries. You can combine metrics in a single alert rule to monitor related conditions.

If a data source goes down, Grafana will typically show an error or a warning on the dashboard. Alerts configured with that data source may also fail to trigger until the connection is restored.

Yes, Grafana allows you to visualize historical data by querying data sources that store time-series data, such as Prometheus. You can create dashboards that analyze trends over time.

Annotations are markers added to graphs in Grafana to indicate significant events or changes. They can provide context for data trends and help identify when specific incidents occurred.

Alerts are conditions set to monitor specific metrics and trigger under certain circumstances, while notifications are the messages sent out when those alerts are triggered, informing users of the situation.

Yes, Grafana offers some customization options for its UI, including themes and layout adjustments. You can also configure dashboard variables to create dynamic and user-friendly interfaces.

Yes, you can use Grafana's API to programmatically create and manage dashboards, allowing for automation in scenarios such as CI/CD pipelines or large-scale deployments.

Grafana offers extensive documentation, tutorials, and community forums. Additionally, there are many online courses and video tutorials available to help users learn the platform.

Asset 37@2x

Apache Druid Integration with Apache Superset

Introduction

In our previous blog, Exploring Apache Druid: A High-Performance Real-Time Analytics Database, we discussed Apache Druid in more detail. In case you have missed it, it is recommended that you read it before continuing the Apache Druid blog series. In this blog, we talk about integrating Apache Superset and Apache Druid.

Apache Superset and Apache Druid are both open-source tools that are often used together in data analytics and business intelligence workflows. Here’s an overview of each and their relationship.

Apache Superset

Apache Superset is an open-source data exploration and visualization platform developed by Airbnb, it was later donated to the Apache Software Foundation. It is now a top-level Apache project, widely adopted across industries for data analytics and visualization. 

  1. Type: Data visualization and exploration platform.
  2. Purpose: Enables users to create dashboards, charts, and reports for data analysis.
  3. Features:
    1. User-friendly interface for querying and visualizing data.
    2. Wide variety of visualization types (charts, maps, graphs, etc.).
    3. SQL Lab: A rich SQL IDE for writing and running queries.
    4. Extensible: Custom plugins and charts can be developed.
    5. Role-based access control and authentication options.
    6. Integrates with multiple data sources, including databases, warehouses, and big data engines.
  4. Use Cases:
    1. Business intelligence (BI) dashboards.
    2. Interactive data exploration.
    3. Monitoring and operational analytics.

Apache Druid

Apache Druid is a high-performance, distributed real-time analytics database designed to handle fast, interactive queries on large-scale datasets, especially those with a time component. It is widely used for powering business intelligence (BI) dashboards, real-time monitoring systems, and exploratory data analytics.

  1. Type: Real-time analytics database.
  2. Purpose: Designed for fast aggregation, ingestion, and querying of high-volume event streams.
  3. Features:
    1. Columnar storage format: Optimized for analytic workloads.
    2. Real-time ingestion: Can process and make data available immediately.
    3. Fast queries: High-performance query engine for OLAP (Online Analytical Processing).
    4. Scalability: Handles massive data volumes efficiently.
    5. Support for time-series data: Designed for datasets with a time component.
    6. Flexible data ingestion: Supports batch and streaming ingestion from sources like Kafka, Hadoop, and object stores.
  4. Use Cases:
    1. Real-time dashboards.
    2. Clickstream analytics.
    3. Operational intelligence and monitoring.
    4. Interactive drill-down analytics.

How They Work Together

  1. Integration: Superset can connect to Druid as a data source, enabling users to visualize and interact with the data stored in Druid.
  2. Workflow:
    1. Data Ingestion: Druid ingests large-scale data in real-time or batches from various sources.
    2. Data Querying: Superset uses Druid’s fast query capabilities to retrieve data for visualizations.
    3. Visualization: Superset presents the data in user-friendly dashboards, allowing users to explore and analyze it.

Key Advantages of Using Them Together

  1. Speed: Druid’s real-time and high-speed querying capabilities complement Superset’s visualization features, enabling near real-time analytics.
  2. Scalability: The combination is suitable for high-throughput environments where large-scale data analysis is needed.
  3. Flexibility: Supports a wide range of analytics use cases, from real-time monitoring to historical data analysis.

Use Cases of Apache Druid and Apache Superset

  1. Fraud Detection: Analyzing transactional data for anomalies.
  2. Real-Time Analytics: Monitoring website traffic, application performance, or IoT devices in real-time.
  3. Operational Intelligence: Tracking business metrics like revenue, sales, or customer behaviour.
  4. Clickstream Analysis: Analyzing user interactions and behaviour on websites and apps.
  5. Log and Event Analytics: Parsing and querying log files or event streams.

Importance of Integrating Apache Druid with Apache Superset

Integrating Apache Druid with Apache Superset opens quite a few doors for the organization to really address high-scale data analysis and interactive visualization needs. Here’s why this combination is so powerful:

  1. High-Performance Analytics:

Druid’s Speed: Apache Druid optimizes low-latency queries and fast aggregation and is designed with large datasets in mind. That can be huge because quick response means Superset can deliver good, reusable, production-quality representations.

Real-Time Data: Druid supports real-time data ingestion, allowing Superset dashboards to visualise near-live data streams, making it suitable for monitoring and operational analytics.

  1. Advanced Query Capabilities:

Complex Aggregations: Druid’s ability to handle complex OLAP-style aggregations and filtering ensures that Superset can provide rich and detailed visualizations.

Native Time-Series Support: Druid’s in-built time-series functionalities allow seamless integration for Superset’s time-series charts and analysis.

  1. Scalability:

Designed for Big Data: Druid is highly scalable and can manage billions of rows in an efficient manner. Together with Superset, it helps dashboards scale easily for high volumes of data.

Distributed Architecture: Being by nature distributed, Druid ensures reliable performance of Superset even when it’s under high loads or many concurrent user sessions.

  1. Simplified Data Exploration:

Columnar Storage: This columnar storage format on Druid greatly accelerates the exploratory data analysis of Superset since it supports scans of selected columns efficiently.

Pre-Aggregated Data: Superset can avail use from the roll-up and pre-aggregation mechanisms in Druid for reducing query cost as well as improving the visualization performance.

  1. Flawless Integration:

Incorporated Connector: Superset is natively integrated with Druid. Therefore, there would not be any hassle regarding implementing it. It lets users query Druid directly and curate dashboards using minimal configuration. Moreover, non-technical users can also perform SQL-like queries (Druid SQL) within the interface.

  1. Advanced Features:

Granular Access Control: When using Druid with Superset, it is possible to enforce row-level security and user-based data access, ensuring that sensitive data is only accessible to authorized users.

Segment-Level Optimization: Druid’s segment-based storage enables Superset to optimize queries for specific data slices.

  1. Cost Efficiency:

Efficient Resource Utilization: Druid’s architecture minimizes storage and compute costs while providing high-speed analytics, ensuring that Superset dashboards run efficiently.

  1. Real-Time Monitoring and BI:

Operational Dashboards: In the case of monitoring systems, website traffic, or application logs, Druid’s real-time capabilities make the dashboards built in Superset very effective for real-time operational intelligence.

Streaming Data Support: Druid supports ingesting stream-processing-enabled data sources; for example, Kafka’s ingestion is supported. This enables Superset to visualize live data with minimal delay.

Prerequisites

  1. The Superset Docker Container or instance should be up and running.
  2. Installation of pydruid in Superset Container or instance.

Installing PyDruid

1. Enable Druid-Specific SQL Queries

PyDruid is the Python library that serves as a client for Apache Druid. This allows Superset to: 

  1. Run Druid SQL queries as efficiently as possible. 
  2. Access Druid’s powerful querying capabilities directly from Superset’s SQL Lab and dashboards.
  3. Without this client, Superset will not be able to communicate with the Druid database, which consequently results in partial functionality or functionality at large

2. Native Connection Support

Superset is largely dependent upon PyDruid for:

  1. Seamless connections to Apache Druid.
  2. Translate Superset queries into Druid-native SQL or JSON that the Druid engine understands.

3. Efficient Data Fetching

PyDruid optimizes data fetching from Druid by

  1. Managing the query API of Druid in regards to aggregation, filtering, and time-series analysis
  2. Reducing latency due to better management of connections and responses

    4. Advanced Visualization Features

Many features of advanced visualization available in Superset (like time-series charts and heat maps) depend on Druid’s capabilities, which PyDruid enables.  It also exposes to Superset Druid’s granularity, time aggregation and access to complex metrics pre-aggregated in Druid.

5. Integration with Superset’s Druid Connector

The Druid connector in Superset relies on PyDruid as a dependency to:

  1. Register Druid as a data source.
  2. Allow Superset users the capability to query and visually explore Druid datasets directly.
  3. Without PyDruid the Druid connector in Superset will not work, limiting your ability to integrate Druid as a backend for analytics.

6. Data Discovery PyDruid Enables Superset to:

Fetch metadata from Druid, including schema, tables, and columns. Enable ad hoc exploration of Druid data in SQL Lab.

Apache Druid as a Data Source

  1. Make sure your Druid database application is set up and working fine in the respective ports.

2. Install pydruid by logging into the container.

$ docker exec -it <superset_container_name> user=root /bin/bash 

$ pip install pydruid

3. Once pydruid is installed, navigate to settings and click on database connections

4. Click on Add Database and select Apache Druid as shown in the screenshot below:

5. Here we have to enter the SQLALCHEMY URI. To learn more about the SQLALCHEMY rules please follow the below link. 

https://superset.apache.org/docs/configuration/databases/

6. For Druid the SQLALCHEMY URI will be in the below format:

druid://<User>:<password>@<Host>:<Port-default-9088>/druid/v2/sql

Note: 

druid://<User>:<password>@<Host>:<Port-default-9088>/druid/v2/sql

According to the Alchemy URI “<Port-default-9088>” the default ports will be “8888” or “8081”. 

If the druid you have installed is running on http without any secure connection the Alchemy URI with the above rule will work without any issue.

Ex: druid://localhost:8888/druid/v2/sql/

If you have a secure connection normally druid://localhost:8888/druid/v2/sql/ will not work and provide an error. 

Here we will understand that we have to provide a domain name instead of localhost if we have enabled a secured connection for the druid UI. So we will make the changes and enter the URI as below

druid://domain.name.com:8888/druid/v2/sql/. This will also not work. 

According to other sources we have to make the changes like 

druid+https://localhost:8888/druid/v2/sql/

OR

druid+https://localhost:8888/druid/v2/sql/?ssl_verify=false

Neither of them will work

Normally once we enable a secured connection port HTTPS will be enabled to the site proxying the default ports of any of the other applications.

So if we enter the below SQL ALCHEMY URI as below

druid+https://domain.name.com:443/druid/v2/sql/?ssl=true

Superset will understand that the druid is enabled with a secured connection and running in the HTTPS port.

7. Since we have enabled the SSL to our druid application we have to enter the below SQLALCHEMY URI

druid+https://domain.name.com:443/druid/v2/sql/?ssl=true 

and click on test connection.

8. Once we click on test connection the Apache superset will be integrated with the Apache druid and will be ready to be worked on.

In the upcoming articles, we will talk about the data ingestion and visualization of Apache Druid data in the Apache Superset.

Apache Druid FAQ

Apache Druid is used for real-time analytics and fast querying of large-scale datasets. It's particularly suitable for time-series data, clickstream analytics, operational intelligence, and real-time monitoring.

Druid supports:

  1. Streaming ingestion: Data from Kafka, Kinesis, etc., is ingested in real-time.
  2. Batch ingestion: Data from Hadoop, S3, or other static sources is ingested in batches.

Druid uses:

  1. Columnar storage: Optimizes storage for analytic queries.
  2. Advanced indexing: Bitmap and inverted indexes speed up filtering and aggregation.
  3. Distributed architecture: Spreads query load across multiple nodes.

No. Druid is optimized for OLAP (Online Analytical Processing) and is not suitable for transactional use cases.

  1. Data is segmented into immutable chunks and stored in Deep Storage (e.g., S3, HDFS).
  2. Historical nodes cache frequently accessed segments for faster querying.
  1. SQL: Provides a familiar interface for querying.
  2. Native JSON-based query language: Offers more flexibility and advanced features.
  1. Broker: Routes queries to appropriate nodes.
  2. Historical nodes: Serve immutable data.
  3. MiddleManager: Handles ingestion tasks.
  4. Overlord: Coordinates ingestion processes.
  5. Coordinator: Manages data availability and balancing.

Druid integrates with tools like:

  1. Apache Superset
  2. Tableau
  3. Grafana
  4. Stream ingestion tools (Kafka, Kinesis)
  5. Batch processing frameworks (Hadoop, Spark)

Yes, it’s open-source and distributed under the Apache License 2.0.

Alternatives include:

  1. ClickHouse
  2. Elasticsearch
  3. BigQuery
  4. Snowflake

Apache Superset FAQ

Apache Superset is an open-source data visualization and exploration platform. It allows users to create dashboards, charts, and interactive visualizations from various data sources.

Superset offers:

  1. Line charts, bar charts, pie charts.
  2. Geographic visualizations like maps.
  3. Pivot tables, histograms, and heatmaps.
  4. Custom visualizations via plugins.

Superset supports:

  1. Relational databases (PostgreSQL, MySQL, SQLite).
  2. Big data engines (Hive, Presto, Trino, BigQuery).
  3. Analytics databases (Apache Druid, ClickHouse).

Superset is a robust, open-source alternative to commercial BI tools. While it’s feature-rich, it may require more setup and development effort compared to proprietary solutions like Tableau.

Superset primarily supports SQL for querying data. Its SQL Lab provides an environment for writing, testing, and running queries.

Superset supports:

  1. Database-based login.
  2. OAuth, LDAP, OpenID Connect, and other SSO mechanisms.
  3. Role-based access control (RBAC) for managing permissions.

Superset can be installed using:

  1. Docker: For containerized deployments.
  2. Python pip: For a native installation in a Python environment.
  3. Helm: For Kubernetes-based setups.

Yes. By integrating Superset with real-time analytics platforms like Apache Druid or streaming data sources, you can build dashboards that reflect live data.

  1. Requires knowledge of SQL for advanced queries.
  2. May not have the polish or advanced features of proprietary BI tools like Tableau.
  3. Customization and extensibility require some development effort.

Superset performance depends on:

  1. The efficiency of the connected database or analytics engine.
  2. The complexity of queries and visualizations.
  3. Hardware resources and configuration.

Would you like further details on specific features or help setting up either tool? Reach out to us

Watch the Apache Blog Series

Exploring Apache Druid_ A High-Performance Real-Time Analytics Database

Exploring Apache Druid: A High-Performance Real-Time Analytics Database

Introduction

Apache Druid is a distributed, column-oriented, real-time analytics database designed for fast, scalable, and interactive analytics on large datasets. It excels in use cases requiring real-time data ingestion, high-performance queries, and low-latency analytics. 

Druid was originally developed to power interactive data applications at Metamarkets and has since become a widely adopted open-source solution for real-time analytics, particularly in industries such as ad tech, fintech, and IoT.

It supports batch and real-time data ingestion, enabling users to perform fast ad-hoc queries, power dashboards, and interactive data exploration.

In big data and real-time analytics, having the right tools to process and analyze large volumes of data swiftly is essential. Apache Druid, an open-source, high-performance, column-oriented distributed data store, has emerged as a leading solution for real-time analytics and OLAP (online analytical processing) workloads. In this blog post, we’ll delve into what Apache Druid is, its key features, and how it can revolutionize your data analytics capabilities. Refer to the official documentation for more information.

Apache Druid

Apache Druid is a high-performance, real-time analytics database designed for fast slice-and-dice analytics on large datasets. It was created by Metamarkets (now part of Snap Inc.) and is now an Apache Software Foundation project. Druid is built to handle both batch and streaming data, making it ideal for use cases that require real-time insights and low-latency queries.

Key Features of Apache Druid:

Real-Time Data Ingestion

Druid excels at real-time data ingestion, allowing data to be ingested from various sources such as Kafka, Kinesis, and traditional batch files. It supports real-time indexing, enabling immediate query capabilities on incoming data with low latency.

High-Performance Query Engine

Druid’s query engine is optimized for fast, interactive querying. It supports a wide range of query types, including Time-series, TopN, GroupBy, and search queries. Druid’s columnar storage format and advanced indexing techniques, such as bitmap indexes and compressed column stores, ensure that queries are executed efficiently.

Scalable and Distributed Architecture

Druid’s architecture is designed to scale horizontally. It can be deployed on a cluster of commodity hardware, with data distributed across multiple nodes to ensure high availability and fault tolerance. This scalability makes Druid suitable for handling large datasets and high query loads.

Flexible Data Model

Druid’s flexible data model allows for the ingestion of semi-structured and structured data. It supports schema-on-read, enabling dynamic column discovery and flexibility in handling varying data formats. This flexibility simplifies the integration of new data sources and evolving data schemas.

Built-In Data Management

Druid includes built-in features for data management, such as automatic data partitioning, data retention policies, and compaction tasks. These features help maintain optimal query performance and storage efficiency as data volumes grow.

Extensive Integration Capabilities

Druid integrates seamlessly with various data ingestion and processing frameworks, including Apache Kafka, Apache Storm, and Apache Flink. It also supports integration with visualization tools like Apache Superset, Tableau, and Grafana, enabling users to build comprehensive analytics solutions.

Use Cases of Apache Druid

Real-Time Analytics

Druid is used in real-time analytics applications where the ability to ingest and query data in near real-time is critical. This includes monitoring applications, fraud detection, and customer behavior tracking.

Ad-Tech and Marketing Analytics

Druid’s ability to handle high-throughput data ingestion and fast queries makes it a popular choice in the ad tech and marketing industries. It can track user events, clicks, impressions, and conversion rates in real time to optimize campaigns.

IoT Data and Sensor Analytics

IoT applications produce time-series data at high volume. Druid’s architecture is optimized for time-series data analysis, making it ideal for analyzing IoT sensor data, device telemetry, and real-time event tracking.

Operational Dashboards

Druid is often used to power operational dashboards that provide insights into infrastructure, systems, or applications. The low-latency query capabilities ensure that dashboards reflect real-time data without delay.

Clickstream Analysis

Organizations leverage Druid to analyze user clickstream data on websites and applications, allowing for in-depth analysis of user interactions, preferences, and behaviors in real time.

The Architecture of Apache Druid

Apache Druid follows a distributed, microservice-based architecture. The architecture allows for scaling different components based on the system’s needs.

The main components are:

Coordinator and Overlord Nodes

  1. Coordinator Node: Manages data availability, balancing the distribution of data across the cluster, and overseeing segment management (segments are the basic units of storage in Druid).
  2. Overlord Node: Responsible for managing ingestion tasks. It works with the middle managers to schedule and execute data ingestion tasks, ensuring that data is ingested properly into the system.

Historical Nodes

Historical nodes store immutable segments of historical data. When queries are executed, historical nodes serve data from the disk, which allows for low-latency and high-throughput queries.

MiddleManager Nodes

MiddleManager nodes handle real-time ingestion tasks. They manage tasks such as ingesting data from real-time streams (like Kafka), transforming it, and pushing the processed data to historical nodes after it has persisted.

Broker Nodes

The broker nodes route incoming queries to the appropriate historical or real-time nodes and aggregate the results. They act as the query routers and perform query federation across the Druid cluster.

Query Nodes

Query nodes are responsible for receiving, routing, and processing queries. They can handle a variety of query types, including SQL, and route these queries to other nodes for execution.

Deep Storage

Druid relies on an external deep storage system (such as Amazon S3, Google Cloud Storage, or HDFS) to store segments of data permanently. The historical nodes pull these segments from deep storage when they need to serve data.

Metadata Storage

Druid uses an external relational database (typically PostgreSQL or MySQL) to store metadata about the data, including segment information, task states, and configuration settings.

Advantages of Apache Druid

  1. Sub-Second Query Latency: Optimized for high-speed data queries, making it perfect for real-time dashboards.
  2. Scalability: Easily scales to handle petabytes of data.
  3. Flexible Data Ingestion: Supports both batch and real-time data ingestion from multiple sources like Kafka, HDFS, and Amazon S3.
  4. Column-Oriented Storage: Efficient data storage with high compression ratios and fast retrieval of specific columns.
  5. SQL Support: Familiar SQL-like querying capabilities for easy data analysis.
  6. High Availability: Fault-tolerant and highly available due to data replication across nodes.

Getting Started with Apache Druid

Installation and Setup

Setting up Apache Druid involves configuring a cluster with different node types, each responsible for specific tasks:

  1. Master Nodes: Oversee coordination, metadata management, and data distribution.
  2. Data Nodes: Handle data storage, ingestion, and querying.
  3. Query Nodes: Manage query routing and processing.

You can install Druid using a package manager, Docker, or by downloading and extracting the binary distribution. Here’s a brief overview of setting up Druid using Docker:

  1. Download the Docker Compose File:
    $curl -O https://raw.githubusercontent.com/apache/druid/master/examples/docker-compose/docker-compose.yml
  2. Start the Druid Cluster: $ docker-compose up
  3. Access the Druid Console: Open your web browser and navigate to http://localhost:8888 to access the Druid console.

Ingesting Data

To ingest data into Druid, you need to define an ingestion spec that outlines the data source, input format, and parsing rules. Here’s an example of a simple ingestion spec for a CSV file:

JSON Code

{ “type”: “index_parallel”, “spec”: { “ioConfig”: { “type”: “index_parallel”, “inputSource”: { “type”: “local”, “baseDir”: “/path/to/csv”, “filter”: “*.csv” }, “inputFormat”: { “type”: “csv”, “findColumnsFromHeader”: true } }, “dataSchema”: { “dataSource”: “example_data”, “timestampSpec”: { “column”: “timestamp”, “format”: “iso” }, “dimensionsSpec”: { “dimensions”: [“column1”, “column2”, “column3”] } }, “tuningConfig”: { “type”: “index_parallel” } }}

Submit the ingestion spec through the Druid console or via the Druid API to start the data ingestion process.

Querying Data

Once your data is ingested, you can query it using Druid’s native query language or SQL. Here’s an example of a simple SQL query to retrieve data from the example_data data source:

SELECT  __time, column1, column2, columnFROM example_dataWHERE __time BETWEEN ‘2023-01-01’ AND ‘2023-01-31’

Use the Druid console or connect to Druid from your preferred BI tool to execute queries and visualize data.

Conclusion

Apache Druid is a powerful, high-performance real-time analytics database that excels at handling large-scale data ingestion and querying. Its robust architecture, flexible data model, and extensive integration capabilities make it a versatile solution for a wide range of analytics use cases. Whether you need real-time insights, interactive queries, or scalable OLAP capabilities, Apache Druid provides the tools to unlock the full potential of your data. Explore Apache Druid today and transform your data analytics landscape. Apache Druid has firmly established itself as a leading database for real-time, high-performance analytics. Its unique combination of real-time data ingestion, sub-second query speeds, and scalability makes it a perfect choice for businesses that need to analyze vast amounts of time-series and event-driven data. With growing adoption across industries.

Need help transforming your real-time analytics with high-performance querying? Contact our experts today!

Watch the Apache Druid Blog Series

Stay tuned for the upcoming Apache Druid Blog Series:

  1. Why choose Apache Druid over Vertica
  2. Why choose Apache Druid over Snowflake
  3. Why choose Apache Druid over Google Big Query
  4. Integrating Apache Druid with Apache Superset for Realtime Analytics
Streamlining Apache HOP Workflow Management with Apache Airflow

Streamlining Apache HOP Workflow Management with Apache Airflow

Introduction

In our previous blog, we talked about the Apache HOP in more detail. In case you have missed it, refer to it here https://analytics.axxonet.com/comparison-of-and-migrating-from-pdi-kettle-to-apache-hop/” page. As a continuation of the Apache HOP article series, here we touch upon how to integrate Apache Airflow and Apache HOP. In the fast-paced world of data engineering and data science, efficiently managing complex workflows is crucial. Apache Airflow, an open-source platform for programmatically authoring, scheduling, and monitoring workflows, has become a cornerstone in many data teams’ toolkits. This blog post explores what Apache Airflow is, its key features, and how you can leverage it to streamline and manage your Apache HOP workflows and pipelines.

Apache HOP

Apache HOP is an open-source data integration and orchestration platform. For more details refer to our previous blog here.

Apache Airflow

Apache Airflow is an open-source workflow orchestration tool originally developed by Airbnb. It allows you to define workflows as code, providing a dynamic, extensible platform to manage your data pipelines. Airflow’s rich features enable you to automate and monitor workflows efficiently, ensuring that data moves seamlessly through various processes and systems.

Use Cases:

  1. Data Pipelines: Orchestrating ETL jobs to extract data from sources, transform it, and load it into a data warehouse.
  2. Machine Learning Pipelines: Scheduling ML model training, batch processing, and deployment workflows.
  3. Task Automation: Running repetitive tasks, like backups or sending reports.

DAG (Directed Acyclic Graph):

  1. A DAG represents the workflow in Airflow. It defines a collection of tasks and their dependencies, ensuring that tasks are executed in the correct order.
  2. DAGs are written in Python and allow you to define the tasks and how they depend on each other.

Operators:

  • Operators define a single task in a DAG. There are several built-in operators, such as:
    1. BashOperator: Runs a bash command.
    2. PythonOperator: Runs Python code.
    3. SqlOperator: Executes SQL commands.
    4. HttpOperator: Makes HTTP requests.

Custom operators can also be created to meet specific needs.

Tasks:

  1. Tasks are the building blocks of a DAG. Each node in a DAG is a task that does a specific unit of work, such as executing a script or calling an API.
  2. Tasks are defined by operators and their dependencies are controlled by the DAG.

Schedulers:

  1. The scheduler is responsible for triggering tasks at the appropriate time, based on the schedule_interval defined in the DAG.
  2. It continuously monitors all DAGs and determines when to run the next task.

Executors:

  • The executor is the mechanism that runs the tasks. Airflow supports different types of executors:
    1. SequentialExecutor: Executes tasks one by one.
    2. LocalExecutor: Runs tasks in parallel on the local machine.
    3. CeleryExecutor: Distributes tasks across multiple worker machines.
    4. KubernetesExecutor: Runs tasks in a Kubernetes cluster.

Web UI:

  1. Airflow has a web-based UI that lets you monitor the status of DAGs, view logs, and check the status of each task in a DAG.
  2. It also provides tools to trigger, pause, or retry DAGs.

Key Features of Apache Airflow

Workflow as Code

Airflow uses Directed Acyclic Graphs (DAGs) to represent workflows. These DAGs are written in Python, allowing you to leverage the full power of a programming language to define complex workflows. This approach, known as “workflow as code,” promotes reusability, version control, and collaboration.

Dynamic Task Scheduling

Airflow’s scheduling capabilities are highly flexible. You can schedule tasks to run at specific intervals, handle dependencies, and manage task retries in case of failures. The scheduler executes tasks in a defined order, ensuring that dependencies are respected and workflows run smoothly.

Extensible Architecture

Airflow’s architecture is modular and extensible. It supports a wide range of operators (pre-defined tasks), sensors (waiting for external conditions), and hooks (interfacing with external systems). This extensibility allows you to integrate with virtually any system, including databases, cloud services, and APIs.

Robust Monitoring and Logging

Airflow provides comprehensive monitoring and logging capabilities. The web-based user interface (UI) offers real-time visibility into the status of your workflows, enabling you to monitor task progress, view logs, and troubleshoot issues. Additionally, Airflow can send alerts and notifications based on task outcomes.

Scalability and Reliability

Designed to scale, Airflow can handle workflows of any size. It supports distributed execution, allowing you to run tasks on multiple workers across different nodes. This scalability ensures that Airflow can grow with your organization’s needs, maintaining reliability even as workflows become more complex.

Getting Started with Apache Airflow

Installation using PIP

Setting up Apache Airflow is straightforward. You can install it using pip, Docker, or by deploying it on a cloud service. Here’s a brief overview of the installation process using pip:

1. Create a Virtual Environment (optional but recommended):

           python3 -m venv airflow_env

            source airflow_env/bin/activate

2. Install Apache Airflow: 

           pip install apache-airflow

3. Initialize the Database:

           airflow db init

4. Create a User:

           airflow users create –username admin –password admin –firstname Adminlastname User –role Admin –email [email protected]

5. Start the Web Server and Scheduler:

           airflow webserver –port 8080

           airflow scheduler

6. Access the Airflow UI: Open your web browser and go to http://localhost:8080.

Installation using Docker

Pull the docker image and run the container to access the Airflow web UI. Refer to the link for more details.

Creating Your First DAG

Airflow DAG Structure:

A DAG in Airflow is composed of three main parts:

  1. Imports: Necessary packages and operators.
  2. Default Arguments: Arguments that apply to all tasks within the DAG (such as retries, owner, start date).
  3. Task Definition: Define tasks using operators, and specify dependencies between them.

Scheduling:

Airflow allows you to define the schedule of a DAG using schedule_interval:

  1. @daily: Run once a day at midnight.
  2. @hourly: Run once every hour.
  3. @weekly: Run once a week at midnight on Sunday.
  4. Cron expressions, like “0 12 * * *”, are also supported for more specific scheduling needs.
  5. Define the DAG: Create a Python file (e.g., run_lms_transaction.py) in the dags folder of your Airflow installation directory.

    Example

from airflow import DAG

from airflow.operators.dummy import DummyOperator

from datetime import datetime

default_args = {

 ‘owner’: ‘airflow’,

 ‘start_date’: datetime(2023, 1, 1),

 ‘retries’: 1,

}

dag = DAG(‘example_dag’, default_args=default_args, schedule_interval=‘@daily’)

start = DummyOperator(task_id=‘start’, dag=dag)

end = DummyOperator(task_id=‘end’, dag=dag)

start >> end

2. Deploy the DAG: Save the file in the Dags folder. Place the DAG Python script in the DAGs folder (~/airflow/dags by default). Airflow will automatically detect and load the DAG.

3. Monitor the DAG: Access the Airflow UI, where you can view and manage the newly created DAG. Trigger the DAG manually or wait for it to run according to the defined schedule.

Calling the Apache HOP Pipelines/Workflows from Apache Airflow

In this example, we walk through how to integrate the Apache HOP with Apache Airflow. Here both the Apache Airflow and Apache HOP are running on two different independent docker containers. Apache HOP ETL Pipelines / Workflows are configured with a persistent volume storage strategy so that the DAG code can request execution from Airflow.  

Steps

  1. Define the DAG: Create a Python file (e.g., Stg_User_Details.py) in the dags folder of your Airflow installation directory.

from datetime import datetime, timedelta

from airflow import DAG

from airflow.operators.bash_operator import BashOperator

from airflow.operators.docker_operator import DockerOperator

from airflow.operators.python_operator import BranchPythonOperator

from airflow.operators.dummy_operator import DummyOperator

from docker.types import Mount

default_args = {

‘owner’                 : ‘airflow’,

‘description’           : ‘Stg_User_details’,

‘depend_on_past’        : False,

‘start_date’            : datetime(2022, 1, 1),

’email_on_failure’      : False,

’email_on_retry’        : False,

‘retries’               : 1,

‘retry_delay’           : timedelta(minutes=5)

}

with DAG(‘Stg_User_details’, default_args=default_args, schedule_interval=‘0 10 * * *’, catchup=False, is_paused_upon_creation=False) as dag:

    start_dag = DummyOperator(

        task_id=‘start_dag’

        )

    end_dag = DummyOperator(

        task_id=‘end_dag’

        )

    hop = DockerOperator(

        task_id=‘Stg_User_details’,

        # use the Apache Hop Docker image. Add your tags here in the default apache/hop: syntax

        image=‘test’,

        api_version=‘auto’,

        auto_remove=True,

        environment= {

            ‘HOP_RUN_PARAMETERS’: ‘INPUT_DIR=’,

            ‘HOP_LOG_LEVEL’: ‘TRACE’,

            ‘HOP_FILE_PATH’: ‘/opt/hop/config/projects/default/stg_user_details_test.hpl’,

            ‘HOP_PROJECT_DIRECTORY’: ‘/opt/hop/config/projects/’,

            ‘HOP_PROJECT_NAME’: ‘ISON_Project’,

            ‘HOP_ENVIRONMENT_NAME’: ‘ISON_Env’,

            ‘HOP_ENVIRONMENT_CONFIG_FILE_NAME_PATHS’: ‘/opt/hop/config/projects/default/project-config.json’,

            ‘HOP_RUN_CONFIG’: ‘local’,

        },

        docker_url=“unix://var/run/docker.sock”,

        network_mode=“bridge”,

        force_pull=False,

        mount_tmp_dir=False

        )

    start_dag >> hop >> end_dag

Note: For reference purposes only.

2. Deploy the DAG: Save the file in the dags folder. Airflow will automatically detect and load the DAG.

After successful deployment, we should see the new “Stg_User_Details” DAG listed in the Active Tab and All Tab from the Airflow Portal. As shown in the screenshot above.

3. Run the DAG: We can trigger pipelines or workflows using Airflow by clicking on the Trigger DAG option as shown below from the Airflow application.

4. Monitor the DAG: Access the Airflow UI, where you can view and manage the newly created DAG. Trigger the DAG manually or wait for it to run according to the defined schedule.

After successful execution, we should see the status message as shown an execution history along with log details. new “Stg_User_Details” DAG listed in the Active Tab and All Tab from the Airflow Portal. As shown in the screenshot above.

Managing and Scaling Workflows

  1. Use Operators and Sensors: Leverage Airflow’s extensive library of operators and sensors to create tasks that interact with various systems and handle complex logic.
  2. Implement Task Dependencies: Define task dependencies using the >> and << operators to ensure tasks run in the correct order.
  3. Optimize Performance: Monitor task performance through the Airflow UI and logs. Adjust task configurations and parallelism settings to optimize workflow execution.
  4. Scale Out: Configure Airflow to run in a distributed mode by adding more worker nodes, ensuring that the system can handle increasing workload efficiently.

Conclusion

Apache Airflow is a powerful and versatile tool for managing workflows and automating complex data pipelines. Its “workflow as code” approach, coupled with robust scheduling, monitoring, and scalability features, makes it an essential tool for data engineers and data scientists. By adopting Airflow, you can streamline your workflow management, improve collaboration, and ensure that your data processes are efficient and reliable. Explore Apache Airflow today and discover how it can transform your data engineering workflows.

Streamline your Apache HOP Workflow Management With Apache Airflow through our team of experts.

Upcoming Apache HOP Blog Series

Stay tuned for the upcoming Apache HOP Blog Series:

  1. Migrating from Pentaho ETL to Apache Hop
  2. Integrating Apache Hop with an Apache Superset
  3. Comparison of Pentaho ETL and Apache Hop
Optimizing DevSecOps with SonarQube and DefectDojo Integration

Optimizing DevSecOps with SonarQube and DefectDojo Integration

In the evolving landscape of software development, the importance of integrating security practices into every phase of the software development lifecycle (SDLC) cannot be overstated.

This is where DevSecOps comes into play a philosophy that emphasizes the need to incorporate security as a shared responsibility throughout the entire development process, rather than treating it as an afterthought.

By shifting security left, organizations can detect vulnerabilities earlier, reduce risks, and improve compliance with industry standards. DevSecOps isn’t just about tools; it’s about creating a culture where developers, security teams, and operations work together seamlessly. However, having the right tools is essential to implementing an effective DevSecOps strategy.

In this article, we’ll delve into the setup of DefectDojo, a powerful open-source vulnerability management tool, and demonstrate how to integrate it with SonarQube to automate Code Analysis as part of the CICD Process and also capture the vulnerability identified in the testing so it can be tracked to closure. 

Why DevSecOps and What Tools to Use?

DevSecOps aims to automate security processes, ensuring that security is built into the CI/CD pipeline without slowing down the development process. Here are some key tools that can be used in a DevSecOps environment:

SonarQube: Static Code Analysis

SonarQube inspects code for bugs, code smells, and vulnerabilities, providing developers with real-time feedback on code quality. It helps identify and fix issues early in the development process, ensuring cleaner and more maintainable code.

Dependency Check: Dependency Vulnerability Scanning

Dependency Check scans project dependencies for known vulnerabilities, alerting teams to potential risks from third-party libraries. This tool is crucial for managing and mitigating security risks associated with external code components.

Trivy: Container Image Scanning

Trivy examines Docker images for vulnerabilities in both OS packages and application dependencies. By integrating Trivy into the CI/CD pipeline, teams ensure that the container images deployed are secure, reducing the risk of introducing vulnerabilities into production.

ZAP: Dynamic Application Security Testing

ZAP performs dynamic security testing on running applications to uncover vulnerabilities that may not be visible through static analysis alone. It helps identify security issues in a live environment, ensuring that the application is secure from real-world attacks.

Defect Dojo: Centralized Security Management

Defect Dojo collects and centralizes security findings from SonarQube, Dependency Check, Trivy, and ZAP, providing a unified platform for managing and tracking vulnerabilities. It streamlines the process of addressing security issues, offering insights and oversight to help prioritize and resolve them efficiently.

DefectDojo ties all these tools together by acting as a central hub for vulnerability management. It allows security teams to track, manage, and remediate vulnerabilities efficiently, making it an essential tool in the DevSecOps toolkit.

Installation and Setup of DefectDojo

Before integrating DefectDojo with your security tools, you need to install and configure it. Here’s how you can do that:

Prerequisites: 

1. Docker
2. Jenkins

Step1: Download the official documentation from DefectDojo github repository using the link

Step2: Once downloaded, navigate to the path where you have downloaded the repository and run the below command.

docker-compose up -d –-build

Step3: DefectDojo will start running and you will be able to access it with http://localhost:8080/ . Enter the default credentials that you have set and login to the application.

Step4: Create a product by navigating as shown below

Step5: Click on the created product name and create an engagement by navigating as shown below

Step6: Once an engagement is created click on the created engagement name and create a test from it.

Step7: Copy the test number and enter the test number in the post stage of the pipeline.

Step8: Building a pipeline in Jenkins to integrate it with DefectDojo inorder to generate vulnerability report

pipeline {

    agent {

        label ‘Agent-1’

    }

    environment {

        SONAR_TOKEN = credentials(‘sonarqube’)

    }

    stages {

        stage(‘Checkout’) {

            steps {

                checkout([$class: ‘GitSCM’, branches: [[name: ‘dev’]],

                    userRemoteConfigs: [[credentialsId: ‘parameterized credentials’, url: ‘repositoryURL’]]

                ])

            }

        }

        stage(‘SonarQube Analysis’) {

            steps {

                script {

                    withSonarQubeEnv(‘Sonarqube’) {

                        def scannerHome = tool name: ‘sonar-scanner’, type: ‘hudson.plugins.sonar.SonarRunnerInstallation’

                        def scannerCmd = “${scannerHome}/bin/sonar-scanner”

                        sh “${scannerCmd} -Dsonar.projectKey=ProjectName -Dsonar.sources=./ -Dsonar.host.url=Sonarqubelink -Dsonar.login=${SONAR_TOKEN} -Dsonar.java.binaries=./”

                    }

                }

            }

        }

        // Other stages…

        stage(‘Generate SonarQube JSON Report’) {

            steps {

                script {

                    def SONAR_REPORT_FILE = “/path/for/.jsonfile”

                    sh “””

                    curl -u ${SONAR_TOKEN}: \

                        “Sonarqube URL” \

                        -o ${SONAR_REPORT_FILE}

                    “””

                }

            }

        }

    }

    post {

        always {

            withCredentials([string(credentialsId: ‘Defect_Dojo_Api_Key’, variable: ‘API_KEY’)]) {

                script {

                    def defectDojoUrl = ‘Enter the defect dojo URL’  // Replace with your DefectDojo URL

                    def testId = ’14’  // Replace with the correct test ID

                    def scanType = ‘SonarQube Scan’

                    def tags = ‘Enter the tag name’

                    def SONAR_REPORT_FILE = “/path/where/.json/file is present”

                    sh “””

                    curl -X POST \\

                      ‘${defectDojoUrl}’ \\

                      -H ‘accept: application/json’ \\

                      -H ‘Authorization: Token ${API_KEY}’ \\

                      -H ‘Content-Type: multipart/form-data’ \\

                      -F ‘test=${testId}’ \\

                      -F ‘file=@${SONAR_REPORT_FILE};type=application/json’ \\

                      -F ‘scan_type=${scanType}’ \\

                      -F ‘tags=${tags}’

                    “””

                }

            }

        }

    }

}

The entire script is wrapped in a pipeline block, defining the CI/CD pipeline in Jenkins.

agent: Specifies the Jenkins agent (node) where the pipeline will execute. The label Agent-1 indicates that the pipeline should run on a node with that label.

environment: Defines environment variables for the pipeline.

SONAR_TOKEN: Retrieves the SonarQube authentication token from Jenkins credentials using the ID ‘sonarqube’. This token is needed to authenticate with the SonarQube server during analysis.

The pipeline includes several stages, each performing a specific task.

stage(‘Checkout’): The first stage checks out the code from the Git repository.

checkout: This is a Jenkins step that uses the GitSCM plugin to pull code from the specified branch.

branches: Indicates which branch (dev) to checkout.

userRemoteConfigs: Specifies the Git repository’s URL and the credentials ID (parameterized credentials) used to access the repository.

stage(‘SonarQube Analysis’): This stage runs a SonarQube analysis on the checked-out code.

withSonarQubeEnv(‘Sonarqube’): Sets up the SonarQube environment using the SonarQube server named Sonarqube, which is configured in Jenkins.

tool name: Locates the SonarQube scanner tool installed on the Jenkins agent.

sh: Executes the SonarQube scanner command with the following parameters:

-Dsonar.projectKey=ProjectName: Specifies the project key in SonarQube.

-Dsonar.sources=./: Specifies the source directory for the analysis (in this case, the root directory).

-Dsonar.host.url=Sonarqubelink: Specifies the SonarQube server URL.

-Dsonar.login=${SONAR_TOKEN}: Uses the SonarQube token for authentication.

-Dsonar.java.binaries=./: Points to the location of Java binaries (if applicable) for analysis.

stage(‘Generate SonarQube JSON Report’): This stage generates a JSON report of the SonarQube analysis.

SONAR_REPORT_FILE: Defines the path where the JSON report will be saved.

sh: Runs a curl command to retrieve the SonarQube issues data in JSON format:

  1. -u ${SONAR_TOKEN}:: Uses the SonarQube token for authentication.
  2. “Sonarqube URL”: Specifies the API endpoint to fetch the issues from SonarQube.
  3. -o ${SONAR_REPORT_FILE}: Saves the JSON response to the specified file path.

post: Defines actions to perform after the pipeline stages are complete. The always block ensures that the following steps are executed regardless of the pipeline’s success or failure.

withCredentials: Securely retrieves the DefectDojo API key from Jenkins credentials using the ID ‘Defect_Dojo_Api_Key’.

script: Contains the script block where the DefectDojo integration happens.

defectDojoUrl: The URL of the DefectDojo API endpoint for reimporting scans.

testId: The specific test ID in DefectDojo where the SonarQube report will be uploaded.

scanType: Indicates the type of scan (in this case, SonarQube Scan).

tags: Tags the scan in DefectDojo.

SONAR_REPORT_FILE: Points to the JSON file generated earlier.

sh: Runs a curl command to POST the SonarQube JSON report to DefectDojo:

-X POST: Specifies that this is a POST request.

-H ‘accept: application/json’: Indicates that the response should be in JSON format.

-H ‘Authorization: Token ${API_KEY}’: Authenticates with DefectDojo using the API key.

-F ‘test=${testId}’: Specifies the test ID in DefectDojo.

-F ‘file=@${SONAR_REPORT_FILE};type=application/json’: Uploads the JSON file as part of the request.

-F ‘scan_type=${scanType}’: Indicates the type of scan being uploaded.

-F ‘tags=${tags}’: Adds any specified tags to the scan in DefectDojo.

Step 9: Verify the Integration

After the pipeline executes, the vulnerability report generated by SonarQube will be available in DefectDojo under the corresponding engagement. You can now track, manage, and remediate these vulnerabilities using DefectDojo’s robust features.

Conclusion

Integrating security tools like SonarQube with DefectDojo is a critical step in building a secure DevSecOps pipeline. By automating vulnerability management and integrating it directly into your CI/CD pipeline, you can ensure that security remains a top priority throughout the development process.

In this article, we focused on setting up DefectDojo and integrating it with SonarQube. In future articles, we will cover the integration of additional tools like OWASP ZAP, Trivy, and Dependency-Check. Stay tuned to further enhance your DevSecOps practices.

Frequently Asked Questions (FAQs)

DefectDojo is an open-source application security vulnerability management tool. It helps organizations in managing, tracking, and remediating security vulnerabilities efficiently. It integrates with a variety of security tools for automating workflows and provides continuous security monitoring throughout the SDLC.

The prerequisites for setting up DefectDojo are Docker, to run DefectDojo inside a container, and Jenkins, for integrating pipelines in CI/CD.

To install DefectDojo:

  1. Download the official DefectDojo documentation from its GitHub repository.
  2. Navigate to the path where you downloaded the repository.
  3. Run the docker-compose up -d --build command to build and start the DefectDojo container and its dependencies.

Once DefectDojo is running, you can access it via a web browser at http://localhost:8080/. Log in with the credentials set during installation, and you can start managing security vulnerabilities by creating products, engagements, and tests.

Yes, DefectDojo integrates seamlessly with other security tools like SonarQube, OWASP ZAP, Trivy, and Dependency-Check. It allows the centralization of vulnerability management across multiple tools, making it an indispensable part of the DevSecOps pipeline.

Get in touch with us and elevate your software security and performance

Unlocking Data Insights with Apache Superset

Unlocking Data Insights with Apache Superset

Introduction

In today’s data-driven world, having the right tools to analyze and visualize data is crucial for making informed decisions. Organizations rely heavily on actionable insights to make informed decisions. With vast amounts of data generated daily, visualizing it becomes crucial for deriving patterns, trends, and insights. One of the standout solutions in the open-source landscape is Apache Superset. Apache Superset, an open-source data exploration and visualization platform, has emerged as a powerful tool for modern data analytics. This powerful, user-friendly platform enables users to create, explore, and share interactive data visualizations and dashboards. Whether you’re a data scientist, analyst, or business intelligence professional, Apache Superset can significantly enhance your data analysis capabilities. In this blog post, we’ll dive deep into what Apache Superset is, its key features, architecture, installation process, use cases, and how you can leverage it to unlock valuable insights from your data. 

Apache Superset

Apache Superset is an open-source data exploration and visualization platform developed by Airbnb, it was later donated to the Apache Software Foundation. It is now a top-level Apache project, widely adopted across industries for data analytics and visualization. Apache Superset is designed to be a modern, enterprise-ready business intelligence web application that allows users to explore, analyze, and visualize large datasets. Superset’s intuitive interface allows users to quickly and easily create beautiful and interactive visualizations and dashboards from various data sources without needing extensive programming knowledge.

Superset is designed to be lightweight yet feature-rich, offering powerful SQL-based querying, interactive dashboards, and a wide variety of data visualization options—all through an intuitive web-based interface.

Key Features

Rich Data Visualizations

Superset offers a clean and intuitive interface that makes it easy for users to navigate and create visualizations. The drag-and-drop functionality simplifies the process of building charts and dashboards, making it accessible even to non-technical users. Superset provides a wide range of customizable visualizations. Whether it’s simple charts like bar charts, line charts, pie charts, scatter plots, geographical maps, or complex visuals like geospatial maps and heatmaps, Superset offers an extensive library to cover various data visualization needs. This flexibility allows users to choose the best way to represent their data, facilitating better analysis and understanding.

  1. Bar Charts: Perfect for comparing different categories of data.
  2. Line Charts: Excellent for time-series analysis.
  3. Heatmaps: Useful for showing data density or intensity.
  4. Geospatial Maps: Visualize location-based data on geographical maps.
  5. Pie Charts, Treemaps, Sankey Diagrams, and More: Additional options for exploring relationships and proportions in the data.

SQL-Based Querying

One of Superset’s most powerful features is its support for SQL-based querying. It provides an SQL editor where users can write and execute SQL queries directly against connected databases. For users who prefer working with SQL, Superset includes a powerful SQL editor called SQL Lab. This feature allows users to run queries, explore databases, and preview data before creating visualizations. SQL Lab supports syntax highlighting, autocompletion, and query history, enhancing the SQL writing experience.

Interactive Dashboards

Superset allows users to create interactive dashboards with multiple charts, filters, and data points. These dashboards can be customized and shared across teams to deliver insights interactively. Real-time data updates ensure that the latest metrics are always displayed.

Extensible and Scalable

Apache Superset is highly extensible and can connect to a variety of data sources such as:

  1. SQL-based databases (PostgreSQL, MySQL, Oracle, etc.)
  2. Big Data platforms (Presto, Druid, Hive, and more)
  3. Cloud-native databases (Google BigQuery, Snowflake, Amazon Redshift)

This versatility ensures that users can easily access and analyze their data, regardless of where it is stored. Its architecture supports horizontal scaling, making it suitable for enterprises handling large-scale datasets.

Security and Authentication

As an enterprise-ready platform, Superset offers robust security features, including role-based access control (RBAC), authentication, and authorization mechanisms. Additionally, Superset is designed to scale with your organization, capable of handling large volumes of data and concurrent users. Superset integrates with common authentication protocols (OAuth, OpenID, LDAP) to ensure secure access. It also provides fine-grained access control through role-based security, enabling administrators to control access to specific dashboards, charts, and databases.

Low-Code and No-Code Data Exploration

Superset is ideal for both technical and non-technical users. While advanced users can write SQL queries to explore data, non-technical users can use the point-and-click interface to create visualizations without requiring code. This makes it accessible to everyone, from data scientists to business analysts.

Customizable Visualizations

Superset’s visualization framework allows users to modify the look and feel of their charts using custom JavaScript, CSS, and the powerful ECharts and D3.js libraries. This gives users the flexibility to create branded and unique visual representations.

Advanced Analytics

Superset includes features for advanced analytics, such as time-series analysis, trend lines, and complex aggregations. These capabilities enable users to perform in-depth analysis and uncover deeper insights from their data.

Architecture of Apache Superset

Superset’s architecture is modular and designed to be scalable, making it suitable for both small teams and large enterprises

Here’s a breakdown of its core components:

Frontend (React-based):

Superset’s frontend is built using React, offering a smooth and responsive user interface for creating visualizations and interacting with data. The UI also leverages Bootstrap and other modern JavaScript libraries to enhance the user experience.

Backend (Python/Flask-based):

  1. The backend is powered by Python and Flask, a lightweight web framework. Superset uses SQLAlchemy as the SQL toolkit and Alembic for database migrations.
  2. Superset communicates with databases using SQLAlchemy to execute queries and fetch results.
  3. Celery and Redis can be used for background tasks and asynchronous queries, allowing for scalable query processing.

Metadata Database:

  1. Superset stores information about visualizations, dashboards, and user access in a metadata database. Common choices include PostgreSQL or MySQL.
  2. This database does not store the actual data being analyzed but rather metadata about the analysis (queries, charts, filters, and dashboards).

Caching Layer:

  1. Superset supports caching using Redis or Memcached. Caching improves the performance of frequently queried datasets and dashboards, ensuring faster load times.

Asynchronous Query Execution:

  1. For large datasets, Superset can run queries asynchronously using Celery workers. This prevents the UI from being blocked during long-running queries.

Worker and Beat​

This is one or more workers who execute tasks like run async queries or take snapshots of reports and send emails, and a “beat” that acts as the scheduler and tells workers when to perform their tasks. Most installations use Celery for these components.

Getting Started with Apache Superset

Installation and Setup

Setting up Apache Superset is straightforward. It can be installed using Docker, pip, or by deploying it on a cloud platform. Here’s a brief overview of the installation process using Docker:

1. Install Docker: Ensure Docker is installed on your machine.

2. Clone the Superset Repository:

git clone https://github.com/apache/superset.git

cd superset

3. Run the Docker Compose Command:

docker-compose -f docker-compose-non-dev.yml up

4. Initialize the Database:

docker exec -it superset_superset-worker_1 superset db upgrade

docker exec -it superset_superset-worker_1 superset init

5. Access Superset: Open your web browser and go to http://localhost:8088 to access the Superset login page.

Configuring the Metadata Storage

The metadata database is where chart and dashboard definitions, user information, logs, etc. are stored. Superset is tested to work with PostgreSQL and MySQL databases. In a Docker Compose installation, the data would be stored in a PostgreSQL container volume. The PyPI installation methods use a SQLite on-disk database. However, neither of these cases is recommended for production instances of Superset. For production, a properly configured, managed, standalone database is recommended. No matter what database you use, you should plan to back it up regularly. In the upcoming Superset blogs, we will go through how to configure the Apache Superset with Metadata storage. 

Creating Your First Dashboard

1. Connect to a Data Source: Navigate to the Sources tab and add a new database or table.

2. Explore Data: Use SQL Lab to run queries and explore your data.

3. Create Charts: Go to the Charts tab, choose a dataset, and select a visualization type. Customize your chart using the various configuration options.

4. Build a Dashboard: Combine multiple charts into a cohesive dashboard. Drag and drop charts, add filters, and arrange them to create an interactive dashboard.

More Dashboards:

Use Cases of Apache Superset

  1. Business Intelligence & Reporting Superset is widely used in organizations for creating BI dashboards that track KPIs, sales, revenue, and other critical metrics. It’s a great alternative to commercial BI tools like Tableau or Power BI, particularly for organizations that prefer open-source solutions.
  2. Data Exploration for Data Science Data scientists can leverage Superset to explore datasets, run queries, and visualize complex relationships in the data before moving to more complex machine learning tasks.
  3. Operational Dashboards Superset can be used to create operational dashboards that track system health, service uptimes, or transaction statuses in real-time. Its ability to connect to various databases and run SQL queries in real time makes it a suitable choice for this use case.
  4. Geospatial Analytics With built-in support for geospatial visualizations, Superset is ideal for businesses that need to analyze location-based data. For example, a retail business can use it to analyze customer distribution or store performance across regions.
  5. E-commerce Data Analysis Superset is frequently used by e-commerce companies to analyze sales data, customer behavior, product performance, and marketing campaign effectiveness.

Advantages of Apache Superset

  1. Open-source and Cost-effective: Being an open-source tool, Superset is free to use and can be customized to meet specific needs, making it a cost-effective alternative to proprietary BI tools.
  2. Rich Customizations: Superset supports extensive visual customizations and can integrate with JavaScript libraries for more advanced use cases.
  3. Easy to Deploy: It’s relatively straightforward to set up on both local and cloud environments.
  4. SQL-based and Powerful: Ideal for organizations with a strong SQL-based querying culture.
  5. Extensible: Can be integrated with other data processing or visualization tools as needed.

Sharing and Collaboration

Superset makes it easy to share your visualizations and dashboards with others. You can export and import dashboards, share links, and embed visualizations in other applications. Additionally, Superset’s role-based access control ensures that users only have access to the data and visualizations they are authorized to view.

Conclusion

Apache Superset is a versatile and powerful tool for data exploration and visualization. Its user-friendly interface, a wide range of visualizations, and robust integration capabilities make it an excellent choice for businesses and data professionals looking to unlock insights from their data. Whether you’re just getting started with data visualization or you’re an experienced analyst, Superset provides the tools you need to create compelling and informative visualizations. Give it a try and see how it can transform your data analysis workflow.

You can also get in touch with us and we will be Happy to help with your custom implementations.