Log Strata — Redefining Observability with Infra Monitor

Log Strata — Redefining Observability with Infra Monitor

Introduction

In an era dominated by cloud-native applications, observability has emerged as the backbone of modern software systems. It goes beyond traditional monitoring, focusing on understanding and improving application performance in real-time. The Log Strata Infra Monitor Challenge by STGI empowers developers to embrace this transformation by designing their own observability platform — a blend of creativity, technical ingenuity, and problem-solving.

This isn’t just about coding; it’s about crafting a solution that addresses the ever-evolving complexities of cloud infrastructure. Let’s break down the journey of building an observability platform from scratch.

The Mission

The challenge is deceptively simple yet profound: Build a Cloud Monitoring Tool that collects, processes, and visualizes logs and metrics from a Kubernetes cluster. This requires blending multiple disciplines — DevOps, backend engineering, and UI/UX design.

Core Requirements

  1. Kubernetes Cluster: Deployed locally using Minikube.

  2. Applications: Two apps powered by PostgreSQL and MySQL databases.

  3. Log Collection: Tools like Fluent Bit, FluentD, or Logstash.

  4. Backend Service: Process logs and metrics, storing them in a time series database (InfluxDB).

  5. Custom Dashboard: A web-based interface to visualize data with search and filtering capabilities.

Why This Challenge Matters

Modern cloud systems generate an overwhelming amount of data, making it crucial to design systems that don’t just monitor but also make sense of this data. Observability platforms serve as the eyes and ears of organizations, helping teams:

  • Detect anomalies in real-time.

  • Optimize performance through actionable insights.

  • Enhance reliability by identifying root causes faster.

STGI’s challenge enables participants to tackle these real-world problems in a simulated environment, laying the foundation for impactful solutions in production-scale systems.

How to Build Your Infra Monitor

Step 1: Laying the Foundation

Docker, Kubectl and Minikube Setup

Start with Docker, the building block of containerized environments, and Minikube, which brings Kubernetes to your local machine. Together, they form the staging ground for your platform.

Install Docker :

You can refer installation from here : medium.com/@sugam.arora23/docker-unboxed-yo..

Install Kubectl :

You can refer installation from here : medium.com/@sugam.arora23/effortlessly-inst..

Install Minikube :

You can refer installation from here : medium.com/@sugam.arora23/minikube-setup-gu..

Step 2: Deploying the Applications

Deploy two sample applications with PostgreSQL and MySQL backends. Each app represents a microservice within the ecosystem. Ensure proper database connections and verify application logs post-deployment.

Setting Up Kubernetes Deployments for app-1 and app-2

Prerequisites

Ensure Minikube is set up and running on your system (Windows or Ubuntu).

S 1: Clone Sample Applications

  1. Create Project Directories
mkdir ~/postgresql
mkdir ~/mysql
cd ~/postgresql
cd ~/mysql

2. Clone the Applications Replace <url-for-app-1> and <url-for-app-2> with the actual URLs of your applications.

git clone https://github.com/tanishsummit/PyEditorial
git clone https://github.com/manjillama/docker-node-crud-mysql

S 2: Create Kubernetes Deployments and Services

1. Create Deployment and Service for MySQL (app-1)

  1. Create YAML File for MySQL Deployment
touch mysql-deployment.yaml

2. Add the Following YAML

apiVersion: apps/v1
kind: Deployment
metadata:
  name: mysql
spec:
  selector:
    matchLabels:
      app: mysql
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        app: mysql
    spec:
      containers:
      - image: mysql:5.6
        name: mysql
        env:
        - name: MYSQL_ROOT_PASSWORD
          value: password
        ports:
        - containerPort: 3306
          name: mysql
---
apiVersion: v1
kind: Service
metadata:
  name: mysql
spec:
  ports:
  - port: 3306
  selector:
    app: mysql

3. Apply the YAML File

kubectl apply -f mysql-deployment.yaml

2. Create Deployment and Service for PostgreSQL (app-2)

  1. Create YAML File for PostgreSQL Deployment
touch postgresql-deployment.yaml

2. Add the Following YAML

apiVersion: apps/v1
kind: Deployment
metadata:
  name: postgres
spec:
  selector:
    matchLabels:
      app: postgres
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
      - image: postgres:9.6
        name: postgres
        env:
        - name: POSTGRES_PASSWORD
          value: password
        ports:
        - containerPort: 5432
          name: postgres
---
apiVersion: v1
kind: Service
metadata:
  name: postgres
spec:
  ports:
  - port: 5432
  selector:
    app: postgres

3. Apply the YAML File

kubectl apply -f postgresql-deployment.yaml

3. Create Deployments and Services for Applications

a. For App-1 (Connected to MySQL)

  1. Create YAML File for App-1 Deployment
touch app-1-deployment.yaml

2. Add the Following YAML

apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-1
spec:
  selector:
    matchLabels:
      app: app-1
  template:
    metadata:
      labels:
        app: app-1
    spec:
      containers:
      - name: app-1
        image: <app-1-docker-image>
        ports:
        - containerPort: 8080
        env:
        - name: MYSQL_HOST
          value: mysql
        - name: MYSQL_USER
          value: root
        - name: MYSQL_PASSWORD
          value: password
---
apiVersion: v1
kind: Service
metadata:
  name: app-1
spec:
  type: NodePort
  selector:
    app: app-1
  ports:
  - port: 8080
    targetPort: 8080
    nodePort: 30001

3. Apply the YAML File

kubectl apply -f app-1-deployment.yaml

b. For App-2 (Connected to PostgreSQL)

  1. Create YAML File for App-2 Deployment
touch app-2-deployment.yaml

2. Add the Following YAML

apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-2
spec:
  selector:
    matchLabels:
      app: app-2
  template:
    metadata:
      labels:
        app: app-2
    spec:
      containers:
      - name: app-2
        image: <app-2-docker-image>
        ports:
        - containerPort: 8081
        env:
        - name: POSTGRES_HOST
          value: postgres
        - name: POSTGRES_USER
          value: postgres
        - name: POSTGRES_PASSWORD
          value: password
---
apiVersion: v1
kind: Service
metadata:
  name: app-2
spec:
  type: NodePort
  selector:
    app: app-2
  ports:
  - port: 8081
    targetPort: 8081
    nodePort: 30002

3. Apply the YAML File

kubectl apply -f app-2-deployment.yaml

S 3: For Logger Hackathon

  1. Pull the Docker Image (Optional)
docker pull tjain598/logger-hackathon:latest

2. Verify the image:

docker images

3. Create YAML File for Logger Hackathon Deployment

touch logger-hackathon-deployment.yaml

4. Add the Following YAML

apiVersion: apps/v1
kind: Deployment
metadata:
  name: logger-hackathon
spec:
  selector:
    matchLabels:
      app: logger-hackathon
  template:
    metadata:
      labels:
        app: logger-hackathon
    spec:
      containers:
      - name: logger-hackathon
        image: tjain598/logger-hackathon:latest
        ports:
        - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: logger-hackathon
spec:
  type: NodePort
  selector:
    app: logger-hackathon
  ports:
  - port: 80
    targetPort: 80
    nodePort: 30003

5. Apply the YAML File

kubectl apply -f logger-hackathon-deployment.yaml

S 4: Accessing the Applications

Use the following commands to access the applications via Minikube:

minikube service app-1
minikube service app-2
minikube service logger-hackathon

You should now be able to access app-1, app-2, and the Logger Hackathon application on your local Kubernetes cluster, connected to their respective services.

S 5 : Verify Logs or Output for Logger App

  1. Check Pod Logs
kubectl logs <logger-pod-name>

Replace <logger-pod-name> with the actual pod name obtained from:

kubectl get pods

2. Describe Deployment (Optional)

kubectl describe deployment logger-app

Caution : There will be many deployment issues alongside so it is important to understand troubleshooting issues. You can refer some of the deployment issues we faced from the following :

medium.com/@sugam.arora23/revolutionizing-d..

Step 3: Mastering Log Collection

Logs are the pulse of any system, providing invaluable insights into its health and behavior. Use Fluent Bit, FluentD, or Logstash to collect logs from Kubernetes pods and forward them to your backend service. Since we had a monolithic service so we decided to go ahead with Fluent Bit as it is most efficient, resource-friendly, and scalable solution for log collection and processing.

Key Components of Fluent Bit

1. Input Plugins

Input plugins define where Fluent Bit will collect logs from.

  • Common Sources: Files, TCP, Syslog, Tail (files), Kubernetes, and Docker.

  • Example: To collect container logs:

[INPUT]     Name tail     Path /var/log/containers/*.log     Tag kube.*

2. Filters

Filters transform or enrich log data before it is forwarded.

  • Example: Adding Kubernetes metadata to logs.
[FILTER]     Name kubernetes     Match kube.*     Merge_Log On

3. Output Plugins

Output plugins define where the logs are sent.

  • Common Destinations: Elasticsearch, stdout, AWS S3, and Kafka.

  • Example: Forward logs to stdout.

[OUTPUT]     Name stdout     Match *

Deploying Fluent Bit in Kubernetes

To efficiently collect logs in a Kubernetes cluster, Fluent Bit is typically deployed as a DaemonSet, ensuring a logging agent runs on each node. Here’s a detailed breakdown of the required configuration.

1. ConfigMap

A ConfigMap stores Fluent Bit’s configuration. Below is an example configuration to collect container logs, system logs, and enrich logs with Kubernetes metadata:

apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
  namespace: default
  labels:
    app: fluent-bit
data:
  fluent-bit.conf: |
    [SERVICE]
        Flush        1
        Log_Level    info
        Daemon       off
        Parsers_File parsers.conf

    [INPUT]
        Name              tail
        Path              /var/log/pods/*/*/*.log
        Exclude_Path      /var/log/pods/*/*fluent-bit*.log, /var/log/pods/*/*coredns*.log, /var/log/pods/*/*kube-controller-manager*.log, /var/log/pods/*/*storage-provisioner*.log, /var/log/pods/*/*kube-scheduler*.log, /var/log/pods/*/*kube-proxy*.log, /var/log/pods/*/*kube-apiserver*.log, /var/log/pods/*/*etcd*.log  
        Parser            docker
        Tag               kube.*
        Refresh_Interval  5
        Mem_Buf_Limit     5MB
        Skip_Long_Lines   On
        DB                /var/log/flb_kube.db

    [FILTER]
        Name                kubernetes
        Match               kube.*
        Merge_Log           On
        Keep_Log            On
        K8S-Logging.Parser  On
        K8S-Logging.Exclude On

    [FILTER]
        Name                geoip
        Match               kube.*
        Key                 ip_address               
        Add_Field           true                     
        Include             city_name,country_code

    [FILTER]
        Name                record_transformer
        Match               kube.*
        Record              new_time ${time}
        Time_Format         "%Y-%m-%d %H:%M:%S"

    [FILTER]
        Name                grep
        Match               kube.*
        Regex               log  "ERROR|WARN|INFO"

    [FILTER]
        Name                lua
        Match               kube.*
        script              extract_fields.lua  
        call                extract_fields

    [FILTER]
        Name              grep
        Match             kube.*
        Regex             kubernetes.pod_name nginx.*

    [FILTER]
        Name              grep
        Match             kube.*
        Regex             kubernetes.namespace_name production.*

    [FILTER]
        Name              grep
        Match             kube.*
        Regex             kubernetes.labels.app webapp.*

    [OUTPUT]
        Name          influxdb
        Match         *
        Host          influxdb
        Port          8086
        bucket        SegFault
        org           STGI
        sequence_tag  _seq
        http_token    7ETF8X3ZThuQrarZxNeHIsa4QmqjEjmgjJzDQY9PnqegvnM884YbpzDOUolunsn29LxX8FRW1sQPb822-25KrA==

  parsers.conf: |
    [PARSER]
        Name              docker
        Format            json
        Time_Key          time
        Time_Format       %Y-%m-%dT%H:%M:%S.%L
        Time_Keep         On

  extract_fields.lua: |
    function extract_fields(tag, timestamp, record)
        -- Custom logic to extract additional fields from the log
        return 1, timestamp, record
        end

        -- Calculate CPU utilization as a percentage
        if total_time > 0 then
            cpu_utilization = (active_time / total_time) * 100
        else
            cpu_utilization = 0
        end

        -- Add the CPU utilization to the log record
        record["cpu_utilization"] = cpu_utilization
        return 1, timestamp, record
    end

2. DaemonSet

The DaemonSet ensures Fluent Bit runs on all nodes in the cluster.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluent-bit
  namespace: default
  labels:
    app: fluent-bit
spec:
  selector:
    matchLabels:
      name: fluent-bit
  template:
    metadata:
      labels:
        name: fluent-bit
    spec:
      serviceAccountName: fluent-bit
      containers:
      - name: fluent-bit
        image: fluent/fluent-bit:latest
        volumeMounts:
        - name: varlog
          mountPath: /var/log
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
        - name: config-volume
          mountPath: /fluent-bit/etc/
      terminationGracePeriodSeconds: 30
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers
      - name: config-volume
        configMap:
          name: fluent-bit-config

3. RBAC

RBAC (Role-Based Access Control) in Fluent Bit is used to grant the necessary permissions for Fluent Bit to access Kubernetes resources securely. Fluent Bit uses these permissions to enrich log data with metadata from Kubernetes, such as pod names, namespaces, labels, and annotations.

apiVersion: v1
kind: ServiceAccount
metadata:
  name: fluent-bit
  namespace: default

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: fluent-bit-read
rules:
- apiGroups: [""]
  resources:
    - namespaces
    - pods
  verbs:
    - get
    - list
    - watch

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: fluent-bit-read
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: fluent-bit-read
subjects:
- kind: ServiceAccount
  name: fluent-bit
  namespace: kube-logging

Accessing Logs

Fluent Bit uses the following paths to access logs in a Kubernetes environment:

  • Container Logs: /var/log/containers/*.log

  • Docker Logs: /var/lib/docker/containers/*/*.log

  • System Logs: /var/log/*.log

Example Configuration:

[INPUT]
    Name tail
    Path /var/log/containers/*.log
    Tag kube.containers.*

Advanced Use Cases

1. Forwarding Logs to Elasticsearch

To forward logs to Elasticsearch:

[OUTPUT]
    Name  es
    Match *
    Host  elasticsearch-host
    Port  9200
    Index fluent-bit

2. Using Custom Parsers

Fluent Bit supports parsers for custom log formats. Example parser for JSON logs:

[PARSER]
    Name json
    Format json

3. Security

  • Use RBAC roles to restrict access to Kubernetes resources.

  • Mount sensitive directories as read-only.

Best Practices

  • Specify Versions: Avoid using latest in Docker images; use a specific version.

  • Use Dedicated Namespace: Deploy Fluent Bit in a separate namespace like kube-logging.

  • Monitor Performance: Regularly check resource usage and optimize configurations for high throughput.

  • Test Configurations: Test Fluent Bit in a staging environment before deploying to production.

Step 4: Building the Brains — The Backend

The problem statement mentioned using a time series database and no better than InfluxDB as the backend for Fluent Bit, it serves as a time-series database to store log data, metrics, or other telemetry collected by Fluent Bit. Configuring Fluent Bit to use InfluxDB involves ensuring proper setup, enabling seamless integration, and following best practices to store and query log data effectively. It is a robust time series database, to store the logs and metrics. The reason for using InfluxDB is its capability to handle high-velocity data like logs and metrics, while making it easy to query and analyze.

1. Understanding InfluxDB

InfluxDB is a high-performance time-series database designed for storing and querying time-stamped data. Fluent Bit can forward log and metric data to InfluxDB, which can then be used for analysis, visualization, or further processing.

Key Features:

  • Stores data as time-series points with associated fields and tags.

  • Optimized for high ingest rates.

  • Integrates with visualization tools like Grafana.

2. Setting Up InfluxDB for Fluent Bit

S 1: Install InfluxDB

  • Docker-based Installation:
docker run -d --name influxdb -p 8086:8086 influxdb:2.0
  • Binary Installation: Download and install from InfluxDB downloads page.

S 2: Create a Database and Token

  • InfluxDB v1.x: Create a database for Fluent Bit logs:
influx > CREATE DATABASE fluentbit_logs
  • InfluxDB v2.x:

  • Create a bucket using the InfluxDB UI or CLI:

influx bucket create -n fluentbit_logs -o your-org

3. Configuring Fluent Bit to Use InfluxDB

To send data to InfluxDB, Fluent Bit uses the InfluxDB output plugin.

S 1: Add the InfluxDB Output Plugin

Update the Fluent Bit configuration file (fluent-bit.conf):

[OUTPUT]
    Name        influxdb
    Match       *
    Host        <INFLUXDB_HOST>
    Port        <INFLUXDB_PORT>
    Database    fluentbit_logs
    Org         your-org          # For InfluxDB v2.x
    Bucket      fluentbit_logs    # For InfluxDB v2.x
    Token       <INFLUXDB_TOKEN>  # For InfluxDB v2.x
    Sequence_Tag sequence_id      # Optional, used for series ordering
    TLS         On                # Enable TLS for secure connections
    Verify      Off               # Skip TLS certificate verification
    Time_Precision ms             # Use millisecond precision for timestamps

Replace <INFLUXDB_HOST>, <INFLUXDB_PORT>, and <INFLUXDB_TOKEN> with your InfluxDB details.

S 2: Set the Input Plugin

Ensure you have an input plugin to collect logs or metrics. For example, if collecting logs from Kubernetes:

[INPUT]
    Name       tail
    Path       /var/log/containers/*.log
    Tag        kube.*
    DB         /var/log/fluentbit.db
    Mem_Buf_Limit 5MB

4. Storing and Querying Data in InfluxDB

InfluxDB organizes data using:

  • Measurements: Tables in a relational database context.

  • Tags: Indexed metadata for efficient querying.

  • Fields: Actual data points (not indexed).

  • Timestamps: Every record must have a time.

5. Best Practices

Optimize Fluent Bit Output

  • Use Tag in Fluent Bit to organize logs by namespace, pod, or other attributes.

  • Use compression to reduce the size of logs sent to InfluxDB.

InfluxDB Performance Tips

  • Limit the number of fields per measurement to avoid performance degradation.

  • Use tags for frequently queried data.

  • Aggregate data before querying or visualization to reduce load.

Integrate with Visualization Tools

Visualize InfluxDB data with tools like Grafana:

  • Add InfluxDB as a data source in Grafana.

  • Create dashboards to display logs and metrics using graphs and tables.

Secure the Integration

  • Use API tokens for authentication in InfluxDB v2.x.

  • Enable TLS for secure data transfer between Fluent Bit and InfluxDB.

6. Example End-to-End Workflow

  1. Collect Logs: Fluent Bit collects logs from Kubernetes pods.

  2. Enrich Logs: Metadata from Kubernetes is added (e.g., namespace, pod labels).

  3. Send to InfluxDB: Logs are forwarded to InfluxDB with tags for efficient querying.

  4. Store and Query: Logs are stored in InfluxDB and queried for debugging or monitoring.

  5. Visualize: Use Grafana to visualize logs for insights.

7. Troubleshooting Tips

  • Connection Issues:

Ensure Fluent Bit’s Host and Port match InfluxDB's settings.

Check network access between Fluent Bit and InfluxDB.

  • Data Not Showing:

Verify the Match directive in Fluent Bit matches the input tags.

Check Fluent Bit logs for errors.

  • InfluxDB High Load:

Use batching in Fluent Bit to reduce write operations:

Batch_Size   500 Flush_Interval 5

For backend we used spring boot . Spring Boot uses SLF4J with Logback as the default logging framework. By default, logs are written to the console, but you can configure Spring Boot to log to files or remote systems. Fluent Bit can collect logs from these files or from the standard output (stdout) and send them to a backend like InfluxDB, Elasticsearch, or a log management system.

Step 5: Crafting the Visual Interface

Your dashboard is where the magic happens — a place where raw data transforms into meaningful insights. We built a custom web-based dashboard using ReactJS. The interface should allow users to:

  • Search logs with keywords or time filters.

  • Visualize data through graphs, charts, and tables.

  • Track trends with real-time metric displays.

Key Design Principles:

  • Keep it minimalistic yet informative.

  • Ensure it’s responsive for mobile and desktop users.

  • Prioritize interactivity with dynamic filters and widgets.

1. Key Metrics Visualized on the Dashboard

1.1 Error Rates

  • Purpose: Error rate monitoring allows you to track the number of errors occurring within a specified period, such as HTTP 5xx responses, application errors, or failed requests.

Visualization:

  • Line Chart: To track error rates over time.

  • Bar Chart: Comparing error rates per endpoint or service.

  • Error Heatmap: To show periods of high error occurrence, highlighting potential issues.

Data Sources:

  • Logs or metrics from the backend, possibly integrated via Fluent Bit or a logging tool like Sentry.

  • Metrics from API responses (status codes).

1.2 Database Query Performance

  • Purpose: Database performance metrics are crucial for understanding how efficiently queries are being executed, the time taken, and any bottlenecks.

Visualization:

  • Bar Chart or Line Graph: Displaying the average query execution time over time.

  • Pie Chart: Visualizing query distribution by type (SELECT, INSERT, UPDATE, DELETE).

  • Heatmap: Indicating peak database load periods.

Data Sources:

  • Collected from the backend via database logs or query performance tools like New Relic or Prometheus.

  • Database monitoring tools can expose this data through APIs.

1.3 Traffic and Resource Usage

  • Purpose: Traffic and resource usage help you monitor how much load the application is handling and how efficiently resources (CPU, memory, bandwidth) are utilized.

Visualization:

  • Line Charts: Displaying real-time traffic and resource usage (CPU, memory, bandwidth).

  • Gauge Charts: For real-time indicators of resource usage such as CPU load or memory consumption.

  • Area Chart: Showing the correlation between traffic spikes and resource consumption.

Data Sources:

  • Prometheus, Grafana, or AWS CloudWatch can be used for real-time monitoring of infrastructure metrics.

  • Web server logs (nginx, apache) and performance monitoring tools.

2. Building the Dashboard in ReactJS

To build a sophisticated and user-friendly dashboard in ReactJS, you’ll need to break it down into components that handle the rendering of the different visualizations. Here’s how you can structure it:

2.1 Project Structure

/src
  /components
    /Dashboard
      Dashboard.js
    /ErrorRateChart
      ErrorRateChart.js
    /DbPerformanceChart
      DbPerformanceChart.js
    /TrafficUsageChart
      TrafficUsageChart.js
  /services
    metricsService.js  // Handles API calls to fetch data
  /utils
    chartUtils.js      // Contains utility functions for data processing and chart formatting
  /App.js
  /index.js

2.2 Example Code for Dashboard (ReactJS)

Here’s a simplified example of how the dashboard structure might look:

Dashboard.js

import React, { useState, useEffect } from 'react';
import ErrorRateChart from './ErrorRateChart';
import DbPerformanceChart from './DbPerformanceChart';
import TrafficUsageChart from './TrafficUsageChart';
import { getMetrics } from '../../services/metricsService';
const Dashboard = () => {
  const [metricsData, setMetricsData] = useState({
    errorRates: [],
    dbPerformance: [],
    trafficUsage: [],
  });  useEffect(() => {
    const fetchData = async () => {
      const data = await getMetrics();
      setMetricsData(data);
    };
    fetchData();
  }, []);  return (
    <div className="dashboard">
      <h1>Project Metrics Dashboard</h1>

      <div className="chart-container">
        <ErrorRateChart data={metricsData.errorRates} />
        <DbPerformanceChart data={metricsData.dbPerformance} />
        <TrafficUsageChart data={metricsData.trafficUsage} />
      </div>
    </div>
  );
};export default Dashboard;

2.3 Example Chart Component (ErrorRateChart.js)

import React from 'react';
import { Line } from 'react-chartjs-2';
const ErrorRateChart = ({ data }) => {
  const chartData = {
    labels: data.map(item => item.timestamp),  // e.g., time intervals
    datasets: [
      {
        label: 'Error Rate',
        data: data.map(item => item.errorCount),  // Number of errors at each time point
        fill: false,
        borderColor: 'rgba(255, 99, 132, 1)',
        tension: 0.1,
      },
    ],
  };  return (
    <div className="chart">
      <h3>Error Rates</h3>
      <Line data={chartData} />
    </div>
  );
};export default ErrorRateChart;

2.4 Example Metrics Service (metricsService.js)

import axios from 'axios';
export const getMetrics = async () => {
  try {
    const response = await axios.get('http://<BACKEND_URL>/api/metrics');
    return response.data;
  } catch (error) {
    console.error('Error fetching metrics:', error);
    return {
      errorRates: [],
      dbPerformance: [],
      trafficUsage: [],
    };
  }
};

3. Frontend Metrics Visualization Libraries

You can use charting libraries in ReactJS for visualizing metrics, such as:

Chart.js (used in the example above)

  • Provides flexible and beautiful visualizations, such as bar, line, and pie charts.

Recharts:

  • React-specific charting library with a focus on simplicity and responsiveness.

Victory:

  • Another React-specific charting library for building interactive visualizations.

D3.js:

  • For complex and custom visualizations, though it might require more setup and understanding of SVGs.

Chart.js Example for Line Graph (for error rates or traffic)

npm install react-chartjs-2 chart.js

In your React component, you can import react-chartjs-2 and use its Line chart component as demonstrated in the ErrorRateChart.js example.

4. Backend API for Metrics

Your backend API (e.g., built with Spring Boot) should expose endpoints that provide the required metrics data. Here’s an example Spring Boot controller:

MetricsController.java

@RestController
@RequestMapping("/api/metrics")
public class MetricsController {
    @Autowired
    private MetricsService metricsService;    @GetMapping
    public ResponseEntity<Map<String, Object>> getMetrics() {
        Map<String, Object> metricsData = metricsService.getMetricsData();
        return ResponseEntity.ok(metricsData);
    }
}

MetricsService.java

@Service
public class MetricsService {
    public Map<String, Object> getMetricsData() {
        Map<String, Object> metricsData = new HashMap<>();
        metricsData.put("errorRates", getErrorRates());
        metricsData.put("dbPerformance", getDbPerformance());
        metricsData.put("trafficUsage", getTrafficUsage());
        return metricsData;
    }    private List<ErrorRate> getErrorRates() {
        // Logic to fetch error rates from logs or databases
    }    private List<DbPerformance> getDbPerformance() {
        // Logic to fetch database performance metrics
    }    private List<TrafficUsage> getTrafficUsage() {
        // Logic to fetch traffic/resource usage
    }
}

This data could be fetched from a database, logs, or monitoring tools like Prometheus or Grafana, and can be serialized into JSON and sent to the frontend.

5. Deployment and Scalability Considerations

  • Real-time Data: If you need real-time updates on the dashboard, consider using WebSocket's to push updates to the React frontend or Server-Sent Events (SSE).

  • Data Caching: Use caching mechanisms (e.g., Redis) to avoid fetching large datasets repeatedly.

  • Scalability: Ensure your backend APIs are optimized to handle large amounts of traffic, and use load balancing if necessary.

Step 6: Going Beyond Basics

1. AWS Resource Logs Integration

Purpose: AWS Resource Logs provide real-time insights into the health and performance of your cloud infrastructure. By integrating AWS CloudWatch logs and metrics, you can enrich your platform with vital cloud-based data.

Steps to Integrate AWS Resource Logs:

Set Up AWS CloudWatch:

  • Enable CloudWatch Logs for your AWS services (EC2, Lambda, S3, etc.) to collect logs and metrics.

  • Set up CloudWatch Alarms to notify you of any anomalies or errors in the logs.

Fetching CloudWatch Logs:

  • Use the AWS SDK for Java (or the appropriate SDK for your backend) to retrieve CloudWatch logs.

  • API call example for fetching logs:

import com.amazonaws.services.logs.AWSLogs;
import com.amazonaws.services.logs.AWSLogsClientBuilder;
import com.amazonaws.services.logs.model.*;

public List<LogEvent> fetchLogs(String logGroupName) {
    AWSLogs client = AWSLogsClientBuilder.defaultClient();
    FilterLogEventsRequest request = new FilterLogEventsRequest()
            .withLogGroupName(logGroupName)
            .withLimit(50);
    FilterLogEventsResult result = client.filterLogEvents(request);
    return result.getEvents();
}

Display Logs in the Dashboard:

  • On the frontend, create components to display the logs in real time. For example, using React and a library like React Table for showing log entries.

Example Component for Displaying Logs:

import React, { useState, useEffect } from 'react';
import { Table } from 'react-table';
import { fetchCloudWatchLogs } from './services/awsLogsService';

const AwsLogsDashboard = () => {
  const [logs, setLogs] = useState([]);

  useEffect(() => {
    const fetchLogs = async () => {
      const data = await fetchCloudWatchLogs();
      setLogs(data);
    };
    fetchLogs();
  }, []);

  const columns = [
    { Header: 'Timestamp', accessor: 'timestamp' },
    { Header: 'Log Message', accessor: 'message' },
    // Add more columns as needed
  ];

  return (
    <div>
      <h2>AWS Resource Logs</h2>
      <Table data={logs} columns={columns} />
    </div>
  );
};

export default AwsLogsDashboard;

2. Alerting System

Purpose: Alerting notifies users about critical events (such as errors, performance degradation, or resource limits being reached) in real-time.

Steps to Implement an Alerting System:

Backend:

  • Integrate AWS CloudWatch Alarms to set thresholds for your resources (e.g., high CPU usage, low memory, errors in logs).

  • Trigger these alarms to notify your backend, which can then communicate with the frontend.

AWS CloudWatch Alarm Configuration:

import com.amazonaws.services.cloudwatch.AmazonCloudWatch;
import com.amazonaws.services.cloudwatch.AmazonCloudWatchClientBuilder;
import com.amazonaws.services.cloudwatch.model.*;

public void createAlarm() {
    AmazonCloudWatch cloudWatch = AmazonCloudWatchClientBuilder.defaultClient();

    PutMetricAlarmRequest alarmRequest = new PutMetricAlarmRequest()
        .withAlarmName("HighCPUUsage")
        .withComparisonOperator(ComparisonOperator.GreaterThanThreshold)
        .withEvaluationPeriods(1)
        .withMetricName("CPUUtilization")
        .withNamespace("AWS/EC2")
        .withPeriod(60)
        .withStatistic(Statistic.Average)
        .withThreshold(80.0)
        .withActionsEnabled(true)
        .withAlarmActions("arn:aws:sns:region:account-id:my-topic");

    cloudWatch.putMetricAlarm(alarmRequest);
}

Frontend (React):

  • Set up a WebSocket connection or Server-Sent Events (SSE) to listen for real-time alerts.

  • Display an alert in the UI when a critical event is triggered (e.g., high CPU usage, error spikes).

  • Example Code for Alert Notification:

import React, { useState, useEffect } from 'react';

const AlertComponent = () => {
  const [alerts, setAlerts] = useState([]);

  useEffect(() => {
    const eventSource = new EventSource('http://<BACKEND_URL>/api/alerts');

    eventSource.onmessage = (event) => {
      const newAlert = JSON.parse(event.data);
      setAlerts((prevAlerts) => [...prevAlerts, newAlert]);
    };

    return () => {
      eventSource.close();
    };
  }, []);

  return (
    <div>
      <h2>System Alerts</h2>
      {alerts.length > 0 ? (
        <ul>
          {alerts.map((alert, index) => (
            <li key={index} className={`alert ${alert.type}`}>
              {alert.message}
            </li>
          ))}
        </ul>
      ) : (
        <p>No alerts at the moment</p>
      )}
    </div>
  );
};

export default AlertComponent;

3. Predictive Analysis Using Machine Learning

Purpose: Predictive analysis helps forecast system performance, resource usage, and potential failures, enabling proactive actions before problems arise.

Steps to Implement Predictive Analysis:

Data Collection:

  • Collect historical metrics such as CPU usage, error rates, traffic volumes, etc., over a significant period.

  • Store these in a database or time-series database like InfluxDB.

Training a Machine Learning Model:

  • Use a machine learning framework like TensorFlow, Scikit-learn, or PyTorch to create models that can forecast metrics.

  • Example: Linear Regression or Time Series Forecasting (e.g., ARIMA or LSTM for sequential data).

  • Python Example (using Scikit-learn):

from sklearn.linear_model import LinearRegression
import numpy as np

# Example of training a simple model to predict CPU usage
data = np.array([[1, 50], [2, 55], [3, 60], [4, 65]])  # [hour, CPU Usage]
X = data[:, 0].reshape(-1, 1)  # Hour
y = data[:, 1]  # CPU usage

model = LinearRegression()
model.fit(X, y)

# Predict CPU usage for hour 5
prediction = model.predict([[5]])
print(f"Predicted CPU usage at hour 5: {prediction}")

Backend Integration:

  • Expose the machine learning model through an API endpoint.

  • Use Flask or FastAPI in Python to serve the model predictions.

  • Example FastAPI Endpoint for Predictions:

from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class PredictionRequest(BaseModel):
    hour: int

@app.post("/predict")
def predict(request: PredictionRequest):
    prediction = model.predict([[request.hour]])
    return {"predicted_cpu_usage": prediction[0]}

Frontend Integration:

  • On the React frontend, request predictions from the backend and visualize the forecasted data.

  • Example of Fetching and Displaying Predictions:

import React, { useState } from 'react';

const PredictiveDashboard = () => {
  const [prediction, setPrediction] = useState(null);

  const fetchPrediction = async (hour) => {
    const response = await fetch('http://<BACKEND_URL>/predict', {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
      },
      body: JSON.stringify({ hour }),
    });
    const data = await response.json();
    setPrediction(data.predicted_cpu_usage);
  };

  return (
    <div>
      <h2>Predictive Performance Analysis</h2>
      <button onClick={() => fetchPrediction(5)}>Predict for Hour 5</button>
      {prediction && <p>Predicted CPU Usage: {prediction}%</p>}
    </div>
  );
};

export default PredictiveDashboard;

The Impact of Innovation

At the heart of every successful hackathon is the drive for innovation, and our team’s experience in designing a state-of-the-art observability platform is a perfect example of this. A well-crafted observability solution is not just a technical fix — it serves as a force multiplier, helping teams manage intricate systems with ease. During the hackathon, we took on the challenge of building this innovative platform, and through this journey, we not only pushed the boundaries of technology but also enhanced our skills and problem-solving abilities. Here’s a closer look at the impact this innovation has had on us as participants:

Hands-On Expertise with Kubernetes, Log Management, and Full-Stack Development

In the hackathon, we dove deep into the world of Kubernetes, log management, and full-stack development, each of which played a pivotal role in the success of our observability platform. Kubernetes, a powerful tool for orchestrating containers, enabled us to streamline the deployment and scaling of our system. We learned to harness its capabilities to ensure our platform was both efficient and resilient, crucial for dealing with the complex environments we aimed to monitor.

Log management was another critical aspect of our solution. We tackled the challenge of collecting, storing, and analyzing logs in real-time to ensure we could quickly identify and address issues within the system. This aspect of the project not only improved our technical acumen but also made us realize the importance of clear, structured logging in production environments, especially when dealing with large-scale applications.

Additionally, we took on both frontend and backend development, ensuring that our observability platform was not just functional but also user-friendly. We gained invaluable experience in building intuitive user interfaces that allow teams to visualize logs, monitor metrics, and trace transactions effectively. This full-stack approach helped us understand the end-to-end process of building robust systems, preparing us for future challenges in web and software development.

Problem-Solving Skills for Tackling Real-World Challenges

The hackathon environment itself provided a fast-paced, high-pressure setting that pushed our problem-solving skills to the limit. We encountered several production-level challenges that tested our creativity, adaptability, and resourcefulness. From debugging complex issues to optimizing system performance under tight deadlines, we learned to think on our feet and find solutions that not only met but exceeded the expectations of the challenge. These problem-solving skills are now a core part of our toolkit, allowing us to approach any future project with a sense of confidence and capability.

Working under the constraints of a hackathon also made us realize the importance of time management and prioritizing tasks, ensuring that we could focus on the most crucial aspects of the platform while delivering a high-quality solution within the given time frame.

A Portfolio Project Showcasing Depth and Breadth of Knowledge

By the end of the hackathon, we were left with a complete observability platform that not only demonstrated our technical skills but also stood as a testament to our ability to innovate and solve real-world problems. This project has become a significant addition to our portfolios, highlighting both our in-depth understanding of observability tools and technologies and our ability to integrate them into a cohesive, functional system.

This portfolio piece is more than just a project — it’s a reflection of our growth as developers, problem-solvers, and innovators. It showcases our ability to work collaboratively, tackle challenges head-on, and deliver a working solution that meets real-world needs.

Conclusion: Empowering Future Innovators

The Log Strata Infra Monitor Challenge is far more than just a hackathon — it’s a chance to redefine observability on your own terms. Throughout the journey, we didn’t just focus on solving technical problems; we worked to shape the future of cloud monitoring. By collecting, processing, and visualizing data in innovative ways, we were able to build a platform that transforms raw information into actionable insights.

This hackathon has been an incredible opportunity for us to push the boundaries of what’s possible in the realm of observability. It’s not just about mastering tools and technologies; it’s about unleashing creativity, exploring new approaches, and contributing to the evolution of cloud systems. We’re not just participants in a challenge — we’re active creators of the next generation of observability platforms.

So, as we wrap up this experience, we’re inspired to continue rolling up our sleeves and pushing forward. The work we’ve done here lays the foundation for future innovations, and we’re excited to see how this journey unfolds as we continue building platforms that drive meaningful insights and transformation in the world of cloud monitoring.

Thank you for taking the time to read my blog. Your feedback is immensely valuable to me. Please feel free to share your thoughts and suggestions.