Complete Guide: Setting Up Grafana & Prometheus for Monitoring and Observability

Introduction

Monitoring your infrastructure and applications is crucial for maintaining system reliability and performance. In this comprehensive guide, we’ll walk through everything you need to know about Grafana and Prometheus – from understanding subscription options to hands-on configuration with practical labs.

What Are Grafana and Prometheus?

Prometheus is an open-source monitoring and alerting toolkit designed for reliability and scalability. It collects and stores metrics as time series data, identified by metric name and key-value pairs called labels.

Grafana is an open-source analytics and interactive visualization web application. It provides charts, graphs, and alerts when connected to supported data sources, with Prometheus being one of the most popular integrations.

Subscription and Licensing Options

Grafana Pricing Models

Grafana Cloud (Managed Service)

  • Free Tier: 10,000 series metrics, 50GB logs, 50GB traces, 3 users
  • Pro Plan: Starting at $8/month per user with increased limits
  • Advanced Plan: Enterprise features with custom pricing

Grafana Enterprise (Self-Hosted)

  • Commercial license with enterprise features
  • Professional support and SLA
  • Advanced authentication, reporting, and security features

Grafana OSS (Open Source)

  • Completely free
  • Community support
  • Core visualization and dashboarding features

Prometheus Licensing

Prometheus is completely open-source under the Apache 2.0 license. However, you might consider:

  • Managed Prometheus services (AWS Managed Prometheus, Google Cloud Managed Prometheus)
  • Commercial support from companies like Grafana Labs or Robusta
  • Enterprise distributions with additional features and support

Installation Methods

Method 1: Docker Installation (Recommended for Labs)

This is the fastest way to get both services running for testing and development.

Prerequisites:

  • Docker and Docker Compose installed
  • Basic understanding of containerization

Method 2: Native Installation

Installing directly on your operating system for production environments.

Method 3: Kubernetes Deployment

Using Helm charts or operators for scalable, production-ready deployments.

Method 4: Cloud Managed Services

Leveraging cloud provider managed services for reduced operational overhead.

Lab 1: Quick Start with Docker Compose

Let’s start with a simple setup to get both services running quickly.

Step 1: Create Project Structure

mkdir grafana-prometheus-lab
cd grafana-prometheus-lab
mkdir prometheus grafana

Step 2: Create Prometheus Configuration

Create prometheus/prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
  
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']
  
  - job_name: 'grafana'
    static_configs:
      - targets: ['grafana:3000']

Step 3: Create Docker Compose File

Create docker-compose.yml:

version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus:/etc/prometheus
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--storage.tsdb.retention.time=200h'
      - '--web.enable-lifecycle'
    restart: unless-stopped

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=admin123
    volumes:
      - grafana_data:/var/lib/grafana
    restart: unless-stopped

  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    ports:
      - "9100:9100"
    command:
      - '--path.procfs=/host/proc'
      - '--path.rootfs=/rootfs'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    restart: unless-stopped

volumes:
  prometheus_data:
  grafana_data:

Step 4: Launch the Stack

docker-compose up -d

Verification:

  • Prometheus: http://localhost:9090
  • Grafana: http://localhost:3000 (admin/admin123)
  • Node Exporter: http://localhost:9100

Lab 2: Configuring Data Sources and Dashboards

Step 1: Add Prometheus as Data Source in Grafana

  1. Login to Grafana (http://localhost:3000)
  2. Go to ConfigurationData Sources
  3. Click Add data source
  4. Select Prometheus
  5. Set URL to http://prometheus:9090
  6. Click Save & Test

Step 2: Import a Dashboard

  1. Go to DashboardsBrowse
  2. Click Import
  3. Enter dashboard ID 1860 (Node Exporter Full)
  4. Click Load
  5. Select Prometheus data source
  6. Click Import

Step 3: Create a Custom Dashboard

Let’s create a simple dashboard to monitor our services:

  1. Click +Dashboard
  2. Click Add visualization
  3. Select Prometheus data source
  4. Enter query: up
  5. Set visualization type to Stat
  6. Title: “Service Status”
  7. Click Apply

Lab 3: Advanced Configuration and Alerting

Step 1: Configure Alerting Rules in Prometheus

Create prometheus/alert_rules.yml:

groups:
  - name: example
    rules:
    - alert: InstanceDown
      expr: up == 0
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: "Instance {{ $labels.instance }} down"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 2 minutes."
    
    - alert: HighCPUUsage
      expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 80
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High CPU usage on {{ $labels.instance }}"
        description: "CPU usage is above 80% for more than 5 minutes on {{ $labels.instance }}"

Step 2: Update Prometheus Configuration

Update prometheus/prometheus.yml to include the rules:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alert_rules.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
  
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']
  
  - job_name: 'grafana'
    static_configs:
      - targets: ['grafana:3000']

Step 3: Add Alertmanager to Docker Compose

Add this service to your docker-compose.yml:

  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager:/etc/alertmanager
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--storage.path=/alertmanager'
      - '--web.external-url=http://localhost:9093'
    restart: unless-stopped

Step 4: Configure Alertmanager

Create alertmanager/alertmanager.yml:

global:
  smtp_smarthost: 'localhost:587'
  smtp_from: '[email protected]'

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'

receivers:
  - name: 'web.hook'
    email_configs:
      - to: '[email protected]'
        subject: 'Alert: {{ .GroupLabels.alertname }}'
        body: |
          {{ range .Alerts }}
          Alert: {{ .Annotations.summary }}
          Description: {{ .Annotations.description }}
          {{ end }}

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

Lab 4: Production-Ready Configuration

Step 1: Security Hardening

Update your docker-compose.yml with security best practices:

version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    user: "65534:65534"
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus:/etc/prometheus:ro
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--storage.tsdb.retention.time=200h'
      - '--web.enable-lifecycle'
      - '--web.enable-admin-api'
    networks:
      - monitoring
    restart: unless-stopped

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    user: "472:472"
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_USER=${GF_ADMIN_USER:-admin}
      - GF_SECURITY_ADMIN_PASSWORD=${GF_ADMIN_PASSWORD}
      - GF_SECURITY_SECRET_KEY=${GF_SECRET_KEY}
      - GF_DATABASE_TYPE=postgres
      - GF_DATABASE_HOST=postgres:5432
      - GF_DATABASE_NAME=grafana
      - GF_DATABASE_USER=grafana
      - GF_DATABASE_PASSWORD=${POSTGRES_PASSWORD}
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning:ro
    networks:
      - monitoring
    depends_on:
      - postgres
    restart: unless-stopped

  postgres:
    image: postgres:13
    container_name: postgres
    environment:
      - POSTGRES_DB=grafana
      - POSTGRES_USER=grafana
      - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
    volumes:
      - postgres_data:/var/lib/postgresql/data
    networks:
      - monitoring
    restart: unless-stopped

networks:
  monitoring:
    driver: bridge

volumes:
  prometheus_data:
  grafana_data:
  postgres_data:

Step 2: Environment Variables

Create .env file:

GF_ADMIN_USER=admin
GF_ADMIN_PASSWORD=your_secure_password_here
GF_SECRET_KEY=your_secret_key_here
POSTGRES_PASSWORD=your_postgres_password_here

Step 3: Grafana Provisioning

Create grafana/provisioning/datasources/prometheus.yml:

apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: true

Create grafana/provisioning/dashboards/dashboard.yml:

apiVersion: 1

providers:
  - name: 'default'
    orgId: 1
    folder: ''
    type: file
    disableDeletion: false
    editable: true
    updateIntervalSeconds: 10
    options:
      path: /etc/grafana/provisioning/dashboards

Monitoring Best Practices

1. Metric Collection Strategy

Choose the Right Metrics:

  • RED Method: Rate, Errors, Duration for services
  • USE Method: Utilization, Saturation, Errors for resources
  • Golden Signals: Latency, traffic, errors, saturation

2. Dashboard Design Principles

  • Start with Overview: High-level metrics first
  • Drill-down Capability: Link dashboards for detailed views
  • Consistent Time Ranges: Use template variables
  • Meaningful Alerts: Avoid alert fatigue

3. Data Retention and Storage

# Prometheus storage configuration
global:
  scrape_interval: 15s       # How often to scrape
  evaluation_interval: 15s   # How often to evaluate rules

# Storage retention
--storage.tsdb.retention.time=15d    # Keep data for 15 days
--storage.tsdb.retention.size=10GB   # Limit storage size

4. High Availability Setup

For production environments, consider:

  • Prometheus Federation: Multiple Prometheus instances
  • Grafana Clustering: Load balancing and shared database
  • Remote Storage: Long-term storage solutions like Thanos or Cortex

Troubleshooting Common Issues

Issue 1: Prometheus Targets Down

Symptoms: Targets showing as “DOWN” in Prometheus Solutions:

  • Check network connectivity between services
  • Verify firewall rules and port accessibility
  • Examine container logs: docker logs prometheus
  • Validate service discovery configuration

Issue 2: Grafana Data Source Connection Failed

Symptoms: “HTTP Error Bad Gateway” or connection timeouts Solutions:

  • Verify Prometheus URL is accessible from Grafana container
  • Check docker network configuration
  • Test connection: docker exec grafana curl http://prometheus:9090

Issue 3: Missing Metrics in Dashboards

Symptoms: Empty graphs or “No data” messages Solutions:

  • Verify metric names in Prometheus: /metrics endpoint
  • Check time range selection in Grafana
  • Examine PromQL query syntax
  • Ensure adequate data retention period

Issue 4: High Resource Usage

Symptoms: High CPU/memory usage by Prometheus Solutions:

  • Reduce scrape frequency for less critical metrics
  • Implement metric filtering and dropping
  • Configure proper retention policies
  • Consider using recording rules for complex queries

Next Steps and Advanced Topics

1. Service Discovery

Move beyond static configurations:

  • Kubernetes Service Discovery: Automatic pod discovery
  • Consul Integration: Dynamic service registration
  • File-based Discovery: External configuration management

2. Custom Exporters

Create application-specific metrics:

  • Client Libraries: Instrument your applications
  • Custom Exporters: Specialized metric collection
  • Pushgateway: For short-lived jobs

3. Advanced Alerting

Enhance your alerting strategy:

  • Multi-dimensional Alerting: Complex alert conditions
  • Alert Routing: Different notifications for different teams
  • Integration: Slack, PagerDuty, OpsGenie webhooks

4. Scaling and Federation

Prepare for growth:

  • Horizontal Scaling: Multiple Prometheus instances
  • Long-term Storage: Thanos, Cortex, or VictoriaMetrics
  • Cross-cluster Monitoring: Federation and remote read/write

Conclusion

This comprehensive guide has taken you from basic setup to advanced configuration of Grafana and Prometheus. You’ve learned about subscription options, completed hands-on labs, and explored production-ready configurations.

Key Takeaways:

  • Start with Docker for quick prototyping and learning
  • Focus on meaningful metrics aligned with your business objectives
  • Implement proper security practices from the beginning
  • Plan for scalability and high availability in production environments
  • Continuously iterate on your monitoring strategy based on operational needs

Resources for Continued Learning:

Remember that monitoring is not a one-time setup but an ongoing process that evolves with your infrastructure and applications. Start simple, measure what matters, and continuously improve your observability practices.

Share: