Complete Guide: Setting Up Grafana & Prometheus for Monitoring and Observability

Introduction

Monitoring your infrastructure and applications is crucial for maintaining system reliability and performance. In this comprehensive guide, we’ll walk through everything you need to know about Grafana and Prometheus – from understanding subscription options to hands-on configuration with practical labs.

What Are Grafana and Prometheus?

Prometheus is an open-source monitoring and alerting toolkit designed for reliability and scalability. It collects and stores metrics as time series data, identified by metric name and key-value pairs called labels.

Grafana is an open-source analytics and interactive visualization web application. It provides charts, graphs, and alerts when connected to supported data sources, with Prometheus being one of the most popular integrations.

Subscription and Licensing Options

Grafana Pricing Models

Grafana Cloud (Managed Service)

Free Tier: 10,000 series metrics, 50GB logs, 50GB traces, 3 users
Pro Plan: Starting at $8/month per user with increased limits
Advanced Plan: Enterprise features with custom pricing

Grafana Enterprise (Self-Hosted)

Commercial license with enterprise features
Professional support and SLA
Advanced authentication, reporting, and security features

Grafana OSS (Open Source)

Completely free
Community support
Core visualization and dashboarding features

Prometheus Licensing

Prometheus is completely open-source under the Apache 2.0 license. However, you might consider:

Managed Prometheus services (AWS Managed Prometheus, Google Cloud Managed Prometheus)
Commercial support from companies like Grafana Labs or Robusta
Enterprise distributions with additional features and support

Installation Methods

Method 1: Docker Installation (Recommended for Labs)

This is the fastest way to get both services running for testing and development.

Prerequisites:

Docker and Docker Compose installed
Basic understanding of containerization

Method 2: Native Installation

Installing directly on your operating system for production environments.

Method 3: Kubernetes Deployment

Using Helm charts or operators for scalable, production-ready deployments.

Method 4: Cloud Managed Services

Leveraging cloud provider managed services for reduced operational overhead.

Lab 1: Quick Start with Docker Compose

Let’s start with a simple setup to get both services running quickly.

Step 1: Create Project Structure

mkdir grafana-prometheus-lab
cd grafana-prometheus-lab
mkdir prometheus grafana

Step 2: Create Prometheus Configuration

Create prometheus/prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
  
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']
  
  - job_name: 'grafana'
    static_configs:
      - targets: ['grafana:3000']

Step 3: Create Docker Compose File

Create docker-compose.yml:

version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus:/etc/prometheus
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--storage.tsdb.retention.time=200h'
      - '--web.enable-lifecycle'
    restart: unless-stopped

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=admin123
    volumes:
      - grafana_data:/var/lib/grafana
    restart: unless-stopped

  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    ports:
      - "9100:9100"
    command:
      - '--path.procfs=/host/proc'
      - '--path.rootfs=/rootfs'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    restart: unless-stopped

volumes:
  prometheus_data:
  grafana_data:

Step 4: Launch the Stack

docker-compose up -d

Verification:

Prometheus: http://localhost:9090
Grafana: http://localhost:3000 (admin/admin123)
Node Exporter: http://localhost:9100

Lab 2: Configuring Data Sources and Dashboards

Step 1: Add Prometheus as Data Source in Grafana

Login to Grafana (http://localhost:3000)
Go to Configuration → Data Sources
Click Add data source
Select Prometheus
Set URL to http://prometheus:9090
Click Save & Test

Step 2: Import a Dashboard

Go to Dashboards → Browse
Click Import
Enter dashboard ID 1860 (Node Exporter Full)
Click Load
Select Prometheus data source
Click Import

Step 3: Create a Custom Dashboard

Let’s create a simple dashboard to monitor our services:

Click + → Dashboard
Click Add visualization
Select Prometheus data source
Enter query: up
Set visualization type to Stat
Title: “Service Status”
Click Apply

Lab 3: Advanced Configuration and Alerting

Step 1: Configure Alerting Rules in Prometheus

Create prometheus/alert_rules.yml:

groups:
  - name: example
    rules:
    - alert: InstanceDown
      expr: up == 0
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: "Instance {{ $labels.instance }} down"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 2 minutes."
    
    - alert: HighCPUUsage
      expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 80
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High CPU usage on {{ $labels.instance }}"
        description: "CPU usage is above 80% for more than 5 minutes on {{ $labels.instance }}"

Step 2: Update Prometheus Configuration

Update prometheus/prometheus.yml to include the rules:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alert_rules.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
  
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']
  
  - job_name: 'grafana'
    static_configs:
      - targets: ['grafana:3000']

Step 3: Add Alertmanager to Docker Compose

Add this service to your docker-compose.yml:

  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager:/etc/alertmanager
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--storage.path=/alertmanager'
      - '--web.external-url=http://localhost:9093'
    restart: unless-stopped

Step 4: Configure Alertmanager

Create alertmanager/alertmanager.yml:

global:
  smtp_smarthost: 'localhost:587'
  smtp_from: '[email protected]'

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'

receivers:
  - name: 'web.hook'
    email_configs:
      - to: '[email protected]'
        subject: 'Alert: {{ .GroupLabels.alertname }}'
        body: |
          {{ range .Alerts }}
          Alert: {{ .Annotations.summary }}
          Description: {{ .Annotations.description }}
          {{ end }}

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

Lab 4: Production-Ready Configuration

Step 1: Security Hardening

Update your docker-compose.yml with security best practices:

version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    user: "65534:65534"
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus:/etc/prometheus:ro
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--storage.tsdb.retention.time=200h'
      - '--web.enable-lifecycle'
      - '--web.enable-admin-api'
    networks:
      - monitoring
    restart: unless-stopped

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    user: "472:472"
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_USER=${GF_ADMIN_USER:-admin}
      - GF_SECURITY_ADMIN_PASSWORD=${GF_ADMIN_PASSWORD}
      - GF_SECURITY_SECRET_KEY=${GF_SECRET_KEY}
      - GF_DATABASE_TYPE=postgres
      - GF_DATABASE_HOST=postgres:5432
      - GF_DATABASE_NAME=grafana
      - GF_DATABASE_USER=grafana
      - GF_DATABASE_PASSWORD=${POSTGRES_PASSWORD}
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning:ro
    networks:
      - monitoring
    depends_on:
      - postgres
    restart: unless-stopped

  postgres:
    image: postgres:13
    container_name: postgres
    environment:
      - POSTGRES_DB=grafana
      - POSTGRES_USER=grafana
      - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
    volumes:
      - postgres_data:/var/lib/postgresql/data
    networks:
      - monitoring
    restart: unless-stopped

networks:
  monitoring:
    driver: bridge

volumes:
  prometheus_data:
  grafana_data:
  postgres_data:

Step 2: Environment Variables

Create .env file:

GF_ADMIN_USER=admin
GF_ADMIN_PASSWORD=your_secure_password_here
GF_SECRET_KEY=your_secret_key_here
POSTGRES_PASSWORD=your_postgres_password_here

Step 3: Grafana Provisioning

Create grafana/provisioning/datasources/prometheus.yml:

apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: true

Create grafana/provisioning/dashboards/dashboard.yml:

apiVersion: 1

providers:
  - name: 'default'
    orgId: 1
    folder: ''
    type: file
    disableDeletion: false
    editable: true
    updateIntervalSeconds: 10
    options:
      path: /etc/grafana/provisioning/dashboards

Monitoring Best Practices

1. Metric Collection Strategy

Choose the Right Metrics:

RED Method: Rate, Errors, Duration for services
USE Method: Utilization, Saturation, Errors for resources
Golden Signals: Latency, traffic, errors, saturation

2. Dashboard Design Principles

Start with Overview: High-level metrics first
Drill-down Capability: Link dashboards for detailed views
Consistent Time Ranges: Use template variables
Meaningful Alerts: Avoid alert fatigue

3. Data Retention and Storage

# Prometheus storage configuration
global:
  scrape_interval: 15s       # How often to scrape
  evaluation_interval: 15s   # How often to evaluate rules

# Storage retention
--storage.tsdb.retention.time=15d    # Keep data for 15 days
--storage.tsdb.retention.size=10GB   # Limit storage size

4. High Availability Setup

For production environments, consider:

Prometheus Federation: Multiple Prometheus instances
Grafana Clustering: Load balancing and shared database
Remote Storage: Long-term storage solutions like Thanos or Cortex

Troubleshooting Common Issues

Issue 1: Prometheus Targets Down

Symptoms: Targets showing as “DOWN” in Prometheus Solutions:

Check network connectivity between services
Verify firewall rules and port accessibility
Examine container logs: docker logs prometheus
Validate service discovery configuration

Issue 2: Grafana Data Source Connection Failed

Symptoms: “HTTP Error Bad Gateway” or connection timeouts Solutions:

Verify Prometheus URL is accessible from Grafana container
Check docker network configuration
Test connection: docker exec grafana curl http://prometheus:9090

Issue 3: Missing Metrics in Dashboards

Symptoms: Empty graphs or “No data” messages Solutions:

Verify metric names in Prometheus: /metrics endpoint
Check time range selection in Grafana
Examine PromQL query syntax
Ensure adequate data retention period

Issue 4: High Resource Usage

Symptoms: High CPU/memory usage by Prometheus Solutions:

Reduce scrape frequency for less critical metrics
Implement metric filtering and dropping
Configure proper retention policies
Consider using recording rules for complex queries

Next Steps and Advanced Topics

1. Service Discovery

Move beyond static configurations:

Kubernetes Service Discovery: Automatic pod discovery
Consul Integration: Dynamic service registration
File-based Discovery: External configuration management

2. Custom Exporters

Create application-specific metrics:

Client Libraries: Instrument your applications
Custom Exporters: Specialized metric collection
Pushgateway: For short-lived jobs

3. Advanced Alerting

Enhance your alerting strategy:

Multi-dimensional Alerting: Complex alert conditions
Alert Routing: Different notifications for different teams
Integration: Slack, PagerDuty, OpsGenie webhooks

4. Scaling and Federation

Prepare for growth:

Horizontal Scaling: Multiple Prometheus instances
Long-term Storage: Thanos, Cortex, or VictoriaMetrics
Cross-cluster Monitoring: Federation and remote read/write

Conclusion

This comprehensive guide has taken you from basic setup to advanced configuration of Grafana and Prometheus. You’ve learned about subscription options, completed hands-on labs, and explored production-ready configurations.

Key Takeaways:

Start with Docker for quick prototyping and learning
Focus on meaningful metrics aligned with your business objectives
Implement proper security practices from the beginning
Plan for scalability and high availability in production environments
Continuously iterate on your monitoring strategy based on operational needs

Resources for Continued Learning:

Remember that monitoring is not a one-time setup but an ongoing process that evolves with your infrastructure and applications. Start simple, measure what matters, and continuously improve your observability practices.