Introduction
Monitoring your infrastructure and applications is crucial for maintaining system reliability and performance. In this comprehensive guide, we’ll walk through everything you need to know about Grafana and Prometheus – from understanding subscription options to hands-on configuration with practical labs.
What Are Grafana and Prometheus?
Prometheus is an open-source monitoring and alerting toolkit designed for reliability and scalability. It collects and stores metrics as time series data, identified by metric name and key-value pairs called labels.
Grafana is an open-source analytics and interactive visualization web application. It provides charts, graphs, and alerts when connected to supported data sources, with Prometheus being one of the most popular integrations.
Subscription and Licensing Options
Grafana Pricing Models
Grafana Cloud (Managed Service)
- Free Tier: 10,000 series metrics, 50GB logs, 50GB traces, 3 users
- Pro Plan: Starting at $8/month per user with increased limits
- Advanced Plan: Enterprise features with custom pricing
Grafana Enterprise (Self-Hosted)
- Commercial license with enterprise features
- Professional support and SLA
- Advanced authentication, reporting, and security features
Grafana OSS (Open Source)
- Completely free
- Community support
- Core visualization and dashboarding features
Prometheus Licensing
Prometheus is completely open-source under the Apache 2.0 license. However, you might consider:
- Managed Prometheus services (AWS Managed Prometheus, Google Cloud Managed Prometheus)
- Commercial support from companies like Grafana Labs or Robusta
- Enterprise distributions with additional features and support
Installation Methods
Method 1: Docker Installation (Recommended for Labs)
This is the fastest way to get both services running for testing and development.
Prerequisites:
- Docker and Docker Compose installed
- Basic understanding of containerization
Method 2: Native Installation
Installing directly on your operating system for production environments.
Method 3: Kubernetes Deployment
Using Helm charts or operators for scalable, production-ready deployments.
Method 4: Cloud Managed Services
Leveraging cloud provider managed services for reduced operational overhead.
Lab 1: Quick Start with Docker Compose
Let’s start with a simple setup to get both services running quickly.
Step 1: Create Project Structure
mkdir grafana-prometheus-lab
cd grafana-prometheus-lab
mkdir prometheus grafana
Step 2: Create Prometheus Configuration
Create prometheus/prometheus.yml
:
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'grafana'
static_configs:
- targets: ['grafana:3000']
Step 3: Create Docker Compose File
Create docker-compose.yml
:
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus:/etc/prometheus
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--storage.tsdb.retention.time=200h'
- '--web.enable-lifecycle'
restart: unless-stopped
grafana:
image: grafana/grafana:latest
container_name: grafana
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=admin123
volumes:
- grafana_data:/var/lib/grafana
restart: unless-stopped
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
ports:
- "9100:9100"
command:
- '--path.procfs=/host/proc'
- '--path.rootfs=/rootfs'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
restart: unless-stopped
volumes:
prometheus_data:
grafana_data:
Step 4: Launch the Stack
docker-compose up -d
Verification:
- Prometheus: http://localhost:9090
- Grafana: http://localhost:3000 (admin/admin123)
- Node Exporter: http://localhost:9100
Lab 2: Configuring Data Sources and Dashboards
Step 1: Add Prometheus as Data Source in Grafana
- Login to Grafana (http://localhost:3000)
- Go to Configuration → Data Sources
- Click Add data source
- Select Prometheus
- Set URL to
http://prometheus:9090
- Click Save & Test
Step 2: Import a Dashboard
- Go to Dashboards → Browse
- Click Import
- Enter dashboard ID
1860
(Node Exporter Full) - Click Load
- Select Prometheus data source
- Click Import
Step 3: Create a Custom Dashboard
Let’s create a simple dashboard to monitor our services:
- Click + → Dashboard
- Click Add visualization
- Select Prometheus data source
- Enter query:
up
- Set visualization type to Stat
- Title: “Service Status”
- Click Apply
Lab 3: Advanced Configuration and Alerting
Step 1: Configure Alerting Rules in Prometheus
Create prometheus/alert_rules.yml
:
groups:
- name: example
rules:
- alert: InstanceDown
expr: up == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 2 minutes."
- alert: HighCPUUsage
expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is above 80% for more than 5 minutes on {{ $labels.instance }}"
Step 2: Update Prometheus Configuration
Update prometheus/prometheus.yml
to include the rules:
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alert_rules.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'grafana'
static_configs:
- targets: ['grafana:3000']
Step 3: Add Alertmanager to Docker Compose
Add this service to your docker-compose.yml
:
alertmanager:
image: prom/alertmanager:latest
container_name: alertmanager
ports:
- "9093:9093"
volumes:
- ./alertmanager:/etc/alertmanager
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
- '--web.external-url=http://localhost:9093'
restart: unless-stopped
Step 4: Configure Alertmanager
Create alertmanager/alertmanager.yml
:
global:
smtp_smarthost: 'localhost:587'
smtp_from: '[email protected]'
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'web.hook'
receivers:
- name: 'web.hook'
email_configs:
- to: '[email protected]'
subject: 'Alert: {{ .GroupLabels.alertname }}'
body: |
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
{{ end }}
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
Lab 4: Production-Ready Configuration
Step 1: Security Hardening
Update your docker-compose.yml
with security best practices:
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
user: "65534:65534"
ports:
- "9090:9090"
volumes:
- ./prometheus:/etc/prometheus:ro
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--storage.tsdb.retention.time=200h'
- '--web.enable-lifecycle'
- '--web.enable-admin-api'
networks:
- monitoring
restart: unless-stopped
grafana:
image: grafana/grafana:latest
container_name: grafana
user: "472:472"
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_USER=${GF_ADMIN_USER:-admin}
- GF_SECURITY_ADMIN_PASSWORD=${GF_ADMIN_PASSWORD}
- GF_SECURITY_SECRET_KEY=${GF_SECRET_KEY}
- GF_DATABASE_TYPE=postgres
- GF_DATABASE_HOST=postgres:5432
- GF_DATABASE_NAME=grafana
- GF_DATABASE_USER=grafana
- GF_DATABASE_PASSWORD=${POSTGRES_PASSWORD}
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning:ro
networks:
- monitoring
depends_on:
- postgres
restart: unless-stopped
postgres:
image: postgres:13
container_name: postgres
environment:
- POSTGRES_DB=grafana
- POSTGRES_USER=grafana
- POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
volumes:
- postgres_data:/var/lib/postgresql/data
networks:
- monitoring
restart: unless-stopped
networks:
monitoring:
driver: bridge
volumes:
prometheus_data:
grafana_data:
postgres_data:
Step 2: Environment Variables
Create .env
file:
GF_ADMIN_USER=admin
GF_ADMIN_PASSWORD=your_secure_password_here
GF_SECRET_KEY=your_secret_key_here
POSTGRES_PASSWORD=your_postgres_password_here
Step 3: Grafana Provisioning
Create grafana/provisioning/datasources/prometheus.yml
:
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: true
Create grafana/provisioning/dashboards/dashboard.yml
:
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: ''
type: file
disableDeletion: false
editable: true
updateIntervalSeconds: 10
options:
path: /etc/grafana/provisioning/dashboards
Monitoring Best Practices
1. Metric Collection Strategy
Choose the Right Metrics:
- RED Method: Rate, Errors, Duration for services
- USE Method: Utilization, Saturation, Errors for resources
- Golden Signals: Latency, traffic, errors, saturation
2. Dashboard Design Principles
- Start with Overview: High-level metrics first
- Drill-down Capability: Link dashboards for detailed views
- Consistent Time Ranges: Use template variables
- Meaningful Alerts: Avoid alert fatigue
3. Data Retention and Storage
# Prometheus storage configuration
global:
scrape_interval: 15s # How often to scrape
evaluation_interval: 15s # How often to evaluate rules
# Storage retention
--storage.tsdb.retention.time=15d # Keep data for 15 days
--storage.tsdb.retention.size=10GB # Limit storage size
4. High Availability Setup
For production environments, consider:
- Prometheus Federation: Multiple Prometheus instances
- Grafana Clustering: Load balancing and shared database
- Remote Storage: Long-term storage solutions like Thanos or Cortex
Troubleshooting Common Issues
Issue 1: Prometheus Targets Down
Symptoms: Targets showing as “DOWN” in Prometheus Solutions:
- Check network connectivity between services
- Verify firewall rules and port accessibility
- Examine container logs:
docker logs prometheus
- Validate service discovery configuration
Issue 2: Grafana Data Source Connection Failed
Symptoms: “HTTP Error Bad Gateway” or connection timeouts Solutions:
- Verify Prometheus URL is accessible from Grafana container
- Check docker network configuration
- Test connection:
docker exec grafana curl http://prometheus:9090
Issue 3: Missing Metrics in Dashboards
Symptoms: Empty graphs or “No data” messages Solutions:
- Verify metric names in Prometheus:
/metrics
endpoint - Check time range selection in Grafana
- Examine PromQL query syntax
- Ensure adequate data retention period
Issue 4: High Resource Usage
Symptoms: High CPU/memory usage by Prometheus Solutions:
- Reduce scrape frequency for less critical metrics
- Implement metric filtering and dropping
- Configure proper retention policies
- Consider using recording rules for complex queries
Next Steps and Advanced Topics
1. Service Discovery
Move beyond static configurations:
- Kubernetes Service Discovery: Automatic pod discovery
- Consul Integration: Dynamic service registration
- File-based Discovery: External configuration management
2. Custom Exporters
Create application-specific metrics:
- Client Libraries: Instrument your applications
- Custom Exporters: Specialized metric collection
- Pushgateway: For short-lived jobs
3. Advanced Alerting
Enhance your alerting strategy:
- Multi-dimensional Alerting: Complex alert conditions
- Alert Routing: Different notifications for different teams
- Integration: Slack, PagerDuty, OpsGenie webhooks
4. Scaling and Federation
Prepare for growth:
- Horizontal Scaling: Multiple Prometheus instances
- Long-term Storage: Thanos, Cortex, or VictoriaMetrics
- Cross-cluster Monitoring: Federation and remote read/write
Conclusion
This comprehensive guide has taken you from basic setup to advanced configuration of Grafana and Prometheus. You’ve learned about subscription options, completed hands-on labs, and explored production-ready configurations.
Key Takeaways:
- Start with Docker for quick prototyping and learning
- Focus on meaningful metrics aligned with your business objectives
- Implement proper security practices from the beginning
- Plan for scalability and high availability in production environments
- Continuously iterate on your monitoring strategy based on operational needs
Resources for Continued Learning:
- Prometheus Official Documentation
- Grafana Documentation
- PromLabs Training
- Monitoring and Observability Community
Remember that monitoring is not a one-time setup but an ongoing process that evolves with your infrastructure and applications. Start simple, measure what matters, and continuously improve your observability practices.