Skip to main content

成本分析fadvisor

Fadvisor: FinOps Advisor,可以采集对应集群成本数据,用户crane的成本可视化和优化,目前实现腾讯云tke的成本采集能力。

https://github.com/gocrane/fadvisor

基于kubekey部署好的多节点集群进行部署,由于kubersphere自带prometheus,使用集成现有监控系统方式(https://github.com/gocrane/fadvisor#integrated-with-existing-monitoring-components) 部署即可,grafana需要单独部署。

部署grafana

granfa使用helm安装,需要编辑override_values.yaml默认数据源配置

wget https://raw.githubusercontent.com/gocrane/helm-charts/main/integration/grafana/override_values.yaml
## Configure grafana datasources
## ref: http://docs.grafana.org/administration/provisioning/#datasources
##
datasources:
datasources.yaml:
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
#url: http://prometheus-server.crane-system.svc.cluster.local:8080 需要更换成kubersphere monitor自带的数据源
url: http://prometheus-k8s.kubesphere-monitoring-system.svc.cluster.local:9090
access: proxy
isDefault: true
helm repo add grafana https://grafana.github.io/helm-charts
helm upgrade --install grafana -f ./override_values.yaml -n crane-system --create-namespace grafana/grafana

查看grafana密码,用户名和密码通常都是admin

kubectl get secret --namespace crane-system grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo

本地映射grafana访问端口

export POD_NAME=$(kubectl get pods --namespace crane-system -l "app.kubernetes.io/name=grafana,app.kubernetes.io/instance=grafana" -o jsonpath="{.items[0].metadata.name}")
kubectl --namespace crane-system port-forward $POD_NAME 3000

http://127.0.0.1:3000 即可访问grafana了

部署fadvisor

部署fadvisor

helm repo add crane https://gocrane.github.io/helm-charts
helm install fadvisor -n crane-system --create-namespace crane/fadvisor

本地创建配置

config.ini
[credentials]
clusterId={your cluster id}
appId=app1
secretId={your cloud provider credential secret id}
secretKey={your cloud provider credential secret key}
[clientProfile]
defaultLimit=100
defaultLanguage=zh-CN
defaultTimeoutSeconds=10
region={your cluster region, such as ap-beijing、ap-shanghai、ap-guangzhou、ap-shenzhen and so on, you can find region name in your cloud provider console}
domainSuffix=internal.tencentcloudapi.com
scheme=

阅读源码看出,clusterId和appId没有使用,可不填,可以使用secretId和secretKey

更改完参数之后转为base64编码

cat config.ini | base64

编辑fadvisor secret,将config内容替换即可

k edit secrets fadvisor -n crane-system

编辑后fadvisor里挂载/etc/cloud/config的内容变了,需要删掉pod重启一下生效

k delete pod fadvisor-5d69776545-gpw65 -n crane-system

prometheus采集

查看prometheus信息

# k get pod -n kubesphere-monitoring-system
NAME READY STATUS RESTARTS AGE
alertmanager-main-0 2/2 Running 0 146d
alertmanager-main-1 2/2 Running 0 146d
alertmanager-main-2 2/2 Running 0 134d
kube-state-metrics-59758cbc95-5sffx 3/3 Running 0 146d
node-exporter-78zp8 2/2 Running 0 146d
node-exporter-kwk59 2/2 Running 4 146d
node-exporter-l7x6d 2/2 Running 4 146d
node-exporter-phd4j 2/2 Running 0 146d
notification-manager-deployment-54c787dd6d-dv2pb 2/2 Running 0 146d
notification-manager-deployment-54c787dd6d-zm46c 2/2 Running 0 146d
notification-manager-operator-58fb9784c8-g54w9 2/2 Running 0 146d
prometheus-k8s-0 2/2 Running 1 146d
prometheus-k8s-1 2/2 Running 1 134d
prometheus-k8s-2 2/2 Running 1 146d
prometheus-operator-5c5db79546-pkr77 2/2 Running 0 146d

kubersphere使用prometheus-operator部署,可以直接使用ServiceMonitor方式部署

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: fadvisor
namespace: crane-system
# 描述抓取目标 Pod 的选取及抓取任务的配置
spec:
endpoints:
- interval: 5m
port: http
path: /metrics
honorLabels: true
namespaceSelector:
matchNames:
- crane-system
selector:
matchLabels:
app.kubernetes.io/instance: fadvisor

创建record rules, 每小时清洗一下数据

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: cost-rules

spec:
groups:
- name: costs.rules
interval: 3600s
rules:
- expr: |
sum(label_replace(irate(container_cpu_usage_seconds_total{container!="POD", container!="",image!=""}[1h]), "node", "$1", "instance", "(.*)")) by (container, pod, node, namespace) * on (node) group_left() avg(avg_over_time(node_cpu_hourly_cost[1h])) by (node)
record: namespace:container_cpu_usage_costs_hourly:sum_rate
- expr: |
sum(label_replace(avg_over_time(container_memory_working_set_bytes{container!="POD",container!="",image!=""}[1h]), "node", "$1", "instance", "(.*)")) by (container, pod, node, namespace) / 1024.0 / 1024.0 / 1024.0 * on (node) group_left() avg(avg_over_time(node_ram_hourly_cost[1h])) by (node)
record: namespace:container_memory_usage_costs_hourly:sum_rate
- expr: |
avg(avg_over_time(node_cpu_hourly_cost[1h])) by (node)
record: node:node_cpu_hourly_cost:avg
- expr: |
avg(avg_over_time(node_ram_hourly_cost[1h])) by (node)
record: node:node_ram_hourly_cost:avg
- expr: |
avg(avg_over_time(node_total_hourly_cost[1h])) by (node)
record: node:node_total_hourly_cost:avg