后台系统的可见性

序

来腾讯做后台也将近一年了，发过许多版本，查过很多问题，有半夜被成功率掉0，服务高负载搞过，踩过一些坑。如果让我总结后台开发比较重要的一些点，我觉得肯定有系统的可见性Visibility，他真的非常重要。应该说任何 7 * 24 运行的系统都是这样的，无监控不7 * 24

本文参考tx 同事 unlikezhang 的《后台系统可见性衡量》, Peter Bourgon 《Metrics, tracing, and logging》

先列下我认为的几种增加可见性方法

log

methond	desc	优点	缺点
printf类	学生时代写一次性程序常用	非常简单	输出信息无法长久保存，刷屏。几乎不能用于生产。
local log类	把 log 打在本地	简单，有保存	线上环境几乎只有error log，log分散在集群中每一台机器
专门的log服务	把log输出到专门的log-agent后发送到存储集群	可全量log，支持染色后全量log	需要专门的服务来做这件事(根本不是缺点)

statistic

methond	desc
属性计数	针对单个属性的数值记录，如cpu负载，磁盘使用量，当前进程数，访问量，入流量等。是kv级别的监控，可以在程序中添加关注的属性的监控，自行进行加上报操作
调用统计	统计每次外部调用的callerId,calleeId, ip, delay , retcode , isSuccess。并以此为基础分析服务运行情况

以上列出了常见的 5 中观察系统运行情况的方式 , log类比较细，基本在代码层面了。statistic层比较大，多在系统层面，所以常见的告警都是利用statistic 层面的数据进行规则匹配的

高可见性的用处

一般的查问题的方法都是
外网投诉 / 告警 —-> 开发者 —-> 查看相关监控数据，观察各项指标 —–> 去log系统捞log —–> 上机器grep log或者tail log —-> 分析问题 —–> 定位到问题 —- 解决问题

问题的解决其实和破案的过程非常像，定位问题是最重要的，而定位问题就需要我们的系统足够可见，不然定位问题无从谈起。

后台可见性的三个方面

Peter Bourgon 在他的文章中提出了系统可见性的三个词 [Metrics, tracing, logging].

以下是原文

metrics :

I think that the defining characteristic of metrics is that they are 
aggregatable: they are the atoms that compose into a single logical gauge, 
counter, or histogram over a span of time. As examples: the current depth of a 
queue could be modeled as a gauge, whose updates aggregate with last-writer-win semantics; 
the number of incoming HTTP requests could be modeled as a counter, whose updates aggregate 
by simple addition; and the observed duration of a request could be modeled into a histogram, 
whose updates aggregate into time-buckets and yield statistical summaries.

logging :

I think that the defining characteristic of logging is that it deals with discrete events. 
As examples: application debug or error messages emitted via a rotated file descriptor through 
syslog to Elasticsearch (or OK Log, nudge nudge); audit-trail events pushed through Kafka to a 
data lake like BigTable; or request-specific metadata pulled from a service call and sent to 
an error tracking service like NewRelic.

tracing :

I think that the single defining characteristic of tracing, then, is that it deals with information 
that is request-scoped. Any bit of data or metadata that can be bound to lifecycle of a single 
transactional object in the system. As examples: the duration of an outbound RPC to a remote 
service; the text of an actual SQL query sent to a database; or the correlation ID of an inbound HTTP request.

其实 Peter Bourgon 所说的 metrics 应该和我说的系统级别的 statistic 是差不多的，这个级别的可见性，应该是记录一些 counter , 或者一些统计量，监控系统最好能以实时统计图的方式给出他们的表现，这个级别的是一个 statistical summaries .

而logging主要就是程序自行输出的运行日志，按我的理解 tracing 是针对请求级别，包括上游链路，下游链路上所有log针对此次处理的Log，当然这种最好就是全量Log,一般由专门的log server完成这些服务，一般是只针对染色后的用户。

##总结

详细而好用又不失性能的监控其实是很难的，当然以后设计时，应该往metrics , logging , tracing 三个方面去想, 每个方面都需要做设计，不可顾此失彼，只有这样建立的监控才是立体的，有效的