前言
skywalking是个非常不错的apm产品,但是在使用过程中有个非常蛋疼的问题,在基于es的存储情况下,es的数据一有问题,就会导致整个skywalking web ui服务不可用,然后需要a? ! M /gent端一个服务一个服务的停用,然后服务重新部署后好,全部走一遍。这种问题同样也会存在skywalking的版本升级迭代中。而且apm 这种过程数据是允许丢弃的,默认skywalking中关于trace的数据记录只保存了90分钟。故博主准备将skywalking的部署容器化,一键部署升级。下文是整个skywas Q z A dlking 容器化部署的过程。
目标:将sk` X Oywalking的docker镜像运行在k8s的集群环境中提供服务
docker镜像构建
FROM registry.cn-xx.xx.com/keking/jdk:1.8^ U ? W p Z l $
ADD apach^ M v z k N $ 0 we-skywalking-apm-incubating/ /opt/apache-skywalking-apm-incubating/[ ` P b b 2 X o
RUN ln -sK . 2 U Z y @f /usr/shaS 7 M / 6 +re/zoneinfo/Asia/Shanghai /etc/localtime \\
&& echo \'Asia/Sf - O ;hanghaH O 3 / + ~ 6 u vi\' >/etc/timezone \\i J Y
&& chmod +x /opt/apachH i ) V C 3 fe-skywalking-apm-incubating/config/setApplicationEnv.sh \\
&& chmod +x^ y t /opt/apache-skywalking-apm-incubating/webm S r F 8 w % , !app/setWebAppEnv.sh \\
&, [ U D l S xamN | * G v 1p;& chmod +x /opt/apache-skywalking-apm-incuba_ i . & lting/bin/startup.sh \\
&& echo \"tail -fn 100 /opt/apache-skywalking-apm-in2 + f - ^ k i f Qcubating/logs/we! # Abapp.log\" >> /opt/apache-skywalking-apm-id q # Y R !ncubB x { Dating/bin/startup.sh
EXPOSE 8080 10800 11800 12800
CMD /opt/apache-skywalking-apm-incubating/config/setApplicationEnv.sh \\3 h n % F x
&a4 z 2 s r m ?mp;& sh /opt/apache-sZ s M 9 (kywalking-apm-incubating/H ] Owebapp/setv y 3WebAppEnv.sh \\
&& /opr ; Lt/apacC E Hhe-skywalking-apm-incubat a _ 5tinO Y / w 4 Zg/bin/startup.sh
在编写Dockerfile时需要考虑几个问题:skywalking中哪些配置需要动态配置(运行时设置)?怎么保证进程一直运L h } + V m 3行(skywalking 的stars ^ 0tup.shG U F _和tm u 2 R Y ; [omcat中 的startup.sh| + Q类似)?
application.yml
#cluster:
# zookeeper:
# hostPort: localhost:2181
# sessionTimeout: 100000
naming:
jetty:
#OS real network IP(binding required), for agent to find collector cl ) ; , j } Zuster
host: 0.0. K 80.0
portc O a } 1 J: 10800
contextPath: /
cache:
# guava:
caffeine:
remote:L + I 9 ^ M B :
gRPC:
# OS real n^ % ketwork IP(binding required. . $ }), for collector nodesi 6 ! O communicate with each other in cluster. collectI C rorN --(gRPC) --> collectorM
host: #real_host
port: 11800
agent_g] h &RPC:
gRPC:
#os real network ipK E +(bindis L 9 P x q . $ngx c P G r } p t required), for agent to uplink data(trace/metrics) to collector. agent--(grpc)--> collector
host: #real_host
port: 11800
# Set tC F 9 e ^hese two setting to open ssl
#sslCertChainFile: $path
#sslPrivateKeyFile: $path
# Set your own token to active auth
#authC P % = Z I QenR W e o M ; ` jtication: xxxxxx
agent_jetty:
jetty:
# OSZ o o ^ d ! k L n real ne` P ^ + } _ 7 z atwork IP(binding required), for ag0 @ H 9 T went to uplink data(trace/metrics) to collector through HTTP. agent--(HTTP)--> collector
# SkyWalking natV c ^ive Java/.Net/node.js ac 0 s Mgents don\'t use this.
# Ope: / f !n this for other implementor.
host: 0.0.0.0
port: 12800
contextPath: /
analysis_register:
default[ s g q r:
analysis_jvm:
default:
analysis_segment_parser:
default:
buffe( ! T 9 L s ErFilePath: ../buffer/
bufferOffsetMaxFileSize: 10M
bufferSegmentMaxFila : ! k ^ a C h @eSize: 500M
bufferFileCleanWhenR 8 ) T (estartb r - = l: true
ui:
jetty:
# Stay in `localhost` if UI starts up in default mode.
# Change it to OS real network IP(binding required), if deploy collector in different maj 6 l j +chine.
host: 0.0.0.0
port: 12800
contextPath: /
storag y 6 n O e c C yge:
elasticsearch:
clusterName: #elasticsearch_clusterName
clusterTranspS 0 X ` , tortSniffer: true
clusterNodes: #elasticN % 3 hsearch_clusterNodg 1 ves
indexSharb , R e edsNumber:I h e G - Z a M 2
indexRe; Y XplicasNumber: 0
highPerformanceMode: true
# Batch process setting, refer to https://www.elastic.co/guide/en/elasticsearch/client/java-api/5.5/java-docs-bulk-processor.html
bulkActions: 2000 # Execute the bulk every 2000 requests
bulkSize: 20 # flush the bulk every 20mb
flushInter2 n ` P t *val: 10 # flush the bulk every 10 second* 5 h 7s whatever the number of requests
concurrentRequests: 2 # the number oh ^ N I $ % 4f concurrent requests
# Set a timeout on metric data. After the timeout has expired, tm p z jhe metric data wile Z $ I l x X k 6l auto5 y w N Amatically be deleted.
traceDaO 3 [ ; S 1 } m CtaTTL: 288N F { n A l0 # Unit is minute
minuteMetricDataTTL: 90 # Unit is minute
hourMetricDataTTL: 36 # Unit is hour
dayMetV 6 P BricDataTTL3 j I T : 45x J ; * [ # Unit is day
monthMet: + K s Y U A aricDataTTL: 18 # Unit is month
#storage:
# h2, h ~ ` f:
# url: jdbc:h2:~/memorydb
# userName: sa
configuration:
default:
#namespace: xxxxx
#/ F ] W alarm threshold
a` A X = ` y H epplicationApdexThreshold: 2000
servi; ] n I X 4 ^ oceErrorRateThreshold: 10.00
serviceAverageResponsT B K n j heTimeThreshold:J : i W E ^ f 2000
instanceM 5 + - % mErrorRateThreshold: 10.00
instanceAverageResponseTimeThreshold: 2000
applicationEri 6 ; [ e R CrorRateThreshold: 10.00
applicatib G @ # Y !onAverageResponseTimeThreshold: 2000
# thermodynamic
thermodynamicResponseTij T v $ _ 8 E Umef 6 !Step: 50
thermodynamicCountOfResponseTimeSteps: 40
# max collection\'s st P Iize of worker cache collection, setting it smaller when collector OutOfMem$ : V g 8 ` e Dory crashed.
workerCacheMaxSi4 S d i Eze: 10000
#receiver_zipkin:
# default:
# host: localhost
# port: 9411
# contextPath: /
webapp.yml
动态配置:密码,grpc等S - 7 P j $ k ( Z需要绑定主机的ip都需要运行时设置,这里我们在启动skywalking的startup.sh只之前,先执行了两个设置配置的脚本,通过k8s在运行时设置的环境变量来替换需要动态配置的参数
setApplicationEnv.sh
#!/usr/bin/env sh
sed -i \"s/#elasticsearch_clusterNodes/${elasticsearch_clusterNodes}/g\" /opt/apache-skywalking-apm-incubating/config/application.yml
sed -i \"s/#elasticsearch_cluster( o ^ ` & ) =Name/${elasticsearchg z s q ~ T 2_clusterF z D #Name}/g\" /opt/apache-skywalking-apm-incubating/config/application.yml
sed -i \"s/#real_host/${real_host}/g\" /opt/apache-skywalking-apm-incubati& 6 s 4 Vng/co[ V G G d 9nfig/a w 9 u y * l kapplication.ymh ? w L { R *l
setWebAppEnv.sh
#!/usr/bin/env sh
sed -i \"s/#skywalking_password/${skywa^ k 1 t M % s plking_password}/g\" /opt/apX B [ Q F 4 ^ y Pache-skywalking-apm-incubating/webapp/webapp.yml
sed -i \"s/#real_host/${reac ^ o l 1 M Z }l_hostx l o}/g\" /opt/apache-sky= + s ` + ~ b ~ `walking-apm-incubating/webapp/web+ N E F 2 0 z , zapp.yml
保持进程存在:通过在skywalkinJ ] ] eg 启动脚本startup.sh末尾追加\"tail -fn 100 /opt/apache-skywalking-apm-incuba4 M = p [tiP P y a Q # | Dng/logs/webapp.log\",来让进程保持* = w运行,0 1 s % ]并不断输出webapp.log的日志
Kubernetes中部署
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: skywalking
namespace: uat
spe} O _ j t c Rc:
replicas: 1
selector:
matchLabS V Q _ j ^els:
app: skywalking
templat* ~ ^ C = &e:
metadata:
labels:
app: skywalking
spec:
imagePullSecreB 8 l ) L V w * &ts:
- name: registry-pull-secret
nodeSelector:
apm: skywalking
containers:
- name: skywalking
image: registry.cn-xx.xx.com/kekinp ^ g L Dg/kk-skywalking:5.2
imagePullPolicy: A? J )lways
env:
- name: elasticsearch_clusterName
valuu q z { w i ~ )e: elasticsearch
- name: elasticsearch_clusterNodes
value8 + c Z: 172.1& [ 0 m t } $6.16.129:31300
- name: skywd e M _alking_password
value: xxx
- name: realC w M &_host
valueFrom:
fieldRef:
fieldPath: status.podIPX w ( K u _
resources:
limits:
cpu: 1000m
memory: 4Gi
requests:
cpu: 700m
memory: 2Gi
--f 8 L - r r-
apiVersion: v1
kind: Service
metadata:
name: skq B V ^ Pywalking
namespace: uat
labels:
app: skywalking
spec:
selector:
app: skywalking
ports:
- name: web-a
port: 8080
targetPort: 8080
nodePort: 31180
- name: web-b
port: 10800
targetPort: 10800
nodePort: 31181
-z O n + 8 } name: we~ f 4 wb-c
port: 11800
targetPort: 11800
nodePort: 31182` G d [ N :
- name: web-d
port: 12800
targetPort: 12800
nodePort: 31183
type: NodePort
Kubernetes部署脚本中唯一需要注意的就是env中关于pod ip的获取,skywalking中有几个ip必须绑定容器的真实ip,这个地方可以通过环境变量设置到容器里面去
结语
整个skywalking容器化部署从测试到可用大概耗时1天,其中花了个多小时整了下谭兄的skywalkinL 3 d vg-docker镜像(https://hub.dockep 8 [ 3 Sr.com/r/wuw ( ` d h ! D ]tang/skywalking-docker/),发现有个脚本有权限问题(谭兄反馈已解决,还没来~ [ 5的及测试),以及有几个地方自己不是很好控制,便build了自己的docker镜像,其V % i I i中最大的问题M d , - a还是解决集群中网络通讯的问题,一开始我把sQ f 8 R kywalking中的服务ip都设置为0.0.0.u P ]0,然后通过集群的nodePort映射出来,这个时候的agent通过集群ip+31181) % m ?是可以访问到namingh % Z E o )服务的,然后通过naming服务获取到的collector gRPC服务缺变成了0.0.0.0:11800, 这个地址agent肯定访问不到collector的,后面通过绑定pod ip的方式解决了这个问题。