一个Apache故障分析过程

[不指定 2013/08/17 21:04 | by ipaddr ]

故障现象为Apache进程数不断上涨,最终达到最大进程数后无法再提供服务,access.log日志也不再滚动,修改日志格式输出请求处理时间后,发现部分请求达到30,60,90s+,初步判断为部分请求处理时间超长,长时间占用apache进程,导致新请求后只能派生新进程,最终所有进程都被长请求占用后,无法再提供服务.

使用  strace -p $ApachePID  跟踪发现,Apache进程都卡在:

recvfrom(29,  <unfinished ...>

进入 /proc/PID/fd/,查找fd=29的为某个socket:

29 -> socket:[968385]

再从/proc/net/tcp中找到id=968385的socket:

grep "968385" /proc/net/tcp


将IP地址,端口转换为10进制后,找到了后端挂掉的服务IP和端口,重启服务后恢复正常.

https://github.com/bitly/nsq

Introduction

NSQ is a realtime distributed messaging platform designed to operate at bitly’s scale, handling billions of messages per day (current peak of 90k+ messages per second).

It promotes distributed and decentralized topologies without single points of failure, enabling fault tolerance and high availability coupled with a reliable message delivery guarantee. See features & guarantees.

Operationally, NSQ is easy to configure and deploy (all parameters are specified on the command line and compiled binaries have no runtime dependencies). For maximum flexibility, it is agnostic to data format (messages can be JSON, MsgPack, go-protobuf, or anything else). Official Go and Python libraries are available out of the box and, if you’re interested in building your own client, there’s a protocol spec (see client libraries).

分页: 1/1 第一页 1 最后页 [ 显示模式: 摘要 | 列表 ]