I've done everything you guys asked, set the keepalives like you suggested, replaced the instances with brand new ones like Matt suggested. Yet again today one server lost it's network connection
i-96c84fdb
Logs i-ac21a5e1
Thu May 30 11:15:07.089
rsSyncNotifier Socket recv() timeout 10.208.62.23:27017
Thu May 30 11:15:07.089
rsSyncNotifier SocketException: remote: 10.208.62.23:27017 error: 9001 socket exception [3] server
10.208.62.23:27017
Thu May 30 11:15:07.090
rsSyncNotifier replset tracking exception: exception: 10278 dbclient error communicating with server: ip-10-208-62-23.eu-west-1.compute.internal:27017
Thu May 30 11:15:08.090
rsSyncNotifier replset setting oplog notifier to ip-10-208-62-23.eu-west-1.compute.internal:27017
Thu May 30 11:15:15.346
conn10328 end connection 10.208.62.23:58790 (4 connections now open)
Thu May 30 11:15:15.347
initandlisten connection accepted from 10.208.62.23:58793 #10330 (5 connections now open)
Logs i-3237b37f
Thu May 30 11:15:09.192
initandlisten connection accepted from 10.208.62.23:35697 #10015 (4 connections now open)
Thu May 30 11:15:12.913
rsSyncNotifier Socket recv() timeout 10.208.62.23:27017
Thu May 30 11:15:12.913
rsSyncNotifier SocketException: remote: 10.208.62.23:27017 error: 9001 socket exception [3] server
10.208.62.23:27017
Thu May 30 11:15:12.913
rsSyncNotifier replset tracking exception: exception: 10278 dbclient error communicating with server: ip-10-208-62-23.eu-west-1.compute.internal:27017
Thu May 30 11:15:13.913
rsSyncNotifier replset setting oplog notifier to ip-10-208-62-23.eu-west-1.compute.internal:27017
Thu May 30 11:15:32.360
conn10014 end connection 10.34.131.93:44070 (3 connections now open)
Thu May 30 11:15:32.362
initandlisten connection accepted from 10.34.131.93:44074 #10016 (4 connections now open)
From i-96c84fdb
Thu May 30 11:14:56.036
conn17522 end connection 10.227.47.122:39068 (8 connections now open)
Thu May 30 11:14:56.038
initandlisten connection accepted from 10.227.47.122:39070 #17528 (9 connections now open)
Thu May 30 11:14:59.951
initandlisten connection accepted from 10.36.184.26:54231 #17529 (10 connections now open)
Thu May 30 11:15:00.383
initandlisten connection accepted from 10.34.168.121:49233 #17530 (11 connections now open)
Thu May 30 11:15:03.644
conn9514 end connection 10.34.168.121:44332 (10 connections now open)
Thu May 30 11:15:03.845
initandlisten connection accepted from 10.36.184.26:54233 #17531 (11 connections now open)
Thu May 30 11:15:05.036
initandlisten connection accepted from 10.36.184.26:54234 #17532 (12 connections now open)
Thu May 30 11:15:05.295
initandlisten connection accepted from 10.36.184.26:54235 #17533 (13 connections now open)
Thu May 30 11:15:08.765
conn9019 killcursors: found 0 of 1
Thu May 30 11:15:08.765
conn9019 end connection 10.227.47.122:58812 (12 connections now open)
Thu May 30 11:15:08.766
initandlisten connection accepted from 10.227.47.122:39073 #17534 (13 connections now open)
Thu May 30 11:15:08.769
conn9020 killcursors: found 0 of 1
Thu May 30 11:15:08.769
conn9020 end connection 10.34.131.93:37546 (12 connections now open)
Thu May 30 11:15:08.771
initandlisten connection accepted from 10.34.131.93:52193 #17535 (13 connections now open)
Thu May 30 11:15:09.176
initandlisten connection accepted from 10.34.168.121:49235 #17536 (14 connections now open)
Thu May 30 11:15:11.501
conn17533 end connection 10.36.184.26:54235 (13 connections now open)
Thu May 30 11:15:11.501
conn17532 end connection 10.36.184.26:54234 (12 connections now open)
The logs from the web servers connecting to Mongo
Logs from i-e61190ab
[#|2013-05-30T10:58:55.518+0000|WARNING|glassfish3.1.2|com.mongodb.ReplicaSetStatus|_ThreadID=54;_ThreadName=Thread-2;|Server seen down: ip-10-208-62-23.eu-west-1.compute.internal/10.208.62.23:27017 - java.io.IOException - message: Read timed out|#]
[#|2013-05-30T11:14:24.334+0000|WARNING|glassfish3.1.2|com.mongodb|_ThreadID=42;_ThreadName=Thread-2;|emptying DBPortPool to ip-10-208-62-23.eu-west-1
.compute.internal/10.208.62.23:27017 b/c of error
java.net.SocketException: Connection timed out
[#|2013-05-30T11:14:24.430+0000|INFO|glassfish3.1.2|javax.enterprise.system.std.com.sun.enterprise.server.logging|_ThreadID=42;_ThreadName=Thread-2;|1
1:14:24.429
http-thread-pool-8080(2) ERROR c.p.r.util.UncaughtExceptionMapper - Error
dYDSdA from
http://10.224.85.209 URL: /api/homepage.json/contents
com.mongodb.MongoException$Network: can't call something : ip-10-208-62-23.eu-west-1.compute.internal/10.208.62.23:27017/parleys
Caused by: java.net.SocketException: Connection timed out
at java.net.SocketInputStream.socketRead0(Native Method)
na:1.7.0_09
... and this goes on and on ...
and from Nginx
2013/05/30 11:16:02
error 22769#0: *846 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 127.0.0.1, server: ip-10-34-168-121, request: "POST /channelsindex/_search?search_type=dfs_query_then_fetch HTTP/1.1", upstream: "http://10.208.62.23:9200/channelsindex/_search?search_type=dfs_query_then_fetch", host: "localhost:9200"
2013/05/30 11:16:24
error 22769#0: *852 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 127.0.0.1, server: ip-10-34-168-121, request: "POST /channelsindex/_search?search_type=dfs_query_then_fetch HTTP/1.1", upstream: "http://10.208.62.23:9200/channelsindex/_search?search_type=dfs_query_then_fetch", host: "localhost:9200"
Logs from i-e4cd4da9
[#|2013-05-30T11:14:25.967+0000|WARNING|glassfish3.1.2|com.mongodb|_ThreadID=37;_ThreadName=Thread-2;|emptying DBPortPool to ip-10-208-62-23.eu-west-1
.compute.internal/10.208.62.23:27017 b/c of error
java.net.SocketException: Connection timed out
[#|2013-05-30T11:14:26.056+0000|INFO|glassfish3.1.2|javax.enterprise.system.std.com.sun.enterprise.server.logging|_ThreadID=37;_ThreadName=Thread-2;|1
1:14:26.055
http-thread-pool-8080(2) ERROR c.p.r.util.UncaughtExceptionMapper - Error
TAg3sd from
http://10.48.18.112 URL: /api/homepage.json/contents
com.mongodb.MongoException$Network: can't call something : ip-10-208-62-23.eu-west-1.compute.internal/10.208.62.23:27017/parleys
at com.mongodb.DBTCPConnector.call(DBTCPConnector.java:284)
mongo-java-driver-2.9.0.jar:na
... and this goes on and on as well ...
and Nginx
~
2013/05/30 11:14:42
error 22790#0: *767 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 127.0.0.1, server: ip-10-36-184-26, request: "POST /usersindex/document/514892080364bc17fc566614 HTTP/1.1", upstream: "http://10.34.131.93:9200/usersindex/document/514892080364bc17fc566614", host: "localhost:9200"
2013/05/30 11:14:44
error 22790#0: *767 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 127.0.0.1, server: ip-10-36-184-26, request: "POST /usersindex/document/514892080364bc17fc566614 HTTP/1.1", upstream: "http://10.227.47.122:9200/usersindex/document/514892080364bc17fc566614", host: "localhost:9200"
2013/05/30 11:14:46
error 22790#0: *767 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 127.0.0.1, server: ip-10-36-184-26, request: "POST /usersindex/document/514892080364bc17fc566614 HTTP/1.1", upstream: "http://10.208.62.23:9200/usersindex/document/514892080364bc17fc566614", host: "localhost:9200"
2013/05/30 11:14:50
error 22790#0: *767 no live upstreams while connecting to upstream, client: 127.0.0.1, server: ip-10-36-184-26, request: "POST /channelsindex/_search?search_type=dfs_query_then_fetch HTTP/1.1", upstream: "http://elasticsearch/channelsindex/_search?search_type=dfs_query_then_fetch", host: "localhost:9200"
2013/05/30 11:15:26
error 22790#0: *771 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 127.0.0.1, server: ip-10-36-184-26, request: "POST /usersindex/document/5148920d0364bc17fc56a361 HTTP/1.1", upstream: "http://10.208.62.23:9200/usersindex/document/5148920d0364bc17fc56a361", host: "localhost:9200"
2013/05/30 11:15:28
error 22790#0: *773 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 127.0.0.1, server: ip-10-36-184-26, request: "POST /usersindex/document/5148920d0364bc17fc56a361 HTTP/1.1", upstream: "http://10.227.47.122:9200/usersindex/document/5148920d0364bc17fc56a361", host: "localhost:9200"
2013/05/30 11:15:28
error 22790#0: *771 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 127.0.0.1, server: ip-10-36-184-26, request: "POST /usersindex/document/5148920d0364bc17fc56a361 HTTP/1.1", upstream: "http://10.34.131.93:9200/usersindex/document/5148920d0364bc17fc56a361", host: "localhost:9200"
2013/05/30 11:15:28
error 22790#0: *771 no live upstreams while connecting to upstream, client: 127.0.0.1, server: ip-10-36-184-26, request: "POST /usersindex/document/5148920d0364bc17fc56a361 HTTP/1.1", upstream: "http://elasticsearch/usersindex/document/5148920d0364bc17fc56a361", host: "localhost:9200"
2013/05/30 11:15:30
error 22790#0: *773 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 127.0.0.1, server: ip-10-36-184-26, request: "POST /usersindex/document/5148920d0364bc17fc56a361 HTTP/1.1", upstream: "http://10.34.131.93:9200/usersindex/document/5148920d0364bc17fc56a361", host: "localhost:9200"
2013/05/30 11:15:30
error 22790#0: *771 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 127.0.0.1, server: ip-10-36-184-26, request: "POST /usersindex/document/5148920d0364bc17fc56a361 HTTP/1.1", upstream: "http://10.208.62.23:9200/usersindex/document/5148920d0364bc17fc56a361", host: "localhost:9200"
2013/05/30 11:15:32
error 22790#0: *771 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 127.0.0.1, server: ip-10-36-184-26, request: "POST /usersindex/document/5148920d0364bc17fc56a361 HTTP/1.1", upstream: "http://10.227.47.122:9200/usersindex/document/5148920d0364bc17fc56a361", host: "localhost:9200"
2013/05/30 11:15:36
error 22790#0: *771 no live upstreams while connecting to upstream, client: 127.0.0.1, server: ip-10-36-184-26, request: "POST /_msearch?search_type=dfs_query_then_fetch HTTP/1.1", upstream: "http://elasticsearch/_msearch?search_type=dfs_query_then_fetch", host: "localhost:9200"
This is NOT acceptable, really! If this madness is not resolved soon, we will switch to another cloud provider.