Discussion Forums
Discussion Forums > Category: Amazon Web Services > Forum: Amazon Elastic Compute Cloud (EC2) >Thread: Random network interruptions with my EC2 instances, failover impossible
Advanced search options
Random network interruptions with my EC2 instances, failover impossible
Posted by: JayVPar
Posted on: May 23, 2013 4:11 AM
  Click to reply to this thread Reply
This question is not answered. Helpful answers available: 2. Correct answers available: 1.
We recently moved from dedicated servers to AWS EC2 for scalability and above all high availability, however the result is a higher cost and frequent downtime, so we're a bit disappointed with AWS so far.

Our web servers seem to randomly experience brief disconnects from the network to all of our MongoDB and ElasticSearch instances at the same time, it only lasts for a couple seconds to a couple minutes, but during this time our site is unavailable (ELB health check kicks all servers offline) and this is unacceptable.

It happens a couple times every month and again last night, the concerned instance architecture and details are in the attached diagram or here

http://s1336.photobucket.com/user/jayvbe/media/Webserverinfrastructure1_zps802ee5be.png.html

We run 2 web instances and it seems that at the same time both had issues connecting to one or more of the DB+Search instances. No details in our logs can confirm if it's one or all, but I assume it's all since Nginx is supposed to failover immediately between the 3 search instances (tested by manipulating firewall rules). The Java Mongo client is not configured to read from secondaries yet so no chance of knowing if it could access the secondaries at this point.

I hope someone from AWS can look into this and clarify what's been happening to these instances and how to resolve this.

Some relevant logs from around 2013-05-23 04:18:08,780 GMT all instances are in the same zone:

Java Exception:

com.mongodb.MongoException$Network: can't call something : ip-10-33-xxx-xxx.eu-west-1.compute.internal

REST call through Nginx on localhost:

<html>
<head><title>502 Bad Gateway</title></head>
<body bgcolor="white">
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx/1.1.19</center>
</body>
</html>

and

<html>
<head><title>504 Gateway Time-out</title></head>
<body bgcolor="white">
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>nginx/1.1.19</center>
</body>
</html>
Permlink Replies: 22 | Pages: 1 - Last Post: Jun 1, 2013 3:22 PM by: JoshuaF@AWS
Replies
Re: Random network interruptions with my EC2 instances, failover impossible
Posted by: JayVPar
Posted on: May 23, 2013 6:18 PM
in response to: JayVPar in response to: JayVPar
  Click to reply to this thread Reply
Later this evening the problem reappeared while I was monitoring the system, so I managed to at least get some more insight into why none of our ElasticSearch nodes responded.

It seems that the brief network disruption confused ElasticSearch enough for it to take about 30-45min before resuming normal operation.
This time 1 node experienced a corrupt index due to the outage, which had to be resolved by a service restart, so I believe there are probably some issues with ES.

I can confirm that none of the nodes could reach eachother during the disconnection, shortly after that, the network connections are restored and all nodes can talk to eachother. So if I'm correct in my findings it means that we can never have a fault-tolerant system on EC2 when services are spread over different nodes?

What does it take for someone from Amazon to look into this?
Re: Random network interruptions with my EC2 instances, failover impossible
Posted by: MattJ@AWS
Posted on: May 23, 2013 6:42 PM
in response to: JayVPar in response to: JayVPar
  Click to reply to this thread Reply
Hello,

Would you say that most the failures occur with communications crossing i-a4c185ee? If so would you be willing to stop/start this instance and let me know if it continues?

Regards,

Matt J
Re: Random network interruptions with my EC2 instances, failover impossible
Posted by: JayVPar
Posted on: May 23, 2013 6:54 PM
in response to: MattJ@AWS in response to: MattJ@AWS
  Click to reply to this thread Reply
It seems that network connectivity is lost between ALL instances (within the same SG) at the same time, so all the red lines in the diagram are cut.

Also the instance you mention is 1 of the 2 webservers, and that one remains reachable from the internet, however they lose access to the DB+Search servers.

I'm not keen on restarting, unless you can assure me that it will keeps the same IP addresses?

In our staging setup we've never seen this, the only difference is that those are different nodes and all nodes run in the default SG, while for production I've created a dedicated SG for each task and specialized rules depending on those SGs. Don't know if that has an impact?
Re: Random network interruptions with my EC2 instances, failover impossible
Posted by: JayVPar
Posted on: May 26, 2013 6:31 AM
in response to: MattJ@AWS in response to: MattJ@AWS
  Click to reply to this thread Reply
From reading the forums it seems that sometimes the host hardware can cause these short network blackouts.

Could someone from Amazon confirm if the following machines run on the same host?

i-3aafeb70
i-e6b5f1ac
i-46b3f70c

From analyzing the logs I have a strong suspicion that at least the first 2 exhibit the problem at the exact same time.

Re: Random network interruptions with my EC2 instances, failover impossible
Posted by: JoshuaF@AWS
Posted on: May 26, 2013 11:49 AM
in response to: JayVPar in response to: JayVPar
  Click to reply to this thread Reply
Hi,

Generally host hardware should not result in "brief disconnects from the network" but just to confirm none of those instances share the same hardware. As a guess, it sounds more like a there may be a mismatch in the keepalive settings (TCP or application level protocol). In any case the underlying hardware is passing all our automated health checks which does include network infrastructure.

What does concern me is that it looks like you are running all your instances in a single Availability Zone. AZs are one or more physical data center locations; if your goal is High Availability you need to use more than one AZ. Our "AWS Cloud Architecture Best Practices" and "Building Fault-Tolerant Applications on AWS" whitepapers have further details and recommendations:
http://aws.amazon.com/architecture/

Your architecture diagram looks like a great fit for VPC by the way; it reminds me of our reference diagram for ELB inside VPC:
http://aws.typepad.com/aws/2011/11/new-aws-elastic-load-balancing-inside-of-a-virtual-private-cloud.html

Instances can use a static Private IP across Stop/Start in VPC:
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-instance-addressing.html

Regards,
Joshua F.
Re: Random network interruptions with my EC2 instances, failover impossible
Posted by: JayVPar
Posted on: May 28, 2013 4:03 AM
in response to: JoshuaF@AWS in response to: JoshuaF@AWS
  Click to reply to this thread Reply
Hi Joshua,

If the host hardware is generally not to blame, why do many people including your coworker Matt suggest the stop-start procedure in case of network issues?

The log files of our web- and appservers: Nginx and Glassfish and our DB- and search-servers: MongoDB and ElasticSearch, all 5 instances expose network connectivity issues around the same timestamps. For me this is enough to rule out application misconfiguration.

We have the exact same architectural setup as a testing environment, configured from the exact same CHEF recipes, running the exact same code, backed by the exact same data, exposed to the same health checks. Yet we've never experienced a single instance of this problem here and it's been running at least twice as long as our production systems, for me this rules out OS level misconfiguration, application misconfiguration or anything that is remotely under our control.

I'll spin up new instances for the ones I believe to experience the non-existing network problems, and we'll see if that magically solves our misconfigurations ;-)
Re: Random network interruptions with my EC2 instances, failover impossible
Posted by: JoshuaF@AWS
Posted on: May 28, 2013 7:13 AM
in response to: JayVPar in response to: JayVPar
  Click to reply to this thread Reply
Hi,

Apologies for any confusion, I am suggesting that you may be running into infrastructure differences than in your testing environment, which I'm guessing might be on a single rack or two. Feel free to clarify if this is not accurate. Due to the scale of EC2, there could be small differences in the way you need to handle TCP connections across larger infrastructure. It sounds like you've collected some debugging logs which is a good first step, but as you've mentioned this only occurs a couple times a month it will be challenging to collect good data on what's going on.

Creating a duplicate environment with new instances is also a great idea. I'd suggest using a different AZ and VPC so that you can reuse your private IP address if needed.

Hope it helps,
Joshua F.
Re: Random network interruptions with my EC2 instances, failover impossible
Posted by: JayVPar
Posted on: May 29, 2013 4:46 AM
in response to: JoshuaF@AWS in response to: JoshuaF@AWS
  Click to reply to this thread Reply
Like I said, we already have an identical testing setup on EC2 with the exact same topology, same instance types, configured exactly the same (via the same CHEF recipes), that doesn't have this problem.

The last couple of days it occurred between 2 to 4 times a day, not once every month!! So far on the new instances no problems at all, same os, same code, same config ;-)

I've now spent an entire week debugging this, which cost my company a multitude of our monthly AWS bill. Would love to get a compensation from Amazon for this expense and the loss of productivity and customer credibility.

Edited by: JayVPar on May 29, 2013 1:47 PM
Re: Random network interruptions with my EC2 instances, failover impossible
Posted by: JoshuaF@AWS
Posted on: May 29, 2013 6:39 AM
in response to: JayVPar in response to: JayVPar
  Click to reply to this thread Reply
Hi,

Thanks for the update, sounds like my guess about the testing setup was incorrect if you're running it on EC2 as well, and this was occurring a lot more often than in your first post. I sent you a PM in case you wanted to provide more detail than you're comfortable sharing on the public forum.

Regards,
Joshua F.
Re: Random network interruptions with my EC2 instances, failover impossible
Posted by: JayVPar
Posted on: May 29, 2013 7:34 AM
in response to: JoshuaF@AWS in response to: JoshuaF@AWS
  Click to reply to this thread Reply
Arrgh... Cried victory too soon, 15 minutes ago it went down again.

Timestamp around GMT 2013-05-29 14:15:55,348

i-a4c185ee and i-34ce8a7e lost connection to i-ac21a5e1 i-3237b37f i-96c84fdb

I'll replace i-a4c185ee and i-34ce8a7e as well, the latter 3 are new instances from last night.

Involved SGs: sg-e3f14594 sg-bdd269ca sg-a3d269d4

Edited by: JayVPar on May 29, 2013 4:40 PM
Re: Random network interruptions with my EC2 instances, failover impossible
Posted by: JoshuaF@AWS
Posted on: May 29, 2013 8:00 AM
in response to: JayVPar in response to: JayVPar
  Click to reply to this thread Reply
Hi,

Thanks for the quick response, I was lucky enough to gather a bit more detail in realtime. It looks like your backend and web client instances are attempting to use some high-number port connections (like TCP 38960, TCP 52654, etc) that have timed out. To allow this behavior, you can open your security groups on the ephemeral ports, typically 32768 to 61000 in Linux:
http://en.wikipedia.org/wiki/Ephemeral_port

However I again may not understand the whole story here as you've mentioned this only effects this set of instances for some reason. I would consider opening the high number ports a workaround, while best practices would be to use keepalives on the server side, for example:

/sbin/sysctl -w net.ipv4.tcp_keepalive_time=200 net.ipv4.tcp_keepalive_intvl=200 net.ipv4.tcp_keepalive_probes=5

Regards,
Joshua F.

Re: Random network interruptions with my EC2 instances, failover impossible
Posted by: JayVPar
Posted on: May 29, 2013 8:33 AM
in response to: JoshuaF@AWS in response to: JoshuaF@AWS
  Click to reply to this thread Reply
JoshuaF@AWS wrote:
like your backend and web client instances are attempting to use some high-number port connections (like TCP 38960, TCP 52654, etc) that have timed out. To allow this behavior, you can
AFAIK these ports are not used by any of the services that I run, I expect traffic to 80, 8080, 9200, 9300-9400 and 27017, depending on which service and as configured in the SGs.
Re: Random network interruptions with my EC2 instances, failover impossible
Posted by: JoshuaF@AWS
Posted on: May 29, 2013 8:43 AM
in response to: JayVPar in response to: JayVPar
  Click to reply to this thread Reply
Hi,

Those were the client end source ports. Basically:

1. application opens a connection from the web instance to the database (auto-assigned client source port like tcp 38960 to destination db tcp 27017). This connection is allowed because the database EC2 Security Group is open on tcp 27017.

2. This tcp:38960 > tcp:27017 connection times out for an unknown reason, perhaps due to lack of application keepalives or a timeout that's too high. This article may be useful background info:
http://www.codeproject.com/Articles/37490/Detection-of-Half-Open-Dropped-TCP-IP-Socket-Conne

3. Since the connection is timed out, when some traffic leaves the db it looks backward like this: tcp:27017 > tcp:38960
Since the web EC2 Security Group is NOT open on tcp:38960 this traffic is blocked.

Again allowing the ephemeral ports is generally a safe workaround as nothing should be listening on these ports, but the best thing to do would be to keep the connection alive.

Regards,
Joshua F.
Re: Random network interruptions with my EC2 instances, failover impossible
Posted by: JayVPar
Posted on: May 29, 2013 10:19 AM
in response to: JoshuaF@AWS in response to: JoshuaF@AWS
  Click to reply to this thread Reply
JoshuaF@AWS wrote:
Again allowing the ephemeral ports is generally a safe workaround as nothing should be listening on these ports, but the best thing to do would be to keep the connection alive.

I've set the keepalive values on all machines as you suggested, we'll see if that changes anything.
Re: Random network interruptions with my EC2 instances, failover impossible
Posted by: JayVPar
Posted on: May 30, 2013 4:48 AM
in response to: JoshuaF@AWS in response to: JoshuaF@AWS
  Click to reply to this thread Reply
I've done everything you guys asked, set the keepalives like you suggested, replaced the instances with brand new ones like Matt suggested. Yet again today one server lost it's network connection i-96c84fdb

Logs i-ac21a5e1

Thu May 30 11:15:07.089 rsSyncNotifier Socket recv() timeout 10.208.62.23:27017
Thu May 30 11:15:07.089 rsSyncNotifier SocketException: remote: 10.208.62.23:27017 error: 9001 socket exception [3] server 10.208.62.23:27017
Thu May 30 11:15:07.090 rsSyncNotifier replset tracking exception: exception: 10278 dbclient error communicating with server: ip-10-208-62-23.eu-west-1.compute.internal:27017
Thu May 30 11:15:08.090 rsSyncNotifier replset setting oplog notifier to ip-10-208-62-23.eu-west-1.compute.internal:27017
Thu May 30 11:15:15.346 conn10328 end connection 10.208.62.23:58790 (4 connections now open)
Thu May 30 11:15:15.347 initandlisten connection accepted from 10.208.62.23:58793 #10330 (5 connections now open)

Logs i-3237b37f

Thu May 30 11:15:09.192 initandlisten connection accepted from 10.208.62.23:35697 #10015 (4 connections now open)
Thu May 30 11:15:12.913 rsSyncNotifier Socket recv() timeout 10.208.62.23:27017
Thu May 30 11:15:12.913 rsSyncNotifier SocketException: remote: 10.208.62.23:27017 error: 9001 socket exception [3] server 10.208.62.23:27017
Thu May 30 11:15:12.913 rsSyncNotifier replset tracking exception: exception: 10278 dbclient error communicating with server: ip-10-208-62-23.eu-west-1.compute.internal:27017
Thu May 30 11:15:13.913 rsSyncNotifier replset setting oplog notifier to ip-10-208-62-23.eu-west-1.compute.internal:27017
Thu May 30 11:15:32.360 conn10014 end connection 10.34.131.93:44070 (3 connections now open)
Thu May 30 11:15:32.362 initandlisten connection accepted from 10.34.131.93:44074 #10016 (4 connections now open)

From i-96c84fdb

Thu May 30 11:14:56.036 conn17522 end connection 10.227.47.122:39068 (8 connections now open)
Thu May 30 11:14:56.038 initandlisten connection accepted from 10.227.47.122:39070 #17528 (9 connections now open)
Thu May 30 11:14:59.951 initandlisten connection accepted from 10.36.184.26:54231 #17529 (10 connections now open)
Thu May 30 11:15:00.383 initandlisten connection accepted from 10.34.168.121:49233 #17530 (11 connections now open)
Thu May 30 11:15:03.644 conn9514 end connection 10.34.168.121:44332 (10 connections now open)
Thu May 30 11:15:03.845 initandlisten connection accepted from 10.36.184.26:54233 #17531 (11 connections now open)
Thu May 30 11:15:05.036 initandlisten connection accepted from 10.36.184.26:54234 #17532 (12 connections now open)
Thu May 30 11:15:05.295 initandlisten connection accepted from 10.36.184.26:54235 #17533 (13 connections now open)
Thu May 30 11:15:08.765 conn9019 killcursors: found 0 of 1
Thu May 30 11:15:08.765 conn9019 end connection 10.227.47.122:58812 (12 connections now open)
Thu May 30 11:15:08.766 initandlisten connection accepted from 10.227.47.122:39073 #17534 (13 connections now open)
Thu May 30 11:15:08.769 conn9020 killcursors: found 0 of 1
Thu May 30 11:15:08.769 conn9020 end connection 10.34.131.93:37546 (12 connections now open)
Thu May 30 11:15:08.771 initandlisten connection accepted from 10.34.131.93:52193 #17535 (13 connections now open)
Thu May 30 11:15:09.176 initandlisten connection accepted from 10.34.168.121:49235 #17536 (14 connections now open)
Thu May 30 11:15:11.501 conn17533 end connection 10.36.184.26:54235 (13 connections now open)
Thu May 30 11:15:11.501 conn17532 end connection 10.36.184.26:54234 (12 connections now open)


The logs from the web servers connecting to Mongo

Logs from i-e61190ab

[#|2013-05-30T10:58:55.518+0000|WARNING|glassfish3.1.2|com.mongodb.ReplicaSetStatus|_ThreadID=54;_ThreadName=Thread-2;|Server seen down: ip-10-208-62-23.eu-west-1.compute.internal/10.208.62.23:27017 - java.io.IOException - message: Read timed out|#]

[#|2013-05-30T11:14:24.334+0000|WARNING|glassfish3.1.2|com.mongodb|_ThreadID=42;_ThreadName=Thread-2;|emptying DBPortPool to ip-10-208-62-23.eu-west-1
.compute.internal/10.208.62.23:27017 b/c of error
java.net.SocketException: Connection timed out

[#|2013-05-30T11:14:24.430+0000|INFO|glassfish3.1.2|javax.enterprise.system.std.com.sun.enterprise.server.logging|_ThreadID=42;_ThreadName=Thread-2;|1
1:14:24.429 http-thread-pool-8080(2) ERROR c.p.r.util.UncaughtExceptionMapper - Error dYDSdA from http://10.224.85.209 URL: /api/homepage.json/contents
com.mongodb.MongoException$Network: can't call something : ip-10-208-62-23.eu-west-1.compute.internal/10.208.62.23:27017/parleys

Caused by: java.net.SocketException: Connection timed out
at java.net.SocketInputStream.socketRead0(Native Method) na:1.7.0_09

... and this goes on and on ...

and from Nginx

2013/05/30 11:16:02 error 22769#0: *846 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 127.0.0.1, server: ip-10-34-168-121, request: "POST /channelsindex/_search?search_type=dfs_query_then_fetch HTTP/1.1", upstream: "http://10.208.62.23:9200/channelsindex/_search?search_type=dfs_query_then_fetch", host: "localhost:9200"
2013/05/30 11:16:24 error 22769#0: *852 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 127.0.0.1, server: ip-10-34-168-121, request: "POST /channelsindex/_search?search_type=dfs_query_then_fetch HTTP/1.1", upstream: "http://10.208.62.23:9200/channelsindex/_search?search_type=dfs_query_then_fetch", host: "localhost:9200"


Logs from i-e4cd4da9

[#|2013-05-30T11:14:25.967+0000|WARNING|glassfish3.1.2|com.mongodb|_ThreadID=37;_ThreadName=Thread-2;|emptying DBPortPool to ip-10-208-62-23.eu-west-1
.compute.internal/10.208.62.23:27017 b/c of error
java.net.SocketException: Connection timed out

[#|2013-05-30T11:14:26.056+0000|INFO|glassfish3.1.2|javax.enterprise.system.std.com.sun.enterprise.server.logging|_ThreadID=37;_ThreadName=Thread-2;|1
1:14:26.055 http-thread-pool-8080(2) ERROR c.p.r.util.UncaughtExceptionMapper - Error TAg3sd from http://10.48.18.112 URL: /api/homepage.json/contents
com.mongodb.MongoException$Network: can't call something : ip-10-208-62-23.eu-west-1.compute.internal/10.208.62.23:27017/parleys
at com.mongodb.DBTCPConnector.call(DBTCPConnector.java:284) mongo-java-driver-2.9.0.jar:na


... and this goes on and on as well ...

and Nginx

~
2013/05/30 11:14:42 error 22790#0: *767 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 127.0.0.1, server: ip-10-36-184-26, request: "POST /usersindex/document/514892080364bc17fc566614 HTTP/1.1", upstream: "http://10.34.131.93:9200/usersindex/document/514892080364bc17fc566614", host: "localhost:9200"
2013/05/30 11:14:44 error 22790#0: *767 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 127.0.0.1, server: ip-10-36-184-26, request: "POST /usersindex/document/514892080364bc17fc566614 HTTP/1.1", upstream: "http://10.227.47.122:9200/usersindex/document/514892080364bc17fc566614", host: "localhost:9200"
2013/05/30 11:14:46 error 22790#0: *767 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 127.0.0.1, server: ip-10-36-184-26, request: "POST /usersindex/document/514892080364bc17fc566614 HTTP/1.1", upstream: "http://10.208.62.23:9200/usersindex/document/514892080364bc17fc566614", host: "localhost:9200"
2013/05/30 11:14:50 error 22790#0: *767 no live upstreams while connecting to upstream, client: 127.0.0.1, server: ip-10-36-184-26, request: "POST /channelsindex/_search?search_type=dfs_query_then_fetch HTTP/1.1", upstream: "http://elasticsearch/channelsindex/_search?search_type=dfs_query_then_fetch", host: "localhost:9200"
2013/05/30 11:15:26 error 22790#0: *771 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 127.0.0.1, server: ip-10-36-184-26, request: "POST /usersindex/document/5148920d0364bc17fc56a361 HTTP/1.1", upstream: "http://10.208.62.23:9200/usersindex/document/5148920d0364bc17fc56a361", host: "localhost:9200"
2013/05/30 11:15:28 error 22790#0: *773 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 127.0.0.1, server: ip-10-36-184-26, request: "POST /usersindex/document/5148920d0364bc17fc56a361 HTTP/1.1", upstream: "http://10.227.47.122:9200/usersindex/document/5148920d0364bc17fc56a361", host: "localhost:9200"
2013/05/30 11:15:28 error 22790#0: *771 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 127.0.0.1, server: ip-10-36-184-26, request: "POST /usersindex/document/5148920d0364bc17fc56a361 HTTP/1.1", upstream: "http://10.34.131.93:9200/usersindex/document/5148920d0364bc17fc56a361", host: "localhost:9200"
2013/05/30 11:15:28 error 22790#0: *771 no live upstreams while connecting to upstream, client: 127.0.0.1, server: ip-10-36-184-26, request: "POST /usersindex/document/5148920d0364bc17fc56a361 HTTP/1.1", upstream: "http://elasticsearch/usersindex/document/5148920d0364bc17fc56a361", host: "localhost:9200"
2013/05/30 11:15:30 error 22790#0: *773 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 127.0.0.1, server: ip-10-36-184-26, request: "POST /usersindex/document/5148920d0364bc17fc56a361 HTTP/1.1", upstream: "http://10.34.131.93:9200/usersindex/document/5148920d0364bc17fc56a361", host: "localhost:9200"
2013/05/30 11:15:30 error 22790#0: *771 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 127.0.0.1, server: ip-10-36-184-26, request: "POST /usersindex/document/5148920d0364bc17fc56a361 HTTP/1.1", upstream: "http://10.208.62.23:9200/usersindex/document/5148920d0364bc17fc56a361", host: "localhost:9200"
2013/05/30 11:15:32 error 22790#0: *771 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 127.0.0.1, server: ip-10-36-184-26, request: "POST /usersindex/document/5148920d0364bc17fc56a361 HTTP/1.1", upstream: "http://10.227.47.122:9200/usersindex/document/5148920d0364bc17fc56a361", host: "localhost:9200"
2013/05/30 11:15:36 error 22790#0: *771 no live upstreams while connecting to upstream, client: 127.0.0.1, server: ip-10-36-184-26, request: "POST /_msearch?search_type=dfs_query_then_fetch HTTP/1.1", upstream: "http://elasticsearch/_msearch?search_type=dfs_query_then_fetch", host: "localhost:9200"


This is NOT acceptable, really! If this madness is not resolved soon, we will switch to another cloud provider.

Re: Random network interruptions with my EC2 instances, failover impossible
Posted by: JoshuaF@AWS
Posted on: May 30, 2013 7:21 AM
in response to: JayVPar in response to: JayVPar
  Click to reply to this thread Reply
Hi,

Sorry to hear of the continued trouble. Would you be willing to try the workaround of opening the high number ports on your security groups? Unfortunately I'm not familiar enough with the applications involved to make any specific configuration recommendations.

Regards,
Joshua F.
Re: Random network interruptions with my EC2 instances, failover impossible
Posted by: JayVPar
Posted on: May 30, 2013 8:40 AM
in response to: JoshuaF@AWS in response to: JoshuaF@AWS
  Click to reply to this thread Reply
JoshuaF@AWS wrote:
Sorry to hear of the continued trouble. Would you be willing to try the workaround of opening the high number ports on your security groups?

This would be the first time in my career that I have to setup firewall rules as liberal as this, opening half the port-range, where no service is even listening doesn't make sense to me. You've given your explanation, to my understanding this is not how it works, but I've changed the security groups as you suggested... time will tell.

I've given you all these logs, can you tell me something about that, did you notice anything different this time?

You said: opening the ports is a workaround, the keepalive is the preferred solution. The MongoDB documentation agrees, so I've implemented that. In fact they have a page dedicated to EC2 Security groups, which describes exactly how my SG were set up prior to your suggestions:

http://docs.mongodb.org/ecosystem/platforms/amazon-ec2/#secure-instances

Unfortunately I'm not familiar enough with the applications involved to make any specific configuration recommendations.

So what does it take to get an Amazon network expert to look into our situation??

Don't get me wrong, I'm glad that you're trying to help us, but at this stage it seems that more specialized advice is appropriate, we can't go on trying every option for weeks.
Re: Random network interruptions with my EC2 instances, failover impossible
Posted by: JoshuaF@AWS
Posted on: May 30, 2013 8:56 AM
in response to: JayVPar in response to: JayVPar
  Click to reply to this thread Reply
Hi,

Thanks for the reply, for clarity Matt J. did reach out to our EC2 support team and I've also asked a colleague to check the situation to verify.

Regarding the MongoDB recommendations, I see they also recommend a similar TCP keepalive setting though for clusters:
http://docs.mongodb.org/manual/faq/diagnostics/
"If you experience socket errors between members of a sharded cluster or replica set, that do not have other reasonable causes, check the TCP keep alive value"

The example I provided earlier is runtime and not persistent across reboot, is that still live?

cat /proc/sys/net/ipv4/tcp_keepalive_time

Regards,
Joshua F.
Re: Random network interruptions with my EC2 instances, failover impossible
Posted by: JayVPar
Posted on: May 30, 2013 9:06 AM
in response to: JoshuaF@AWS in response to: JoshuaF@AWS
  Click to reply to this thread Reply
JoshuaF@AWS wrote:
The example I provided earlier is runtime and not persistent across reboot, is that still live?

I've made a CHEF recipe that configures this on all our machines, so yes:

jayv:~ jo$ knife ssh -C 1 "chef_environment:prod" "sudo cat /proc/sys/net/ipv4/tcp_keepalive_time"
ec2-54-xxx-xxx-0.eu-west-1.compute.amazonaws.com 200
ec2-54-xxx-xxx-109.eu-west-1.compute.amazonaws.com 200
ec2-54-xxx-xxx-38.eu-west-1.compute.amazonaws.com 200
ec2-54-xxx-xxx-112.eu-west-1.compute.amazonaws.com 200
ec2-54-xxx-xxx-128.eu-west-1.compute.amazonaws.com 200
ec2-54-xxx-xxx-17.eu-west-1.compute.amazonaws.com 200
ec2-54-xxx-xxx-204.eu-west-1.compute.amazonaws.com 200
jayv:~ jo$
Re: Random network interruptions with my EC2 instances, failover impossible
Posted by: JoshuaF@AWS
Posted on: May 31, 2013 4:20 PM
in response to: JayVPar in response to: JayVPar
  Click to reply to this thread Reply
Hi,

A colleague verified that keepalives were the best approach as noted here and from testing MongoDB configuration recommendations it looks like they follow general recommendations such as here:
http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html

Is the port 9300 Elastic Search traffic also using the application setting "network.tcp.keep_alive" ? We noticed the docs mention "By default not explicitly set."
http://www.elasticsearch.org/guide/reference/modules/network/

Regards,
Joshua F.
Re: Random network interruptions with my EC2 instances, failover impossible
Posted by: JayVPar
Posted on: Jun 1, 2013 10:56 AM
in response to: JoshuaF@AWS in response to: JoshuaF@AWS
  Click to reply to this thread Reply
JoshuaF@AWS wrote:
Hi,

A colleague verified that keepalives were the best approach as noted here and from testing MongoDB configuration recommendations it looks like they follow general recommendations such as here:
http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html


It probably helps, but we still went down the other day with the keepalive parameters like you suggested.

Is the port 9300 Elastic Search traffic also using the application setting "network.tcp.keep_alive" ? We noticed the docs mention "By default not explicitly set."
http://www.elasticsearch.org/guide/reference/modules/network/

We don't have this option enabled right now.

So far since opening the ephemeral ports in the SG we've not had any issues, looks promising compared to the past week of daily issues.

What strikes me the most in this story, is that all traffic suddenly drops at the exact same time, between different instances, different services and protocols. I can understand a timeout of some sort, but then with all these services starting at different times, I would not expect connections to drop all at once. Anyway there must be something particular about the way the SecurityGroups work that causes this, I've never seen this happen in regular networks.

Thanks for your continued support!
Re: Random network interruptions with my EC2 instances, failover impossible
Posted by: JoshuaF@AWS
Posted on: Jun 1, 2013 3:22 PM
in response to: JayVPar in response to: JayVPar
  Click to reply to this thread Reply
Hi,

Thanks for the confirmation. It does seem very strange that all the traffic stops at the same time, it's not what I would expect from the EC2 Security Group infrastructure either. I would advise enabling the TCP keepalive on the other ports too in case it is a timeout issue that triggers the behavior.

Regards,
Joshua F.