-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Occasional extremely slow /_changes request #2612
Comments
I've just tried deleting the streaming replication jobs, and this appears to solve the problem. However, we need replication! When I recreate them, the problem reappears. There is only a small amount of replication traffic though. Every few minutes there's a monitoring process that writes a test doc and checks to see if it successfully replicates to each remote cluster. We're only talking a handful of documents within a minute or two. |
I've found further info digging into the logs more. The problematic _changes requests all come from one of the cluster nodes, which is hitting the other node's _changes endpoint. The situation is as shown below:
Note also that couch01 has bad _changes response times too, with requests from its own IP. What I'm wondering is:
Thx |
Exactly. There is a timeout. What the timeout means is that as long there are changes on the source, it will stream them. There could be say 100M documents and they should all stream out. Then, after they are all done and the stream is all caught up, it will sleep / wait for 10 more seconds to see if there would be more changes coming. Then if there are no more changes, the feed will timeout and stop. The replicator will then start a new request from the last update sequence. That timeout is part of the normal replication protocol. When I measure requests latency for example I always exclude _changes feeds as they have these long running connections that skew the results. That timeout parameter is also derived from a setting you can configure: https://docs.couchdb.org/en/stable/config/replicator.html#replicator/connection_timeout And it is |
Ah ok, well all is good if this is to be expected. However, is there any way I can remove the effect of _changes from the avg response times stats returned on _stats? This is throwing things off and it's why I noticed this in the first place; I'd really like to have response times stats for typical requests only. |
I was just doing in when monitoring the external (3rd party proxy) logs not anything in the core CouchDB code. So when building latency histograms I would say something like "exclude _changes API calls" as an extra step. I think the original question was answered. Should we close the, issue, what do to you think? |
Every few minutes a GET on /_changes is taking in excess of 40,000ms, when typically it takes 2-10ms.
Description
We're running a 2-node couch (v2.3.1) cluster on CentOS-7. We have a daemon running which polls the changes feed once per second, specifying
since
as the most recent sequence number returned previously. Our write rate is extremely low so in practice this usually returns zero records, and no more than a handful at most. Overall read activity in the cluster is very low also, at sub-10rps for each node.Typically the /_changes response time is 2-10ms, but every few minutes it rockets up to 10,000-40,000ms. There is no associated uptick in CPU/mem resources.
Behaviour seems very similar to #1319.
Steps to Reproduce
None besides running cluster in scenario described above.
Expected Behaviour
Fast and stable /_changes response times
Your Environment
Additional Context
The distribution of long response times looks a little odd. For example, I see a lot of response times of 10,003-10,005ms, almost as if there's been some sort of timeout at 10,000ms. One node will also tend to hit bad response times more often than the other.
The daemon that hits the /_changes endpoint randomly hits either of the two nodes, as they're behind a reverse proxy. However traffic is balanced, one node is hit as often as the other.
Sample log of long response time:
The text was updated successfully, but these errors were encountered: