Occasional extremely slow /_changes request #2612

nerdvegas · 2020-02-28T06:04:49Z

Every few minutes a GET on /_changes is taking in excess of 40,000ms, when typically it takes 2-10ms.

Description

We're running a 2-node couch (v2.3.1) cluster on CentOS-7. We have a daemon running which polls the changes feed once per second, specifying since as the most recent sequence number returned previously. Our write rate is extremely low so in practice this usually returns zero records, and no more than a handful at most. Overall read activity in the cluster is very low also, at sub-10rps for each node.

Typically the /_changes response time is 2-10ms, but every few minutes it rockets up to 10,000-40,000ms. There is no associated uptick in CPU/mem resources.

Behaviour seems very similar to #1319.

Steps to Reproduce

None besides running cluster in scenario described above.

Expected Behaviour

Fast and stable /_changes response times

Your Environment

CouchDB version used: 2.3.1
Browser name and version: N/A (python requests lib)
Operating system and version: CentOS-7.6.1810

Additional Context

Two databases, but reading changes feed on just one of those ('configs').
Two streaming replications are running on the 'configs' db (inbound and outbound to one other remote cluster).

The distribution of long response times looks a little odd. For example, I see a lot of response times of 10,003-10,005ms, almost as if there's been some sort of timeout at 10,000ms. One node will also tend to hit bad response times more often than the other.

The daemon that hits the /_changes endpoint randomly hits either of the two nodes, as they're behind a reverse proxy. However traffic is balanced, one node is hit as often as the other.

Sample log of long response time:

[notice] 2020-02-28T05:42:34.120799Z [email protected] <0.15451.0> 76934fa50d ******:5984 10.7.65.121 admin GET /configs/_changes?feed=continuous&style=all_docs&since=%221065727-g1AAAABWeJzLYWBgYMxgTmEQTs4vTc5ISXIwNNAz1zMz1TM0MswBSjIlMuSxMPwHgqwM5iQGAaf_uUBRdvMkQ0vDRDPs-rIAuoQWpw%22&timeout=10000 200 ok 16978

The text was updated successfully, but these errors were encountered:

nerdvegas · 2020-02-28T06:23:25Z

I've just tried deleting the streaming replication jobs, and this appears to solve the problem. However, we need replication! When I recreate them, the problem reappears.

There is only a small amount of replication traffic though. Every few minutes there's a monitoring process that writes a test doc and checks to see if it successfully replicates to each remote cluster. We're only talking a handful of documents within a minute or two.

nerdvegas · 2020-02-28T22:13:38Z

I've found further info digging into the logs more.

The problematic _changes requests all come from one of the cluster nodes, which is hitting the other node's _changes endpoint. The situation is as shown below:

 /----\
 |    |
 |    (bad-resp-times)
 |    |
 |    v
couch01-----(bad-resp-times)------>couch02
^                                    ^
|                                    |
|                                    |
(good-resp-times)                    (good-resp-times)
|                                    |
|                                    |
webservice---------------------------/

Note also that couch01 has bad _changes response times too, with requests from its own IP.

What I'm wondering is:

Why the bad response times only with requests from couch01?
Why does couch01 hit couch02 at all (is this an expected part of cluster operation? Is it related to replication?)
Why doesn't couch02 hit couch01?
Why is couch01 requesting changes feed from itself?

Thx
Allan

nickva · 2020-02-28T22:36:05Z

[notice] 2020-02-28T05:42:34.120799Z [email protected] <0.15451.0> 76934fa50d ******:5984 10.7.65.121 admin GET /configs/_changes?feed=continuous&style=all_docs&since=%221065727-g1AAAABWeJzLYWBgYMxgTmEQTs4vTc5ISXIwNNAz1zMz1TM0MswBSjIlMuSxMPwHgqwM5iQGAaf_uUBRdvMkQ0vDRDPs-rIAuoQWpw%22&timeout=10000 200 ok 16978

The distribution of long response times looks a little odd. For example, I see a lot of response times of 10,003-10,005ms, almost as if there's been some sort of timeout at 10,000ms

Exactly. There is a timeout. What the timeout means is that as long there are changes on the source, it will stream them. There could be say 100M documents and they should all stream out. Then, after they are all done and the stream is all caught up, it will sleep / wait for 10 more seconds to see if there would be more changes coming. Then if there are no more changes, the feed will timeout and stop. The replicator will then start a new request from the last update sequence.

That timeout is part of the normal replication protocol. When I measure requests latency for example I always exclude _changes feeds as they have these long running connections that skew the results.

That timeout parameter is also derived from a setting you can configure:

https://docs.couchdb.org/en/stable/config/replicator.html#replicator/connection_timeout

And it is connection_timeout / 3. If you know you only replicate locally, you can set that value to something lower. So example if it is set to 6000 (msec), then the timeout in the _changes requests will be 2000 msec which is 2 seconds. However that also means that idle replication will continuously restart their requests every 2 seconds which doesn't seem that great but if you want that it's something you can configure.

nerdvegas · 2020-02-28T22:52:39Z

Ah ok, well all is good if this is to be expected.

However, is there any way I can remove the effect of _changes from the avg response times stats returned on _stats? This is throwing things off and it's why I noticed this in the first place; I'd really like to have response times stats for typical requests only.

nickva · 2020-03-04T15:27:32Z

However, is there any way I can remove the effect of _changes from the avg response times stats

I was just doing in when monitoring the external (3rd party proxy) logs not anything in the core CouchDB code. So when building latency histograms I would say something like "exclude _changes API calls" as an extra step.

I think the original question was answered. Should we close the, issue, what do to you think?

nerdvegas added bug needs-triage labels Feb 28, 2020

wohali closed this as completed Mar 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Occasional extremely slow /_changes request #2612

Occasional extremely slow /_changes request #2612

nerdvegas commented Feb 28, 2020 •

edited

Loading

nerdvegas commented Feb 28, 2020 •

edited

Loading

nerdvegas commented Feb 28, 2020

nickva commented Feb 28, 2020

nerdvegas commented Feb 28, 2020

nickva commented Mar 4, 2020

Occasional extremely slow /_changes request #2612

Occasional extremely slow /_changes request #2612

Comments

nerdvegas commented Feb 28, 2020 • edited Loading

Description

Steps to Reproduce

Expected Behaviour

Your Environment

Additional Context

nerdvegas commented Feb 28, 2020 • edited Loading

nerdvegas commented Feb 28, 2020

nickva commented Feb 28, 2020

nerdvegas commented Feb 28, 2020

nickva commented Mar 4, 2020

nerdvegas commented Feb 28, 2020 •

edited

Loading

nerdvegas commented Feb 28, 2020 •

edited

Loading