We have made 2 changes in the past that caused an unexpected edge case:
1. when faced with QUIC "no network activity", give up re-attempts and fall-back
2. when a protocol is chosen explicitly, rather than using auto (the default), do not fallback
The reasoning for 1. was to fallback quickly in situations where the user may not
have chosen QUIC, and simply got it because we auto-chose it (with the TXT DNS record),
but the users' environment does not allow egress via UDP.
The reasoning for 2. was to avoid falling back if the user explicitly chooses a
protocol. E.g., if the user chooses QUIC, she may want to do UDP proxying, so if
we fallback to HTTP2 protocol that will be unexpected since it does not support
UDP (and same applies for HTTP2 falling back to h2mux and TCP proxying).
This PR fixes the edge case that happens when both those changes 1. and 2. are
put together: when faced with a QUIC "no network activity", we should only try
to fallback if there is a possible fallback. Otherwise, we should exhaust the
retries as normal.
This adds various bug fixes when investigating why QUIC transports were
not being unregistered when they should (and only when the graceful shutdown
started).
Most of these bug fixes are making the QUIC transport implementation closer
to its HTTP2 counterpart:
- ServeControlStream is now a blocking function (it's up to the transport to handle that)
- QUIC transport then handles the control plane as part of its Serve, thus waiting for it on shutdown
- QUIC transport now returns "non recoverable" for connections with similar semantics to HTTP2 and H2mux
- QUIC transport no longer has a loop around its Serve logic that retries connections on its own (that logic is upstream)
This does a few fixes to make sure that the QUICConnection returns from
Serve when the context is cancelled.
QUIC transport now behaves like other transports: closes as soon as there
is no traffic, or at most by grace-period. Note that we do not wait for
UDP traffic since that's connectionless by design.
Connections from cloudflared to Cloudflare edge are long lived and may
break over time. That is expected for many reasons (ranging from network
conditions to operations within Cloudflare edge). Hence, logging that as
Error feels too strong and leads to users being concerned that something
is failing when it is actually expected.
With this change, we wrap logging about connection issues to be aware
of the tunnel state:
- if the tunnel has no connections active, we log as error
- otherwise we log as warning
The default max streams value of 100 is rather small when subject to
high load in terms of connecting QUIC with streams faster than it can
create new ones. This high value allows for more throughput.
All header transformation code from h2mux has been consolidated in the connection package since it's used by both h2mux and http2 logic.
Exported headers used by proxying between edge and cloudflared so then can be shared by tunnel service on the edge.
Moved access-related headers to corresponding packages that have the code that sets/uses these headers.
Removed tunnel hostname tracking from h2mux since it wasn't used by anything. We will continue to set the tunnel hostname header from the edge for backward compatibilty, but it's no longer used by cloudflared.
Move bastion-related logic into carrier package, untangled dependencies between carrier, origin, and websocket packages.
- Move packages the provide generic functionality (such as config) from `cmd` subtree to top level.
- Remove all dependencies on `cmd` subtree from top level packages.
- Consolidate all code dealing with token generation and transfer to a single cohesive package.
Jitter is important to avoid every cloudflared in the world trying to
reconnect at t=1, 2, 4, etc. That could overwhelm the backend. But
if each cloudflared randomly waits for up to 2, then up to 4, then up
to 8 etc, then the retries get spread out evenly across time.
On average, wait times should be the same (e.g. instead of waiting for
exactly 1 second, cloudflared will wait betweeen 0 and 2 seconds).
This is the "Full Jitter" algorithm from https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/
- Don't rely on edge to close connection on graceful shutdown in h2mux, start muxer shutdown from cloudflared.
- Don't retry failed connections after graceful shutdown has started.
- After graceful shutdown channel is closed we stop waiting for retry timer and don't try to restart tunnel loop.
- Use readonly channel for graceful shutdown in functions that only consume the signal