Patch to address #40673. An additional check has been added to
onion_pending_add() in order to ensure that we avoid counting create
cells from clients.
In the cpuworker.c assign_onionskin_to_cpuworker
method if total_pending_tasks >= max_pending_tasks
and channel_is_client(circ->p_chan) returns false then
rep_hist_note_circuit_handshake_dropped() will be called and
rep_hist_note_circuit_handshake_assigned() will not be called. This
causes relays to run into errors due to the fact that the number of
dropped packets exceeds the total number of assigned packets.
To avoid this situation a check has been added to
onion_pending_add() to ensure that these erroneous calls to
rep_hist_note_circuit_handshake_dropped() are not made.
See the #40673 ticket for the conversation with armadev about this issue.
mike is concerned that we would get too much exposure to adversaries,
if we enforce that none of our L2 guards can be in the same family.
this change set now essentially finishes the feature that commit a77727cdc
was attempting to add, but strips the "_and_family" part of that plan.
We had omitted some checks for whether our vanguards (second layer
guards from proposal 333) overlapped or came from the same family.
Now make sure to pick each of them to be independent.
Fixes bug 40639; bugfix on 0.4.7.1-alpha.
Remove UPTIME_TO_GUARANTEE_STABLE, MTBF_TO_GUARANTEE_STABLE,
TIME_KNOWN_TO_GUARANTEE_FAMILIAR WFU_TO_GUARANTEE_GUARD and replace each
of them with a tunnable torrc option.
Related to #40652
Signed-off-by: David Goulet <dgoulet@torproject.org>
Previously, `channelpadding_get_netflow_inactive_timeout_ms` would
crash with an assertion failure if `low_timeout` was greater than
`high_timeout`. That wasn't possible in practice because of checks
in `channelpadding_update_padding_for_channel`, but it's better not
to have a function whose correctness is this tricky to prove.
Fixes#40645. Bugfix on 0.3.1.1-alpha.
Note that with this commit, TRUNCATED cells won't be used anymore that
is client and relays won't emit them.
Fixes#40623
Signed-off-by: David Goulet <dgoulet@torproject.org>
This also incidently removes a use of uninitialized stack data from the
connection_or_set_ext_or_identifier() function.
Fixes#40648
Signed-off-by: David Goulet <dgoulet@torproject.org>
The escaped() function and its kin already wrap their output in
quotes: there's no reason to do so twice.
I am _NOT_ making a corresponding change in calls that make the same
mistake in controller-related functions, however, due to the risk of
a compatibility break. :(
Closes#22723.
Once the cpath is finalized, e2e encryption setup, transfer the ccontrol
from the rendezvous circuit to the cpath.
This allows the congestion control subsystem to properly function for
both upload and download side of onion services.
Closes#40586
Signed-off-by: David Goulet <dgoulet@torproject.org>
Introduces two new consensus parameter:
exit_dns_timeout: Number of seconds before libevent should consider
the DNS request a timeout.
exit_dns_num_attempts: Number of attempts that libeven should retry a
previously failing query before calling it a timeout.
Closes#40312
Signed-off-by: David Goulet <dgoulet@torproject.org>
Due to a possible Guard subsystem recursion, when the HS client gets
notified that the directory information has changed, it must run it in a
seperate mainloop event to avoid such issue.
See the ticket for more information on the recursion. This also fixes a
fatal assert.
Fixes#40579
Signed-off-by: David Goulet <dgoulet@torproject.org>
It is possible to not have the descriptor anymore by the time the
rendezvous circuit opens. Don't BUG() on that.
Instead, when sending the INTRODUCE1 cell, make sure the descriptor we
have (or have just fetched) matches what we setup in the rendezvous
circuit.
If not, the circuit is closed and another one is opened for a retry.
Fixes#40576
Signed-off-by: David Goulet <dgoulet@torproject.org>
Prometheus needs unique labels and so this bug was causing an onion
service with multiple ports to have multiple "port=" label for the
metrics requiring a port label.
Fixes#40581
Signed-off-by: David Goulet <dgoulet@torproject.org>
Prometheus needs unique labels and so this bug was causing an onion
service with multiple ports to have multiple "port=" label for the
metrics requiring a port label.
Fixes#40581
Signed-off-by: David Goulet <dgoulet@torproject.org>
It is possible that a scheduled channel ended up with 0 bytes in its
outbuf after the scheduling loop and having an outbuf table entry
indicating that we need to flush bytes on the wire after the loop.
This lead to attempt to write 0 bytes up to the TLS layer that would
prevent such action.
All in all, this fixes wasted CPU cycles on attempting to flush nothing.
Fixes#40548
Signed-off-by: David Goulet <dgoulet@torproject.org>
We removed HSIntro=3 and HSDir=1 that are v2 specific. Since 0.3.5.17,
we do not support introducing or being a directory for onion service v2.
Closes#40509
Signed-off-by: David Goulet <dgoulet@torproject.org>
Change it from "timeout" to "tor_timeout" in order to indicate that the
DNS timeout is one from tor's DNS threshold and not the DNS server
itself.
Fixes#40527
Signed-off-by: David Goulet <dgoulet@torproject.org>
Tor has configure libevent to attempt up to 3 times a DNS query for a
maximum of 5 seconds each. Once that 5 seconds has elapsed, it consider
the query "Timed Out" but tor only gets a timeout if all 3 attempts have
failed.
For example, using Unbound, it has a much higher threshold of timeout.
It is well defined in
https://www.nlnetlabs.nl/documentation/unbound/info-timeout/ and has
some complexity to it. But the gist is that if it times out, it will be
much more than 5 seconds.
And so the Tor DNS timeouts are more of a "UX issue" rather than a
"network issue". For this reason, we are removing this metric from the
overload general signal.
See https://gitlab.torproject.org/tpo/network-health/team/-/issues/139
for more information.
Fixes#40527
Signed-off-by: David Goulet <dgoulet@torproject.org>
This avoids performing and then freeing a lot of small mallocs() if
the hash line has too many elements.
Fixes one case of bug 40472; resolves OSS-Fuzz 38363. Bugfix on
0.3.1.1-alpha when the consdiff parsing code was introduced.
When a new consensus method is negotiated, these values will all get
replaced with "2038-01-01 00:00:00".
This change should be safe because:
* As of 0.2.9.11 / 0.3.0.7 / 0.3.1.1-alpha, Tor takes no action
about published_on times in the future.
* The only remaining parties relying on published_on values are (we
believe) relays running 0.3.5.x, which rely on the values in a NS
consensus to see whether their descriptors are out of date. But
this patch only changes microdesc consensuses.
* The latest Tor no longer looks at this field in consensuses.
Why make this change? In experiments, replacing these values with a
fixed value made the size of compressed consensus diffs much much
smaller. (Like, by over 50%!)
Implements proposal 275; Implements #40130.
The connect failure cache had a bad interaction with retrying connections
to our guards or bridges when we go offline and then come back online --
while offline we would fail to connect and cache this result, and then
when we return we would decline to even attempt to connect, because our
failure cache said it wouldn't work.
Now only cache connect failures for relays when we connected to them
because of somebody else's EXTEND request.
Fixes bug 40499; bugfix on 0.3.3.4-alpha.
From LibreSSL versions 3.2.1 through 3.4.0, our configure script
would conclude that TLSv1.3 as supported, but it actually wasn't.
This led to annoying breakage like #40128 and #40445.
Now we give an error message if we try to build with one of those
versions.
Closes#40511.
Previously the logic was reversed, and always gave the wrong answer.
This has no other effect than to change whether we suppress
deprecated API warnings.
Fixes#40429; bugfix on 0.3.5.13.
Mingw headers sometimes like to define alternative scanf/printf
format attributes depending on whether they're using clang, UCRT,
MINGW_ANSI_STDIO, or the microsoft version of printf/scanf. This
change attempts to use the right one on the given platform.
This is an attempt to fix part of #40355.
Our code doesn't allow it and so this prevents an assert() crash if the
DirPort is for instance IPv6 only.
Fixes#40494
Signed-off-by: David Goulet <dgoulet@torproject.org>
When we don't yet have a descriptor for one of our bridges, disable
the entry guard retry schedule on that bridge. The entry guard retry
schedule and the bridge descriptor retry schedule can conflict,
e.g. where we mark a bridge as "maybe up" yet we don't try to fetch
its descriptor yet, leading Tor to wait (refusing to do anything)
until it becomes time to fetch the descriptor.
Fixes bug 40497; bugfix on 0.3.0.3-alpha.
we do a notice-level log when we decide we *don't* have enough dir
info, but in 0.3.5.1-alpha (see commit eee62e13d9, #14950) we lost our
corresponding notice-level log when things come back.
bugfix on 0.3.5.1-alpha; fixes bug 40496.
When we try to fetch a bridge descriptor and we fail, we mark
the guard as failed, but we never scheduled a re-compute for
router_have_minimum_dir_info().
So if we had already decided we needed to wait for this new descriptor,
we would just wait forever -- even if, counterintuitively, *losing* the
bridge is just what we need to *resume* using the network, if we had it
in state GUARD_REACHABLE_MAYBE and we were stalling to learn this outcome.
See bug 40396 for more details.
When we looked, this was the third most frequent message at
PROTOCOL_WARN, and doesn't actually tell us what to do about it.
Now:
* we just log it at info
* we log it only once per circuit
* we report, in the heartbeat, how many times it happens, how many
cells it happens with per circuit, and how long these circuits
have been alive (on average).
Fixes the final part of #40400.
This is due to the libevent bug
https://github.com/libevent/libevent/issues/1219 that fails to return
back the DNS record type on error.
And so, the MetricsPort now only reports the errors as a global counter
and not a per record type.
Closes#40490
Signed-off-by: David Goulet <dgoulet@torproject.org>
With this commit, we will only report a general overload state if we've
seen more than X% of DNS timeout errors over Y seconds. Previous
behavior was to report when a single timeout occured which is really too
small of a threshold.
The value X is a consensus parameters called
"overload_dns_timeout_scale_percent" which is a scaled percentage
(factor of 1000) so we can represent decimal points for X like 0.5% for
instance. Its default is 1000 which ends up being 1%.
The value Y is a consensus parameters called
"overload_dns_timeout_period_secs" which is the time period for which
will gather DNS errors and once over, we assess if that X% has been
reached ultimately triggering a general overload signal.
Closes#40491
Signed-off-by: David Goulet <dgoulet@torproject.org>
This means that at this commit, tor will stop logging that v2 is
deprecated and treat a v2 address as a bad hostname that we can't use.
Part of #40476
Signed-off-by: David Goulet <dgoulet@torproject.org>
Values greater than 100 would have had the same effect as 100, so
this doesn't actually change Tor's behavior; it just makes the
intent clearer. Fixes#40486; see also torspec#66.
This is the loudest of our LOG_PROTOCOL_WARN messages, it can occur
naturally, and there doesn't seem to be a great response to it.
Partial fix for 40400; bugfix on 0.1.1.13-alpha.
This one happens every time we get a failure from
circuit_receive_relay_cell -- but for all the relevant failing cases
in that function, we already log in that function.
This resolves one case of #40400. Two cases remain.
Series 0.4.2.x, 0.4.3.x and 0.4.4.x will all be rejected at the
authority level at this commit.
Futhermore, the 0.4.5.x alphas and rc will also be rejected.
Closes#40480
Signed-off-by: David Goulet <dgoulet@torproject.org>
The connection_ap_attach_pending() function processes all pending
streams in the pending_entry_connections list. It first copy the pointer
and then allocates a brand new empty list.
It then iterates over that copy pointer to try to attach entry
connections onto any fitting circuits using
connection_ap_handshake_attach_circuit().
That very function, for onion service, can lead to flagging _all_
streams of the same onion service to be put in state RENDDESC_WAIT from
CIRCUIT_WAIT. By doing so, it also tries to remove them from the
pending_entry_connections but at that point it is already empty.
Problem is that the we are iterating over the previous
pending_entry_connections which contains the streams that have just
changed state and are no longer in CIRCUIT_WAIT.
This lead to this bug warning occuring a lot on busy services:
May 01 08:55:43.000 [warn] connection_ap_attach_pending(): Bug:
0x55d8764ae550 is no longer in circuit_wait. Its current state is
waiting for rendezvous desc. Why is it on pending_entry_connections?
(on Tor 0.4.4.0-alpha-dev )
This fix is minimal and basically allow a state to be not CIRCUIT_WAIT
and move on to the next one without logging a warning. Because the
pending_entry_connections is emptied before processing, there is no
chance for a streams to be stuck there forever thus it is OK to ignore
streams not in the right state.
Fixes#34083
Signed-off-by: David Goulet <dgoulet@torproject.org>
When building with --enable-fragile-hardening, add or relax Linux
seccomp rules to allow AddressSanitizer to execute normally if the
process terminates with the sandbox active.
Further resolves issue 11477.