We cap our number of CPU worker threads to at least 2 even if we have a
single core. But also, before we used to always add one extra thread
regardless of the number of core.
This meant that we were off when re-using the get_num_cpus() function
when calculating our onionskin work overhead because we were always off
by one.
This commit makes it that we always use the number of thread our actual
thread pool was configured with.
Fixes#40719
Signed-off-by: David Goulet <dgoulet@torproject.org>
Created and Rejected connections are ever going up counters. While
Opened connections are gauges going up and down.
Fixes#40712
Signed-off-by: David Goulet <dgoulet@torproject.org>
This change mitigates DNS-based website oracles by making the time that
a domain name is cached uncertain (+- 4 minutes of what's measurable).
Resolves TROVE-2021-009.
Fixes#40674
This is part of the fast path so we need to cache consensus parameters
instead of querying it everytime we need to learn a value.
Part of #40704
Signed-off-by: David Goulet <dgoulet@torproject.org>
Until now, there was this magic number (64) used as the maximum number
of tasks a CPU worker can take at once.
This commit makes it a consensus parameter so our future selves can
think of a better value depending on network conditions.
Part of #40704
Signed-off-by: David Goulet <dgoulet@torproject.org>
Transform the hardcoded value ONIONQUEUE_WAIT_CUTOFF into a consensus
parameter so we can control it network wide.
Closes#40704
Signed-off-by: David Goulet <dgoulet@torproject.org>
This also incidently removes a use of uninitialized stack data from the
connection_or_set_ext_or_identifier() function.
Fixes#40648
Signed-off-by: David Goulet <dgoulet@torproject.org>
Remove a harmless "Bug" log message that can happen in
relay_addr_learn_from_dirauth() on relays during startup:
tor_bug_occurred_(): Bug: ../src/feature/relay/relay_find_addr.c:225: relay_addr_learn_from_dirauth: Non-fatal assertion !(!ei) failed. (on Tor 0.4.7.10 )
Bug: Tor 0.4.7.10: Non-fatal assertion !(!ei) failed in relay_addr_learn_from_dirauth at ../src/feature/relay/relay_find_addr.c:225. Stack trace: (on Tor 0.4.7.10 )
Finishes fixing bug 40231.
Fixes bug 40523; bugfix on 0.4.5.4-rc.
Move the retry from circuit_expire_building() to when the offending
circuit is being closed.
Fixes#40695
Signed-off-by: David Goulet <dgoulet@torproject.org>
Logic is too convoluted and we can't efficiently apply a specific
timeout depending on the purpose.
Remove it and instead rely on the right circuit cutoff instead of
keeping this flagged circuit open forever.
Part of #40694
Signed-off-by: David Goulet <dgoulet@torproject.org>
Change it to an "unreachable" error so the intro point can be retried
and not flagged as a failure and never retried again.
Closes#40692
Signed-off-by: David Goulet <dgoulet@torproject.org>
Directory authorities and relays now interact properly with directory
authorities if they change addresses. In the past, they would continue
to upload votes, signatures, descriptors, etc to the hard-coded address
in the configuration. Now, if the directory authority is listed in
the consensus at a different address, they will direct queries to this
new address.
Specifically, these three activities have changed:
* Posting a vote, a signature, or a relay descriptor to all the dir auths.
* Dir auths fetching missing votes or signatures from all the dir auths.
* Dir auths fetching new descriptors from a specific dir auth when they
just learned about them from that dir auth's vote.
We already do this desired behavior (prefer the address in the consensus,
but fall back to the hard-coded dirservers info if needed) when fetching
missing certs.
There is a fifth case, in router_pick_trusteddirserver(), where clients
and relays are trying to reach a random dir auth to fetch something. I
left that case alone for now because the interaction with fallbackdirs
is complicated.
Implements ticket 40705.
Directory authorities stop voting a consensus "Measured" weight
for relays with the Authority flag. Now these relays will be
considered unmeasured, which should reserve their bandwidth
for their dir auth role and minimize distractions from other roles.
In place of the "Measured" weight, they now include a
"MeasuredButAuthority" weight (not used by anything) so the bandwidth
authority's opinion on this relay can be recorded for posterity.
Resolves ticket 40698.
The AuthDirDontVoteOnDirAuthBandwidth torrc option never worked, and it
was implemented in a way that could have produced consensus conflicts
if it had.
Resolves bug 40700.
Move the retry from circuit_expire_building() to when the offending
circuit is being closed.
Fixes#40695
Signed-off-by: David Goulet <dgoulet@torproject.org>
Logic is too convoluted and we can't efficiently apply a specific
timeout depending on the purpose.
Remove it and instead rely on the right circuit cutoff instead of
keeping this flagged circuit open forever.
Part of #40694
Signed-off-by: David Goulet <dgoulet@torproject.org>
Change it to an "unreachable" error so the intro point can be retried
and not flagged as a failure and never retried again.
Closes#40692
Signed-off-by: David Goulet <dgoulet@torproject.org>
Patch to address #40673. An additional check has been added to
onion_pending_add() in order to ensure that we avoid counting create
cells from clients.
In the cpuworker.c assign_onionskin_to_cpuworker
method if total_pending_tasks >= max_pending_tasks
and channel_is_client(circ->p_chan) returns false then
rep_hist_note_circuit_handshake_dropped() will be called and
rep_hist_note_circuit_handshake_assigned() will not be called. This
causes relays to run into errors due to the fact that the number of
dropped packets exceeds the total number of assigned packets.
To avoid this situation a check has been added to
onion_pending_add() to ensure that these erroneous calls to
rep_hist_note_circuit_handshake_dropped() are not made.
See the #40673 ticket for the conversation with armadev about this issue.
mike is concerned that we would get too much exposure to adversaries,
if we enforce that none of our L2 guards can be in the same family.
this change set now essentially finishes the feature that commit a77727cdc
was attempting to add, but strips the "_and_family" part of that plan.
We had omitted some checks for whether our vanguards (second layer
guards from proposal 333) overlapped or came from the same family.
Now make sure to pick each of them to be independent.
Fixes bug 40639; bugfix on 0.4.7.1-alpha.
Remove UPTIME_TO_GUARANTEE_STABLE, MTBF_TO_GUARANTEE_STABLE,
TIME_KNOWN_TO_GUARANTEE_FAMILIAR WFU_TO_GUARANTEE_GUARD and replace each
of them with a tunnable torrc option.
Related to #40652
Signed-off-by: David Goulet <dgoulet@torproject.org>
This also incidently removes a use of uninitialized stack data from the
connection_or_set_ext_or_identifier() function.
Fixes#40648
Signed-off-by: David Goulet <dgoulet@torproject.org>
The escaped() function and its kin already wrap their output in
quotes: there's no reason to do so twice.
I am _NOT_ making a corresponding change in calls that make the same
mistake in controller-related functions, however, due to the risk of
a compatibility break. :(
Closes#22723.
It came as a surprise that Serge, the bridge authority, omits the Running
flag for all bridges in its first 30 minutes after a restart:
https://bugs.torproject.org/tpo/anti-censorship/rdsys/102
The fix we're doing for now is to accept it as correct behavior in
Tor, and change all the supporting tools to be able to handle bridge
networkstatus docs that have no Running bridges.
I'm documenting it here inside Tor too so the next person might not
be so surprised.
We had 3 callsites setting up the circuit congestion control and so this
commit consolidates all 3 calls into 1 function.
Related to #40586
Signed-off-by: David Goulet <dgoulet@torproject.org>
Once the cpath is finalized, e2e encryption setup, transfer the ccontrol
from the rendezvous circuit to the cpath.
This allows the congestion control subsystem to properly function for
both upload and download side of onion services.
Closes#40586
Signed-off-by: David Goulet <dgoulet@torproject.org>
This applies only for relays. Previous commit adds two new consensus
parameters that dictate how libevent is configured with DNS resolution.
And so, with a new consensus, we now look at those values in case they
ever change.
Without this, Exit relay would have to HUP or restart to apply any new
Exit DNS consensus parameters.
Related to #40312
Signed-off-by: David Goulet <dgoulet@torproject.org>
Introduces two new consensus parameter:
exit_dns_timeout: Number of seconds before libevent should consider
the DNS request a timeout.
exit_dns_num_attempts: Number of attempts that libeven should retry a
previously failing query before calling it a timeout.
Closes#40312
Signed-off-by: David Goulet <dgoulet@torproject.org>
This code was heavily reused from the previous DNS timeout work done in
ticket #40491 that was removed afterall from our code.
Closes#40560
Signed-off-by: David Goulet <dgoulet@torproject.org>
Due to a possible Guard subsystem recursion, when the HS client gets
notified that the directory information has changed, it must run it in a
seperate mainloop event to avoid such issue.
See the ticket for more information on the recursion. This also fixes a
fatal assert.
Fixes#40579
Signed-off-by: David Goulet <dgoulet@torproject.org>
It is possible to not have the descriptor anymore by the time the
rendezvous circuit opens. Don't BUG() on that.
Instead, when sending the INTRODUCE1 cell, make sure the descriptor we
have (or have just fetched) matches what we setup in the rendezvous
circuit.
If not, the circuit is closed and another one is opened for a retry.
Fixes#40576
Signed-off-by: David Goulet <dgoulet@torproject.org>
Prometheus needs unique labels and so this bug was causing an onion
service with multiple ports to have multiple "port=" label for the
metrics requiring a port label.
Fixes#40581
Signed-off-by: David Goulet <dgoulet@torproject.org>
Prometheus needs unique labels and so this bug was causing an onion
service with multiple ports to have multiple "port=" label for the
metrics requiring a port label.
Fixes#40581
Signed-off-by: David Goulet <dgoulet@torproject.org>
Republishing is necessary to ensure that clients connect using the correct
sendme_inc upon any change. Additionally, introduction points must be
re-chosen, so that cached descriptors with old values are not usable.
We do not expect to change sendme_inc, unless cell size or TLS record size
changes, so this should be rare.
Signed-off-by: David Goulet <dgoulet@torproject.org>
Move it to extension.trunnel instead so that extension ABI construction
can be used in other parts of tor than just HS cells.
Specifically, we'll use it in the ntorv3 data payload and make a
congestion control parameter extension using that binary structure.
Only rename. No code behavior changes.
Signed-off-by: David Goulet <dgoulet@torproject.org>
Introduced in bf10206e9e which is not
released yet thus no changes file.
Found by Coverity with cid #1495786.
Fixes#40532
Signed-off-by: David Goulet <dgoulet@torproject.org>
Change it from "timeout" to "tor_timeout" in order to indicate that the
DNS timeout is one from tor's DNS threshold and not the DNS server
itself.
Fixes#40527
Signed-off-by: David Goulet <dgoulet@torproject.org>
Tor has configure libevent to attempt up to 3 times a DNS query for a
maximum of 5 seconds each. Once that 5 seconds has elapsed, it consider
the query "Timed Out" but tor only gets a timeout if all 3 attempts have
failed.
For example, using Unbound, it has a much higher threshold of timeout.
It is well defined in
https://www.nlnetlabs.nl/documentation/unbound/info-timeout/ and has
some complexity to it. But the gist is that if it times out, it will be
much more than 5 seconds.
And so the Tor DNS timeouts are more of a "UX issue" rather than a
"network issue". For this reason, we are removing this metric from the
overload general signal.
See https://gitlab.torproject.org/tpo/network-health/team/-/issues/139
for more information.
Fixes#40527
Signed-off-by: David Goulet <dgoulet@torproject.org>
This avoids performing and then freeing a lot of small mallocs() if
the hash line has too many elements.
Fixes one case of bug 40472; resolves OSS-Fuzz 38363. Bugfix on
0.3.1.1-alpha when the consdiff parsing code was introduced.
Some PT applications support more than one transport. For example,
obfs4proxy supports obfs4, obfs3, and meek. If one or more transports
specified in the torrc file are supported, we shouldn't kill the managed
proxy on a {C,S}METHOD-ERROR. Instead, we should log a warning.
We were already logging warnings on method errors. This change just
makes sure that the managed proxy isn't killed, and then if no
transports are configured for the managed proxy, bumps the log level up
from a notice to a warning.
Closes#7362
When a new consensus method is negotiated, these values will all get
replaced with "2038-01-01 00:00:00".
This change should be safe because:
* As of 0.2.9.11 / 0.3.0.7 / 0.3.1.1-alpha, Tor takes no action
about published_on times in the future.
* The only remaining parties relying on published_on values are (we
believe) relays running 0.3.5.x, which rely on the values in a NS
consensus to see whether their descriptors are out of date. But
this patch only changes microdesc consensuses.
* The latest Tor no longer looks at this field in consensuses.
Why make this change? In experiments, replacing these values with a
fixed value made the size of compressed consensus diffs much much
smaller. (Like, by over 50%!)
Implements proposal 275; Implements #40130.
Nothing breaks here, since all non-voting users of
routerstatus_t.published_on have been adjusted or removed in
previous commits.
We have to expand the API of routerstatus_format_entry() a bit,
though, so that it can always get a published time as argument,
since it can't get it from the routerstatus any more.
This should have no effect on voter behavior.
It used to describe when the old and new routerinfos were published
when we'd decide to download a routerinfo. Now it describes what
their descriptor digests are.
The consensus voters shouldn't actually include such old routers in
the consensus anyway, so this logic shouldn't come up...
but if a client _does_ download something it wouldn't use, it won't
retry infinitely: see checks for WRA_NEVER_DOWNLOADABLE.
Previously we'd look at the routerstatus published_on field when
deciding what to dump, which really has no point. If something's in
the consensus with an ancient published date, then we do want to
keep it.
Mingw headers sometimes like to define alternative scanf/printf
format attributes depending on whether they're using clang, UCRT,
MINGW_ANSI_STDIO, or the microsoft version of printf/scanf. This
change attempts to use the right one on the given platform.
This is an attempt to fix part of #40355.
This also moves the warnings and add some theatrical effect around the
code so anyone modifying those list should notice the warnings signs and
read the comment accordingly.
Signed-off-by: David Goulet <dgoulet@torproject.org>
Our code doesn't allow it and so this prevents an assert() crash if the
DirPort is for instance IPv6 only.
Fixes#40494
Signed-off-by: David Goulet <dgoulet@torproject.org>
When we don't yet have a descriptor for one of our bridges, disable
the entry guard retry schedule on that bridge. The entry guard retry
schedule and the bridge descriptor retry schedule can conflict,
e.g. where we mark a bridge as "maybe up" yet we don't try to fetch
its descriptor yet, leading Tor to wait (refusing to do anything)
until it becomes time to fetch the descriptor.
Fixes bug 40497; bugfix on 0.3.0.3-alpha.
we do a notice-level log when we decide we *don't* have enough dir
info, but in 0.3.5.1-alpha (see commit eee62e13d9, #14950) we lost our
corresponding notice-level log when things come back.
bugfix on 0.3.5.1-alpha; fixes bug 40496.
Specifically, every time a guard moves into or out of state
GUARD_REACHABLE_MAYBE, it is an opportunity for the guard reachability
state to get out of sync with the have-minimum-dir-info state.
Fixes even more of #40396.
When we try to fetch a bridge descriptor and we fail, we mark
the guard as failed, but we never scheduled a re-compute for
router_have_minimum_dir_info().
So if we had already decided we needed to wait for this new descriptor,
we would just wait forever -- even if, counterintuitively, *losing* the
bridge is just what we need to *resume* using the network, if we had it
in state GUARD_REACHABLE_MAYBE and we were stalling to learn this outcome.
See bug 40396 for more details.
The bridge descriptor fetching codes ends up fetching a lot of duplicate
bridge descriptors, because this is how we learn when the descriptor
changes.
This commit only changes comments plus whether we log that one line.
It moves us back to the old behavior, before the previous commit for
30496, where we would only log that line when the bridge descriptor
we're talking about is better than the one we already had (if any).
Without this change, if we have a working bridge, and we add a new bridge,
we will schedule the fetch attempt for that new bridge descriptor for
three hours(!) in the future.
This change is especially needed because of bug #40396, where if you have
one working bridge and one bridge whose descriptor you haven't fetched
yet, your Tor will stall until you have successfully fetched that new
descriptor -- in this case for hours.
In the old design, we would put off all further bridge descriptor fetches
once we had any working bridge descriptor. In this new design, we make the
decision per bridge based on whether we successfully got *its* descriptor.
To make this work, we need to also call learned_bridge_descriptor() every
time we get a bridge descriptor, not just when it's a novel descriptor.
Fixes bug 40396.
Also happens to fix bug 40495 (redundant descriptor fetches for every
bridge) since now we delay fetches once we succeed.
A side effect of this change is that if we have any configured bridges
that *aren't* working, we will keep trying to fetch their descriptors
on the modern directory retry schedule -- every couple of seconds for
the first half minute, then backing off after that -- which is a lot
faster than before.
When this method is in place, then any relay which is assigned
MiddleOnly has Exit, V2Dir, Guard, and HSDir cleared
(and has BadExit set if appropriate).
This proposal implements part of Prop335; it's based on a patch
from Neel Chauhan.
When configured to do so, authorities will assign a MiddleOnly flag
to certain relays. Any relay which an authority gives this flag
will not get Exit, V2Dir, Guard, or HSDir, and might get BadExit if
the authority votes for that one.