Troubleshooting
Different problems that can occur on a new instance, and how to solve them.
Many Lemmy features depend on a correct reverse proxy configuration. Make sure yours is equivalent to our nginx config.
General
Logs
For frontend issues, check the browser console for any error messages.
For server logs, run docker compose logs -f lemmy
in your installation folder. You can also do docker compose logs -f lemmy lemmy-ui pictrs
to get logs from different services.
If that doesn't give enough info, try changing the line RUST_LOG=error
in docker-compose.yml
to RUST_LOG=info
or RUST_LOG=trace
, then do docker compose restart lemmy
.
Creating admin user doesn't work
Make sure that websocket is working correctly, by checking the browser console for errors. In nginx, the following headers are important for this:
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
Rate limit error when many users access the site
Check that the headers X-Real-IP
and X-Forwarded-For
are sent to Lemmy by the reverse proxy. Otherwise, it will count all actions towards the rate limit of the reverse proxy's IP. In nginx it should look like this:
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header Host $host;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
Federation
Other instances can't fetch local objects (community, post, etc)
Your reverse proxy (eg nginx) needs to forward requests with header Accept: application/activity+json
to the backend. This is handled by the following lines:
set $proxpass "http://0.0.0.0:{{ lemmy_ui_port }}";
if ($http_accept = "application/activity+json") {
set $proxpass "http://0.0.0.0:{{ lemmy_port }}";
}
if ($http_accept = "application/ld+json; profile=\"https://www.w3.org/ns/activitystreams\"") {
set $proxpass "http://0.0.0.0:{{ lemmy_port }}";
}
proxy_pass $proxpass;
You can test that it works correctly by running the following commands, all of them should return valid JSON:
curl -H "Accept: application/activity+json" https://your-instance.com/u/some-local-user
curl -H "Accept: application/activity+json" https://your-instance.com/c/some-local-community
curl -H "Accept: application/activity+json" https://your-instance.com/post/123 # the id of a local post
curl -H "Accept: application/activity+json" https://your-instance.com/comment/123 # the id of a local comment
Fetching remote objects works, but posting/commenting in remote communities fails
Check that federation is allowed on both instances.
Also ensure that the time is correct on your server. Activities are signed with a timestamp, and will be discarded if it is off by more than one hour.
It is possible that federation requests to /inbox
are blocked by tools such as Cloudflare. The sending instance can find HTTP errors with the following steps:
- If you use a separate container for outgoing federation, you need to apply the following steps to that container only
- Set
RUST_LOG=lemmy_federate=trace
for Lemmy - Reload the new configuration:
docker compose up -d
- Search for messages containing the target instance domain:
docker compose logs -f --tail=100 lemmy | grep -F lemm.ee -C 10
- You also may have to reset the fail count for the target instance (see below)
Reset federation fail count for instance
If federation sending to a specific instance has been failing consistently, Lemmy will slow down sending using exponential backoff. For testing it can be useful to reset this and make Lemmy send activities immediately. To do this use the following steps:
- Stop Lemmy, or specifically the container for outgoing federation
docker compose stop lemmy
- Enter SQL command line:
sudo docker compose exec postgres psql -U lemmy
- Reset failure count via SQL:
update federation_queue_state
set fail_count = 0
from instance
where instance.id = federation_queue_state.instance_id
and instance.domain = 'lemm.ee';
- Exit SQL command line with
\q
, then restart Lemmy:docker compose start lemmy
Other instances don't receive actions reliably
Lemmy uses one queue per federated instance to send out activities. Search the logs for "Federation state" for summaries. Errors will also be logged.
For details, execute this SQL query:
select domain,currval('sent_activity_id_seq') as latest_id, last_successful_id,fail_count,last_retry from federation_queue_state
join instance on instance_id = instance.id order by last_successful_id asc;
You will see a table like the following:
domain | latest_id | last_successful_id | fail_count | last_retry |
---|---|---|---|---|
toad.work | 6837196 | 6832351 | 14 | 2023-07-12 21:42:22.642379+00 |
lemmy.deltaa.xyz | 6837196 | 6837196 | 0 | 1970-01-01 00:00:00+00 |
battleangels.net | 6837196 | 6837196 | 0 | 1970-01-01 00:00:00+00 |
social.fbxl.net | 6837196 | 6837196 | 0 | 1970-01-01 00:00:00+00 |
mastodon.coloradocrest.net | 6837196 | 6837196 | 0 | 1970-01-01 00:00:00+00 |
This will show you exactly which instances are up to date or not.
You can also use this website which will help you monitor this without admin rights, and also let others see it
https://phiresky.github.io/lemmy-federation-state/site
You don't receive actions reliably
Due to the lemmy queue, remote lemmy instances will be sending apub sync actions serially to you. If your server rate of processing them is slower than the rate the origin server is sending them, when visiting the lemmy-federation-state for the remote server, you'll see your instance in the "lagging behind" section.
This can be avoided by setting the config value federation.concurrent_sends_per_instance
to a value greater than 1 on the sending instance.
Typically the speed at which you process an incoming action should be less than 100ms. If this is higher, this might signify problems with your database performance or your networking setup.
Note that even apub action ingestion speed which seems sufficient for most other instances, might become insufficient if the origin server is receiving actions from their userbase faster than you can process them. I.e. if the origin server receives 10 actions per second, but you can only process 8 actions per second, you'll inevitably start falling behind that one server only.
These steps might help you diagnose this.
Check processing time on the loadbalancer
Check how long a request takes to process on the backend. In haproxy for example, the following command will show you the time it takes for apub actions to complete
tail -f /var/log/haproxy.log | grep "POST \/inbox"
If these actions take more than 100ms, you might want to investigate deeper.
Check your Database performance
Ensure that it's not very high in CPU or RAM utilization.
Afterwards check for slow queries. If you regularly see common queries with high max and mean exec time, it might signify your database is struggling. The below SQL query will show you all queries (you will need pg_stat_statements
enabled)
\x auto
SELECT user,query,max_exec_time,mean_exec_time,calls FROM pg_stat_statements WHERE max_exec_time > 10 AND CALLS > 100 ORDER BY max_exec_time DESC;
If you see very high time on inserts, you might want to consider disabling synchronous_commit
to see if this helps.
Check your backend performance.
Like the DB, if the server where your lemmy rust backend is running is overloaded, you might see such an impact
Check your Network Layout
If your backend and database appear to be in good condition, it might be that your issue is network based.
One problem can occur is your backend and your database are not in the same server and are too far from each other in geographic location. Due to the amount of DB queries performed for each apub sync request, even a small amount of latency can quickly add up.
Check the latency between your rust backend and your DB using ping
ping your_database_ip
if the time
you see if above 1-2ms, this can start causing such delays. In that case, you might want to consider moving your backend closer to your DB geographically, so that your latency is below 2ms
Note that your external loadbalancer(s) (if any) do not necessarily need to be closer to the DB, as they do not do multiple small DB requests.
Downgrading
If you upgraded your instance to a newer version (by mistake or planned) and need to downgrade it. Often you need to reverse database changes as well.
First you need to figure out what SQL changes happened between your upgraded version, and the one you're downgrading. Then in that diff, check which files were added in the migrations
dir.
Let's say that for the migration you're doing, the following were added
2023-10-24-131607_proxy_links
2023-10-27-142514_post_url_content_type
2023-12-19-210053_tolerable-batch-insert-speed
2023-12-22-040137_make-mixed-sorting-directions-work-with-tuple-comparison
2024-01-05-213000_community_aggregates_add_local_subscribers
2024-01-15-100133_local-only-community
2024-01-22-105746_lemmynsfw-changes
2024-01-25-151400_remove_auto_resolve_report_trigger
2024-02-15-171358_default_instance_sort_type
2024-02-27-204628_add_post_alt_text
2024-02-28-144211_hide_posts
Each of these folders contains a down.sql
file. We need to run that against our postgresql DB to rollback those DB changes.
- Stop your lemmy backend, and take a backup of your DB.
- Copy the
migrations
folder to your DB container or server - Acquire a shell in your postgresql container or server and switch to the
postgres
user - Run each relevant script with this command
Alternatively, copy the content of the file and paste into a psql sessiondownfolder=2024-02-28-144211_hide_posts psql -d lemmy -a -f /path/to/migrations/${downfolder}/down.sql
- You now need to clean the
__diesel_schema_migrations
table from the migration records, so that they will be correctly applied the next time you upgrade. You can use this command to sort them
You have to delete the entries in that table which match the current timestamp you applied them (This should typically be any time in the past few minutes)select * from __diesel_schema_migrations ORDER BY run_on ASC;
delete from __diesel_schema_migrations where version='20240228144211';
- You should now be able to start your lemmy in the previous version
UI randomly slow or offline
If you notice that your lemmy-ui sometimes becomes sluggish or unresponsive over a period of minutes/hours and then it passes, you might be getting targeted by scraping bots.
There's a lot of scraping bots online and they can easily overwhelm your site when they're behaving too "greedily". Unfortunately the existing lemmy-ui has a habit of falling over when polled too eagerly, while the backend still continues to work.
A solution is to cache responses in nginx as seen in lemmy-ansible here and here. This way lemmy-ui doesn't have to generate all responses from scratch, which reduces CPU load. However it won't help if a single crawler goes through thousands of unique urls in a short time.
Another option is to block the scraper's user agents. To do so, you can modify your nginx_internal.conf
to block some of the usual suspects, with this line under server
if ($http_user_agent ~* " Bytedance|Bytespider|Amazonbot|ClaudeBot") { return 444; }
This is an example blocking some the well-known misbehaving bots, but there are many more more. To discover the ones affecting you, you can use the following bash script at your lemmy backend (where your docker compose is) to enumerate any agents which are hitting you too much.
docker-compose logs --tail=10000 proxy |
grep -o '"[^"]*"$' | # Extract the last quoted string (user agent)
grep -v '^"$' | # Remove empty quotes
tr -d '"' | # Remove the quotes
sort | # Sort the user agents
uniq -c | # Count unique occurrences
sort -rn | # Sort numerically in reverse order
head -n 10 # Show top 10 results
This will parse the last 10K log entries in your nginx internal proxy and show the agents which cause the most hits. This should give a good indicator of which agents are potentially misbehaving and you can proceed to block those as well by adding their names to the list above.