Troubleshooting
System monitoring
The back-end can be monitored in Azure Insights. It logs system performance, failures, crashes as well as logged events. Crash reports are also logged to Crash reports.
Insights gathers data from all regions in production. There’s a separate insights service for the staging site.
In addition to insights, application logs are written to /logs/ directory in a separate shared azure file instance. It can be accessed by shelling into one of the kubernetes pods.
Crash reports
Errors from the production sites are logged to Sentry. Login using your MEG Google account to view the crash reports.
See also
Crash reports in Logging documentation
Database
Each region has its own database in Azure.
CPU usage
High CPU usage (around 100%) often means long-running queries. To troubleshoot this, use Query Performance Insight. It lists queries that take the longest time in the last 24 hours. Adjust the Time period and Nuber of Queries as needed to see all queries.
Configuration Query Performance Insight on an Azure PostgreSQL Flexible Server
See also
Set PostgreSQL Server Parameters:
Set
pg_qs.query_capture_modeandpgms_wait_sampling.query_capture_modetotop(recommended) orall:SERVER_NAME=#name of the flexible server instance RESOURCE_GROUP=meg CAPTURE_MODE=top az postgres flexible-server parameter set -g ${RESOURCE_GROUP:?} -s ${SERVER_NAME:?} -n pg_qs.query_capture_mode -v ${CAPTURE_MODE:?} # Optionally, enable wait sampling az postgres flexible-server parameter set -g ${RESOURCE_GROUP:?} -s ${SERVER_NAME:?} -n pgms_wait_sampling.query_capture_mode -v all
Enable Diagnostic Settings:
To send logs to a Log Analytics workspace, you need to create a diagnostic setting. This involves specifying the PostgreSQL server as the resource, the Log Analytics workspace as the destination, and selecting the
PostgreSQLLogscategory. You will need the Resource IDs of both the server and the workspace.SERVER_NAME=#name of the flexible server instance RESOURCE_GROUP=meg WORKSPACE_NAME=#name of the workspace to be created for the specific PostgreSQL Server az monitor log-analytics workspace create -g ${RESOURCE_GROUP:?} --name ${WORKSPACE_NAME:?} az monitor diagnostic-settings create --name QueryPerformanceInsightLogs --resource "${SERVER_NAME:?}" --resource-namespace Microsoft.DBforPostgreSQL -g ${RESOURCE_GROUP:?} --resource-type flexibleServers --workspace ${WORKSPACE_NAME:?} \ --logs '[{"category":"PostgreSQLLogs","enabled":true},{"category":"PostgreSQLFlexSessions","enabled":true},{"category":"PostgreSQLFlexQueryStoreWaitStats","enabled":true},{"category":"PostgreSQLFlexTableStats","enabled":true},{"category":"PostgreSQLFlexDatabaseXacts","enabled":true}]'
To address issue of long queries you can try the following:
Simplify the query
Query can be broken into separate smaller non-nested queries.
For instance, replace sub-query with a list of ids fetched in a separate query.
Move complex filtering logic to Python
Python can be better suited for parallel processing as it runs in Kubernetes and scales up on demand. Database does not have this benefit.
Only select the columns you need using
only()orvalues()QuerySetmethods.This method will benefit models with a lot of fields, or fields containing a lot of data such as large json objects or text fields.
Remove calls to
select_related()select_related()is great for reducing number of queries, but it can result in a one large query that takes long to execute in some rare instances.
Reduce the frequency of running the query
If query cannot be optimized, try caching its results or part of it
use django cache mechanism to use redis cache
use
@cached_propertyor@lru_cachedecorators to cache data locallyother caching logic such as storing objects locally in a dictionary
Other database optimization techniques can be found in Django documentation. However, do note that performance in testing may differ a lot from performance with production data.
Listing slow queries in Admin Dashboard
Once Query store is enabled, you can view a list of slow queries in /meg-admin/dashboard/database/.
The threshold for what is considered a slow query can be set with SLOW_QUERY_THRESHOLD_MS.
See also
Connecting to the database
To connect to the database, you need to enable “Allow access to Azure services” in “Connection Security”. Remember to disable it afterwards.
azure_sys database in the EU. Run this command in the Cloud Shell in Azure. Replace ${PSQL_PASSWORD} with postgres password. # Connect to postgres interactively
psql "host=megforms-eu-restored.postgres.database.azure.com port=5432 dbname=azure_sys user=megforms@megforms-eu-restored password=${PSQL_PASSWORD} sslmode=require"
# Run an SQL command and write its output to a file (replace ${QUERY} with the Query you want to run):
# (File can be downloaded using "download" button in Cloud Console)
psql "host=megforms-eu-restored.postgres.database.azure.com port=5432 dbname=azure_sys user=megforms@megforms-eu-restored password=${PSQL_PASSWORD} sslmode=require" --csv -c ${QUERY} > top.csv
Useful database queries
Below you will find SQL queries commonly used for troubleshooting
-- Currently running database queries, including time how long its been running
SELECT pid, age(clock_timestamp(), query_start), usename, query, state FROM pg_stat_activity WHERE state != 'idle' AND query NOT ILIKE '%pg_stat_activity%' ORDER BY query_start desc;
-- Cancel a running database query - replace PID with query ID
SELECT pg_cancel_backend(PID);
-- show past queries (pg_qs.query_capture_mode needs to be enabled) and their execution times
-- More: https://learn.microsoft.com/en-us/azure/postgresql/single-server/concepts-query-store#query_storeqs_view
SELECT query_id, mean_time, max_time, calls, query_sql_text FROM query_store.qs_view where query_id IN (QUERY_IDS);
-- Select query text for given query id
SELECT query_sql_text FROM query_store.qs_view where query_id = QUERY_ID LIMIT 1;
-- Analyze query; to run this, you need to connect to "megforms" database (replace azure_sys in the connection string)
-- Replace (QUERY) with the SQL query being analyzed
-- More: https://www.postgresql.org/docs/current/sql-explain.html
EXPLAIN ANALYZE (QUERY);
Note
query_sql_text is limited to 1000 characters, so it will not always contain full query.
Scaling database
If required be workloads, database should be scaled-up (CPU & RAM). After scaling up, some database settings may need to be adjusted, to ensure it can fully utilize its resources.
To edit these settings, in Azure Portal go to the selected database, then select “Server parameters”
Note
Take note of units, as some settings are expressed in pages (8kb), in which case, when calculating the value, it needs to be divided by 8.
Parameter |
Recommended Value |
|---|---|
maintenance_work_mem |
~1GB - 2GB (or 5% of RAM) |
work_mem |
(RAM * 0.25) / max_connections |
max_parallel_workers |
Equal to vCore count |
effective_cache_size |
75% of Total RAM |
shared_buffers |
25% of Total RAM |
random_page_cost |
1.1 (if SSD/Premium Storage) |