Troubleshooting

System monitoring

The back-end can be monitored in Azure Insights. It logs system performance, failures, crashes as well as logged events. Crash reports are also logged to Crash reports.

Insights gathers data from all regions in production. There’s a separate insights service for the staging site.

In addition to insights, application logs are written to /logs/ directory in a separate shared azure file instance. It can be accessed by shelling into one of the kubernetes pods.

Crash reports

Errors from the production sites are logged to Sentry. Login using your MEG Google account to view the crash reports.

See also

Crash reports in Logging documentation

Database

Each region has its own database in Azure.

CPU usage

High CPU usage (around 100%) often means long-running queries. To troubleshoot this, use Query Performance Insight. It lists queries that take the longest time in the last 24 hours. Adjust the Time period and Nuber of Queries as needed to see all queries.

Configuration Query Performance Insight on an Azure PostgreSQL Flexible Server

  1. Set PostgreSQL Server Parameters:

    Set pg_qs.query_capture_mode and pgms_wait_sampling.query_capture_mode to top (recommended) or all:

    SERVER_NAME=#name of the flexible server instance
    RESOURCE_GROUP=meg
    CAPTURE_MODE=top
    az postgres flexible-server parameter set -g ${RESOURCE_GROUP:?} -s ${SERVER_NAME:?} -n pg_qs.query_capture_mode -v ${CAPTURE_MODE:?}
    # Optionally, enable wait sampling
    az postgres flexible-server parameter set -g ${RESOURCE_GROUP:?} -s ${SERVER_NAME:?} -n pgms_wait_sampling.query_capture_mode -v all
    
  2. Enable Diagnostic Settings:

    To send logs to a Log Analytics workspace, you need to create a diagnostic setting. This involves specifying the PostgreSQL server as the resource, the Log Analytics workspace as the destination, and selecting the PostgreSQLLogs category. You will need the Resource IDs of both the server and the workspace.

    SERVER_NAME=#name of the flexible server instance
    RESOURCE_GROUP=meg
    WORKSPACE_NAME=#name of the workspace to be created for the specific PostgreSQL Server
    az monitor log-analytics workspace create -g ${RESOURCE_GROUP:?} --name ${WORKSPACE_NAME:?}
    az monitor diagnostic-settings create --name QueryPerformanceInsightLogs --resource "${SERVER_NAME:?}" --resource-namespace Microsoft.DBforPostgreSQL -g ${RESOURCE_GROUP:?} --resource-type flexibleServers --workspace ${WORKSPACE_NAME:?} \
    --logs '[{"category":"PostgreSQLLogs","enabled":true},{"category":"PostgreSQLFlexSessions","enabled":true},{"category":"PostgreSQLFlexQueryStoreWaitStats","enabled":true},{"category":"PostgreSQLFlexTableStats","enabled":true},{"category":"PostgreSQLFlexDatabaseXacts","enabled":true}]'
    

To address issue of long queries you can try the following:

  • Simplify the query

    • Query can be broken into separate smaller non-nested queries.

      For instance, replace sub-query with a list of ids fetched in a separate query.

    • Move complex filtering logic to Python

      Python can be better suited for parallel processing as it runs in Kubernetes and scales up on demand. Database does not have this benefit.

    • Only select the columns you need using only() or values() QuerySet methods.

      This method will benefit models with a lot of fields, or fields containing a lot of data such as large json objects or text fields.

    • Remove calls to select_related()

      select_related() is great for reducing number of queries, but it can result in a one large query that takes long to execute in some rare instances.

  • Reduce the frequency of running the query

    If query cannot be optimized, try caching its results or part of it

    • use django cache mechanism to use redis cache

    • use @cached_property or @lru_cache decorators to cache data locally

    • other caching logic such as storing objects locally in a dictionary

  • Other database optimization techniques can be found in Django documentation. However, do note that performance in testing may differ a lot from performance with production data.

Listing slow queries in Admin Dashboard

Once Query store is enabled, you can view a list of slow queries in /meg-admin/dashboard/database/. The threshold for what is considered a slow query can be set with SLOW_QUERY_THRESHOLD_MS.

See also

Task #27934

Connecting to the database

To connect to the database, you need to enable “Allow access to Azure services” in “Connection Security”. Remember to disable it afterwards.

Connects to the azure_sys database in the EU. Run this command in the Cloud Shell in Azure. Replace ${PSQL_PASSWORD} with postgres password.
 # Connect to postgres interactively
 psql "host=megforms-eu-restored.postgres.database.azure.com port=5432 dbname=azure_sys user=megforms@megforms-eu-restored password=${PSQL_PASSWORD} sslmode=require"

 # Run an SQL command and write its output to a file (replace ${QUERY} with the Query you want to run):
 # (File can be downloaded using "download" button in Cloud Console)
 psql "host=megforms-eu-restored.postgres.database.azure.com port=5432 dbname=azure_sys user=megforms@megforms-eu-restored password=${PSQL_PASSWORD} sslmode=require" --csv -c ${QUERY} > top.csv

Useful database queries

Below you will find SQL queries commonly used for troubleshooting

-- Currently running database queries, including time how long its been running
SELECT pid, age(clock_timestamp(), query_start), usename, query, state FROM pg_stat_activity WHERE state != 'idle' AND query NOT ILIKE '%pg_stat_activity%' ORDER BY query_start desc;
-- Cancel a running database query - replace PID with query ID
SELECT pg_cancel_backend(PID);
-- show past queries (pg_qs.query_capture_mode needs to be enabled) and their execution times
-- More: https://learn.microsoft.com/en-us/azure/postgresql/single-server/concepts-query-store#query_storeqs_view
SELECT query_id, mean_time, max_time, calls, query_sql_text FROM query_store.qs_view where query_id IN (QUERY_IDS);
-- Select query text for given query id
SELECT query_sql_text FROM query_store.qs_view where query_id = QUERY_ID LIMIT 1;

-- Analyze query; to run this, you need to connect to "megforms" database (replace azure_sys in the connection string)
-- Replace (QUERY) with the SQL query being analyzed
-- More: https://www.postgresql.org/docs/current/sql-explain.html
EXPLAIN ANALYZE (QUERY);

Note

query_sql_text is limited to 1000 characters, so it will not always contain full query.

Scaling database

If required be workloads, database should be scaled-up (CPU & RAM). After scaling up, some database settings may need to be adjusted, to ensure it can fully utilize its resources.

To edit these settings, in Azure Portal go to the selected database, then select “Server parameters”

Note

Take note of units, as some settings are expressed in pages (8kb), in which case, when calculating the value, it needs to be divided by 8.

PostgreSQL Configuration Guidelines

Parameter

Recommended Value

maintenance_work_mem

~1GB - 2GB (or 5% of RAM)

work_mem

(RAM * 0.25) / max_connections

max_parallel_workers

Equal to vCore count

effective_cache_size

75% of Total RAM

shared_buffers

25% of Total RAM

random_page_cost

1.1 (if SSD/Premium Storage)