====================================== Troubleshooting ====================================== System monitoring =================== The back-end can be monitored in `Azure Insights `_. It logs system performance, failures, crashes as well as logged events. Crash reports are also logged to `Crash reports`_. Insights gathers data from all regions in production. There's a separate insights service for the `staging site `_. In addition to insights, application logs are written to ``/logs/`` directory in a separate shared azure file instance. It can be accessed by shelling into one of the kubernetes pods. Crash reports ================= Errors from the production sites are logged to `Sentry `_. Login using your MEG Google account to view the crash reports. .. seealso:: :ref:`crash reports` in Logging documentation Database =========== Each region has its own database in `Azure `_. CPU usage ----------- High CPU usage (around 100%) often means long-running queries. To troubleshoot this, use Query Performance Insight. It lists queries that take the longest time in the last 24 hours. Adjust the Time period and Nuber of Queries as needed to see all queries. .. _enable-query-store: Configuration Query Performance Insight on an Azure PostgreSQL Flexible Server -------------------------------------------------------------------------------- .. seealso:: `Query Store documentation `_ 1. **Set PostgreSQL Server Parameters:** Set ``pg_qs.query_capture_mode`` and ``pgms_wait_sampling.query_capture_mode`` to ``top`` (recommended) or ``all``: .. code-block:: bash SERVER_NAME=#name of the flexible server instance RESOURCE_GROUP=meg CAPTURE_MODE=top az postgres flexible-server parameter set -g ${RESOURCE_GROUP:?} -s ${SERVER_NAME:?} -n pg_qs.query_capture_mode -v ${CAPTURE_MODE:?} # Optionally, enable wait sampling az postgres flexible-server parameter set -g ${RESOURCE_GROUP:?} -s ${SERVER_NAME:?} -n pgms_wait_sampling.query_capture_mode -v all 2. **Enable Diagnostic Settings:** To send logs to a Log Analytics workspace, you need to create a diagnostic setting. This involves specifying the PostgreSQL server as the resource, the Log Analytics workspace as the destination, and selecting the ``PostgreSQLLogs`` category. You will need the Resource IDs of both the server and the workspace. .. code-block:: bash SERVER_NAME=#name of the flexible server instance RESOURCE_GROUP=meg WORKSPACE_NAME=#name of the workspace to be created for the specific PostgreSQL Server az monitor log-analytics workspace create -g ${RESOURCE_GROUP:?} --name ${WORKSPACE_NAME:?} az monitor diagnostic-settings create --name QueryPerformanceInsightLogs --resource "${SERVER_NAME:?}" --resource-namespace Microsoft.DBforPostgreSQL -g ${RESOURCE_GROUP:?} --resource-type flexibleServers --workspace ${WORKSPACE_NAME:?} \ --logs '[{"category":"PostgreSQLLogs","enabled":true},{"category":"PostgreSQLFlexSessions","enabled":true},{"category":"PostgreSQLFlexQueryStoreWaitStats","enabled":true},{"category":"PostgreSQLFlexTableStats","enabled":true},{"category":"PostgreSQLFlexDatabaseXacts","enabled":true}]' To address issue of long queries you can try the following: * Simplify the query * Query can be broken into separate smaller non-nested queries. For instance, replace sub-query with a list of ids fetched in a separate query. * Move complex filtering logic to Python Python can be better suited for parallel processing as it runs in Kubernetes and scales up on demand. Database does not have this benefit. * Only select the columns you need using :meth:`only` or :meth:`values` :class:`QuerySet` methods. This method will benefit models with a lot of fields, or fields containing a lot of data such as large :term:`json` objects or text fields. * Remove calls to :meth:`select_related` :meth:`select_related` is great for reducing number of queries, but it can result in a one large query that takes long to execute in some rare instances. * Reduce the frequency of running the query If query cannot be optimized, try caching its results or part of it * use django cache mechanism to use redis cache * use ``@cached_property`` or ``@lru_cache`` decorators to cache data locally * other caching logic such as storing objects locally in a dictionary * Other database optimization techniques can be found in `Django documentation `_. However, do note that performance in testing may differ a lot from performance with production data. .. _troubleshooting-slow-queries: Listing slow queries in Admin Dashboard ----------------------------------------- Once Query store is enabled, you can view a list of slow queries in :url:`/meg-admin/dashboard/database/`. The threshold for what is considered a slow query can be set with :envvar:`SLOW_QUERY_THRESHOLD_MS`. .. seealso:: :task:`27934` Connecting to the database --------------------------- To connect to the database, you need to enable "Allow access to Azure services" in "Connection Security". Remember to disable it afterwards. .. code-block:: bash :caption: Connects to the ``azure_sys`` database in the EU. Run this command in the Cloud Shell in Azure. Replace ``${PSQL_PASSWORD}`` with postgres password. # Connect to postgres interactively psql "host=megforms-eu-restored.postgres.database.azure.com port=5432 dbname=azure_sys user=megforms@megforms-eu-restored password=${PSQL_PASSWORD} sslmode=require" # Run an SQL command and write its output to a file (replace ${QUERY} with the Query you want to run): # (File can be downloaded using "download" button in Cloud Console) psql "host=megforms-eu-restored.postgres.database.azure.com port=5432 dbname=azure_sys user=megforms@megforms-eu-restored password=${PSQL_PASSWORD} sslmode=require" --csv -c ${QUERY} > top.csv Useful database queries ------------------------ Below you will find SQL queries commonly used for troubleshooting .. code-block:: sql -- Currently running database queries, including time how long its been running SELECT pid, age(clock_timestamp(), query_start), usename, query, state FROM pg_stat_activity WHERE state != 'idle' AND query NOT ILIKE '%pg_stat_activity%' ORDER BY query_start desc; -- Cancel a running database query - replace PID with query ID SELECT pg_cancel_backend(PID); -- show past queries (pg_qs.query_capture_mode needs to be enabled) and their execution times -- More: https://learn.microsoft.com/en-us/azure/postgresql/single-server/concepts-query-store#query_storeqs_view SELECT query_id, mean_time, max_time, calls, query_sql_text FROM query_store.qs_view where query_id IN (QUERY_IDS); -- Select query text for given query id SELECT query_sql_text FROM query_store.qs_view where query_id = QUERY_ID LIMIT 1; -- Analyze query; to run this, you need to connect to "megforms" database (replace azure_sys in the connection string) -- Replace (QUERY) with the SQL query being analyzed -- More: https://www.postgresql.org/docs/current/sql-explain.html EXPLAIN ANALYZE (QUERY); .. note:: ``query_sql_text`` is limited to 1000 characters, so it will not always contain full query. .. _scaling-database: Scaling database ----------------- If required be workloads, database should be scaled-up (CPU & RAM). After scaling up, some database settings may need to be adjusted, to ensure it can fully utilize its resources. To edit these settings, in `Azure Portal `_ go to the selected database, then select "Server parameters" .. note:: Take note of units, as some settings are expressed in pages (8kb), in which case, when calculating the value, it needs to be divided by 8. .. list-table:: PostgreSQL Configuration Guidelines :widths: 25 45 :header-rows: 1 * - Parameter - Recommended Value * - **maintenance_work_mem** - ~1GB - 2GB (or 5% of RAM) * - **work_mem** - (RAM * 0.25) / max_connections * - **max_parallel_workers** - Equal to vCore count * - **effective_cache_size** - 75% of Total RAM * - **shared_buffers** - 25% of Total RAM * - **random_page_cost** - 1.1 (if SSD/Premium Storage)