مختصر البحث:
The possibility of solving the problem of searching for inefficient SQL queries in large
software systems (OLTPIDW.BI, etc.) based on an extended set of parameters stored in
the long-term memory of an industrial DBMS. Neural network clustering bas…
The possibility of solving the problem of searching for inefficient SQL queries in large
software systems (OLTPIDW.BI, etc.) based on an extended set of parameters stored in
the long-term memory of an industrial DBMS. Neural network clustering based on the self-
organizing Kohonen network allows you to divide significant volumes of executable requests into groups with
a different profile of the load on the system, providing an easy-to-understand visualization of
query performance parameters. The subsequent use of the Bayesian classifier,
based on a limited set of queries with low performance, automates the
processes of evaluating the effectiveness of SQL queries.
Keywords: SQL query, self-organizing neural network, clustering,
statistical performance parameters, Bayesian classifier, Gaussian mixture
of distributions.
The growth in the volume of stored data in large distributed client systems exacerbates
the problem of searching for long and resource-intensive SQL queries. The number of requests that need
to be analyzed in a short time by software system support specialists and/or
DBMS, exceed human capabilities. The complexity of the task increases in the case of a gradual
drop in productivity without the presence of obvious signs. Searching for really inefficient requests
among hundreds, thousands, and even tens of thousands of requests in the analyzed time period becomes
far from a trivial task. Traditional data sources in identifying resource - intensive-
The main characteristics of the query execution plan and statistical
performance parameters that characterize the consumption of time resources of the processor and
the I/O system are the pros. Utility tool software or self-written queries to
the DBMS provide quite extensive sets of SQL queries with limiting parameters and characteristics
, but not every one of these queries requires replacement or transformation. This is due to the fact that
the execution of SQL queries is performed in a highly dynamic environment in which peak
load, insufficient or, conversely, frequent updating of statistical parameters
of data distribution, absence of the required data in the cache, etc. All this has a difficult-to-diagnose effect
on query performance parameters and can be regarded as a noise component with
unknown distribution parameters.
One of the ways to improve the procedure for searching for inefficient SQL queries in such
conditions is their preliminary clustering based on a symbiosis of self-organizing
Kohonen neural networks (SOM) [1] and the restoration of the distribution density by constructing
a Gaussian mixture of distributions (Gaussian Mixture Model, GMM), with subsequent evaluation of
query performance based on the Bayesian classifier.
Selection of clustering object parameters. In [1,2]
, the characteristics of the Oracle DBMS SQL query execution plan are described and analyzed for their use as parameters
of clustered objects. It is noted that the characteristics of the plan can only be a starting point
to assess the true effectiveness, since they largely depend on: 1) the relevance of statistical
information on the distribution of values in the indexes and database tables at the time of generating the plan;
2) settings of the DBMS optimizer component, which generates a set of execution plans and
ultimately selects the plan with the lowest complexity estimate. It is shown that a more accurate
vector for evaluating SQL queries is statistical data on the progress of query execution -
the so-called performance statistics that the DBMS collects during query execution and
stores, along with a variety of system information, in various internal structures and
repositories. Among the various sources of service information storage, the best for the
purpose - searching for inefficient SQL queries in the absence of obvious reasons - according to the authors,
is a specialized diagnostic long-term data warehouse (Advanced Workload
Repository, AWR). Performance statistics for a single query in AWR are stored in two representations
: in total since the start of the DB instance (DataBase Instance) and in the form of changes for
a certain time interval. Analysis of the first type of data showed that they are not suitable for
solving the problem of preliminary grouping of SQL queries due to resetting when restarting the instance
DB and in a number of other cases. The second type of data represents the difference between the aggregate statistics
at the beginning and end of a given period and is more suitable for the analysis of
productivity.