OPTIMAL GENETIC ALGORITHM SELECTION FOR DEEP NEURAL NETWORK SETTINGS

— The problem of construction of deep neural networks with the use of genetic algorithms is considered. The problem of structural-parametric synthesis with creation of neural networks is defined. The main purpose of the study is to find a deep neural network that is optimal for solving urgent problems. The classification problem is chosen as the urgent problem to solve. Also the classification of genetic algorithms is given, which are used as a basis for establishing the parameters of deep neural networks A system for the optimal tuning of the parameters of deep neural networks is proposed, which includes a two-stage algorithm. At the first stage of the algorithm, a multicriteria genetic algorithm is selected from a set of possible ones (genetic algorithm of vector estimation, genetic algorithm of Fonseca and Fleming, genetic algorithm of Pareto approximation with niche, genetic algorithm of sorting without dominance, genetic algorithm of Pareto force, genetic algorithm of Pareto-2 force ) that best fits the given training sample. At the second step, the problem of structural-parametric synthesis of a neural network is solved according to the criteria of accuracy and complexity. As a result of training, the values of the neural network parameters are found, such as: the number of layers, the number of neurons in each layer, the values of the weight coefficients.The modeling of the proposed system is carried out. The results of modeling, comparison of results with similar software packages are presented. The obtained results show the possibilities of wide use.


I. INTRODUCTION
Modern systems that require solutions to problems of high complexity are keep using artificial neural networks (ANN) [1] - [12]. Such problems include solving the problem of classification, forecasting, clustering, decision making, approximation, data analysis, optimization. The use of ANN can be widely found in many areasconstruction (for example, finding the optimal design parameters), medicine (disease recognition), economics (forecasting the exchange rate) and others. But when using ANN there is a problem of decision-making on the type of neural network and its parameters, depending on many criteria. This article shows an algorithm for determining the best genetic algorithm for learning a neural network with the definition of its structure (number of layers, number of neurons in layers) and values of weights, which differs from those known in that in order to improve the quality of learning the best of the algorithms.
As a result, we get a system that increase the efficiency of solving problems of neural networks due to a multiobjective optimization system -the optimality of the parameters reduce the complexity of learning neural networks while obtaining optimal accuracy.
It can be used in practice to improve neural network problem solving -it can be used to reduce resource use in problem solving.

II. PROBLEM STATEMENT
Structural-parametric synthesis [2] is a process as a result of which the structure of the object is determined and the parametric values of its constituent elements are found, so that the conditions of the synthesis task are fulfilled. If the synthesized object is optimal by any criteria, then the synthesis itself is optimal.
The used mathematical and computer models used for automation of structural-parametric synthesis of objects differ significantly from those models used for automation of parametric synthesis. It follows that if the structure of the object in the synthesis process does not change during parametric synthesis, then during the process of structural-parametric synthesis there is a change in the parameters of the object and a change in the structure.

COMPUTER SCIENCES AND INFORMATION TECHNOLOGIES
Statement of the problem of structural-parametric synthesis in our case will determine the structure of the model and their parameters are two criteria (complexity and accuracy).
To solve the problem of structural-parametric synthesis in our case it is necessary to determine the model structure and its parameters, namely -the model complexity of the deep neural networks (DNN) and accuracy.
Let the maximum number of neurons be given A in a neural network constructed to approximate the dependence on the sample of source data <X, Y> where X = {X i } -a set of values of characteristics (features) that describe the object or process; Y = {y p } -array of parameter values at the output in this sample; X i = {X ip } this is the th feature in the sample, i = 1, 2, ..., L; X p is the value of the ith attribute for the pth sample, p = 1, 2, ..., M; y p is the value of the predicted parameter for the pth instance; is the total number of objects in the original set; m -number of samples.
Then the task of the structural-parametric synthesis is to find a model of the form HC = HC (S, W, B), for which ξ(HC, X, Y) → min, while S = S(L, A) is a matrix that determines the presence of synaptic connections between network elements (input functions, neurons); W = W(S) is a matrix of weights corresponding to the ratios present in the network; B = B(S) is the shift vector of network neurons; ξ(HC, X, Y) is the criteria that determine the effectiveness of the model of the DNN to approximate the relationship between the set of parameters at the input -X and the corresponding vector of parameter values at the output -Y.

III. OVERVIEW OF GENETIC ALGORITHMS
To date, there are many genetic algorithms [1], but the most popular are VEGA (vector genetic algorithm), FFGA (multi-purpose genetic algorithm) Fonseca and Fleming) or MOGA (also called a multi-purpose genetic algorithm), NPGA (Niche-Pareto genetic algorithm), SPEA and SPEA2 (Pareto force evolutionary algorithm).
1) VEGA. David Schaffer (1984) [3] extended Grefenstette's program GENESIS to include multicriteria functions. Schaffer's approach was to use an extension of the Simple Genetic Algorithm (SGA), which he called the Vector Genetic Algorithm (VEGA), and which differed from SGA only in selection. This operator was modified so that a number of subpopulations were generated for each generation, performing proportional selection according to each criterion. Thus, for a problem with the k criteria, subpopulations of the size of N / k are generated (assuming is the total population size). These subpopulations will be mixed to obtain a new population of size N, which will be subject to GA, in which crossover and mutation operators are used in the usual way. In Figure 1 a structural representation of this process is shown.
The main advantage of this algorithm is its simplicity, ie this approach is quite easy to implement. Richardson and co-authors. (1989) [4] note that shuffling and merging of all subpopulations corresponds to the averaging of the custom components associated with each of the criteria. Because Schaffer used the proportional purpose of fitness, these components of fitness, in turn, were proportional to the criteria themselves. Thus, the obtained expected adaptation corresponds to a linear combination of goals, where the weights depend on the distribution of the population in each generation, as shown by Richardson and co-authors.
The main disadvantage of this is that when we have a concave compromise surface, certain points in the concave regions will not be found by this optimization procedure, in which we use only a linear combination of criteria, and it has been proven that regardless of the set of weights are used. Thus, the main weakness of this approach is its inability to produce Pareto-optimal solutions in the presence of non-convex search spaces   Fig. 1. VEGA algorithm 2) FFGA. Fonseca and Fleming (1993) [4] implemented Goldberg's proposal differently. First, let's discuss what Goldberg has to offer in terms of Pareto's rating.
Goldberg proposed a Pareto ranking scheme in (1989), where the solution x during generation has a corresponding target vector x u , and n is the size of the population, the rank of the solution is determined by the following algorithm.
FFGA algorithm is described below: 8. The end of doing 9. _ = _ + 1 = 10. end while This means that the entire population is checked in the Pareto rating, and all individuals who do not dominate are assigned the rank "1". These individuals are then removed from the population with rank "1". All non-dominated individuals from the rest of the population are identified again and assigned the rank "2". Thus, the procedure continues until all decisions receive the required rank.
But in multiobjective genetic algorithms, the entire population is tested, and all individuals who do not dominate are assigned the rank of "1". Other individuals are classified by checking their dominance relative to the rest of the population as follows.
For example, an individual in a generation t in which ( ) t i p individuals predominate in the current generation. Its current position in the rank of individuals can be given by ( ) Once the ranking procedure is complete, it is time to assign fitness to each individual. Fonseca and Fleming proposed two methods for determining fitness: -rank based determination of fitness; -methods of forming niches. The appointment of fitness on the basis of rank is as follows: -sort the population by rank; -assign fitness by interpolating from best (rank "1") to worst (rank n N  ) in the usual way, according to a certain function, usually linear, but not required.
-On average, assess the level of fitness of individuals with the same rank so that they all participate at the same rate. This procedure maintains a constant adaptation of the global population, maintaining the appropriate selection pressure, as determined by the function used.
As Goldberg and Deb pointed out, this type of blocked adaptation is likely to lead to high selection pressures, which can lead to premature convergence. To avoid this, Fonseca and Fleming used the second method (i.e., the niche formation method) to distribute the population over the optimal Pareto region, but instead of exchanging parameter values, they used exchanging the values of the objective function.
The main advantage is that it is effective and relatively easy to implement. The effectiveness of this method strongly depends on the distribution coefficient share  . However, Fonseca and Fleming have developed a good methodology for calculating this value for their approach.
Pareto-dominant tournament. Basically, it is a scheme of tournament selection based on Pareto dominance. In this scheme, a selection set of comparisons consisting of a certain number (t dom ) of individuals is randomly selected from a population at the beginning of each selection process. Two individuals are selected at random from the population for selection. Then each of the individuals is compared with each individual in the comparison set. If in one the set of comparison prevails, and in another is not present, the later is selected for reproduction. If neither or both are dominated by a set of comparisons, then we move on to the second technique.
Exchange of equivalence classes. Since both individuals are the same, ie either dominant or nondominant, it is likely that they are in the same class of equivalence. So in this case we choose the "best fit" according to the following procedure.
We choose the radius of the niche share  , and according to this radius, the candidates who have the smallest number of individuals in the population are the most suitable. In the following Fig. 2 shows how this procedure works: here we maximize along the x-axis and minimize along the y-axis.
In this case, the set of candidates for selection does not exceed the set of comparison. Thus, in terms of the number of niches, this shows that candidate 1 is the best fit. Here t dom is selected only once for a certain generation t. After creating a new population, a genetic operator similar to other methods is used. 3. Random selection of two chromosomes from the current population.
4. Selection of n individuals based on the following procedure: Compare two chromosomes with tdom for nondominance by the previous definition. If one is dominant and the other is not dominant, choose one that is not dominant. If both do not dominate or dominate, then choose the best chromosome (individual) by the method of niche formation. 5. Apply crossover and mutation to get a new population.
6. Check if the performance criteria are met, if not, go to step 2, otherwise go to step 7. 7. Stop. Because this approach does not apply Pareto selection to the entire population, but only to its segment at each run, its main advantages are that it is very fast and that it creates good nondominant fronts that can be maintained for many generations.
The effectiveness of this method strongly depends on the partition coefficient ( share  ) and good tournament numbers (t dom ) and difficult to implement.
4) NSGA. NSGA [6] differs from a simple GA only in the form in which the selection operator is used. Operator crossover and dominant solutions are also important in order to obtain a good distribution of solutions in the optimal Pareto front. Adaptation is performed in two stages.
1. Assigning the same fictitious adaptability to all decisions of a certain level of dominance. 2. Application of exchange strategy.
We will now discuss the details of these two steps. First, all decisions on the first nonpredominant front are assigned an adaptation equal to the population size. This becomes the maximum suitability that any solution can have for any population. Based on the sharing strategy, if a solution has many adjacent solutions on one front, its fictitious adaptability is reduced by a factor and the total fit is calculated. The factor depends on the number and proximity of neighboring solutions. After all decisions on the first front are assigned values of fitness, the lowest total value of fitness is determined.
After that, the persons who are on the second level of domination are assigned a fictitious adaptation, equal to the number less than the least total adaptation of the previous front. This ensures that no decision on the second front has a more common fit than any decision on the first front. This maintains pressure on the solution to lead to an optimal Pareto front. The method of sharing is again used among the persons of the second front, and the general adaptability of each person is revealed. This procedure is continued until all individuals have received general fitness.
After the fitness assignment method [7], [8], roulette selection (RWS) is used to select N individuals. Crossover and mutation are then used. Joint fitness is calculated as follows.
Given the set of n 1 decisions in the lth nonpredominant front, each of which has a fictitious value f 1 , the common procedure is performed as follows for each decision i = 1, 2, 3, ..., n 1 .
1. Calculate the normalized measure of the Euclidean distance with a different solution in the lth nonpredominant front, as shown below: where P is the number of criteria in the problem. These parameters ( ) .
This procedure continues for all i = 1, 2, 3, ..., n 1 and found accordingly. After that, the smallest value min i f of min from all in i f  lth nonpredominant front was found for further processing. The fictitious adaptation of the next nonpredominant front is determined by The above sharing procedure requires a predefined parameter share  , which can be calculated as follows: where q is the desired number of different Paretooptimal solutions. Although the calculation share  depends on this parameter q, q = 10 has been shown to work well for many test tasks.
The main advantage of this method is that it can handle any number of criteria, and this makes the distribution in the space of parameter values instead of the value space of the object, which provides a better distribution of people and allows to obtain several equivalent solutions.
Some researchers note that this is ineffective when considered as the computational efficiency of the produced Pareto fronts. Another disadvantage is that it is more sensitive to share  . 5) SPEA, SPEA-2 General scheme of SPEA (Evolutionary force algorithm Pareto) [6], [8], [9]:  initialization: generating the 1st set and creating an empty external Pareto-optimal set (archive);  Pareto-optimal set updating. If the value of the Pareto-optimal set exceeds the specified limit, subsequent Pareto networks are destroyed by the clustering method. To reduce the Pareto set to a controlled size, the average algorithm of hierarchical clustering based on compounds is used. It performs an iterative combination of adjacent clusters to achieve the required number of groups;  calculating the value of the fitness function for the external Pareto-optimal set and set of individuals;  selection by means of tour selection: the population and the external set are combined, and any two persons are chosen at random. As for their fitness function, the best of them moves to the pool (mating pool). A pool is a collection of intersecting populations that undergo mutation and crossover operations to create a new population;  the new population is produced by mutation and crossover operations;  is not met, go to step 2; otherwise the members of the archive are represented as the optimal Pareto set.
SPEA differs significantly from its predecessors in that: -the concept of Pareto dominance is used to assign scalar suitability to individuals; -persons who do not dominate other members of the population of Gindi, are stored separately in a special external set (archive); -to reduce the number of persons stored in the archive, clustering is carried out, which, in turn, does not affect the characteristics of persons acquired in the search process.
The uniqueness and advantages of the SPEA method is that: -it combines the above approaches in one algorithm; -the suitability of each individual of the population is determined only in relation to the persons of the external archive, regardless of whether the individuals of the population dominate each other; -despite the fact that the "best" persons obtained in previous generations are stored in the external archives, they all participate in the selection; -to prevent premature convergence, a special mechanism of niche formation is used, where the distribution of general suitability is carried out not in terms of the distance between individuals, but on the basis of Pareto dominance.
One of the disadvantages of SPEA is that most of the resources and time are spent on the clustering procedure, which provides support for population diversity.
When developing SPEA-2 [1] [10], the main goal was to eliminate the potential shortcomings of the predecessor (SPEA) and incorporate the latest results to create a powerful and modern multicriteria evolutionary algorithm. The main differences between SPEA-2 and SPEA: -an improved scheme for assigning a fitness function that takes into account each individual, how many people dominate him and how many people dominate others; -the closest method of estimating the density of neighbors, which allows you to more accurately manage the search process; -a new method of truncating archives, which guarantees the preservation of marginal solutions.
In general, one of the most important steps in SPEA-2 is to determine fitness function or fitness.
Determination of fitness [11]: performed on the basis of the concept of Pareto-dominance, the algorithm for calculating which (suitability F) for each individual from the population of P t and the archive A t has the following form.
1) Suppose we have a set of people who make up the P t population and archive B, where each person is assigned a value ( ) [0.1), S i  is called "force" (which shows how many decisions it dominates), which is proportional to the number of members of the population , in the case of multicriteria optimization. The proportion is as follows: where N is the population size; n is the number of individuals that dominate under conditions ( ) ( ). f i f j  But the "strength" of each individual and the set of persons in the A t archive and the set of populations P t will be defined as the sum of "force" on persons and "force" taking into account the dominance encoded by criterion above the criterion encoded by j: where + stands for multiset union, the symbol ∧ stands for a conjuction operator and the symbol > corresponds to the Pareto dominance relation extended to individuals ( i j  if the decision vector encoded by i dominates the decision vector encoded by j).
2) Based on the value of ( ) S i is calculated "raw" value of adaptation ( ) R i of individual , which is calculated by summing the "forces" of all individuals j, which dominate or weakly dominate by all criteria: where P t is the set of individuals of the population A t a set of persons of the archive.
3) Density estimation method is an adaptation of the method of the kth nearest neighbor, where the density at any point is a (descending) function of the distance to the kth nearest data point. A decreasing function is called on some interval if for any values of the argument from this interval a larger value of the argument corresponds to a smaller value of the function. The inversion of the distance density to the kth nearest neighbor is taken to estimate the density. For each individual i the distances (in the criteria space) to the persons j in the archive and the general sample are calculated and stored in the list.
To calculate the value of fitness is used the value of the density of the location of individuals: for each individual and calculates the Cartesian distance from it to the rest of the individuals j in the archive and the set of individuals.
After ranking the list in ascending order, the kth element gives the desired distance to the person and is denoted .
 denotes the distance from the individual to the nearest kth neighbor. We use k, which is equal to the square root of the sample size. But it should be noted that quite often it is enough to use k =1, which leads to effective implementation.Then calculate the density value   D i for the individual i: where N this is the size of the population; N a is the number of archives; k is approaching the nearest integer.
Two is added to the denominator to make sure its value is greater than zero and that   1.  Output: A  -a nondominant set.
Step 1: Initialization: creating an initial set P and an empty archive -an external set 0 , 0. A t    Step 4: Completion: If t is greater than or equal to T, or another stop criterion is satisfied, then the set A  is a set of solution vectors representing nondominant solutions in 1 t A  . End.
Step 5: Selection: We use binary tournament selection with substitutions for 1 t A  to fill the pool.
Step 6: Variation: Use crossover and mutation operators for the pool and set P as the result set. Increase the population counter (t = t+1) and go to step 2.
Let's give some comparison of the multiobjective genetic algorithms [1].

IV. PROBLEM SOLUTION
To solve our problem A system for optimal selection of deep network parameters is proposed which contains the next components:  chromosome formation component;  component of multiobjective genetic algorithms;  neural network learning component. In Figure 3 an abbreviated scheme of the system for optimal selection of deep network parameters is shown. Of course, the genetic algorithms A 1 ,…, A n mean the multiobjective genetic algorithms in the previous sections -VEGA, FFGA, NPGA, NSGA, SPEA, SPEA-2.
1) Chromose formation component. Here we form the first chromosome which has genes to store the next parameters: the number of layers, the number of neurons, the weigths.
2) Сomponent of multiobjective genetic algorithms. Here we do calculations the first step of our system -to use genetic algotihms (GAs) to get chromoses for the next step. The algorithm is written a bit below. 3) Neural network learning component. It is a component of the system that performs the stages of learning the neural network. The scheme of the component can be seen in Fig. 4.
4) The algorithm of choosing the optimal multiobjective GA. The whole process of genetic algorithms involved in the calculation subsystem depends on the algorithms included in it. All these methods have their own specific system for determining the optimal parameters -i.e. the values of the genes of our chromosome). As a result, it is necessary to determine specifically the optimal multiobjective GA (genetic algorithm) for the selected type of problem. The algorithm begins with the filling of the chromosome, which structure we have already formed. Let R k be the number of bits under the number of DNN layers, R m be the number of bits under the number of neurons in each layer, and R w be the number of bits responsible for the corresponding weights of neurons from the layers.
Algorithm steps 1. The initialization of the chromosome which stores the number of DNN layers, the number of neurons, the weights and the allocation of discharges for those genes.
2. Calculation of population size and number of generations (based on the fact that these parameters depend on chromosome size).
3. Initialization of chromosome formation, the number of which is equal to the size of the population and filling of genes of chromosomes with random bits using a random number generator.
4. Population copying in N multiobjective genetic algorithms.
5. For all genetic algorithms: Calculation of chromosome fitness. If the evolution is not complete, you need to go to the next step. Otherwise -step 8.
6. All genetic algorithms have the following: evolution with the help of GA operators (according to a specific algorithm). The operators are operators of selection, crossover, mutation. 7. All genetic algorithms have the following: obtaining a new generation the size of a population.
8. End of evolution and obtaining optimal parameters (number of layers, number of neurons for all layers, corresponding weights) in the form of chromosomes from all algorithms that go to the training of models from which the best of the models is selected. The learning process is shown in section in Fig. 4.

V. RESULTS
For testing our system the basic MNIST [13] was chosen. At the same time we have the next optimal parameters of our DNN model (Fig. 5). The results of the system to reach

VI. CONCLUSIONS
The optimal choosing of paramaters for deep neural network was considered.
Overview of genetic algorithms was given.
The system to solve the problem of choosing optimal parameters was designed.
The effectiveness of the proposed system is confirmed by results.