Volume 1(58)

CONTENTS

  1. Lyakhov 0. A. Renewable resources accounting in integer models of project scheduling
  2. Skopin I. N. Time Model for Studying Evolving Systems
  3. Akhatov A. R., Renavikar A.. Rashidov A. E., Nazarov F. M. Optimization of the number ofdatabases in the big data processing     
  4. Kossov G. A., Seleznev I. A. Influence of neural network parameters for the quality of prediction for the tasks of automatic lithotype description               
  5. Snytnikova T. V. Associative computing implementation library cuSTAR: data representation for bioinformatics problems           
  6. Kharyutkina S. A., Gavrilov A. V., Yakimenko A. A. Choosing operator emotions as feedback for training neural networks           

O. A. Lyakhov

Institute of Computational Mathematics and Mathematical Geophysics SB RAS, 630090, Novosibirsk, Russia

RENEWABLE RESOURCES ACCOUNTING IN INTEGER MODELS OF PROJECT SCHEDULING

DOI: 10.24412/2073-0667-2023-1-5-11

EDN: PWEWCU

In complicated complexes of operations scheduling renewable resources are assumed by constants that does not always agree with practice of management. Determining of renewable resources as not stored (type “power”) which non-use leads to their loss, does not fully reflect their specificity. Formalizing of redistribution of renewable resources is linked to representation conditions of their usage in models. Redistribution of resources is considered on an example of network model for minimizing an unbalance of resources at set directive times for completion scheduling.

Key words: project, network models, scheduling, renewable resources.

Bibliographic reference: Lyakhov 0. A. Renewable resources accounting in integer models of project scheduling//journal “Problems of informatics”. 2023, № 1. P.5-11. DOI:10.24412/2073-0667-2023-1-5-11

article

Institute of Computational Mathematics and Mathematical Geophysics SB RAS, 630090, Novosibirsk, Russia
Novosibirsk State University, 630090, Novosibirsk, Russia

TIME MODEL FOR STUDYING EVOLVING SYSTEMS

DOI: 10.24412/2073-0667-2023-1-12-32

EDN: PXACQI

Approaches to the determination of model time in studies of developing systems are discussed. The possibility of setting the global time of the system using the local times of its elements, understood as protocols of events in which they participate, is shown. Combining all such protocols leads to a partial order of events. It is proposed to use this order as the global time of the system. The correctness of such a definition of time is shown, as well as the fact that it is well combined with the use of an event control mechanism in simulation models.

Key words: local and global time; partial order relation on the set of events; events, reaction of elements to events; event protocols.

Bibliographic reference:  Skopin I. N. Time Model for Studying Evolving Systems //journal “Problems of informatics”. 2023, № 1. P.12-32. DOI:10.24412/2073-0667-2023-1-12-32

article


A. R. Akhatov, A. Renavikar*, A. E. Rashidov, F.M. Nazarov

Samarkand State University, 140101, Samarkand, Uzbekistan
*NeARIech Solution, 411033, Pune, India

OPTIMIZATION OF THE NUMBER OF DATABASES IN THE BIG DATA PROCESSING

DOI: 10.24412/2073-0667-2023-1-33-47

EDN: QBRKTM

Today, many organizations and companies increasingly need to use Big Data in order to increase their income, strengthen competitiveness, and study the interests of customers. However, most approaches to real-time processing and analysis of Big Data are based on the cooperation of several servers. In turn, the use of multiple servers limits the possibilities of many organizations and companies due to cost, management and other parameters. This research paper presents an approach for real¬time processing and analysis of Big Data on a single server based on a distributed computing engine, and it is based on research that the approach leads to efficiency in terms of cost, reliability, integrity, network independence, and manageability. Also, in order to improve the efficiency of the approach, the methodology of optimizing the number of databases on a single server was developed. This methodology uses MinMaxScalcr, StandardScaler, RobustScaler, MaxAbsScalcr, QuantilcTransformcr Power Transformer scaling functions together with Machine Learning Linear Regression, Random Forest Regression, Multiple Linear Regression, Polynomial Regression, Lasso Regression algorithms. The obtained results were analyzed and the effectiveness of the regression algorithm and scaling function was determined for the experimental data.

Key words: Big Data, Real Time Processing, Single Server Distributed Computing Engine, Architecture, Machine Learning, Regression Algorithms, Scaling.

Bibliographic reference: Akhatov A. R., Renavikar A.. Rashidov A. E., Nazarov F. M. Optimization of the number of databases in the big data processing //journal “Problems of informatics”. 2023, № 1. P.33-47. DOI:10.24412/2073-0667-2023-1-33-47

article


G.A. Kossov, I. A. Seleznev

LLC “TCS”, 125171, Moskow, Russia

INFLUENCE OF NEURAL NETWORK PARAMETERS FOR THE QUALITY OF PREDICTION FOR THE TASKS OF AUTOMATIC LITHOTYPE DESCRIPTION

DOI 10.24412/2073-0667-2023-1-48-59

EDN: QQFRGC

Machine learning methods are widely used for solving problems of interpreting and describing geological and geophysical data. One of them is automatic lithology extraction during the analysis of a whole core photographs. In this paper we propose to analyze the parameters that represent the textural and color features of the images. The advantage of this approach is that it allows online training and retraining of the classification model. Among the existing classification methods, such as boosting, random forests, support vector machines, neural networks are preferred for their universality and implementation in various sets of programming tools. The application of neural networks requires the user to have a clear understanding of the modelling goals, because an important factor is the choice of model architecture.

There are many parameters that are set by the user, and all of them affect the quality of the prediction. Therefore, the purpose of this research is to study the behavior of networks with various configurations and to find any common regularities. The paper considers the problem of classifying lithotypes using fully connected neural networks. The data for processing are color and textural features that were obtained as a result of the processing of whole core images. Thus, we consider the classification task of training examples with 48 features into 20 classes corresponding to certain lithotypes. The test sample consisted of 2998 elements. We trained the model on samples consisting of 10,000 and 1,000 elements, respectively. The hyperparameters of the model include loss function, optimization method, activation function, batch size, number of epochs, number of hidden layers, and number of neurons in a layer. Based on a given issue, it is already possible to explain the choice of one or another parameter or function in advance. For the classification problem the optimal way is using ReLU and LogSoftMax activation function. CrossEntropyLoss was used as a loss function. This loss function combines LogSoftMax and NLLLoss, so the use of LogSoftMax is also justified by simplifying the calculation of CrossEntropyLoss. We use the Adam algorithm as the method of optimization. The quality of the model was evaluated using the fl-score metric. According to the results of training a model with a fixed number of layers and nodes, but with a different batch size, it was figured out that the optimal batch size consists of 256 elements. Based on this assumption we determined that 30 epochs are enough to train the model. All in all among a large set of network hyperparameters it is complicated to determine the exact number of network elements, i.e. the number of layers and neurons. Therefore, in the current research we study the dependence of fl-score and the value of the loss function on the number of nodes in the layer. The paper shows that an increase in the number of neurons definitely leads to a gain in quality. Fl-score equals 1 for all cases after 10 neurons in a layer. Moreover, a model with incorrect number of layers can be improved by increasing the amount of neurons in each layer. Increasing the number of layers allows the model to construct a more complex approximation,

which can improve the quality of the prediction. However, as the number of layers increases, there is a risk of network overfitting and the appearance of local minima of the error function that leads to training problems. Thus, the number of nodes in a layer is the defining parameter and we should set this parameter up first. An important factor in the model training is the time spending. In this research, we propose a following estimate of the algorithm complexity. Besides, we have studied the influence of the number of layers (m) and nodes (n). The estimate is given in terms of O-notation. It is shown that the number of performed operations increase linearly O(m) in the number of layers and cubically O(n3) in the number of neurons. Consequently, with relation to the number of operations it is preferably to increase the number of network layers. However, many elements does not guarantee the rise in the fl-score. The predictions of some classification algorithms (for example, boosting or random forest) are highly dependent on the first initialization of the parameters. In our case, the dependence of the loss value on the random initialization of the neural network weights was investigated. We use the Epps-Pally test to check the normality of the loss value distribution. Tests have shown that the distribution of the value of the loss is not a Gaussian one. This fact should be taken into account in setting the requirement for the reproducibility of experiments result. The starting model weights should be initialized accordingly.

Key words: neural network, lithotype description, core analysis, hyperparameters, supervised learning.

Bibliographic reference: Kossov G. A., Seleznev I. A. Influence of neural network parameters for the quality of prediction for the tasks of automatic lithotype description //journal “Problems of informatics”. 2023, № 1. P.48-59. DOI:10.24412/2073-0667-2023-1-48-59

article

Institute of Computational Mathematics and Mathematical Geophysics SB RAS, 630090, Novosibirsk, Russia

ASSOCIATIVE COMPUTING IMPLEMENTATION LIBRARY CUSTAR: DATA REPRESENTATION FOR BIOINFORMATICS PROBLEMS

DOI: 10.24412/2073-0667-2023-1-60-68

EDN: QWFFMA

Over the past few years, genome processing has become a widely sought-after task. Both medical laboratories (from PCR tests to genetic passports) and research teams are engaged in various processing options. At the same time, both the first and the second process large amounts of data either due to the number of samples, or due to the length of these samples: from tens of thousands to several billion nucleotides. Note that a huge part of the calculations is related to the search for individual nucleotides or their sequences in a larger sequence or in a large number of sequences. So it is advisable to use associative parallel computing. But associative architectures are not represented on the computer hardware market, unlike widely available graphics accelerators. The cuSTAR library was designed to implement associative computing model STAR-machine on graphics accelerators. In this paper, a method of organizing data for processing genomes by associative algorithms is proposed.

In this paper, we propose several methods of data organization. Such an organization allows the use of associative algorithms to solve various tasks related to genome processing. Let’s recall a brief description of the associative model of the STAR machine, and its cuSTAR implementation. Both the castor library and its STAR machine model use three types of data for associative processing. The Table type stores data as a binary table. The Slice type is used to access the bit column, and the word type is used to access the bit string. It should be noted that data processing is performed mainly using¬bit columns. Therefore, the presentation of data in the cuSTAR system is fundamentally different. Usually, a sequence of nucleotides is represented by a array of characters. It can be considered as a binary table in which the rows specify one character. That is, the data is stored line by line. To use cuSTAR, a variable of type Table is stored by columns.

The alphabet of nucleotides consists of the symbols A (adenine), C (cytosine), G (guanine) and T (thymine). Also, the “—” symbol is often used in the data to indicate possible gaps in reading, insertions or deletions in the nucleotide sequence. Thus, four or five characters are used, depending on the task. We propose two ways to encode a sequence of nucleotides. The first method is optimized for memory usage. The second method is optimized for the search time of the nucleotide in the sequence. The memory-optimized method uses the following encoding: “000” for “—” symbol, “001” for adenine, “011” for cytosine, “101” for guanine, “111” for thymine. The time-optimized method uses the following encoding: “1000” for adenine, “0100” for cytosine, “0010” for guanine, “0001” for thymine. It uses 4 bits instead of 3 bits, but allows you to replace the task of searching for a word in the table with a less time-consuming one. To find all occurrences of a nucleotide in the sequence, one needs to determine the position “1” in the code of this nucleotide. The proposed data encoding methods are more compact than the standard representation in the form of an array of characters. The time-optimized method makes it possible to search for nucleotides in a sequence an order of magnitude faster than the procedure from

the t memory-optimized method. But the memory-optimized method is preferable if the representation of the nucleotide sequence in the form of a graph is used. And in this case, the de Bruijn graph is constructed from the original sequence of nucleotides in a trivial way. Although with symbolic encoding of nucleotides, this is a time-consuming and memory-consuming task.

When using cuSTAR, it is easy to construct a de Bruijn graph from a sequence of nucleotides of any parameter k. The graph is given by a list of edges, which is one of the standard representation for associative processing. Note that by defining the graph as a list of edges, we avoid problems associated with repeating arcs.

When reading the sequence, a table GEN of size 31 is formed, where 1 is the length of the input sequence. For a graph given by a list of arcs, we form tables LEFT and RIGHT of size 3k(l — k). The table LEFT is obtained by copying к times the columns of the GEN into the corresponding columns with an upward shift. In turn, the table RIGHT is obtained by copying with a shift up one row of the table LEFT. Copying of all tables is performed in parallel.

Since genome processing involves multiple searches over a large amount of data, the development of associative algorithms for this area is relevant. The applied value of the work consists in the possibility of executing these algorithms on graphics accelerators — widespread equipment from personal computers to cluster systems.

Key words: associative parallel algorithms, bioinformatics, GPU, CUDA.

Bibliographic reference: Snytnikova T. V. Associative computing implementation library cuSTAR: data representation for bioinformatics problems //journal “Problems of informatics”. 2023, № 1. P.60-68. DOI:10.24412/2073-0667-2023-1-60-68

article

Novosibirsk State Technical University, 630073, Novosibirsk, Russia

CHOOSING OPERATOR EMOTIONS AS FEEDBACK FOR TRAINING NEURAL NETWORKS

DOI: 10.24412/2073-0667-2023-1-69-76

EDN: QWXYBT

The work is devoted to the study and selection of human emotions with the highest probability of recognition for training neural networks using operator emotions as feedback. On the basis of the presented program, experiments were set up and conducted to study emotions. The following emotions were studied in the work: “anger”, “disgust”, “fright”, ‘happiness”, “sadness”, “surprise” and “neutral emotion”. During the experiments, human emotions were determined, which are recognized by the program with the greatest probability. The average values of the probability of successful or unsuccessful recognition were calculated, and the similarity of emotions was analyzed. Assumptions are made about the use of operator emotions as feedback for training neural networks. The problem of reducing the time for training a neural network aimed at solving socially significant economic problems is solved. It is assumed that the approach will expand the scope of neural networks in non-corc industries by reducing the requirements for the operator/programmer and computing resources.

Key words: artificial intelligence, neural network, emotions.

Bibliographic reference: Kharyutkina S. A., Gavrilov A. V., Yakimenko A. A. Choosing operator emotions as feedback for training neural networks //journal “Problems of informatics”. 2023, № 1. P.69-76. DOI:10.24412/2073-0667-2023-1-69-76

article