2018 № 2(39)

Contents

  1. Rubtsova Y.V.  NEURAL NETWORK MODEL FOR OVERCOMING TIME GAP OF SENTIMENT CLASSIFICATION
  2. Razakova M.G. RADAR REMOTE SENSING METHODS OF FOREST COVER
  3. Bredikhin S.V., Lyapunov V.M., Shcherbakova N.G. SPECTRAL ANALYSIS OF THE JOURNAL CITATION NETWORK
  4. Blagodarniy A.I. SOFTWARE TOOLS FOR BUILDING AUTOMATED CONTROL SYSTEMS IN THE ENVIROMENT OF THE DOMESTIC OPERATING SYSTEM
  5. Kulikov I.M., Chernykh I.G. gooPhi: A NEW CODE FOR NUMERICAL MODELING OF ASTROPHYSICAL FLOWS ON INTEL XEON PHI SUPERCOMPUTERS

Rubtsova  Y. V. 

A. P. Ershov Institute of Informatics Systems, Novosibirsk State University, 630090, Novosibirsk, Russia

NEURAL NETWORK MODEL FOR OVERCOMING TIME GAP OF SENTIMENT CLASSIFICATION

UDC 004.912

This paper presents a neural network model to improve sentiment classification in dynamically updated text collections in natural language. As social networks are constantly updated by users there is essential to take into account new jargons, crucial discussed topics while solving classification task. Therefore neural network modelfor solutionthis problem is suggested along with supervised machine learning method and unsupervised machine learning method all of them were used for sentiment analysis. It was shown in the paper that the quality of text classification by sentiment is reduced up to 15 % according to F-measure over one and a half year. Therefore the aim of the approach is to minimise the decrease according toF-measure while classifying text collections that are spaced over time. Experiments were made on sufficiently representative text collections, which were briefly described in the paper.

Automatic sentiment classification is rather a topical subject. The great amount of information contained in social networks is represented as text in natural language. Therefore it is requires computational linguistics methods to proceed all this information. Over the about past ten years, a lot of researcher worldwide were involved in the task of automatically extracting and analysing the texts of social media. Moreover as one of the main tasks was consideredthe problem of sentiment classification of texts in natural language.

Researches and experiments on automatic text classification show that the final results of classification highly depend on the training text set and also the subjectare that the training collection corresponds to. Great amount of projects centred onfeature engineering and the involvement of additional data, such as externaltext collections (that do not overlap with the training collection) or sentimentvocabulary. Additional information can reduce the reliance on the training collectionand improve classification results. In order to successfully classify texts by sentiment, it is necessary to havetagged by sentiment text collections. Moreover, in order to improve sentimentclassification in dynamically updated text collections, it is necessary to have severalcollections identical by their properties, compiled in different periods of time. The prepared text collections formed the basis for training and test collections of Twitter posts used to assess the sentiments of tweets towards a given subject at classifier competition at SentiRuEvalin 2015 and 2016. It was shown that the collections are complete and sufficiently representative.

Previously author showsquite good results of the models builded on feature space for training the classifier based on the training collection and is therefore highly dependent on the quality andcompleteness of this collection. Describedabove, there are no semantic relationships between the terms, and the addition of new terms leads to an increase in the dimension of feature vectorspace. Another way to overcome the obsolescence of a lexicon is the use of thedistributed word representations as features to train the classifier. So this paper was focused on distributed word representations.

In the basis of this approach is the concept of a distributed word representation and the Skip-gram neural language model. External resources were used here. The distributed word representationspace was built on an untagged collection of tweets gathered in 2013 that was many times largerthan the automatically tagged training collection. It is important to mention that the length of thevector space was only 300 this is the first advantage of the approach. A second advantage of this approach is the classification results: the difference between Collection of 2013 and of 2015 years is 0.26 % according to F-measure.

In summary, proposed approached can reduce the deterioration of sentiment classification results for collections staggered over time.

Key words: natural language processing, sentiment analysis, sentiment classification, machine learning.

Bibliographic reference: Rubtsova Y. V.; Neural network model for overcoming time gap of sentiment classification //journal “Problems of informatics”. 2018, № 2. P. 4-14.

Article


Razakova M. G.

JSC National center of space researches and technologies, 050010, Almaty, Kazakhstan

RADAR REMOTE SENSING METHODS OF FOREST COVER

UDC 550.388.2

By way of analysis of a group of the radar-imagery statistical information, we determine an optimal level of filtration for automatic allocation of main objects of the observed surface. This analysis has been made based on the satellite data TerraSAR-X in both parallel and cross-polarization modes (VV, VH). Some pre-radiometric calibration values for the magnitudes are required since the radar imagery is performed under an angle. For this, we need to extract the value of sigma naught (radiometric calibration). To eliminate the high-frequency noise could be used many different filters; a simplest one is the average filter. To determine the degree of averaging, it is possible to calculate the analytical form of relationship between the size of the signature smoothing and residual values of magnitudes of radar data. The radar ability to detect texture is a major advantage over other types of imagery where the texture is not a quantitative characteristic. Filtration methods can distinguish an object component of the signal.

There are three main groups of image processing algorithms on computers: (a) initial (pre-) image processing algorithms for restoration, cleaning from random noise, improve the quality, correction of geometric distortion of radar systems; (b) thematic image processing and pattern recognition algorithms; they are performed to determine the parameters of image detail and include: finding the homogeneity of the image in terms of light and color areas, extract the feature forms, identify the coordinates of the singular points of objects and so on; (c) algorithms of the target isolation of specific objects by binarization methods of the image; in accordance with the predetermined threshold, the values of separate elements will be identified, which are suitable under the conditions of the task we discuss.

In the work, we show that by regular applying filtration methods to radar data, the area covered by vegetation has the maximum difference in the cross polarization image in comparison with that of classified forest area.

Key words: radar satellite data, forest classification.

Bibliographic reference: Razakova M.G. Radar remote sensing methods of forest cover //journal “Problems of informatics”. 2018, № 2. P. 15-23.

Article


Bredikhin S. V.,  Lyapunov V. M.,  Shcherbakova N. G.

Institute of Computational Mathematics and Mathematical Geophysics SB RAS, 630090, Novosibirsk, Russia

SPECTRAL ANALYSIS OF THE JOURNAL CITATION NETWORK

UDC 001.12+303.2

In this paper we investigate methods of spectral clustering for analysis of the journal citation networks. Clustering problem is reduced to min-cut graph partitioning: to find a partition of the graph such that the edges between different groups have very low weights and the edges within a group have high weights. That means that objects in different clusters are dissimilar from each other and objects within the same cluster are similar to each other, see C. J. Alpert, S.-Z. Yao (1995). Graph partitioning problems can be solved exactly in polynomial time, so for practical applications approximate solution methods have been developed. One of the widely used is the spectral partitioning method. The spectral methods usually involve taking the eigenvectors of some matrix based on relations between data elements. Most spectral clustering algorithms cluster the data with the help of eigenvectors of graph Laplacian matrices.

We study two major versions of spectral clustering, so called “unnormalized” and “normalized” spectral clustering that reveal the relationship of the object function formulation and the matrix used in the eigenvalue equation. Unnormalized spectral bi-clustering algorithms use the Laplacian matrix L=D-A for solving the problem Lv=lv and assigning vertices to clusters according to the signs of elements of the eigenvector v corresponding to the second smallest eigenvalue. The simplified versions of the unnormalized spectral bi-clustering method is presented as the techniques of the consistency confirmation of the approach. As shown in M. E. J. Newman, M. Girvan (2004) this class of spectral clustering is only consistent under strong additional assumptions, which are not always satisfied in real data. Most of normalized spectral bi-clustering algorithms use the symmetric normalized Laplacian matrix  for these purposes, see J. Shi, J. Malik (2000). As shown in M. Meila, J. Shi (2001) the same results can be obtained by using the largest eigenvector of the matrix . Spectral k-way clustering uses not only the second but also the next few eigenvectors to construct a partition.

The journal citation network on study is built on the basis of the bibliographic information extracted from the DB RePEc. The main component of the corresponding weighted digraph G has 1729 vertices (journals) and 135702 arcs (citations).We analyze the work of two spectral clustering algorithms in the context of three versions of transformation of digraph G to an undirected form. So, we examine the graphs represented by matrices  (graph ),  (graph ) and  (graph ), where A is the journal-journal citation matrix. Algorithm WTR P. Pons, M. Latapy (2005) is the agglomerative algorithm based on random walk matrix . Algorithm LEV M. E. J. Newman (2006) is the bi-clustering algorithm based on the modularity matrix. The algorithms are implemented with use of the igraph packet (C library). We use  indexes as the measures of similarity of two data clusterings. For  clustering the similarity is low, as an example . The most similarity is reached for graph . WTR clusters of small size (less than 200) can be interpreted in terms of thematic fields. The results are presented in the tables (1–6). We can see that results strongly depend on the digraph transformation and the algorithm used.

Key words: journal citation network, co-citation network, bibliographic coupling network, weighted directed graph, graph partitioning, spectral clustering.

Bibliographic reference:  Bredikhin S.V., Lyapunov V.M., Shcherbakova N.G. Spectral analysis of the journal citation network //journal “Problems of informatics”. 2018, № 2. P. 24-40.

Article


Blagodarniy  A. I. 

Institute of Computational Technologies of SB RAS, 630090, Novosibirsk, Russia

SOFTWARE TOOLS FOR BUILDING AUTOMATED CONTROL SYSTEMS IN THE ENVIROMENT OF THE DOMESTIC OPERATING SYSTEM

UDC 004.9

In the article Software tools for building automated control systems in the environment of the domestic operating system the construction, schemes of components interrelation and the tool nucleus of the SCADA-system on the platform of the Russian network real time operational system Neutrino KPDA.10964-01, which is being developed and accompanied by the Russian Ltd Company SVD Installed systems from St. Petersburg. Using the Russian operational system answers the task of import substitution of the programming provision, the importance of which is increasing because of geopolitical risks and the growing pressure of economic sanctions from unfriendly states.

The product being described in this article is the result of developing the SCADA-system BLACART, constructed at the ICT SB RAS and functioning in the environment of the operational system QNX 4.25. The denominated SCADA-system displayed high performance in dozens of realized systems of control in different branches of industry, as a rule, in dangerous production. The certificate of conformity and the resolution for its application at the mining enterprises were received.

As a result of relocating SCADA-system Blacart onto the base of the Neutrino operational system the structure of the SCADA-system hasn’t been changed. It was only the programming key (code) which was changed, particularly, all applications of different system libraries of subprograms were carried out in POSIX-interoperable format. Having preserved the structure as well as the main characteristics and functional capabilities of the SCADA-system in the process of relocating it into the milieu of the Neutrino system, it also became possible to keep up all of the previous product values that prove to have been efficient in the course of realized long-term control systems practice.

A newly-developed SCADA-system has been realized as a distributed computing technological network, which is simultaneously a local computing network on the basis of the network protocol Qnet of the operational system Neutrino.

The software of the SCADA-system is a hierarchical association of two subsystems: the subsystem of the upper level and the one of the lower level, which has been put into operation at one or another unit(s) of the technological network. The subsystem of the upper level is the automatic work position of the operator, which includes the graphical interface of the operator and the operational and archive databases control system. The subsystem of the lower level realizes the interface with the monitoring and technological equipment control.

The elaboration or modification of a specific project of the automated control system over technological processes on the basis of newly-developed SCADA-system is reduced to the construction of the graphical interface of the operator with the help of the system application builder (graphics editor) Application Builder Photon and to compiling a set of text configuration files of the upper and lower level subsystems.
All automated work positions of the operator are fully equal in status display operations and technological equipment control, their local database being synchronized between themselves. The absence of the dedicated data server in the technological network is the first and major peculiarity of the SCADA-system software. Such an approach realizes the principle of multiple hot backup in automated control systems.
The described SCADA-system has been constructed and elaborated as a real time SCADA-system and is intended for automatizing technological processes, which put tough demands on the time of control system reaction to one or another event. A guaranteed little time of reacting to any single event is one more major peculiarity of the SCADA-system.

The safeguard of the newly developed SCADA-system from cyber-attacks can only be reached by tough monitoring of users’ access to management in accordance with the categories as well as by blocking all of the functions of using the operational system (by the operators). The characteristics of the operational system Neutrino itself also make safety possible. As for the vulnerability of the programming provision, it can be easily detected and neutralized as the programming key of the SCADA-system is open.

Key words: SCADA, import substitution of software, QNX, Neutrino.

Bibliographic reference:  Blagodarniy A.I. Software tools for building automated control systems in the enviroment of the domestic operating system //journal “Problems of informatics”. 2018, № 2. P. 41-51.

Article


Kulikov I. M. ,  Chernykh I. G.

Institute of Computational Mathematics and Mathematical Geophysics SB RAS, 630090, Novosibirsk, Russia

gooPhi: A NEW CODE FOR NUMERICAL MODELING OF ASTROPHYSICAL FLOWS ON INTEL XEON PHI SUPERCOMPUTERS

UDC 519.6, 524.3

In this paper, a new hydrodynamics code called gooPhi to simulate astrophysical flows on modern Intel Xeon Phi processors with KNL architecture is presented. In this paper, an astrophysical phenomenon a jellyfish galaxy formation was considered. It is known, that the main scenarios of formation these objects are based on the ram-pressure mechanism of intergalactic gas or based on the galactic wind by means active galaxy nuclei. However, the ram-pressure mechanism can be obtained as a result of collision of galaxies with different masses. This scenario was investigated in the present work using the developed code. A new vector numerical method implemented in the form of a program code for massively parallel architectures is proposed. For the numerical solution of hydrodynamic equations, the modification of the original numerical method based on a combination of the operator splitting method, Godunov method and HLL solver was used. This method combines all advantages of the above methods and has a high degree of parallelism. In the base of parallel implementation is a multi-level decomposition of computing. At the first level, geometric decomposition of the computational domain by means MPI library was used. At the second level, there is a decomposition of computing between the Intel Xeon Phi accelerator threads by means OpenMP library. In everyone thread, vectorization of computing is carried out by means of AVX512. It should be noted, that the construction of the numerical method allows all kinds of decomposition. The results of the verification of numerical method on three tests of Godunov and on the Sedov blast wave test are presented. The purpose of the first test is the correctness of the contact discontinuity description. Most methods for solving hydrodynamics equations yield either oscillation or diffusion of shock waves. The author’s method gives the diffusion of the shock wave, while at the same time correctly reproduces the location of the shock wave, contact discontinuity and the waveform of the rarefaction wave. In the second test, a gas with the same thermodynamic parameters expands in different directions, forming a rarified region in the center. The test reveals an ability to physically believable simulate such a situation. It is known from the literature that many methods give an erroneous (unphysical) temperature jump in the region of strong rarefaction, and as a result, the resulting solution is distorted. The author’s method successfully simulates the rarefaction region. The main idea of the third test is to check the stability of the numerical method. A big pressure drop (5 decimal orders) should reveal the ability of the method to stably model strong perturbations with the emergence of rapidly propagating shock waves. The author’s method successfully simulates a strong wave. Sedov blast wave test is a standard test that verifies the ability of a method and its realization to reproduce strong shock waves with large Mach numbers. The author’s numerical method reproduces quite well the position of the shock wave, as well as the density profile. A detailed description is given, and a parallel implementation of the code is made. A performance of 173 gigaflops and 48 speedup are obtained on a single Intel Xeon Phi processor. A 97 per cent scalability is reached with 16 processors. In this paper, we considered the scenario of the formation of galaxies like a jellyfish on the basis of the collision of two dwarf galaxies dSph, which differ by an order in mass. We also considered the chemical processes taking place in the tail of galaxies by means of the complete system of chemical reactions and a shortened version that allow construct an analytical solution. It is worth noting that the asymptotics of these solutions has one nature. Behind the front of a massive galaxy, a tail is formed, in which the development of the Kelvin-Helmholtz instability develops an analog of the turbulent flow, due to which the tail is fragmented into tentacles observed in the jellyfish galaxies. For characteristic temperature values, as well as the characteristic concentration of atomic neutral hydrogen in tentacles, the behavior of the concentration of various forms of hydrogen was modeled by means of the ChemPAK code, which in its overwhelming part was ionized and the molecular one was several thousandths percent. It is obvious that the process of formation of molecular hydrogen plays a smaller role than the processes leading to the ionization of hydrogen. In this connection, an analytic solution of the ionization process is of main interest.

Key words: Numerical modeling, computational astrophysics, Intel Xeon Phi.

Bibliographic reference: Kulikov I.M., Chernykh I.G. gooPHI: a new code for numerical modeling of astrophysical flows on Intel Xeon phi supercomputers //journal “Problems of informatics”. 2018, № 2. P. 52-74.

Article