Volume 4(65)

CONTENTS

  1. Venzel A.S., Ivanisenko T.V., Demenkov P.S., Ivanisenko V.A. Software pipeline for predicting the impact of mutations on the stability of protein spatial structures using free energy change estimation methods and artificial intelligence 
  2. Venzel A. S., Klimenko A. I., Ivanisenko T. V., Demenkov P. S., Lashin S. A., Ivanisenko V. A. An approach for predicting protein abundance in yeast cells based on their genomical sequences 
  3. Demenkov P.S., Mukhin A.M., Ivanisenko V.A., Lashin S.A., Kolchanov N.A. The “Microbitech” digital platform: architecture and purpose 
  4. Ivanisenko T. V., Demenkov P.S., Ivanisenko V.A. Combined Approach to Associative Network Reconstruction: Integrating GraphSAGE and Co-occurrence Statistics 
  5. Lakhova T., Kazantsev F., Khlebodarova T., Matushkin Yu., Lashin S. Software module for studying the regulation of bacterial metabolic pathways by mathematical modeling methods 
  6. Lashin S., Kazantsev F., Lakhova T., Matushkin Yu. DynMicrobiotech: a software module for automatic reconstruction of frame-based dynamic models of microbial gene networks 
  7. Mukhin A., Oschepkov D., Lashin S. A computational pipeline for de novo recognition of transcription factor binding sites in bacterial genomes 

 
Institute of Cytology and Genetics, SB RAS, 630090, Novosibirsk, Russia
Kurchatov Genomic Center of the Institute of Cytology and Genetics, SB RAS, 630090, Novosibirsk, Russia
Novosibirsk State University, 630090, Novosibirsk, Russia

SOFTWARE PIPELINE FOR PREDICTING THE IMPACT OF MUTATIONS ON THE STABILITY OF PROTEIN SPATIAL STRUCTURES USING FREE ENERGY CHANGE ESTIMATION METHODS AND ARTIFICIAL INTELLIGENCE

DOI: 10.24412/2073-0667-2024-4-6-16
EDN:EAMKIP

In this work, a computational pipeline was developed to predict the impact of mutations on the stability of protein structure. The pipeline employs a combined approach, utilizing state-of-the- art artificial intelligence methods for protein structure prediction and classical algorithms for free energy change estimation. The pipeline includes protein structure prediction using the ESM3 model, followed by calculation of free energy changes in mutant forms using pyRosetta. This approach allows overcoming the limitations of existing methods by combining the advantages of deep learning and the interpretability of energy calculations. The developed tool can find applications in structural bioinformatics, biotechnology, and medicine, especially given the limited number of experimentally determined protein structures.

Key words: protein structure prediction, protein structure stability, molecular modeling, ESM3.

References

  1. Jumper J. et al. Highly accurate protein structure prediction with AlphaFold // Nature. 2021. V. 596, № 7873. P. 583-589.
  2. Abramson J. et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3 // Nature. 2024. P. 1-3.
  3. Baek M. et al. Accurate prediction of protein structures and interactions using a three-track neural network // Science. 2021. V. 373, № 6557. P. 871-876.
  4. Lin Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model // Science. 2023. V. 379, № 6637. P. 1123-1130.
  5. Thomas P.J., Qu B.H., Pedersen P.L. Defective protein folding as a basis of human disease // Trends in biochemical sciences. 1995. V. 20, № 11. P. 456-459.
  6. Kellogg E.H., Leaver-Fay A., Baker D. Role of conformational sampling in computing mutation- induced changes in protein structure and stability // Proteins: Structure, Function, and Bioinformatics. 2011. V. 79, № 3. P. 830-838.
  7. Dehouck Y. et al. PoPMuSiC 2.1: a web server for the estimation of protein stability changes upon mutation and sequence optimality // BMC bioinformatics. 2011. V. 12. P. 1-12.
  8. Schymkowitz J. et al. The FoldX web server: an online force field // Nucleic acids research. 2005. V. 33, № suppl_2. P. W382-W388.
  9. Montanucci L. et al. DDGun: an untrained method for the prediction of protein stability changes upon single and multiple point variations // BMC bioinformatics. 2019. V. 20. P. 1-10.
  10. Pires D.E.V., Ascher D.B., Blundell T.L. mCSM: predicting the effects of mutations in proteins using graph-based signatures // Bioinformatics. 2014. V. 30, № 3. P. 335-342.
  11. Nikam R. et al. ProThermDB: thermodynamic database for proteins and mutants revisited after 15 years // Nucleic Acids Research. 2021. V. 49, № DI. P. D420-D424.
  12. Xavier J.S. et al. ThermoMutDB: a thermodynamic database for missense mutations // Nucleic Acids Research. 2021. V. 49, № DI. P. D475-D479.
  13. Stourac J. et al. FireProtDB: database of manually curated protein stability data // Nucleic Acids Research. 2021. V. 49, № DI. P. D319-D324.
  14. Cao H. et al. DeepDDG: Predicting the Stability Change of Protein Point Mutations Using Neural Networks //J. Chem. Inf. Model. 2019. V. 59, № 4. P. 1508-1514.
  15. Umerenkov D. et al. PROSTATA: a framework for protein stability assessment using transformers // Bioinformatics. 2023. V. 39, № 11. P. btad671.
  16. Pak M.A. et al. Using AlphaFold to predict the impact of single mutations on protein stability and function // Pios one. 2023. V. 18, № 3. P. e0282689.
  17. Mansoor S. et al. Zero-shot mutation effect prediction on protein stability and function using RoseTTAFold // Protein Science. 2023. V. 32, № 11. P. e4780.
  18. Akdel M. et al. A structural biology community assessment of AlphaFold2 applications // Nature Structural & Molecular Biology. 2022. V. 29, № 11. P. 1056-1067.
  19. Burley S.K. et al. RCSB Protein Data Bank (RCSB.org): delivery of experimentally- determined PDB structures alongside one million computed structure models of proteins from artificial intelligence/machine learning // Nucleic Acids Research. 2023. V. 51, № DI. P. D488-D508.
  20. The UniProt Consortium. UniProt: the Universal Protein Knowledgebase in 2023 // Nucleic Acids Research. 2023. V. 51, № DI. P. D523-D531.
  21. Hayes T. et al. Simulating 500 million years of evolution with a language model // bioRxiv. 2024. P. 2024.07.01.600583.
  22. Frenz B. et al. Prediction of Protein Mutational Free Energy: Benchmark and Sampling Improvements Increase Classification Accuracy // Front. Bioeng. Biotechnol. 2020. V. 8.
  23. Pancotti C. et al. Predicting protein stability changes upon single-point mutation: a thorough comparison of the available tools on a new dataset //Briefings in Bioinformatics. 2022. V. 23. № 2. P. bbab555.
  24. Chaudhury S., Lyskov S., Gray J.J. PyRosetta: a script-based interface for implementing molecular modeling algorithms using Rosetta // Bioinformatics. 2010. V. 26, № 5. P. 689-691.
  25. Alford R.F. et al. The Rosetta all-atom energy function for macromolecular modeling and design // Journal of chemical theory and computation. 2017. V. 13, № 6. P. 3031-3048.
  26. Zhang Y., Skolnick J. TM-align: a protein structure alignment algorithm based on the TM-score // Nucleic acids research. 2005. V. 33, № 7. P. 2302-2309.
  27. Kunzmann P., Hamacher K. Biotite: a unifying open source computational biology framework in Python // BMC Bioinformatics. 2018. V. 19, № 1. P. 346.

Bibliographic reference: Venzel A.S., Ivanisenko T.V., Demenkov P.S., Ivanisenko V.A. Software pipeline for predicting the impact of mutations on the stability of protein spatial structures using free energy change estimation methods and artificial intelligence //journal “Problems of informatics”. 2024, № 4. P.6-16. DOI: 10.24412/2073-0667-2024-4-6-16.


A.S. Venzel 1,2,3. A. I. Klimenko1,2. T.V. Ivanisenko1,2,3. P. S. Demenkov1,2,3. S.A. Lashin1,2,3 . V. A. Ivanisenko1,2,3

1Institute of Cytology and Genetics, SB RAS, 630090, Novosibirsk, Russia
2Kurchatov Genomic Center of the Institute of Cytology and Genetics, SB RAS, 630090, Novosibirsk, Russia
3Novosibirsk State University, 630090, Novosibirs, Russia

AN APPROACH FOR PREDICTING PROTEIN ABUNDANCE IN YEAST CELLS BASED ON THEIR GENOMICAL SEQUENCES

DOI: 10.24412/2073-0667-2024-4-17-26
EDN:HIAEDZ

In this work presented a new method for predicting protein abundance in Saccharomyces cerevisiae baker’s yeast cells, based on the analysis of their biological sequences using pre-trained language models. For sequence processing, ESM2 family models were applied to amino acid protein sequences, and the GENA-LM model was used for nucleotide gene sequences, which allowed for obtaining informative embedding of input data. The study evaluates the impact of various architectures and sizes of pre­trained language models on prediction accuracy. The proposed method has potential applications in biotechnology, optimization of biosynthesis processes, and computer-aided design of producer strains with enhanced gene expression of target proteins. The results of the study may contribute to a deeper understanding of genetic expression regulation mechanisms and open up prospects for predicting protein abundance in other microorganisms.

Key words: protein abundance, transformer, ESM2, machine learning.

References

  1. Vogel C., Marcotte E.M. Insights into the regulation of protein abundance from proteomic and transcriptomic analyses // Nat Rev Genet. 2012. V. 13, № 4. P. 227-232.
  2. Schwanhausser B. et al. Global quantification of mammalian gene expression control // Nature. 2011. V. 473, № 7347. P. 337-342.
  3. Rives A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences // Proceedings of the National Academy of Sciences. 2021. V. 118, № 15. P. e2016239118.
  4. Ji Y. et al. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome // Bioinformatics. 2021. V. 37, № 15. P. 2112-2120.
  5. Ferreira M. et al. Protein Abundance Prediction Through Machine Learning Methods // Journal of Molecular Biology. 2021. V. 433, № 22. P. 167267.
  6. Lin Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model // Science. 2023. V. 379, № 6637. P. 1123-1130.
  7. Fishman V. et al. GENA-LM: A Family of Open-Source Foundational DNA Language Models for Long Sequences. 2023.
  8. Cherry J.M. et al. SGD: Saccharomyces Genome Database // Nucleic Acids Research. 1998. V. 26, № 1. P. 73-79.
  9. Huang Q. et al. PaxDb 5.0: Curated Protein Quantification Data Suggests Adaptive Proteome Changes in Yeasts // Molecular & Cellular Proteomics. 2023. V. 22, № 10.
  10. Schmirler R., Heinzinger M., Rost B. Fine-tuning protein language models boosts predictions across diverse tasks // Nat Commun. 2024. V. 15, № 1. P. 7407.

Bibliographic reference: Venzel A. S., Klimenko A. I., Ivanisenko T. V., Demenkov P. S., Lashin S. A., Ivanisenko V. A. An approach for predicting protein abundance in yeast cells based on their genomical sequences //journal “Problems of informatics”. 2024, № 4. P.17-26. DOI: 10.24412/2073-0667-2024-4-17-26.


P. S. Demenkov, A. M. Mukhin, V.A. Ivanisenko, S.A. Lashin, N.A. Kolchanov

Kurchatov Genome Center of the Institute of Cytology and Genetics of the Siberian Branch
of the Russian Academy of Sciences (KGC ICG SB RAS), 630090, Novosibirsk, Russia

THE “MICROBITECH” DIGITAL PLATFORM: ARCHITECTURE
AND PURPOSE

DOI: 10.24412/2073-0667-2024-4-27-36
EDN:HXJHCY

The article examines the architecture of the developed digital platform “Microbitech” for solving a wide range of problems in systems and structural biology, discusses the use of software integrated into the platform for processing and analyzing large volumes of genetic information, as well as for predicting the structure and function of proteins. The use of the “Microbitech” digital platform allows increasing the productivity of research, improving the accuracy of data analysis, and contributing to the development of new research methods.

Key words: information platform, systems biology, bioinformatics.

References

  1. Bharadwaj, A., El Sawy, O.A., Pavlou, P. A., & Venkatraman, N. (2013). Digital business strategy: toward a next generation of insights. MIS quarterly, 471-482.
  2. Yoo, Y., Henfridsson, O., & Lyytinen, K. (2010). Research commentary—the new organizing logic of digital innovation: an agenda for information systems research. Information systems research, 21(4), 724-735.
  3. Goecks, J., Nekrutenko, A., Taylor, J., & Galaxy Team. (2010). Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome biology, 11(8), R86.
  4. Gruning, B.A., Rasche, E., Rebolledo-Jaramillo, B., Eberhard, C., Houwaart, T., Chilton, J., ... & Backofen, R. (2017). Jupyter and Galaxy: Easing entry barriers into complex data analyses for biomedical researchers. PLoS computational biology, 13(5), el005425.
  5. Afgan, E., Baker, D., van den Beek, M., Blankenberg, D., Bouvier, D., Cech, M., ... & Goecks, J. (2016). The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update. Nucleic acids research, 44(W1), W3-W10.
  6. Lowe, R., Shirley, N., Bleackley, M., Dolan, S., & Shafee, T. (2017). Transcriptomics technologies. PLoS computational biology,
  7. Gruning, B., Chilton, J., Koster, J., Dale, R., Soranzo, N., van den Beek, M., ... & Backofen, R. (2019). Practical computational reproducibility in the life sciences. Cell systems, 8(3), 183-188.
  1. Afgan, Е., Baker, D., Batut, В., van den Веек, М., Bouvier, D., Cech, M., ... & Blankenberg, D. (2018). The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update. Nucleic acids research, 46(W1), W537-W544.
  2. Kluyver, T., Ragan-Kelley, B., Perez, F., Granger, B.E., Bussonnier, M., Frederic, J., ... & Ivanov, P. (2016). Jupyter Notebooks-а publishing format for reproducible computational workflows. In ELPUB (pp. 87-90).
  3. Zaharia, M., Xin, R.S., Wendell, P., Das, T., Armbrust, M., Dave, A., ... & Ghodsi, A. (2016). Apache Spark: a unified engine for big data processing. Communications of the ACM, 59(11), 56-65.
  4. Varia, J., & Mathew, S. (2014). Overview of Amazon Web Services. Amazon Web Services, 16.
  5. Chee, B. J., Franklin, J.C., & Chee, B. J. (2009). Cloud computing: Technologies and strategies of the ubiquitous data center. CRC Press.
  6. Pronozin A. Y., Salina E. A., Afonnikov D. A. GBS-DP: a bioinformatics pipeline for processing data coming from genotyping by sequencing. Vavilovskii Zhurnal Genetiki i Selektsii = Vavilov Journal of Genetics and Breeding. 2023; 27(7):737-745. DOI 10.18699/VJGB-23-86
  7. Ivanisenko VA, Saik OV, Ivanisenko NV, Tiys ES, Ivanisenko TV, Demenkov PS, Kolchanov NA. ANDSystem: an associative network discovery system for automated literature mining in the field of biology. BMC Syst Biol. 2015;9 Suppl 2(Suppl 2):S2. doi: 10.1186/1752-0509-9-S2-S2.
  8. Ivanisenko VA, Demenkov PS, Ivanisenko TV, Mishchenko EL, Saik OV. A new version of the ANDSystem tool for automatic extraction of knowledge from scientific publications with expanded functionality for reconstruction of associative gene networks by considering tissue-specific gene expression. BMC Bioinformatics. 2019;20(Suppl 1):34. doi: 10.1186/sl2859-018-2567-6.

Bibliographic reference: Demenkov P.S., Mukhin A.M., Ivanisenko V.A., Lashin S.A., Kolchanov N.A. The “Microbitech” digital platform: architecture and purpose //journal “Problems of informatics”. 2024, № 4. P.27-36. DOI: 10.24412/2073-0667-2024-4-27-36.


T.V. Ivanisenko, P. S. Demenkov, V.A. Ivanisenko

Kurchatov Genomic Center of ICG SB RAS, Novosibirsk 630090, Russia
Institute of Cytology and Genetics, Siberian Branch of Russian Academy of Sciences (SB RAS), Novosibirsk 630090, Russia

COMBINED APPROACH TO ASSOCIATIVE NETWORK RECONSTRUCTION: INTEGRATING GRAPHSAGE AND CO-OCCURRENCE STATISTICS

DOI: 10.24412/2073-0667-2024-4-37-45
EDN:LEXHCE

This study focuses on developing a hybrid approach for predicting molecular-genetic interactions, combining graph neural networks (GNNs) and co-occurrence analysis of entities in scientific literature. The method’s effectiveness is demonstrated using the associative network of Escherichia coli, reconstructed using the ANDSystem and its ANDDigest module. Results showed a significant improvement in the accuracy of interaction predictions, in terms of conformity to the original graph topology, compared to using GNNs alone. The combination of approaches improved the Fl-score from 0.815 to 0.97 and reduced the loss function value from 0.405 to 0.08. Evaluation on experimentally confirmed protein-protein interactions also demonstrated high model efficiency (Fl-score 0.9799, Matthews correlation coefficient 0.9597). The proposed method can be applied in analyzing complex biological systems, planning experiments, and optimizing biotechnological processes.

Key words: graph neural networks, molecular-genetic interactions, text-mining, Escherichia coli, ANDSystem, ANDDigest, GraphSAGE.

References

  1. Bornmann L., Haunschild R., Mutz R. Growth rates of modern science: a latent piecewise growth curve approach to model publication numbers from established and new literature databases // Humanities and Social Sciences Communications. 2021. № 8. P. 224.
  2. Kitano H. Systems biology: a brief review // Science. 2002. V. 295, № 5560. P. 1662-1664.
  3. Zhou J., Cui G., Hu S., Zhang Z., Yang C., Liu Z., Wang L., Li C., Sun M. Graph neural networks: A review of methods and applications //AI Open. 2020. V. 1. P. 57-81.
  4. Scarselli F., Gori M., Tsoi A.C., Hagenbuchner M., Monfardini G. The graph neural network model // IEEE Transactions on Neural Networks. 2008. V. 20, № 1. P. 61-80.
  5. Kolchanov N.A., Ignat’eva E.V., Podkolodnaya O.A., Likhoshvai V. A., Matushkin Yu.G. Gennye seti // Vavilovskii zhurnal genetiki i selektsii. 2013. T. 17, № 4/2. P. 833-850.
  6. Zitnik М., Agrawal М., Leskovec J. Modeling polypharmacy side effects with graph convolutional networks // Bioinformatics. 2018. V. 34, № 13. P. i457-i466.
  7. Ivanisenko T.V., Demenkov P. S., Kolchanov N.A., Ivanisenko V.A. The new version of the ANDDigest tool with improved Al-based short names recognition // International Journal of Molecular Sciences. 2022. V. 23, № 23. P. 14934.
  8. Von Mering C., Jensen L. J., Snel B., Hooper S. D., Krupp M., Foglierini M. et al. STRING: known and predicted protein-protein associations, integrated and transferred across organisms // Nucleic Acids Research. 2005. V. 33, Suppl. 1. P. D433-D437.
  9. Ivanisenko V.A., Saik O.V., Ivanisenko N.V. et al. ANDSystem: an Associative Network Discovery System for automated literature mining in the field of biology // BMC Systems Biology. 2015. V. 9, Suppl. 2. P. S2.
  10. Ivanisenko V.A., Demenkov P. S., Ivanisenko T.V., Mishchenko E.L., Saik O.V. A new version of the ANDSystem tool for automatic extraction of knowledge from scientific publications with expanded functionality for reconstruction of associative gene networks by considering tissue-specific gene expression // BMC Bioinformatics. 2019. V. 20. P. 5-15.
  11. Hamilton W.L., Ying R., Leskovec J. Inductive representation learning on large graphs // Advances in Neural Information Processing Systems. 2017. V. 30.
  12. Blount Z.D. The unexhausted potential of E. coli // eLife. 2015. V. 4. P. e05826.
  13. Pontrelli S., Chiu T.Y., Lan E.I., Chen F.Y., Chang P., Liao J.C. Escherichia coli as a host for metabolic engineering // Metabolic Engineering. 2018. V. 50. P. 16-46.
  14. Choi K.R., Jang W.D., Yang D., Cho J.S., Park D., Lee S.Y. Systems metabolic engineering strategies: integrating systems and synthetic biology with metabolic engineering // Trends in Biotechnology. 2019. V. 37, № 8. P. 817-837.
  15. Hermjakob H., Montecchi-Palazzi L., Lewington C., Mudali S., Kerrien S., Orchard S., Vingron M., Roechert B., Roepstorff P., Valencia A., Margalit H., Armstrong J., Bairoch A., Cesareni G., Sherman D., Apweiler R. IntAct: an open source molecular interaction database // Nucleic Acids Research. 2004. V. 32, Suppl. 1. P. D452-D455.
  16. Wren J.D., Garner H.R. Shared relationship analysis: ranking set cohesion and commonalities within a literature-derived relationship network // Bioinformatics. 2004. V. 20, № 2. P. 191-198.
  17. Ivanisenko T.V., Saik O.V., Demenkov P. S., Ivanisenko N.V., Savostianov A.N., Ivanisenko V. A. ANDDigest: a new web-based module of ANDSystem for the search of knowledge in the scientific literature // BMC Bioinformatics. 2020. V. 21. P. 1-21.
  18. Loshchilov L, Hutter F. Decoupled Weight Decay Regularization // International Conference on Learning Representations (ICLR). 2019.

Bibliographic reference: Ivanisenko T. V., Demenkov P.S., Ivanisenko V.A. Combined Approach to Associative Network Reconstruction: Integrating GraphSAGE and Co-occurrence Statistics //journal “Problems of informatics”. 2024, № 4. P.37-45. DOI: 10.24412/2073-0667-2024-4-37-45.


 
Kurchatov Genomic Center of the Institute of Cytology and Genetics SB RAS, 630090, Novosibirsk, Russia
Institute of Cytology and Genetics SB RAS, 630090, Novosibirsk, Russia
Novosibirsk State University, 630090, Novosibirsk, Russia

SOFTWARE MODULE FOR STUDYING THE REGULATION OF BACTERIAL METABOLIC PATHWAYS BY MATHEMATICAL MODELING METHODS

DOI: 10.24412/2073-0667-2024-4-46-55
EDN:LWFDRD

Mathematical modeling is widely used in microbial biotechnology. It is used to describe and understand metabolite fluxes and changes in their concentrations, allows one to consider pathways of protein biosynthesis and make predictions on the costs of culture media for the yield of target products, etc. Standard approaches to modeling bacterial metabolism usually miss the regulatory processes operating at the genetic level. Meanwhile, the development of computational methods of genomic analysis reveals more and more such regulatory relationships. Accounting for regulatory relationships, in the process of model reconstruction, will allow us to investigate finer details of bacterial metabolism control. This paper presents a program module that generates frame-based mathematical models on the structure of the bacterial gene network, extended with tools to take into account regulatory relationships in the bacterial genome. Model generation is performed in terms of ordinary differential equations within the SBML standard. The study of the resulting mathematical model is finally available in a variety of specialized modeling tools.

Key words: mathematical modeling, operon, gene network, differential equations.

References

  1. Faeder J.R., Blinov M.L., Hlavacek W.S. Rule-Based Modeling of Biochemical Systems with BioNetGen // Systems Biology. Methods in Molecular Biology. 2009. C. 113-167. Humana Press.
  2. Machado D. и др. Fast automated reconstruction of genome-scale metabolic models for microbial species and communities // Nucleic Acids Res. 2018. T. 46. № 15. C. 7542-7553.
  3. Kolchanov N.A. и ДР. Gene networks // Vavilovskii Zhurnal Genetiki i Selektsii = Vavilov Journal of Genetics and Breeding. 2013. T. 17. № 4/2. C. 833-850.
  4. RATNER V.A. A molecular genetic control system // Priroda = Nature. 2001. T. 3. C. 16-22.
  5. Kazantsev F.V., др. System of automated generation of mathematical models of gene networks // Informatsionnyj vestnik VOGIS. 2009. T. 13. № 1. C. 163-169.
  6. Drager А. и ДР. SBMLsqueezer 2: context-sensitive creation of kinetic equations in biochemical networks // BMC Syst. Biol. 2015. T. 9. № 1. C. 68.
  7. Lakhova T.N. и др. Algorithm for the Reconstruction of Mathematical Frame Models of Bacterial Transcription Regulation // Mathematics. 2022. T. 10. № 23. C. 4480.
  8. Likhoshvai V., Ratushny A. Generalized hill function method for modeling molecular processes //J. Bioinform. Comput. Biol. 2007. T. 05. № 02b. C. 521-531.
  9. Skiena S.S. Graph Traversal // The Algorithm Design Manual. 2012. C. 145-190. Springer, London.
  10. Landini В. и ДР. The leucine-responsive regulatory protein (Lrp) acts as a specific repressor for oy-dependent transcription of the Escherichia coli aidB gene // Mol. Microbiol. 1996. T. 20. № 5. C. 947-955.
  11. Rippa V. II ДР. Specific DNA Binding and Regulation of Its Own Expression by the AidB Protein in Escherichia coli // J. Bacteriol. 2010. T. 192. № 23. C. 6136-6142.
  12. Hucka M. II ДР. The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models // Bioinformatics. 2003. T. 19. № 4. C. 524-531.
  13. Keating S.M. и др. SBML Level 3: an extensible format for the exchange and reuse of biological models // Mol. Syst. Biol. 2020. T. 16. № 8.
  14. Welsh С. и ДР. libRoadRunner 2.0: a high performance SBML simulation and analysis library // Bioinformatics. 2023. T. 39. № 1.
  15. Hoops S. и ДР. COPASI-a COmplex PAthway Simulator // Bioinformatics. 2006. T. 22. № 24. C. 3067-3074.
  16. Ligon T.S. и ДР. GenSSI 2.0: multi-experiment structural identifiability analysis of SBML models // Bioinformatics. 2018. T. 34. № 8. C. 1421-1423.
  17.  Hoops S. и ДР. COPASI—a COmplex PAthway Simulator // Bioinformatics. 2006. T. 22. № 24. С. 3067-3074.

Bibliographic reference: Lakhova T., Kazantsev F., Khlebodarova T., Matushkin Yu., Lashin S. Software module for studying the regulation of bacterial metabolic pathways by mathematical modeling methods //journal “Problems of informatics”. 2024, № 4. P.46-55. DOI: 10.24412/2073-0667-2024-4-46-55.


S. Lashin, F. Kazantsev, T. Lakhova, Yu. Matushkin

Kurchatov Genomic Center of the Institute of Cytology and Genetics SB RAS,
Institute of Cytology and Genetics SB RAS,
Novosibirsk State University, 630090, Novosibirsk, Russia

DYNMICROBIOTECH: A SOFTWARE MODULE FOR AUTOMATIC RECONSTRUCTION OF FRAME-BASED DYNAMIC MODELS OF MICROBIAL GENE NETWORKS

DOI: 10.24412/2073-0667-2024-4-56-68
EDN: QVDKVH

Modern genetic technologies are used in industrial biotechnology to design microbial strains- producers with target characteristics for based on close integration of experimental and information­computer approaches. The increasing availability of genomic data and methods of their functional annotation requires the development of new methods of systems biology, in particular, methods of reconstruction of gene networks and metabolic pathways controlling target processes and characteristics of microorganisms based on information about sequenced genomes, as well as methods of building mathematical models of these networks and pathways. This paper presents the DynMicrobiotech software module for automatic reconstruction of frame-based mathematical models based on the generalized chemical-kinetic modeling method. The input data for the module are the annotation and markup of the genome, while the output data are the generated model in the form of a system of ordinary differential equations written in SBML format.

Key words: generalized chemical-kinetic method of modeling, differential equations, gene networks.

References

  1. Goodwin S., McPherson J.D., McCombie W.R. Coming of age: ten years of next-generation sequencing technologies // Nat. Rev. Genet. 2016. T. 17. N 6. C. 333-351.
  2. Quail M. и др. A tale of three next generation sequencing platforms: comparison of Ion torrent, pacific biosciences and illumina MiSeq sequencers // BMC Genomics. 2012. T. 13. N 1. C. 341.
  3. Gowan AE., Mendes P., Blinov M.L. ModelBricks—modules for reproducible modeling improving model annotation and provenance // npj Syst. Biol. Appl. 2019. T. 5. N 1.
  4. Gilbert D. и др. Towards dynamic genome-scale models // Brief. Bioinform. 2019. T. 20. N 4. C. 1167-1180.
  5. Karr J. R. и ДР. A Whole-Cell Computational Model Predicts Phenotype from Genotype // Cell. 2012. T. 150. N 2. C. 389-401.
  6. Kim W.J., Kim H.U., Lee S.Y. Current state and applications of microbial genome-scale metabolic models // Curr. Opin. Syst. Biol. 2017. T. 2. C. 10-18.
  7. Akberdin I.R. и ДР. In Silico Cell: Challenges and Perspectives // Math. Biol. Bioinforma. 2013. T. 8. N 1.
  8. Demin О., Goryanin I. Kinetic Modelling in Systems Biology. , 2008.
  9. Hellerstein J.L. и др. Recent advances in biomedical simulations: a manifesto for model engineering // FlOOOResearch. 2019. T. 8. C. 261.
  10. OCONE A., MILLAR A.J., Sanguinetti G. Hybrid regulatory models: a statistically tractable approach to model regulatory network dynamics // Bioinformatics. 2013. T. 29. N 7. C. 910-916.
  11. Funahashi А. и др. CellDesigner 3.5: A Versatile Modeling Tool for Biochemical Networks // Proc. IEEE. 2008. T. 96. N 8. C. 1254-1265.
  12. King Z.A. и др. BiGG Models: A platform for integrating, standardizing and sharing genome­scale models // Nucleic Acids Res. 2016. T. 44. N DI. C. D515-D522.
  13. Lloyd C.M. и др. The CellML Model Repository // Bioinformatics. 2008. T. 24. N 18. C. 2122-2123.
  14. Malik-Sheriff R.S. и др. BioModels—15 years of sharing computational models in life science // Nucleic Acids Res. 2019.
  15. Henkel R., Wolkenhauer O., Waltemath D. Combining computational models, semantic annotations and simulation experiments in a graph database // Database. 2015. T. 2015. C. 1-16.
  16. Kirk P.D.W., Babtie A.C., Stumpf M.P.H. Systems biology (un)certainties // Science (80-. ). 2015. T. 350. N 6259. C. 386-388.
  17. Stanford N.J. и ДР. The evolution of standards and data management practices in systems biology // Mol. Syst. Biol. 2015. T. 11. N 12. C. 851-851.
  18. Beal J. и ДР. Communicating Structure and Function in Synthetic Biology Diagrams // ACS Synth. Biol. 2019. T. 8. N 8. C. 1818-1825.
  19. Bruggeman F.J., Westerhoff H.V. The nature of systems biology // Trends Microbiol. 2007. T. 15. N 1. C. 45-50.
  20. Likhoshvai V.A. и др. Generalized chemokinetic method for gene network simulation // Mol. Biol. 2001. T. 35. N 6. C. 919-925.
  21. Palsson B. The challenges of in silico biology Moving from a reductionist paradigm to one that views cells as systems will necessitate // 2000. T. 18. C. 1147-1150.
  22. Kurata H. и др. BioFNet: Biological functional network database for analysis and synthesis of biological systems // Brief. Bioinform. 2013. T. 15. N 5. C. 699-709.
  23. RATNER V.A. A molecular genetic control system // Priroda = Nature. 2001. T. 3. C. 16-22.
  24. Moodie S. и др. Systems Biology Graphical Notation: Process Description language Level 1 Version 1.3 //J- Integr. Bioinform. 2015. T. 12. N 2.
  25. Norsigian C.J. и др. BiGG Models 2020: multi-strain genome-scale models and expansion across the phylogenetic tree // Nucleic Acids Res. 2019. T. 48. N DI. C. D402-D406.
  26. Zhang F. и др. Systems biology markup language (SBML) level 3 package: multistate, multicomponent and multicompartment species, version 1, release 2 // J. Integr. Bioinform. 2020. T. 17. N 2-3. C. 0-74.
  27. Likhoshvai V.A. и др. A generalized chemical-kinetic method for modeling complex biological systems. Computer model of bacteriophage Lambda ontogenesis // Vychislitelnye texnologii = Journal of Computational Technologies. 2000. T. 5. N Special issue dedicated to the 10th anniversary of the Laboratory of Theoretical Genetics of the Institute of Cytology and Genetics SB RAS. C. 87-99.
  28. Kazantsev F.V., др. System of automated generation of mathematical models of gene networks // Informatsionnyj vestnik VOGIS. 2009. T. 13. N 1. C. 163-169.
  29. Akberdin I.R. и др. “Electronic cell”: problems and perspectives //Mathematicheskaya Biologiya i Bioinformatika Mathematical Biology and Bioinformatics. 2013. T. 8. N 1. C. 287-307.
  30. Zhabotinsky A.M. Concentration auto oscillations., 1974. C. 1-179. M.: Nayka.
  31. Hucka M. и ДР. The Systems Biology Markup Language (SBML): Language Specification for Level 3 Version 1 Core //J. Integr. Bioinforma. 2015. T. 12. N 2. C. 382-549.
  32. Lakhova T.N. и ДР. Algorithm for the Reconstruction of Mathematical Frame Models of Bacterial Transcription Regulation // Mathematics. 2022. T. 10. N 23. C. 4480.
  33. KANEHISA M. Enzyme Annotation and Metabolic Reconstruction Using KEGG., 2017. C. 135­145.
  34. McDonald A.G., Boyce S., Tipton K.F. ExplorEnz: the primary source of the IUBMB enzyme list // Nucleic Acids Res. 2009. T. 37. N Database. C. D593-D597.
  35. Wittig U. и др. SABIO-RK-database for biochemical reaction kinetics // Nucleic Acids Res. 2012. T. 40. N DI. C. D790-D796.
  36. Kazantsev F.V. и др. MAMMOTh: A new database for curated mathematical models of biomolecular systems //J. Bioinform. Comput. Biol. 2018. T. 16. N 01. C. 1740010 (16 pages).
  37. Otasek D. и ДР. Cytoscape Automation: empowering workflow-based network analysis // Genome Biol. 2019. T. 20. N 1. C. 185.
  38. Hoops S. и др. COPASI-a COmplex PAthway Simulator // Bioinformatics. 2006. T. 22. N 24. C. 3067-3074.
  39. Cock P.J.A. и др. Biopython: freely available Python tools for computational molecular biology and bioinformatics // Bioinformatics. 2009. T. 25. N 11. C. 1422-1423.

Bibliographic reference: Lashin S., Kazantsev F., Lakhova T., Matushkin Yu. DynMicrobiotech: a software module for automatic reconstruction of frame-based dynamic models of microbial gene networks //journal “Problems of informatics”. 2024, № 4. P.56-68. DOI: 10.24412/2073-0667-2024-4-56-68.


A. Mukhin, D. Oschepkov, S. Lashin

Kurchatov Genomic Center Institute Cytology and Genetics SB RAS.
Institute Cytology and Genetics SB RAS.
Novosibirsk State University, 630090, Novosibirsk, Russia

A COMPUTATIONAL PIPELINE FOR DE NOVO RECOGNITION OF TRANSCRIPTION FACTOR BINDING SITES IN BACTERIAL GENOMES

DOI: 10.24412/2073-0667-2024-4-69-83
EDN: UGUBKF

The search for transcription factor binding sites (TFBSs) in bacterial genomes is one of the most important steps for their study and subsequent use in biotechnology and microbiology. The characteristic length of TFBS is 5-20 nucleotide pairs, and each transcription factor has the ability to bind to a set of sites similar in sequence. The concept of motif is used to describe the spectrum of sequences that have substantial (non-random) similarity. That is, a motif in molecular biology is a group (or a representative of a group, depending on the context) of relatively short sequences of nucleotides (or amino acids) that have sufficient similarity due to their performance of a single biological function, e. g., binding of a single transcription factor. The similarity of motifs is directly used by various bioinformatics approaches for their de novo detection in genomic sequence samples, and is possible only if there is sufficient enrichment of the tested sample with the corresponding sequence similarity. In cases where the bacterial genome is insufficiently annotated, such as when working with a newly sequenced genome, it is the de novo motif detection method that proves to be the most effective for finding TFBSs. In this paper, we propose a set of computational motif search pipelines that take as input the bacterial genome data and its primary annotation. The proposed pipelines using two different approaches (full-genome search, when de novo motifs are searched for in a set of promoters of a single genome, and phylogenetic footprinting, when motifs are searched for among a set of promoters of similar genes and/or operons) to search for motifs, provide the researcher with a comprehensive set of settings for obtaining the most complete annotation by sites of both the whole genome and more detailed annotation of the regulatory region of the selected gene. The presented pipelines were implemented using both the modern Nextflow platform and scripts in the Python programming language. Also, the following tools were used within the pipelines: BoBro as a method for searching de novo motifs in promoters of a single organism; MP3, which implements de novo motif searching by phylogenetic footprinting in a set of promoters, GOST to identify similar genes and/or operons between two genome assemblies, OperonMapper to determine the operon structure of the genome, and TomTom for annotation of de novo motifs. We have developed an indexed metadata database for known bacterial genomes using an embedded SQLite DBMS, which allows us to significantly accelerate data retrieval for further calculations.

Key words: pipeline, motifs, TFBS, genomics, Nextflow, Python, SQLite, JBrowse2, bioinformatics.

References

  1. Seemann Т. Prokka: rapid prokaryotic genome annotation // Bioinformatics. 2014. V. 30. N. 14. P. 2068-2069.
  2. Pachkov M., Balwierz P. J., Arnold P., Ozonov E., Nimwegen E. SwissRegulon, a database of genome-wide annotations of regulatory sites: recent updates // Nucleic Acids Research. 2012. 11. V. 41. N DI. P. D214-D220. https://academic.oup.com/nar/article-pdf/41/Dl/D214/3645388/ gksll45.pdf.
  3. Robison K., McGuire A. M., Church G. M. A comprehensive library of DNA-binding site matrices for 55 proteins applied to the complete Escherichia coli K-12 genomellEdited by R. Ebright // Journal of Molecular Biology. 1998. V. 284. N 2. P. 241-254. Access mode: https://www.sciencedirect.com/ science/article/pii/S002228369892160X.
  4. Dudek C.-A., Jahn D. PRODORIC: state-of-the-art database of prokaryotic gene regulation // Nucleic acids research. 2022. V. 50. N. DI. P. D295-D302.
  5. Liu B., Zhang H., Zhou C., Li G., Fennell A., Wang G., Kang Y., Liu Q., Ma Q. An integrative and applicable phylogenetic footprinting framework for cis-regulatory motifs identification in prokaryotic genomes // BMC genomics. 2016. V. 17. P. 1-12.
  6. Tagle D. A., Koop B. F., Goodman M., Slightom J. L., Hess D. L., Jones R. T. Embryonic e and 7 globin genes of a prosimian primate (Galago crassicaudatus): Nucleotide and amino acid sequences, developmental regulation and phylogenetic footprints // Journal of molecular biology. 1988. V. 203. N. 2. P. 439-455.
  7. Yang J., Chen X., McDermaid A., Ma Q. DMINDA 2.0: integrated and systematic views of regulatory DNA motif identification and analyses // Bioinformatics. 2017. V. 33. N 16. P. 2586-2588.
  8. Bailey T. L., Johnson J., Grant С. E., Noble W. S. The MEME Suite // Nucleic Acids Research. 2015. 05. V. 43. N. Wl. P. W39-W49. https://academic.oup.com/nar/article-pdf/43/Wl/W39/ 17435890/gkv416.pdf.
  9. Sayers E. W., Bolton E. E., Brister J. R., Canese K., Chan J., Comeau D., Connor R., Funk K., Kelly C., Kim S., Madej T., Marchler-Bauer A., Lanczycki C., Lathrop S., Lu Z., Thibaud-Nissen F., Murphy T., Phan L., Skripchenko Y., Tse T., Wang J., Williams R., Trawick B., Pruitt K., Sherry S. Database resources of the national center for biotechnology information. Nucleic Acids Research. 2021. 12. V. 50.N DI. P. D20-D26. https://academic.oup.com/nar/article-pdf/50/Dl/D20/42058080/ gkablll2.pdf.
  10. Mukhin A. M., Kazantsev F. V., Klimenko A. L, Lakhova T. N., Demenkov P. S., Lashin S. A. The Web Platform for Storing Biotechnologically Significant Properties of Bacterial Strains // International Conference on Parallel Computing Technologies / Springer. 2021. P. 445-450.
  11. Taboada B., Estrada K., Ciria R., Merino E. Operon-mapper: a web server for precise operon identification in bacterial and archaeal genomes // Bioinformatics. 2018. 06. V. 34. N. 23. P. 4118-4120. https://academic.oup.com/bioinformatics/article-pdf/34/23/4118/48921148/ bioinformatics\_34\_23\_4118.pdf.
  12. Ma Q., Liu B., Zhou C., Yin Y., Li G., Xu Y. An integrated toolkit for accurate prediction and analysis of cis-regulatory motifs at a genome scale. Bioinformatics. 2013. 07. V. 29. N 18. P. 2261-2268. https://academic.oup.com/bioinformatics/article-pdf/29/18/2261/50782707/ bioinformatics\_29\_18\_2261.pdf.
  13. Bailey T. L. STREME: accurate and versatile sequence motif discovery // Bioinformatics. 2021. 03. V. 37. N 18. P.2834-2840. https://academic.oup.com/bioinformatics/article-pdf/37/ 18/2834/50579626/btab203.pdf.
  14. Di Tommaso P., Chatzou M., Floden E. W., Barja P. P., Palumbo E., Notredame C. Nextflow enables reproducible computational workflows // Nature biotechnology. 2017. V. 35. N. 4. P. 316-319.
  15. Li G., Ma Q., Мао X., Yin Y., Zhu X., and Xu Y. Integration of sequence-similarity and functional association information can overcome intrinsic problems in orthology mapping across bacterial genomes // Nucleic acids research. 2011. V. 39. N. 22. P. el50-el50.
  16. Li G., Liu B., Ma Q., Xu Y. A new framework for identifying cis-regulatory motifs in prokaryotes // Nucleic acids research. 2011. V. 39. N 7. P. e42 e42.
  17. Мао X., Ma Q., Zhou C., Chen X., Zhang H., Yang J., Mao F., Lai W., Xu Y. DOOR 2.0: presenting operons and their functions through dynamic and integrated views // Nucleic acids research. 2014. V. 42. N. DI. P. D654-D659.
  18. Peltek S., Bannikova S., Khlebodarova T. M., Uvarova Y., Mukhin A. M., Vasiliev G., Scheglov M., Shipova A., Vasilieva A., Oshchepkov D., Bryanskaya A., Popik V. The Transcriptomic Response of Cells of the Thermophilic Bacterium Geobacillus icigianus to Terahertz Irradiation // International Journal of Molecular Sciences. 2024. V. 25. N 22.
  19. Diesh C., Stevens G. J., Xie P., De Jesus Martinez T., Hershberg E. A., Leung A., Guo E., Dider S., Zhang J., Bridge C., et al. JBrowse 2: a modular genome browser with views of synteny and structural variation // Genome biology. 2023. V. 24. N 1. P. 74.
  20. Pratt H., Weng Z. LogoJS: a Javascript package for creating sequence logos and embedding them in web applications // Bioinformatics. 2020. 03. V. 36. N 11. P. 3573-3575. https://academic.oup.com/bioinformatics/article-pdf/36/11/3573/50670952/bioinformatics\_36\_ll\_3573.pdf.

Bibliographic reference: Mukhin A., Oschepkov D., Lashin S. A computational pipeline for de novo recognition of transcription factor binding sites in bacterial genomes //journal “Problems of informatics”. 2024, № 4. P.69-83. DOI: 10.24412/2073-0667-2024-4-69-83.