Chapter 2 What is Computational Biology?

This is mostly notes.

2.1 Notes

2.1.1 NIH definition

See file NIH_def_bioinfo_2000.rmd for full text of the NIH document

NOTES:

  • bioinformatics relevant to all life sciences (not just biology)
  • Bioinf/comp bio both “rooted” in life science, comp sci, info sci and technologies
  • Bioinformatics: application focused, access / understanding from data
  • Comp bio: theory/discovery approaches

Bioinformatics: Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data.

Computational Biology: The development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems

2.2 Luscombe et al 2001

N.M. Luscombe, D. Greenbaum, M. Gerstein. Bioinformatics definition committee. 2001. What is Bioinformatics? A Proposed Definition and Overview of the Field https://www.thieme-connect.com/products/ejournals/abstract/10.1055/s-0038-1634431

“Our definition is as follows: Bioinformatics is conceptualizing biology in terms of macromolecules (in the sense of physical-chemistry) and then applying “informatics” techniques (derived from disciplines such as applied maths, computer science, and statistics) to understand and organize the information associated with these molecules, on a large-scale.” Luscombe et al 2000

“Analyses in bioinformatics predominantly focus on three types of large datasets available in molecular biology: macromolecular structures, genome sequences, and the results of functional genomics experiments (eg expression data). Additional information includes the text of scientific papers and “relationship data” from metabolic pathways, taxonomy trees, and protein-protein interaction networks.” Luscombe et al 2000

2.3 Taprich et al 20121

Taprich et al. 2021. An instructional definition and assessment rubric for bioinformatics instruction. https://iubmb.onlinelibrary.wiley.com/doi/full/10.1002/bmb.21361

“An interdisciplinary field that is concerned with the development and application of algorithms that analyze biological data to investigate the structure and function of biological polymers and their relationships to living systems.”

2.4 Smith 2015 Frontiers in Genetics

Smith. 2015. Broadening the definition of a bioinformatician. Frontiers in Genetics https://www.frontiersin.org/articles/10.3389/fgene.2015.00258/full

“On a given day, I spend much of my research time staring at nucleotide sequences on a computer screen and theorizing about the evolution of genomes; thus, I feel comfortable calling myself a bioinformatician, or at the very least a scientist who primarily uses bioinformatics for his research. If asked, most of my colleagues, mentors, and students would also define me as a bioinformatician. But there is one small catch: I don’t know how to program computer software or curate databases, and I am even quite pathetic at writing UNIX commands, which according to some precludes me from having the title of bioinformatician.” Smith (2015)

“I imagine that many of the scientists reading this essay will consider me an imposter, an amateur who points, clicks, and stumbles his way through the complicated landscape of bioinformatics. … I believe that as a research community we need to broaden our definition of what it means to be a bioinformatician, not restricting it to only those who develop software or design and maintain data resources. … [T]he term bioinformatician should encompass the countless and ever growing number of scientists who use computers and bioinformatics programs to address fundamental questions in biology.” Smith (2015)

2.5 Good readings / references

Koonin, EV. 20xx. Primer: Computational genomics. Current Biology - Magazine. R155-158…

2.7 Original info, from Eco Data Science project

[NOTE: This section is currently under development. The paper by Touchon & McCoy (2016) and its references lay out many of the reasons for the statistical focus of this book and relates to all biology, not just ecology.]

“Ecological questions and data are becoming increasingly complex and as a result we are seeing the development and proliferation of sophisticated statistical approaches in the ecological literature. … It is no longer sufficient to only ask ‘whether’ or ‘which’ experimental manipulations significantly deviate from null expectations. Instead, we are moving toward parameter estimation and asking ‘how much’ and in ‘what direction’ ecological processes are affected by different mechanisms” (Touchon & McCoy 2016, Ecosphere, emphsis mine)

“Spreadsheets are often used as the basis of data collection and education; but this is potentially problematic since spreadsheets typically do not promote good data management practices…. The features of spreadsheets that make them desirable for the average researcher, such as extensibility, use of formatting for organization, embedding charts, make them undesirable for preparing data for long‐term archiving and reuse.”(Strasser & Hampton 2012 Ecosphere)

2.8 What is data science?

Data analysis include “procedures for analyzing data, techniques for interpreting the results…, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data.” John Tukey, “The future of data analysis”, Annals of Mathematical Statistics, 1962.

  • People argue about what data science is
  • What Tukey calls “data analysis” is now termed “data science” by many.
  • Some define data science as closely allied with computer science and want its use most closely associated with things like “big data”, data mining, machine learning, and artificial intelligence.
  • Others, such as RStudio’s Hadley Whickham (creator of ggplot2, dplyr, and most of the infrasture of the tidyverse of R package) define it more broadly to involve all aspects of the life cycle of data.
  • (Wickham also defines a data scientists as “A data scientist is a statistician who is wearing a bow tie” https://twitter.com/hadleywickham/status/906146116412039169?lang=en)

2.9 Refereces

Strasser & Hampton 2012. The fractured lab notebook: undergraduates and ecological data management training in the United States. EcoSphere. https://esajournals.onlinelibrary.wiley.com/doi/abs/10.1890/ES12-00139.1

Touchon & McCoy 2017. The mismatch between current statistical practice and doctoral training in ecology. EcoSphere. https://esajournals.onlinelibrary.wiley.com/doi/abs/10.1002/ecs2.1394

Tukey. 1962. The future of data analysis. Annals of Mathematical Statistics. https://www.jstor.org/stable/2237638

2.10 Bibliography

Relevant papers cited by Touchon & McOy 2017.

Barraquand, F., T. H. G. Ezard, P. S. Jørgensen, N. Zimmerman, S. Chamberlain, R. Salguero‐Gómez, T. J. Curran, and T. Poisot. 2014. Lack of quantitative training among early‐career ecologists: a survey of the problem and potential solutions. PeerJ 2:e285.

Butcher, J. A., J. E. Groce, C. M. Lituma, M. C. Cocimano, Y. Sánchez‐Johnson, A. J. Campomizzi, T. L. Pope, K. S. Reyna, and A. C. S. Knipps. 2007. Persistent controversy in statistical approaches in wildlife sciences: a perspective of students. Journal of Wildlife Management 71:2142–2144

Ellison, A. M., and B. Dennis. 2009. Paths to statistical fluency for ecologists. Frontiers in Ecology and the Environment 8:362–370.

Germano, J. D. 2000. Ecology, statistics, and the art of misdiagnosis: the need for a paradigm shift. Environmental Reviews 7:167–190.

Quinn, J. F., and A. E. Dunham. 1983. On hypothesis testing in ecology and evolution. American Naturalist 122:602–617.