The birth of Bioinformatics
Turning the first turfs - The origins of a novel discipline
To do bioinformatics, we need computers, and still, in the 1950s most computers were women, but gradually this started to change. ERA 1101, later named UNIVAC, designed by Engineering Research Associates and built by Remington-Rand, was one of the first commercially produced computers in 1950, followed by IBM 701, IBM 650 with a magnetic drum, the Digital Electronic Universal Computing Engine (DEUCE), RCA, Autonetics, and GE.
In the 1960s computers were expensive, room-sized machines and programmed by punched-cards, such as DEC PDP-1, NEAC 2203, and several models from IBM including the 7000 series, which was the first to use transistors, DEC, and RCA's Spectra series, Burroughs, Honeywell, and Packard Bell to mention some.
The programming in assembly language was tedious; therefore, the development of high-level languages was essential. High-level languages made it much easier to program. In 1957, John Backus and his team at IBM released FORTRAN (FORmula TRANslation) programming language for IBM 704. The following year John McCarthy at MIT invents LISP. The teamwork of several computer manufacturers and Pentagon produced COBOL, Common Business-Oriented Language in 1960. The 1970s saw the introduction of Pascal by Niklaus Wirth, and C created by Dennis Ritchie and his team. They also re-wrote the UNIX code in C.
The notable early pioneers of biological data analysis and storage were Margaret O. Dayhoff, Richard V. Eck, and Robert S. Ledley.
Ledley drafted the book 'Use of Computers in Biology and Medicine' in 1960, which he published in 1965. It explored the possibilities of digital computing applications in biology and medicine.
James A. Fowley commented the book in Quarterly Review of Biology, "Because of the 'computer revolution' there are many new research opportunities which are not obvious to most biologists ... Why has biological research been so slow to take advantage of what seems to be a great breakthrough? Obviously, one reason is the great gap between what the computer engineers know can be done, and what the biological researchers might ask them to do."
In their early attempts, Eck and Dayhoff searched the published literature and compiled all known protein sequences, using punched cards to store all them in a computer for analysis. To put a long story short, they estimated the number of amino acid changes over time by comparing known protein sequences. These analyses resulted in widely used PAM (point accepted mutation) matrices in 1978(*). They also introduced the one-letter code for amino acids, which is still in use today.
The first printed edition of Dayhoff's and Eck's collection was published in 1965 and entitled the Atlas of Protein Sequence and Structure, followed by early publications. The first edition contained 65 proteins. This collection of sequences later developed into an international effort of data collection centers, that included International Protein Information Database in Japan (now PDBj), Martinsried Institute for Protein Sequences (MIPS) together with National Biomedical Research Foundation (NBRF), which founded Protein Identification Resource (PIR). In 2002, PIR was included into UniProt.
In 1982 Dayhoff and colleagues published the first nucleic acid sequence database. The paper is entitled 'Nucleic acid sequence database computer system.' They submitted the work in September 1981 and was the first nucleic acid sequence database. In the same year, as the publication of the article, National Institutes of Health (NIH) created Genbank.
When Francis Crick published a revamped version of his Central Dogma in 1970(*), he likely reminded many scientists of the importance of DNA acting as an information database in all living organisms and that the flow of information is the central constituent of life. Interestingly, in the same year, Paulien Hogeweg and Ben Hesper used the term bioinformatics and proposed the first definition, defining it as "the study of informatic processes in biotic systems."(*)