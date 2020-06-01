Camille Terfve, associate and patent attorney at IP firm Mewburn Ellis assesses what the emergence of big data means for medical sciences.

× Expand Big data

What is bioinformatics?

Bioinformatics is a term which we will use as an umbrella to encompass work that may also be referred to as computational biology or medical informatics. It is an interdisciplinary field of science that combines biology, computer science, mathematics, statistics and engineering (including their subfields such as control theory, information theory, thermodynamics, machine learning and artificial intelligence) to analyse and interpret biological and medical data. Indeed, both molecular and phenotypic data has become available on an unprecedented scale from genomics, proteomics, functional genomics, high-throughput cell screening, metabolomics, and imaging platforms (amongst other).

This data enabled researcher to study biological systems at both higher specificity (enabling us to look simultaneously at single molecule, molecular species or cells) and breadth (enabling us to look at entire proteomes, genomes, metabolomes, transcriptomes in a single assay). This data has further been combined with higher level medical data such as .g. the presence of a disease or subtype of disease, data about response to treatment, relapse, comorbidity, medical history to develop a better understanding of the aetiology of diseases, and to tailor therapies to patient’s molecular profiles.

Exponential growth

The field was arguably born in the mid-nineties, and has grown exponentially since, underlined by parallel major technological advance in two areas: biological data collection technologies and computer science / information technology.

In the area of computer science, major improvements in our abilities to store, process and share data via improvements in CPUs, disk storage, the development of the internet and more recently cloud computing, have revolutionised just about every aspect of human life in the last three decades.

In the area of biological data collection technologies, perhaps the first paradigm shift that contributed to the birth of the bioinformatics field relates to nucleic acid sequencing technologies. Whereas in the 80’s-90’s Sanger sequencing allowed the sequencing of a single short fragment (up to a few hundred bases) at a time in a lengthy process, today high-throughput sequencing (born in the mid-nineties, also referred to as ‘next generation sequencing’ - NGS) allows the sequencing of an entire genome in a few hours and for an ever decreasing cost.

Today, most types of biological or medical data that can be collected have a high-throughput and/or high content equivalent. High-throughput assays (enabling the rapid collection of data from a large number of samples) and high-content assays (enabling the rapid collection of data about a large number of features from each sample) have dramatically changed the nature and sheer amount of data that comes out of biological and biomedical assays, and opened new possibilities.

In order to support, organise and share the vast amount of data that has been generated through the above-mentioned technologies, databases and data resources have been created which are pivotal to bioinformatics, and hence to modern biological and medical sciences. These include well-known resources such as GenBank (a genetic sequence database comprising nucleotide sequences for over 300k organisms), Uniprot (a database of protein sequence and function comprising 500k+ manually curated entries and over 180M automatically annotated entries), KEGG (a database of pathways, functions and utilities), Ensembl (a vertebrate genome browser for 200+ species), GEO (a gene expression data repository), Cosmic (a database of somatic mutations in cancer), amongst many others. Tremendous and invaluable effort is spent on developing tools to create, maintain, share and analyse these data resources, many of which are at least partially publicly funded, such as e.g. by the NIH or EMBL-EBI, and freely accessible.

How has bioinformatics changed the biological and medical science landscape?

This new approach to biomedical/biological sciences, with large amounts of data at its core has had a large impact from health services to R&D, where applications / projects that involve at least one bioinformatics element are becoming increasingly common - if not the norm rather than the exception. For example, in the pharmaceutical industry, bioinformatics tools and resources are used to identify drug targets, to perform drug repurposing, to identify drug candidates, to stratify patients, to study the effect of compounds in model systems, to perform simulations that can reduce the experimental burden associated with drug development, optimisation and testing, etc. In health services, bioinformatics is used to analyse genetic tests, to process and visualise medical images, to build improved medical devices such as heart rate monitors, intelligent insulin pumps, etc. Further, bioinformatics is also central to the fields of personalised and precision medicine, which impact both the pharma side and the healthcare provider side.

As a result, companies that have bioinformatics as a central element of their R&D or even product are getting more and more numerous. These includes giants such as Illumina, Roche (see e.g. Roche’s Ariosa), Google (via Google Health), AstraZeneca and smaller players such as Sano Genetics, Inivata, Eagle Genomics, Genomics plc, Cambridge Cancer Genomics, Seven Bridges Genomics, amongst many others. Further, the recognition that data sharing, openness and collaborations are essential to realise the full potential of the field has led to an increasing prevalence of academic-industrial partnerships (which were previously uncommon in the pharmaceuticals field) such as e.g. the Open Targets initiative, and to a push towards open science.