Reducing the odds — how to prepare for the impact of big data

28 May 2018

12:00

In this guide, Manuel Duval, PhD, from Scientist.com looks at how pharma and biopharma companies can best prepare for the impact of big data in healthcare.

reducing odds

The ultimate goal of the set of technologies collectively referred to as big data is to provide the means to improve the accuracy of the decisions and to decrease the latency of making them. In competitive business environments, applying big data is not a choice; it is a necessity to sustain the organisation and achieve growth. In the healthcare area, it is a moral obligation, reducing uncertainty and delivering optimised treatments. In short, big data is the means of reducing the odds.

Big data has been primarily enabled by the ability to collect and consume census-scale data sets with respect to some problem space. Before the advent of big data, analysts relied on sampling from a population and, depending on the size of the samples, were able to come up with some conclusion bound with some level of uncertainty.

At the base of the big data stack is the physical layer — the very large cluster of racks in data centres operated by numerous cloud service providers (CSP) accessible via broad network access. Aside from analytics, the ‘big’ attribute of big data also has meaning in information technology: above a certain threshold in terms of size (tera, petabytes and beyond) data sets cannot be handled with traditional data storage technologies. The prevailing solution in that area is the open-source Hadoop Distributed File System (HDFS) technology whose specifications are maintained by the non-profit Apache Software Foundation.

From the perspective of a new or prospective big data user, the good news is that there is no need to become an expert in HDFS technology to take advantage of the power of big data. The other great advantage in contemplating big data usage is that it does not require capital expenditure. Instead, it comes from the operational side of the organisation and frees up the ‘big’ headache of procuring hardware. No more is it necessary to allocate a significant amount of time to plan ahead for the right IT infrastructure, hoping to target the right size. However, some planning remains necessary.

At first, procuring computing services could be loosely compared to purchasing transportation services. One does not need to invest in a vehicle that would not fit all-year various needs. Instead, one uses a ride-sharing service that fits the current need. In addition, there is no need to consider parking, maintenance of the vehicle, dealing with the consequence of an accident or breaking parts. The same more or less holds true for cloud computing services with some minor differences:

Data hosting: For big data, one needs a ‘home’ for the data sets the organisation plans to mine. In that respect, CSP providers offer several options: (i) free data storage, but pay per analysis. This option is usually offered by large CSPs who provide extremely efficient data query solutions allowing users to mine terabytes of data in minutes or even seconds. The CSP charges on the amount of data queried, not on the storage, which comes in handy when one has very large data to host but only needs to query sparsely; (ii) a mix of both — a renting space for your data at a discount with some additional fee when IO and CPU intensive queries are run.
Resources availability: Some CSPs offer pre-emptive resources at a very large discount, which means that if the business deploys an application that requires long CPU time over numerous nodes but can be stopped and restarted without losing its state, it can procure pre-emptible virtual machines. In the event the CSP needs them because of a spike on its side, the user is alerted with short notice, enough to stop, and then restart when the resource becomes available again. Another point to consider is network access: large CSP provides multi-regional data centres in US, Europe and Asia. When high availability and low data latency is required, it is preferable to procure CSP from the geographically closest data centre.
Portability and ranges of services: This is another critical aspect in selecting the CSP, depending on the business needs. CSP comes in three flavours: (i) IaaS, Infrastructure as a Service. This applies when the organisation only needs to procure computing, storage and network resources, regardless of OS and software. Some IaaS providers allow users to deploy their own virtual machine solution on top of the bare-metal hardware; (ii) PaaS: Platform as a Service: on top of the hosting operating system, users have the options to run services such as a web server application, file system management, database management etc. The users in that case are not involved in maintaining these applications and can deploy their own higher-level programs relying on these services and develop their own via CSP provided API; (iii) SaaS: Software as a Service: this is the turnkey solution when a specific application is hosted on the cloud by the CSP and the users consume it directly, usually via a web browser client application. Google Docs is an example of a typical SaaS.
Applications: Some CSPs do provide advance machine learning solutions one could readily apply to its own data space (provided of course that the data is large enough).
Access management: Regarding data security; the jury is obviously still out, and one could argue that data is safer when hosted by one of the major CSPs instead to be on premises depending on who administers the firewall. That being said, when dealing with large data, user access management is evidently critical. Good CSPs provide users with management solutions, referred to as identity and access management (IAM) allowing fine tuning of the user’s access to assets and resources. Most useful IAMs provide various levels of administration.
Last but not least, precious time is usually allocated to develop applications and their persistence across different environments. Portability of systems is paramount and readily possible with technologies derived from virtual machines. Containers allow you to create an image of a full deployment stack and port it to a completely different environment. Another major advantage of containers is that they can be stopped and restarted quite easily, which allows some saving on resources usage. Once done with a job, one could stop the container, store it as an image in a free-hosting hub and restart it only when needed. Major CSPs provide a way to configure the environment to be scalable, allowing a true model of computing on demand.