The Advent Of Big Data Will Solve More Problems Than It Causes

Big Data’ is a somewhat nebulous term commonly used to describe the huge and complex datasets that arise from living in a largely digitised world. It encompasses all kinds of information across many domains: from global weather measurements or the medical statistics of hospital patients, to the efficiency of the electric grid or your online shopping habits. The huge amount of information generated all over the world may offer important and beneficial insights if it can be processed appropriately, however there are several inherent challenges to overcome. The NIST Big Data Interoperability Framework [1] defines big data as “refer[ring] to the inability of traditional data architectures to efficiently handle the new datasets” – so essentially any data that we are struggling to process or derive value from with existing technology. This analysis will consider some of the problems associated with big data, and ultimately argue that the potential benefits outweigh the cost.

Big data has already been instrumental in allowing researchers to extract useful insights from large simulations, across a range of scientific domains as varied as climatology, astrophysics and engineering, and a number of hardware and software tools have been developed to facilitate analysis of these large datasets such as Hadoop, Mahout, and Giraph [2]. Useful tools of this kind are key to making big data accessible to groups who may not have deep knowledge of machine learning algorithms, and as such the potential contribution of big data relies upon these tools to some extent.

Making proper use of big data is not necessarily a simple task as any insights gained are only as valid as the methods used to derive them, assuming that the data is accurate and the analytical methods sound. Given the popularity of big data it’s easy to treat it as some kind of magic bullet, suited to any problem where large data sets are available, but this can lead to dangerous assumptions about the accuracy of any insights derived from it. Bollier writes:

How can the numbers be interpreted accurately without unwittingly introducing bias? As a large mass of raw information, Big Data is not self-explanatory. And yet the specific methodologies for interpreting the data are open to all sorts of philosophical debate. Can the data represent an “objective truth” or is any interpretation necessarily biased by some subjective filter or the way that data is “cleaned?”’[3]

Google Flu Trends’ failure to accurately predict influenza in 2012 [4] underlines this point. This failure was not due to any inherent shortcoming in the data itself – rather to Google’s interpretation and use of it. However, as better practices, tools and methodologies become available to deal with big data, it is likely that avoiding these kinds of problems will become more and more trivial.

The sheer fact of the rapid growth of big data and its accompanying technologies poses another problem – due to the huge volume of information being generated, the costs of maintaining secure and reliable storage are significant. Whilst Google, Amazon, Microsoft, Facebook and others have spent billions of dollars building commercial data centers, it appears that research groups and governments have not yet made sufficient investment in data management [5]. This is problematic as massive amounts of data require massive systems, which themselves represent engineering, environmental and design challenges – and without sufficient and timely investment in hardware like this, non-commercial entities may struggle to keep pace with the giant tech corporations who are currently dominating the big data landscape. However, more investment sooner rather than later would solve this problem, and that seems likely to occur as the benefits of big data become increasingly apparent. Similarly, another potential short term problem due to big data’s rapid rise may be a lack of professionals with relevant skills – however this can be easily surmounted by education and training programs, at least until those jobs become automated themselves.

A key issue that has yet to be adequately addressed is the balance between big data and individual privacy. The cost to individual autonomy of widespread data collection and use must be considered, as there are several privacy risks associated with the current handling of big data, and this problem is only becoming more complex as the world becomes more interconnected. It is not uncommon for data regarding an individual to include many personal and potentially identifying details, such as health information, location, browsing history, power use and more. In the past, to maintain the privacy of their users, institutions or businesses collecting large amounts of data implemented a variety of ‘de-identification’ methods, such as encryption, anonymization, pseudonymization, data sharding and key-coding [6]. Recently however it has become clear that these methods may not be at all sufficient to protect an individual’s right to privacy. Ohm describes how:

scientists have demonstrated they can often ‘reidentify’ or ‘deanonymize’ individuals hidden in anonymized data with astonishing ease. By understanding this research, we will realize we have made a mistake, labored beneath a fundamental misunderstanding, which has assured us much less privacy than we have assumed”. [7]

However, this may not be as troubling as it seems. As Masiello and Whitten note, if the data in question cannot be absolutely attributed to a given individual (which is likely to be the case, particularly in a very large dataset, assuming the data is anonymised), it does not pose a significant privacy risk as there can be no definitive identification. If the individual cannot be identified, then her anonymity cannot be said to have been breached.

Many of the most pressing privacy risks, however, exist only if there is certainty in re-identification, that is if the information can be authenticated. As uncertainty is introduced into the reidentification equation, we cannot know that the information truly corresponds to a particular individual” [8]

Perhaps a more pressing issue for the field of big data as a whole is that these concerns over privacy and autonomy could instigate an overzealous regulatory backlash, halting beneficial progress and innovation in the field. For this reason, it is wise to address privacy concerns as a matter of urgency, and develop some system of best practice to ensure that progress in big data may continue without sacrificing too much individual autonomy. Tene and Polonetsky [9] also call for “the development of a model where the benefits of data for businesses and researchers are balanced against individual privacy rights” which seems like an excellent idea, although that balance may be difficult to pinpoint.

Is big data useful enough to warrant the time and investment necessary to solve these complex problems? Absolutely yes. Big data offers a new whole new way of discovering knowledge, allowing the creation of entirely data driven theories. Anderson elaborates:

Petabytes allow us to say: “Correlation is enough.” We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.” [10]

Indeed, applied big data is already changing the world for the better, and in more ways than there is room to list here. A tiny sample: South african researchers using the health records of HIV and AIDS patients discovered an important link between delayed AIDS onset and vitamin B consumption [11]; The Mexican government is using big data to reform not only their education system, but their tax and energy systems too [12]; It’s being used to push for climate change action [13]; and seems likely to revolutionise healthcare. [14]

It’s undeniable that the rise of big data has created several problems that demand attention, with the most complex of these appearing to be privacy and infrastructure – but it seems certain that these will be overcome as the field matures over the coming years, as laid out above. In any case, these concerns pale in comparison to the potential of big data to provide crucial scientific, economic and sociological insights, drive important research, spark innovation and discover knowledge across many scientific domains. It therefore appears obvious that the advent of big data will solve far more problems than it causes.

References

  1. Big Data Public Working Group, N. (n.d.). NIST Big Data Interoperability Framework: Volume 1, Definitions. NIST Special Publication, 1500–1. https://doi.org/10.6028/NIST.SP.1500-1

  2. Reed, D. A., & Dongarra, J. (2015). Exascale computing and big data. Communications of the ACM, 58(7), 56–68. https://doi.org/10.1145/2699414

  3. Bollier, D. (2010). The Promise and Peril of Big Data. Eighteenth Annual Aspen Institute Roundtable on Information Technology. Retrieved from https://assets.aspeninstitute.org/content/uploads/files/content/docs/pubs/The_Promise_and_Peril_of_Big_Data.pdf

  4. Butler, D. (2013). When Google got flu wrong. Nature, 494(7436), 155–156. https://doi.org/10.1038/494155a

  5. Reed, D. A., & Dongarra, J. (2015). Exascale computing and big data. Communications of the ACM, 58(7), 56–68. https://doi.org/10.1145/2699414

  6. Tene, O., & Polonetsky, J. (2012). Privacy in the Age of Big Data | Stanford Law Review. Retrieved October 10, 2017, from https://www.stanfordlawreview.org/online/privacy-paradox-privacy-and-big-data/

  7. Ohm, P. (2009, August 13). Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization. Retrieved from https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1450006

  8. Masiello, B., & Whitten, A. (2010). Engineering Privacy in an Age of Information Abundance. 2010 AAAI Spring Symposium Series, 36(1), 119–124. Retrieved from https://www.aaai.org/ocs/index.php/SSS/SSS10/paper/viewFile/1188/1497

  9. Tene, O., & Polonetsky, J. (2012). Privacy in the Age of Big Data | Stanford Law Review. Retrieved October 10, 2017, from https://www.stanfordlawreview.org/online/privacy-paradox-privacy-and-big-data/

  10. Anderson, B. C. (2008). The End of Theory : The Data Deluge Makes the Scientific Method Obsolete, 14–16. https://doi.org/10.1016/j.ecolmodel.2009.09.008

  11. Masiello, B., & Whitten, A. (2010). Engineering Privacy in an Age of Information Abundance. 2010 AAAI Spring Symposium Series, 36(1), 119–124. Retrieved from https://www.aaai.org/ocs/index.php/SSS/SSS10/paper/viewFile/1188/1497

  12. Ueti, R. da M., Espinosa, D. F., Rafferty, L., & Hung, P. C. K. (2016). Case Studies of Government Use of Big Data in Latin America: Brazil and Mexico (pp. 197–214). Springer, Cham. https://doi.org/10.1007/978-3-319-30146-4_9

  13. Technology | United Nations Global Pulse. (n.d.). Retrieved October 11, 2017, from https://www.unglobalpulse.org/big-data-climate-challenge-2014

  14. Cano, I., Tenyi, A., Vela, E., Miralles, F., Roca, J., & Pi Sunyer, A. (2017). Perspectives on Big Data applications of health information (2017). Current Opinion in Systems Biology, 3, 36–42. https://doi.org/10.1016/j.coisb.2017.04.012