View on GitHub

DATA-150

Knowledge Creation 3

Big data is considered a revolutionary change in the way we process information and problems, now being able to understand the complexities of those issues rather than just surface level models that attempt but ultimately fail to understand the full picture. This new wave of data science techniques is crucial to understanding the complex systems that surround us in everyday life including health care, urban development, engineering, humanitarian aid, infrastructure assessment, economic development, disease modeling, flood modeling, carbon emissions, and even human behavior. These are all systems which at one time seemed too complex or unpredictable to understand, but now big data provides a lens through which to understand them. The systems that make up everyday life are exceedingly complex, with data science providing an important understanding of those systems and a optimistic future given the proper precautions

Geoff West similarly describes the incredible complexities of the world in his book Scale. He examines the complexity of a variety of systems including biology, city development, and more. He describes how these systems are more complex than we initially believe. Geoff West attempts to explain these complex systems as an extension of people and our biological systems. For example, he points to the fact that people are composed of an even smaller subunit: cells. These cells are incredibly small and have their own complex processes. Despite this, people do not self-identify with one of their cells or even a collection of them. Someone is highly unlikely to look at a piece of their hair that fell out and think that it is them even if it is a part of them. When you put the simple pieces together, it creates a complex being where the whole is much greater than just the sum of its parts.

West further applied this theory to the idea of cities. These are complex processes that consist of infrastructure, networks of people, and more. Take one road, and that’s not considered the city itself. Similarly, one person does not make up a social network. As such, these systems transcend the characteristics of just their parts, and rather must be examined as a whole. This illustrates how it is not easy to break down a complex system by pure logic alone, as it ends up in failure.

With this in mind, West points to how humanity has an initial thought to expect that scale is alway linear. This is not inherently wrong thinking, as there are relationships, but it is often too simplistic to truly understand the complexities of what’s occurring. Instead, West illustrates the use of data science to understand the proportions of a variety of systems. He mentions at first the scaling of biological entities. For example, he points to the fact that one would initially expect someone of double the weight, and thus double the size, to require the need for double the food. This inherent idea is actually wrong, as there is an economy of scale. As weight increases, the amount of food in proportion to the body is only 85% of what would be expected.

West was capable of expanding this understanding of non-linearity to cities themselves, trying to understand the underlying relationships that comprise cities. More specifically, he looked at the wages, GDP per capita, population, infrastructure, patents, AIDs cases, and crime rate of cities and examined how they scaled from one place to another. His example was the difference between Oklahoma City and Los Angeles. He described how the population of LA was roughly half of that of OKC so initial inclination would assume that LA has double the population, wages, etc. Again, this initial thought of linearity is wrong. Instead, there is superlinearity where the increase factor is 115%. This factor ties in all of the stated variables, creating concrete links between them that would previously have been unseen. Data science methods and big data provides the capabilities to see these underlying patterns and relationships that hold the world together as we know it, expanding our understanding from just logic based reasoning.

The use of data science for good and reducing unfreedoms is long documented despite its relatively new use. A clear example has been in the country of Vietnam where its people have had an increase in flood risk as a result of rising sea levels and peri-urbanization. Without the use of data science methods, researchers would be unable to predict areas where flooding was likely to occur with the same level of specificity as they can now. Even a topic as simple as flooding has a large amount of variables which all combine to create an outcome. Those variables in regards to flooding include surface material, ground type, soil type, elevation, sea levels, slope factors, vegetation, and more. Before the advent of remote sensing and predictive models like arcGIS, researchers would have a broad and incomplete understanding of where flooding may occur based on sea level and slope. But now, researchers have a clear understanding of where flooding is likely month to month, as well as having the ability to warn people in advance to clear the area, saving people and businesses.

Staying with Vietnam, data science techniques have proven effective in understanding and predicting the spread of avian influenza in poultry. This is a disease which has demanded the money of the government for years as it ravishes the poultry population as well as transferring to some people. This has led to great unfreedoms in economic capabilities as the poultry, the main business of many citizens in Vietnam, have died in mass leading to large losses in economic output. Data science methods including satellite imagery and kernel based bayesian models have had a predictive nature that shows the cluster of avian influenza cases in the peri-urbanization areas. This is an insight that would be previously unavailable to the government and researchers without the use of advanced data science techniques. This understanding led to a revised relief plan, vaccinating the peri-urbanization poultry at higher level than other areas and with more frequency, leading to a decrease in cases and restoring the freedoms of the people.

It is important to understand the use and utility of data science techniques on the whole and how they are perceived to impact the future of scientific theory. Chris Anderson believes that data science techniques remove the need for theory in its entirety. He mentions how theories are often too broad and rely too heavily on establishing causal relationships. He warns against this, saying that the data has the ability to speak for itself and only correlation is enough. He believes that theory and the requirements for causality restrict the amount of information that can be garnered from the data itself, insisting that the data speaks volumes for itself.

Anderson has examples to back his beliefs. The most prominent one is of J. Craig Venter finding a new species without even being in the field. Anderson describes how Venter was able to analyze the code of many organisms from his computer and at a lightning pace. This discovery did not require years of research with a proposed hypothesis or prediction. Instead, it was done in a matter of seconds using big data. This is where Anderson believes theory is dead.

A further example that supports Anderson’s point is targeted ads at Target. Target uses machine learning algorithms to predict what products a customer will be interested in, very similar to online advertising. In one instance, Target observed the buying behavior of a teenage girl and subsequently recommended she buy diapers, insinuating that she was most likely pregnant. The daughter and father were outraged, believing it must be a problem. They were wrong and the algorithm was right, as a test showed that she was indeed pregnant without even knowing it. This example illustrates the eerily accurate predictive nature of data science methods when coupled with large data sets.

While these two examples are fascinating and awe inspiring for the future, they must be taken with an attitude of cautious optimism. It is true that big data can find incredible results that change perception of fundamental systems, but it is still possible that there is noise in the data which is nonsense. One important point is that the impossible is very possible with a large enough sample size, and thus is the case with big data. One example comes from the book How Not To Be Wrong: The Power of Mathematical Thinking by Jordan Ellenburg. He examines one instance where a group of people were analyzing the Torah, finding that it had a predictive nature if you looked at the letters of words in a random order. This caused hysteria that the book was all seeing or predicting real world events. This hysteria was put to rest with the observation that the same predictive nature was supposedly in other books which had no religious basis. This finding illustrated that this pattern was instead due to random chance which was the higher chance outcome.

This thinking of initial reservation about leaning fully into big data without theory is echoed by Robert Kitchin. He directly responded to Chris Anderson, declaring that theory was in fact not dead but could serve side by side with big data. He mentioned the increasing potential that there could be false conclusions from using big data on its own. He believes these errors come from sampling bias as researchers have to decide where they collect their data, as well as the real possibilities of fake correlations that lack true causation. It’s a tried and true phenomenon that causal relationships are often of random accord and not necessarily based on either factor.

Because of these observed downfalls, Kitchin encourages a model that incorporates big data into the current scientific model, providing a hybrid model. He believes in the need to show true causation rather than tenuous correlations, as well as the need to vet for potential biases in the data collection method. This means that Kitchin is a proponent of big data, but has realistic expectations for their application in the future to come.

My prediction for data science methods as applied to big data falls in line closely with Robert Kitchin’s view. While I appreciate the exuding optimism of Chris Anderson, I find it ill advised to throw away a tried and true system like the scientific method as it has been the backbone of investigative inquiry for centuries, showing time and time again its unparalleled ability to decipher pheneona. The two systems, that being the scientific method and big data, can work hand in hand to provide a comprehensive overview of complex systems, building upon the backbone of investigative inquiries of the past. Big data provides tremendous insight but must be checked by the safeguards of study we have already deemed so important in order to make conclusive discoveries with known merit.