Data manipulation is the process in which scientific data is forged, presented in an unprofessional way or changed with disregard to the rules of the academic world. Data manipulation may result in distorted perception of a subject which may lead to false theories being build and tested. An experiment based on data that has been manipulated is risky and unpredictable.
In the modern world we encounter data manipulation every day. Arguably the most common kind of data manipulation is misuse of statistics – many click-bait article titles on the internet are based on misuse of statistic as are some political and economic arguments. It is these examples that we will concentrate on in this chapter.
Misuse of statistics does include data forgery – the process in which data is created without any connection to the object of the data but the most important kinds of misuse of statistics are these that involve real data that is presented in a manner that may be misleading and even dangerous.
This chapter will try to describe the kinds of data manipulations that there are and the ways to deal with them. Most importantly – what red flags to look for when reading an article or a project that might be a sign of data manipulation.
Part one: Kinds of Data Manipulation and Reasons behind Them
Omitting important facts, factors
An issue that is part of an even bigger issue that is: scientists are looking for results (Because that means research grants etc.) and thus they sometimes deliberately or unintentionally manipulate data to fit their hypothesis.
When conducting an experiment, scientists have to conduct lists of relevant factors – for a political poll for example it can be the age, income or religious beliefs of the participants. The weak place here is the fact that the scientists may not put an important factor as relevant in the study. If a study “Computer games – art or not?” was conducted on participants between the ages of fifty and sixty then it’s results will probably be quite different from the results of the same study conducted on participants between the ages of fifteen and twenty. If in the resulting publication the age of the participants is not clearly stated then that is an example of data manipulation and specifically misuse of statistics.
How can this be avoided?
Most scientists when conducting polls and studies in the modern world don’t do it themselves. In the described above example of the “Computer games – art or not?” study the scientist or scientists behind it would put down all the specifics of the poll, all of the questions and any other relevant information which they would then send to a special poll conducting company that would do the actual data gathering.
There are many companies like that in the world – some of them are famous (For example Gallup, inc.) and some of them are infamous. There are also these that are not well known. Whichever company the scientists or journalists or anyone who has decided to publish poll data uses – it has to be described in the publication. If a poll conducting company (consulting company etc.) is not mentioned in the publication then the data should not be taken seriously.
The better the company used in the experiment – the higher the chances that it has a transparent system that lets anyone who is interested in how polls are conducted to investigate. Companies with reputation of being employed by big firms like Apple, Microsoft etc. are usually quite transparent because for them being caught on data manipulation would mean the loss of clients.
Small, not well known companies are not risking as much and thus manipulate data more often. There are two main reasons for doing so. First, the company wants to please the client – give the expected poll results. A good example of this are companies that provide political ratings for dictatorships: Nicolae Ceaușescu, the Romanian communist politician, had an approval rating of 94 % two days before his death by a firing squad (Which happened as a result of a revolution in the country). Many dictators had such high ratings right before their downfalls – this can only be explained by data manipulation (Well, data imagination really).
But this can also happen to big production corporations. Imagine a big tobacco producing company that wants to conduct a research on the probability of cancer being a result of smoking the cigarettes that the company sells. There is a definite result the company wants to get which is that the probability is not higher than for non-smokers. The understanding of this might lead a poll conducting company to manipulate data to get results that would please the client. Some big corporations even have pocket poll conducting companies – companies that only exist to create data that approves the corporation’s intentions.
A good real life example would be the Volkswagen Scandal of 2015 where the Volkswagen Corporation falsified information about the gas emissions of its cars. This led to the release of cars that polluted forty times more than allowed by law. The falsification was done using a “defeat device” – smart software that would turn on emission control when the car was being tested in a laboratory.
The easiest way to expose this kind of manipulation is to look at the reputation of the company that conducted the research and the clients it has had.
The second reason to manipulate data for a small company is that research is hard to do. Conducting a poll with a few thousand participants is a lot of work in a lot of different specters. It is easier to just come up with data or to copy data from another research of the same topic (This is technically plagiarism not data manipulation but these two go together). We will expand on this later with the case of “Irregularities in LaCour”.
False causality and illogical sequences
False causality and illogical sequences are another way of manipulating data. This kind of falsification is done to deceive these who are not quite familiar with the subject of the research. Imagine there is a cage of mice who have a specific fur color. A scientist follows the family through multiple generations and comes up with this theory which is later published: “Every generation of brown mice has more deaths than the previous one”. This is a really simple example but it gets the point across. The reason for this statistic (Which is true) is not the mentioned color of the mice but the fact that each new generation has more mice than the previous one and thus has more deaths.
A graph like this is misleading (it also involves omitting the fact of more births as you may notice). This helps understand the idea of “Statistic misuse” because the information, the fact in this graph are true. But they are not in the correlation that the work is trying to show.
How can this be avoided?
Always remember that you can’t compare apples and oranges. A graph or a statistic usually compares data and it is important to understand what kind of data it is and how it connects. For example: People buy more lighters – more people get cancer – lighters are bad for you. The wrong turn here is that in actuality people get cancer because they smoke and they buy lighters because they smoke. So looking through the logical connections that the research makes is quite important.
Data dredging and fact fitting
Data dredging is a part of a bigger problem which is scientists wanting to get results as mentioned earlier. Data dredging is the process in which researchers look through big amounts of data trying to find patterns. The amounts of data picked for dredging is usually so big that there would be at least one or two coincidences that can be used to base a theory on them. With the introduction of computers this became even easier because a computer is much better at figuring out more strings of facts out of even bigger amounts of data.
This leads to publications that are irrelevant or are based on a pure coincidence which is a coincidence in the sense that it does not mean anything and not a coincidence in the sense that it had to happen with the amount of work put into finding it.
Fact fitting is the process which in a sense is the opposite of fact omitting – facts are shaped to fit a certain theory. This for example happens to a lot of historians. It is a well-known fact that historians around the world have been for hundreds of years trying to come up with the ideal theory of history which would help us predict the future of humanity and understand its past. Probably the guiltiest of fact fitting are some Marxist philosophers of the twentieth century who had a quite simple view of history as a repetitive process which led them to say quite fascinating things.
How can this be avoided?
Check the facts. That is actually important in a lot of cases. If a scientific fact is actually a scientific fact then it probably comes up in more than one publication. It might be something revolutionary that has not been mentioned before (As many click-bait titles suggest) but think about this: here is a semi realistic click-bait title “Businessmen are hiding this way to make money. Get $$$ today”. Apply the rule of big numbers – if something like that would actually be true – would you be the first one to discover it? This applies to scientific studies too. A shocking new discovery? Why has it not been done before – time to go over all the previous points and check the research.
These three are points do not cover everything concerning data manipulation and statistic misuse but they do provide the most important tools to be able to detect them in a field of which one has at least a basic comprehension. Most people don’t get a chance as big as the one in the example in part two but most meet questionable articles at some point of their life.
Part two: LaCour and scientifically based data manipulation.
There are unique situations when data is not just manipulated, but manipulated professionally. Fact are pushed in different ways by a person who knows what he is doing – an easy example to see this in work is any political debate. But here we would like to look at the previously mentioned case of “Irregularities in LaCour” – an exposure done by David Broockman, then a graduate student at UC Berkeley of a work of another graduate student, Michael LaCour, who forged enormous amounts of data in a significant study on perception of gay marriage. It a case that happened in the years 2014-2015 and has rocked the scientific world.
Summary of LaCour’s research (As put down by LaCour)
LaCour hired the company USamp for a research that would prove his theory: people’s views on gay marriage can change dramatically after a conversation with someone who is homosexual. It was a large scale poll with ten thousands of respondents. The research proved LaCour’s theory which was a new and unique result . All previous results in similar works had shown that people hardly change their political and social views.
What did Broockman do
Brookman was familiar with LaCour’s work at early stages and was greatly impressed by it. He wanted to conduct similar experiments. After looking into it he found out that USamp or any other company could not have conducted such research for a graduate student’s budget. This was his first hint. He was not sure about it – at the time he was not trying to debunk LaCour’s theory. He did not come out with it because it is easy to gain the reputation of someone who does no work of his own and just tries to ruin others work. A lot of people – scientists, researchers who Broockman talked to told him not to publish such materials.
Later Broockman with Josh Kalla (A friend and colleague) noticed some specific irregularities (politically correct term for “mistakes and falsifications”) in the used data – it did not looked random enough. Later Kalla found the database (1012 CCAP) that LaCour copied which was the last argument Broockman and Kalla needed to publish their report.
LaCour lost his just obtained position in Princeton and his reputation – it will now be really hard for him to return into the world of science.
Broockman made the headlines and spoke a lot about debunking and academic integrity.
While some people are arguing about the competence of LaCours research it has now been revealed that he had not in fact hire any poll conducting company, forged a letter from USamp and lied in later interviews. After all of these occurrences there seems to be no reason for a question of “competence”.
Lessons to be learned
LaCour wanted a result that would make him a first class researcher and succeeded – until the exposure he had time to become quite famous. To obtain such a result he plagiarized data AND manipulated it to fit his theory thus he committed a set of violations of academic integrity. He was caught and is now a great example of how debunking works. The possibility of such exposure is one of the main protections of the scientific world from academic dishonesty and thus should be advocated.
The Conclusive Part Three: Education on Data Manipulation
While a lot is being done to expose and debunk data manipulation, it is a subject that is not a part of popular culture. A lot of people see this kind of work as some scientific mambo-jumbo where one nerd is telling another nerd that that nerd lost an “X” somewhere in an equation. First of all, as seen in the LaCour example, debunking does not always involve high class science – Broockman just started checking how LaCour did his research.
Second and maybe even more importantly, publication of false data may cause harm as in the already described Volkswagen example or the medical field. LaCour´s data nearly caused the reform of many systems and structures that concerned political and social views because of its “original” content. Even minor knowledge like the one provided by this chapter can stop people from being misled about certain subjects and teach them to question others research even when it looks quite legit at a first glance.
An Article on The LaCour case:
BBC on Volkswagen: