Thinking Data
On the importance of data, how it is created, and how it can be used to mislead you.
“It's easier to fool people than to convince them that they have been fooled.”
-Mark Twain
You can be smart and good at math but still be fooled by the data you read.
Did you know that the Australian Bureau of Statistics considers you “employed” even if you work only one hour a week, and that you are “fully employed” simply if you don’t want more working hours than you currently have?
If you are like most people I talk to, right now you are probably showing the same face of surprise they did — understandably, since the definition used by the ABS on the matter of employment flies in the face of what most people think employment and full employment mean. This example, however, is far from the only one in the world of high-profile data. In fact, there are countless examples throughout studies and reports where the definition of a variable can similarly catch your attention and make you wonder “why?”. Another interesting example comes from the World Economic Forum’s Global Gender Gap Report 2023. Section A: Computation and composition of the Global Gender Gap Index (page 62) states under “Gender equality vs. women’s empowerment” that
the [Global Gender Gap Index] rewards countries that reach the point where outcomes for women equal those for men, but it neither rewards nor penalizes cases in which women are outperforming men in particular indicators in some countries. Thus, a country that has higher enrolment for girls rather than boys in secondary school will score equal to a country where boys’ and girls’ enrolment is the same.
Regardless of where you stand on the matter of gender equality, it stands to reason that if you design an index to measure a gap between two populations — and to reward or penalise a country based on said gap — the index should not turn a blind eye, by design, to data that might not be politically palatable to report on or take into consideration.
Notwithstanding an article or report going into the detail of why and how these definitions are used, this perceived disconnection between the definition used and what people might expect it to be raises the point I want to bring forth: the data you see reported in every article, report, study, communication, etc. is the direct consequence of the method used to collect and filter the data, and the biases of the people in charge of producing the data. To the scientists and analyst defence, we have to make decisions every day about what data to include and not include in our models and reports, and it is sometimes not easy to know how to best define a variable or create a model that best represents the problem we want to engage with — but this is not an excuse to condone explicitly biased behaviour for political expediency or otherwise convenient reasons. Sadly, nonetheless, these behaviours abound, and this is especially true when people deal with and study complex problems like “society”, “health”, “poverty”, “violence”, and so on.
There is no effective policy that we can concoct to prevent people from trying to misrepresent reality, or to outright lie to others. There will never be. Therefore it becomes critical that people like us take pause to understand the data we come across — instead of simply outsourcing our critical thinking to the “experts” and their army of acolytes chanting that we should “trust the science”. In this light, a first and most crucial step to form a solid line of defence is to truly understand the data we come across on a daily basis by understanding how it was created. Alternatively, taking it with a grain of salt if we do not have the time to commit to its understanding is a wise move. After all, we might all know how to calculate something like the mean or the median of the data, but these metrics are misleading at best and useless at worst if the data underneath them does not appropriately capture the problem at hand, and if we do not understand whether that data is what others claim it to be.
We can all be fooled simply by not understanding or not caring for how the data we read about was created in the first place — and we will strongly resist any evidence that might point to how we were fooled in the first place… thus being misled by statistical and scientific sophistry.
Methodology: the cradle and graveyard of data.
Broadly speaking, the framework used within an report or study to collect and analyse data, and therefore shaping the results, is known as “methodology”, and this framework is the birthplace of data. Depending on the type and style of publication and reporting, this section is usually located at the middle or end of the publication — or sometimes buried two or three links away from the report on a website. Sometimes, however, — and this is mostly the case for news, fact-checking exercises, and other forms of short communications, — the methodology can simply not be presented, and depending on the data and report at hand you might need to file a Freedom of Information Request to get it.
As implied earlier, the methodology behind a study or report is the bedrock of the validity of the data in it, and thus the usefulness and truth behind the work done. This is why this is not only the place where data is born, but also where it goes to die.
If I were to present a study of the number of stars on the sky, for example, and my methodology was simply based on how many I can count with the naked eye at night from one spot of the Earth only, I’d be laughed out of the room. Bad methodology simply begets bad studies, bad results, and bad conclusions — and sometimes bad policy. The latter is especially problematic when complex issues, such as economic, ecological, social and human behaviour, are not or cannot appropriately be accounted for and brought into the model, leading to the infamous Law of Unintended Consequences. Not even the best studies are immune to it.
While the aforementioned example is deliberately simplistic for the sake of convey the point, it nonetheless serves to further advance the idea of how critical the methodology of a study is, so that real-life, serious examples like the ones listed in the first section can be appreciated.
Biases in methodology abound, and these are especially pernicious in academic circles where scholars behave like activists first and foremost, and where scientific rigorousness appears to be a luxury that can only be incurred into provided it does not interfere with the political. Famous amongst these examples is the recent case of the so-called “Grievance Studies”, where studies arguing, for example, that “men who masturbate while thinking about a woman without her consent are perpetrators of sexual violence” (as reported by The Atlantic in the preceding link), can still be accepted for publication at scientific journals.
Some other examples are much more subtle, and while they cannot easily be argued, let alone proven, to constitute lack of scientific integrity or willful deceptiveness for activist reasons, they nonetheless invite the curious mind to wonder. Two examples come to mind.
The first one comes from the Australian Bureau of Meteorology’s changes in and closures of weather stations. How different are the metrics (say, rainfall or temperature), in a given location when a station has been closed and now the conclusions for a given area (e.g. a region) rely on fewer stations, or rely on stations distributed differently compared to historical records? What about the composition of the total data for a large region (the whole of Australia) for historical comparisons when stations are simply closed for good and thus removed from the total sample? Some have already criticised the BoM for the impact that these changes can bring about to our records and understanding of weather and climate more generally. Without having access to the raw data collected across the BoM’s systems, I can only speculate about what these changes mean, but nonetheless I can ask: to what degree the changes in records between years are a reflection of the weather/climate, or to changes in the stations being utilised to measure it? Changes in methodology can completely distort our perception of weather from a historical perspective, and thus the narratives surrounding climate change. The science on the latter can then land on hotter waters if an insider decides to speak up and denounce what he believes are biases in the cherry-picking of data, studies, and preferred narratives in high profile journals, and argue that factors besides climate change that nonetheless contribute to issues such as wildfires (like, for example, the amount of vegetation available to burn, which is an issue of Government management rather than of climate) are often ignored because “they don’t sell”.
The second example comes from an allegedly banned Ted Talk by Rupert Sheldrake that we can still access via YouTube. On discussing the speed of light and gravitational constants (timestamp 10:00), Sheldrake goes on to explain his interactions with the Head of Metrology at the National Physical Laboratory in Teddington, England, and how metrologists apparently “solved” the problem of data variability in universal constants by decreeing them constant by definition, while at the same time releasing the new values at given periods of time. The whole exchange invites an infinitude of questions: If these are constants, why do you have to reconvene at set periods of time to revise measurements and release corrected values? Doesn’t the word constant imply something does not change? It is interesting to note that over time the value of the speed of light has changed, arguably due to the changes in methodology, from the time it was first measured by Ole Roemer to the present values we can obtain from modern physics books. However, this necessarily opens the door to the possibility that our current methods and knowledge are also not sufficiently adequate by future standards, just like Roemer’s were back in 1676.
Questions of this nature however meet fierce opposition and invite widespread derision in scientific circles, yet upon inquiry it is often revealed that many scientists operate from dogma and trust, instead of only from knowledge generated by themselves. In their defence, we all do. It is impractical to have to re-calculate or re-develop everything others have done in the past — we could not even advance knowledge if we had to. But this should not mean that dissenting or “heretical” lines of inquiry should be met with contempt and dismissal, lest we want to impede scientific progress, make a religion out of science, and lock ourselves into our current understanding of reality. We should also not deprive ourselves from knowledge simply because it may cause harm or used by someone the wrong way, or hold the publication of research subject to meeting said guidelines, lest we want to further bias our own understanding of reality in the name of peacekeeping.
All of this simply underscores that the way we generate knowledge and data, and the biases that we engage in when deciding what knowledge should be produced and disseminated, are as or more important that the data itself. After all, data is downstream of methodology, and methodology is downstream of personal biases, and sometimes collective ones.
What data allows you to do… and what doesn’t.
One last topic demands our attention, and it lies at the intersection of data and what we decide use it for. Why would we generate data if not to guide our decision-making processes? Linking scientific outcomes, insights, and data with effective widespread action is, however, a process fraught with unintended consequences — sometimes catastrophic ones — especially when the people working with data do not understand what the data at hand conveys.
Often we encounter data that is a representation of some other, underlying data. Statistical aggregates such as the average, the median, and percentages are common in normal parlance, and they convey a high-level view of the data they represent … a view that is often distorted, and that people can lose sight of.
If I say, by way of an example, that on average the Australian men are 5ft 9” tall, does this mean that all Australian men are 5ft 9”? After given an example like this, everyone can appreciate that the mean is only a general representation of the distribution of the data; a representation that not all data points conform to. This realisation however often goes unnoticed by people of all walks of life, regardless of how smart they are. Nowhere is this lack of appreciation for the limitation of statistics more prevalent that when crafting policy — when using data to “solve a problem”. To make this brief, I will limit myself to one example.
In a recent talk by Richard Reeves on male inequality, for example, Reeves proposes that, as a consequence of the brains of boys developing a year later than that of girls, boys should enter school a year later to solve some of the issues of men lagging behind in education. Yet I am not sure that Reeves understands that the average coming from those studies on brain development conveys an undeniable biological variability in the data in which there will be some boys whose brain development will be on par with some girls, some will be ahead of some girls, and some will be behind them. The same way that we all understand that while measuring physical strength, some women will be stronger than some men, while still men being on average stronger than women. This is because these statistical aggregates are simply the reduction of a distribution of data points to a single number that conceals said distribution.
Should we craft discriminatory policy that apparently dismisses out of hand the underlying variability of the data in favour of what the means convey? I would highly disagree, especially when the proposed “solution” is simply window-dressing: we would simply be shifting the age of boys in school so that the attainment gaps for each educational year shrink, despite the normalised, age-equivalent data still telling us the truth we want to escape. And this is without even contemplating what could be the unintended consequences of said policy.
More importantly, and as I have argued before, data does not tell you what you should do with it. When we are using data to solve a problem, we have a priori decided that the morals embedded in the desired outcome are the ones that should prevail, usually forgetting that complex social, environmental, or health problems are real-life trolley problems, but without knowing exactly what lies beyond the pulling of the lever. Stating that we do know what lies beyond is basically stating that we can predict the future in all its complexity, accounting for all the variables, even the ones we do not know existed — a statement thus revealing the inadequacy of the person to understand the complexity of reality, and the eagerness of the person to “solve the problem”, despite whatever unintended consequences the solution might entail. This is why it is paramount that complex problems and their devised solutions are informed by as many points of view as possible — especially from opposing political and moral standpoints — and not only by the ones that suit a predetermined agenda. It is often helpful to ask ourselves: do we really want to solve the problem, or do we simply just want to push our personal interests forward and hope the problem gets solved along the way?
TLDR (Too Long, Didn’t Read).
If the present article got quickly put in the too long, didn’t read basket, then let me at least leave you with some take-home messages:
Data is the outcome of a thinking process (the methodological framework), not a statement of undeniable reality. Reality, after all, is much more complex than what can be captured in a study, and the thinking process may be inadequate to capture it.
Data is only as valuable as the process that creates it — if the process is bad, the data is worthless. Your aim is therefore to understand the process, not to simply ingest the data. If you do not have time to commit to understanding the process, be wise to take the data with a massive grain of salt.
Data only allows you to inform yourself about a very small fraction of a problem, from one perspective only (the one adopted by the people writing the study or report). Always remember to look for alternative points of view, and be especially wary of fields and topics where you simply cannot find dissenting voices, and of people trying to disuade you from looking for them.
Data does not tell you what to do with it. What drives people are their a priori moral and political beliefs, which are then justified using the data to effectively disguise a moral stance as a scientific one — making them look impartial and authoritative in the process.
If you can at least take to heart one of these statements, I will consider my job done. We are all swimming in oceans of data daily, and parsing information is a critical skills that is highly sought after by people of all walks of life, not only universities seeking to hire the most brilliant minds. But it is unbelievably easy to forget what data actually means, and what the process behind it represents. Understanding the latter is what lies at the centre of critical thinking; that rare skills that we all forget to regularly exercise, but that we all so desperately need to withstand the tidal waves of lies, half-truths, and propaganda we have to endure in the age of the internet.