Data Representation
general principles and pointers
Compiled May 25, 2020
Current draft aims to introduce researchers to the key ideas in data representation that would help to prepare their data for data analysis.
Our target audience is primarily the research community at VUB / UZ Brussel, those who might apply for data analysis at ICDS in particular.
We invite you to help improve this document by sending us feedback
wilfried.cools@vub.be or anonymously at icds.be/consulting (right side, bottom)
In preparation of data analysis, it is wise to think carefully about how to represent your data. The key ideas are listed first, and will be explained and exemplified in more detail throughout current draft.
Test yourself: create a data file for the following 4 participants (assuming many more), ready for analysis.
Read through this draft and if necessary alter your solution.
A possible solution is included at the end.
Current draft addresses data representation with the following outline:
In following drafts, data manipulation, modeling and visualization are considered. Typically, all are more straightforward when data are more tidy.
To avoid problems and frustration in your data analysis, it may be worthwhile to consider the checklist below. It points at various issues that have been encountered in actual data at ICDS and that are easy to avoid. In general most data offered by researchers whom did not attempt to do their own analysis, or at least the preliminary descriptives, is full with issues like the ones highlighted in this section.
In summary:
When labeling or scoring properties for research units (cells), avoid typo’s, inconsistent labeling, inconsistent scoring, …
Often observed problems:
man
- women
- womem
or likely
- likly
- Likely
,man
- Man
- woman
. Most statistical software is case sensitive (eg., R),_
), eg., man__
- man
- _woman
- woman
,4.2
- 5,3
- 5,9
. A comma is often used locally, a dot is used internationally (scientifically),_
- NA - 99. Software differ in their default, but consistency is key !
Advice: frequency tables often suffice to detect most of these errors, or a summary for numeric values.
|
|
Note that the average score for the table on the left appears to be 3.65, do you see what went wrong ?
When labeling or scoring properties for research units (cells), avoid ambiguity and incompleteness.
Often observed problems within cells:
group 1
),missing
or none
, no answer is different from the answer 0 or “” (eg., types
variable in ambiguous - incomplete below),3.9
combined with >10
(eg., score
variable in ambiguous - incomplete below),A:B
, A:C
, B
to signal treatments received (none or A, B, and/or C) (eg., types
variable in ambiguous - incomplete below).Each cell should best be fully interpretable on its own, with reference to both row and column only. A codebook, discussed below, serves to alleviate any possible discrepancy between the data representation and the actual data.
Often observed problems combining cells:
blood volume
for both baseline and after treatment
),baseline measurement
). When labeling or scoring, or when specifying a variable name, avoid characters that may not be understood properly. Note that some characters call for specific operations in certain statistical software.
Often observed inconveniences follow from using:
$, %, #, ", ',
),
|
|
When labeling variables or values, strike a balance between meaningful and simple. This is especially important when requesting help from data analysts who typically program their analysis and often do not understand your line of research. Some analysts may even prefer all values as numeric, (eg., 0 vs. 1) while others prefer short alphanumeric values (eg., male vs female).
Advice: To keep meaningful but long and complex headers, use a second line with simple headers to read in for the analysis. Maybe use patientID
and id1
instead of patient_identifiers_of_first_block
and patient_number_1
.
|
Advice: To ensure a correct interpretation, now and later, the researcher could make the following distinction,
notAgree - neutral - agree
,r1 - r2 - r3
for ordinal scale not to be used as continuous,
|
A codebook could address the relation between labels and their interpretation as well.
When starting the analysis, or offering data to third parties, retain only the data of interest for the analysis. Store the remainder of the data in a secure place with an appropriate link.
Advice: remove
|
Spreadsheets are convenient for representing data because their base structure is a table, with rows and columns, which you need for most statistical analysis, and because they allow for straightforward manipulations of data.
Manually constructed spreadsheets, Excel or other, unfortunately, promote the use of implicit information rather than the required explicit information. For example, cells are left empty because it is, at least for a human, clear from the context what the value should be (eg., Excel showcase, empty field meaning group 1 or 2).
Excel deserves special attention. Understandably very popular, it often does more than expected and can cause serious problems.
Often observed problems:
Advice: A safe way to store data, once fully ready, could be a tab-delimited text file. While inconvenient to manipulate, risks for unwanted behavior are eliminated. It is straightforward to convert one into the other.
For data analysis data is most often represented in one or more tables. It is repeated that:
While it is best to avoid a bad data table from the start, it is in many cases not impossible to convert tables into more appropriate forms.
Purely for illustration purposes, R
code is included using the tidyverse
package to show a possible data transformation starting from a bad example turning it into another data representation. In current draft the focus is on data representation, not on changing it. More details on how to manipulate, visualize and model data are offered in future drafts.
Consider this monstrous dataset, showing various features that are common in data offered for analysis.
|
Apparently, substances (subst
) can be s1
, s2
, both or none. So, having s1,s2
is partly overlapping with s1
, but how does the algorithm know ? Lets turn this multiple selection item into multiple columns. Apparently, young and old are two variables, which makes no sense because you are either young or old, so lets remove one of them.
badExample <- tBadBad %>%
mutate(s1=ifelse(grepl('s1', subst),T,F),s2=ifelse(grepl('s2',subst),T,F)) %>%
select(-subst,-old)
|
Apparently, various columns contain variable values (consider 4th to 9th column). As the variable names suggest, observations are obtained under certain conditions, A or B, and at various time points, time 0, 1 or 2. In this example example condA_time1
partly overlaps with condA_time2
with which it shares a method, and partly overlaps with condB_time1
with which it shares a time point. Let’s turn these columns into values first, and at the same time simply ignore the missing values.
Observe that the names of the columns turn into values in a column names messystuff
, making the dataframe less wide and more long.
badExample <- badExample %>%
pivot_longer(names_to="messyStuff",values_to="scores",-c(id,young,stat,s1,s2)) %>%
filter(!is.na(scores))
|
The new column still combines two types of information, condition and time. The column should be split into two columns.
badExample <- badExample %>%
separate(messyStuff,c('cond','time'))
|
Much better. A last issue here is that the minimum and maximum could be variables and not values. No hard rules here, but often it is intuitively clear. So, let’s turn these values into variables to represent two types of observation.
goodExample <- badExample %>%
pivot_wider(names_from=stat,values_from=scores)
|
While not convenient here, if there are many variables it may be interesting to split the table into different tables. Each table is research unit specific. So, let’s create a persons file and an observations file, and merge them together again afterwards.
persons <- goodExample %>% select(id,young) %>% distinct()
observations <- goodExample %>% select(-young)
combinedAgain <- observations %>% full_join(persons)
|
|
|
Various issues were highlighted, and will be discussed in more detail below.
If within a research unit several scores are obtained, they can be represented within a row but often it is better or even necessary to unfold them into multiple rows that are identified with an indicator variable.
For example, consider a repeated measurements datafile, with multiple observations for each participant. The observations within a patient could be represented on a patient specific row (wide) with an identifier column for the participant, or one below the other covering several rows (long) with an indicator variable for both the participant (includes multiple rows) and the time of observation.
|
|
Note: the switch between both representations is easy. In Excel use pivot
tables, in R
many functions exist, for example the pivot_wider or pivot_longer in tidyr
. Knowing how to transform data between wide and long form is very convenient and worth the effort learning about it.
It may be appropriate to split up a table into different tables, as is done with relational databases, in order to combine all information in research unit specific tables. Different tables can be combined when of interest using key variables. This is particularly interesting as datafiles get bigger and as values are constant within blocks.
For example, a datafile could be split into a person datafile and an observation datafile. A person file only consists of person related properties that are constant for a particular person. An observation file consists of observation related properties that are constant for a particular observation. Note that the person providing the observation is represented once per observation.
|
|
|
For example, an additional table could be used to add item specific information about what the correct response is, how to score a particular response, or whether a score should be inverted when using it to summarize over an underlying scale. The main observation file includes the actual responses, not the scores.
Note: to split up and merge tables is easy. In Excel use merge
, in R
use join in dplyr
for example. Knowing how to split and combine data can be convenient.
A full data representation not only considers the actual data but also the possible data. The way to include this type of information is with additional tables that specify all possible outcomes. A codebook can also be used to provide this information in textual format.
For example, consider a question for which the response option fully agree
was never selected, a separate table could include that option nevertheless.
For example, consider a question for which selecting none of the alternatives is a viable response, a separate table could include this.
|
|
Note: it is possible to add option specific information, for example a score or indication of correctness. This has the advantage that the score can easily be changed and the used scores are easy to determine.
A main point of interest is to include only one piece of information within a cell, unambiguously interpretable. Typically this would involve brining in additional columns.
It could be of interest to distinguish between a missing value due to non-response, and a missing value by design. A full data registration can include an extra column for example, to signal for each missing value how to interpret it. A codebook can be an alternative in which codes are specified for different types of missing data.
|
|
|
|
Note: the original information is still available, but each variable contains only one type of information and cells have only numbers or (implied) ranges.
Values sometimes partially overlap so that they do not offer a single piece of information. A possible full data registration adds columns to isolate the different pieces of information.
|
|
|
Note that this way the combination of A and B is correctly considered as a combination of two constituting parts that were neither of them necessary. The original information is again easily retrieved from the available variables.
Note: the original information is still available, but each variable contains only one type of information and cells have only numbers or boolean values.
It is best to let data be as self-explanatory as possible and ready for automated processing.
The information that is impossible or very impractical to include in the actual table(s) should be explained in a codebook. A codebook explains the discrepancy between the data as represented and its meaning.
A possible solution to the challenge above is presented here. Other more simple solutions are possible.
|
|
|
|
Methodological and statistical support to help make a difference
ICDS provides complementary support in methodology and statistics to our research community, for both individual researchers and research groups, in order to get the best out of them
ICDS aims to address all questions related to quantitative research, and to further enhance the quality of both the research and how it is communicated
website: https://www.icds.be/ includes information on who we serve, and how
booking: https://www.icds.be/consulting/ for individual consultations