“Is it dishonest to remove outliers and / or transform data?”
October 18, 2011, 10:02 pm
Filed under: Uncategorized

An outlier is a data point that is far outside the norm of other values in a random sample from a population. If most of the data fit a particular trend an outlier is a data point that is radically outside of that trend, a point far away from the line of fit. An outlier is a unusually large or an unusually small value in comparison to others data points.

My initial thoughts on the removal of data was that it was cheating, but after considering how outliers are caused I can see that most are not valid data points and do not add anything to a study. Outliers make statistical analyses difficult, and can distort the interpretation of the data, influencing the mean and the variability, standard deviation and consequently the findings of a study.

An outlier maybe the result of an error in measurement, such as a human error in data collection, recording or entry. These errors may be corrected using the original data, double checking and recalculating, but if they cannot be corrected they should be removed as they do not represent valid data points.

Another cause could be a participant’s misinterpretation of the task, and their interpretation leads them to perform the task wrongly, in a different manner to all other participants, so their data is not a fair assessment of the participant performance of the same task, therefore their data is not valid. Participant may chose to behave in certain ways, such as purposely giving false, invalid data to appear socially acceptable. For example studies investigating sexual experience, educational achievement, the rate of truancy or financial income. Individual participant effects, such as enduring high levels of stress on the day of the test, illness or fatigue or perhaps immediate environmental effects such as a distracting noise outside the testing lab, can effect results. Participants may become bored with a task and answer any old how resulting again in data that is not valid.

Other causes of outliers are researcher effects, an attractive researcher may affect a participant’s answers or multiple researchers may record data in different ways. A participant may gather the true nature of the study and the desired outcome and may adjust their answers in accordance just to please or to oppose.

Outliers can also occur due an error in sampling. For example studying nurses and their income, some of the ward sisters with a considerable higher income could be mistakenly including in the sample. These could provide undesirable outliers, which should be removed as they do not reflect the target population.

Incorrect assumptions about the distribution of the data can also lead to outliers. Data may not fit the original assumption and may be affected by unanticipated long or short-term trends. For example, a study of library usage rates for the month of September finds outlying values at the beginning, low rates and end of the month, high rates. This data may have a legitimate place in the data set as it may reflect the return of students for the new semester, the low rate and the run up to midterm exams, the high rate.

An outlier can come from the population being sampled legitimately through random chance. Sample size is important in the probability of outlying values. Within a normally distributed population, it is more probable that a given data point will be drawn from the most densely concentrated area of the distribution, rather than one of the tails. As a data set becomes larger, the more the sample resembles the population from which it was drawn, increasing the likelihood of outlying values. There is only about a 1% chance you will get an outlying data point from a normally distributed population.

Before proceeding with any formal analysis researchers need to consider whether outlying data contains valuable information. If an outlier is a genuine result, it might indicate an extreme of behaviour that inspirers and requires further inquiry as to what makes these participants different and if can we learn from them.

Outliers can represent an error or genuine data, which must be examined carefully and should not be removed without justification.


7 Comments so far
Leave a comment

Your blog describes outliers very well as you have said an outlier is an observation that lies an abnormal distance from other values in a random sample from a population. In a sense, this definition leaves it up to the researcher to decide what will be considered abnormal.

Before abnormal observations can be singled out, it is necessary to distinguish normal observations. There are two actions that are essential for describing a set of data:
Investigation of the overall shape of the graphed data for important features, including symmetry, and the investigation of the data for unusual observations that are far removed from the mass of data.

A box plot is a graphical demonstration for describing Outliers. An outlier can be picked up very easily when using box plots. Also scatter graphs are a good way to spot an outlier. As you have pointed out, the decision to remove an outlier should not be taken lightly by the researcher as some outliers are important results and to remove them can cause problems with your data.


Comment by smmitch

I also think it is extremely important to remove outliers from data and is an important process when it comes to conducting research because of the implications it can have on the results. Until looking into detail of the effects that outliers can have on results i would have never have appreciated how much of an impact they can have. For example they can completely change the mean of results, and this could simple be down to the participant misunderstanding the task instructions, and this can cause either a type 1 or type 2 error.

If researchers did not remove outliers and causes a type 1 or type 2 error, a treatment that doesn’t have an effect may be used or a treatment that does have an effect wont be used.

However I think outliers should only be removed when there is a good reason for doing so and not simply to support the researchers hypothesis.

Comment by serenapsychology

Outliers are definitely a danger to the validity of a study’s results, and as you described, they can occur in so many ways that it would be difficult to avoid them completely, and even more difficult to accurately discover their origin. I think that pilot studies are key to preventing outliers caused by things like participant selection bias problems, measurement errors, or outliers that may be generated due to the participant becoming bored. The pilot study would help to identify these issues in order for the fully developed study to function with a much lower risk of taking on outlying results at all.

Comment by statisticalperrin

[…] https://ellies1mpson.wordpress.com/2011/10/18/%e2%80%9cis-it-dishonest-to-remove-outliers-and-or-tran… Advertisement Eco World Content From Across The Internet. Featured on EcoPressed The EU is subsidising illegal fishing Share this:TwitterFacebookLike this:LikeBe the first to like this post. […]

Pingback by Links to comments, happy to help my good ol’ TA « statisticalperrin

I agree with you that the data set should be looked at very carefully in order to try and find any outliers and then make a well informed decision regarding what to do with these outliers. If the decision was made to include an outlier which maybe should not have been then this increases the chances of types one and two errors, which could mean conclusions made about the results of the experiment could be incorrect. One example of when removing an outlier would be justified would be if a participants repeatedly made incorrect responses during an experiment and the reaction times for this participant was either very fast or very slow. Consistently fast, incorrect responses may indicate that the participant wasn’t paying attention and was simply pressing buttons. A consistently slow incorrect participant may be showing that they do not understand the task they are supposed to be completing and so in both of these circumstances removing their results could be justified. However, if a particpant was repeatdly making incorrect assumptions but their reaction times were similar to that of other participants then this may simply indicate an extreme score and so should still be included. This example highlights the need to look carefully at the data set so you can make better informed decisions about what to do with extreme scores.

Comment by laurenpsychology

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: