NavList:
A Community Devoted to the Preservation and Practice of Celestial Navigation and Other Methods of Traditional Wayfinding
Re: Rejecting outliers
From: George Huxtable
Date: 2011 Jan 2, 23:12 -0000
From: George Huxtable
Date: 2011 Jan 2, 23:12 -0000
I'm failing to understand some aspects of Peter Hakel's approach, though it least it seems numerical, definable, and repeatable. First, though, I should say that problems arise when viewing his Excel output. It looks as though some columns need to be widened to see their contents, but it seems that the file is deliberately crippled to prevent me taking many of the actions I am familiar with in Excel. But I should add that mine is an old 2000 version, which may be part of the problem. The alternative .png version opens with "paint" but only allows me to see the top-left area of the sheet. Is that all I need? Now to more substantive matters- I don't understand how the various weighting factors have been derived, and why many of them are exactly 1.0000, when others are much less. Presumably, they depend on the divergence of each data point from some calculated line, which is then readjusted by iteration, but I have failed to follow how that initial straight-line norm was assessed, or what algorithm was used to obtain the weights. Answers in words rather than in statistical symbols would be most helpful to my simple mind. You seem to have ended up with a best-fit slope of about 24' in a 5-minute period, as I did when allowing Excel to make a best-fit, when giving it freedom to alter the slope as it thought fit. But the slope can be pre-assessed with sufficient accuracy from known information, and unless there is some error in the information given, such as an (unlikely) star mis-identification, we can be sure that the actual slope is nearer to 32', and the apparent lower figure is no more than a product of scatter in the data. This is a point that Peter Fogg keeps reiterating, perhaps the only valid point in all he has written on this topic. As a result, we could, if we wished, subtract off that line of known constant slope from all the data, and end up with a set of numbers, all of which should be roughly equal, simply scattering around some mean value that we wish to discover. Then the statistical task of weeding outliers becomes somewhat simpler. ================= If you apply your procedure to an artificially-generated data-set, scattering in a known Gaussian manner about a known mean (which could be zero), and known to contain no non-Gaussian outliers, what is the resulting scatter in the answer? How does it compare with the predicted scatter from simple averaging? I suspect (predict) that it can only be worse, though perhaps not by much. This is the way I picture it. If there is any lopsidedness in the distribution, then each added observation that is in the direction of the imbalance will be given greatest weight, whereas any observation that would act to rebalance it, on the other side, will be attenuated, being further from the trend. So there will be some effect, however small, that acts to enhance any unbalance, though probably not to the extent of causing instability. Does that argument make sense to you? It could be checked out by some Monte Carlo procedure. I presume that the proposed procedure is entirely empirical, and has no theoretical backing, though there may not be anything wrong with that, if it works. George. contact George Huxtable, at george@hux.me.uk or at +44 1865 820222 (from UK, 01865 820222) or at 1 Sandy Lane, Southmoor, Abingdon, Oxon OX13 5HX, UK.