Hey wavelet, turn that noise down!

One way to prevent noise to affect your data driven projects

Piero Paialunga

Published in

Towards Data Science

5 min readNov 20, 2020

**Valentin Salja, from Unsplash (**https://unsplash.com/@valentinsalja)

Noise is so difficult to treat, every data scientist knows that.

The fact is that, as one dear friend of mine loves to say,

“The hardest part of getting what you want is figuring out what it is”

Indeed, we can’t specify what noise really is. As a physicist, I find myself in the situation of studying a dataset and trying to understand if my data has a physical sense. When a clear pattern can’t be identified in a part of my data (or my signal) , I tend to classify that part as “noise”. But, this approach could be dangerous and misleading. Moreover, sometimes you just don’t know what you are talking about or the problem is just too complex for you to expect something before you actually see it

So what do we do?

The first step is trying to figure out what kind of noises can affect your data. Noises are usually subdivided by their colour (aka their dependency on the frequency in the Fourier spectrum). But I want to put myself in the worst situation.

Let’s pretend we know nothing about the noises sources of our system

Then a safe assumption could be that our system is disturbed by a gaussian white noise, that is a noise that lives everywhere in the frequency spectrum with almost the same amplitude and has a gaussian distribution (mean=0).

Ok. Let’s play.

The instrument

If you think about noise, I’m sure that the image that you picture in your mind is close to this one:

As it has already been said, if you think about it in a monodimensional way it is a signal that lives at all the frequencies, even the highest ones. But your signal is typically band limited. That means that if you have a way to separate the highest frequencies of your signal, you can try to distinguish where the noise really lives.

You actually have this way: it is called wavelet.

I’m far away to be a wavelet hero, and you can get enough information about it by yourself. What we need to know to proceed is that they are able to filter your signal in a similar way of a Fourier Transform, but using different basis functions. In particular, they are able to do that at different levels, and the first levels use a filter with low scale, thus investigating the high frequencies, while the last ones use a filter with high scale, thus obtaining the lowest ones.

Now let’s start coding.

The libraries:

Friends don’t lie :)

The dataset

The example I’ve used is a daily series time series from kaggle. https://www.kaggle.com/sumanthvrao/daily-climate-time-series-data

The mean temperature has been used as an example.

The wavelet method

Using the wavelet you get this scenario:

As I’ve said, the first levels detail coefficients are associated with high frequencies, while the lowest are associated with the low ones. The detail coefficients of the first level + the approximation coefficients forms the original signal. The first approximation coefficient may actually be our reconstructed signal, as it is the original signal — what we would like to call “noise”

The “noise” has actually been filtered out, but we must not be too proud of ourselves.

In fact an important check that you MUST do while dealing with noise is to see if the difference between your reconstructed signal and the original one is correlated with the original signal.

Umh. 5% of correlation is not that low. Can we do better than that?

The refinement

As we said before, the noise we would like to treat is gaussian. But let’s give a look at the first detail coefficient histogram :

Build the histogram:

2. Make it symmetric:

3. Plot the gaussian fit:

I know, it’s bad, but wait a minute.

If you look closely, the ups and downs in the core of the distribution may be considered as statistical fluctuations. On the other hand, at a certain point of the tail, the distribution is all above our gaussian fit.

This is where we want to attack!

The strategy here is to take all the core of the gaussian down, and extract the important information that are found in the tails.

Let’s see if it works (SPOILER: it does.)

As the gaussian fit function gives you the fitted sigma, it has been used to set the threshold. At a certain point of the gaussian (Fitted Sigma * threshold) let’s set to 0 everything that is in between these value and it’s symmetric one.

Now, as it has already been said, let’s give a look at the correlation (using different thresholds):

And this result is obtained

With otpimal TH=4.

This refinement outclass the naive wavelet method, obtaining an error that is considerably less correlated.

The result is pretty good even in terms of the RMSE. In fact if this method is applied the RMSE is pretty similar (just a little higher) to the RMSE that could have been obtained by just using the first approximation coefficient (taking away all the first detail)

WARNING:

This method is not magic!

Just because it is working surprisingly well on this one, that does not mean that it can be applied on every dataset you have. Some data may not have gaussian white noise, but pink, red, or blue. Plus you could have data with gaussian distribution, so that this method becomes powerless. Plus you could have not band limited signal. And the list goes on.

I just wanted to spot a light on a really efficient method when all the restrictions are applied, that could be used in a part of a really complex Machine Learning algorithm or data driven process.

If you liked the article and you want to know more about Machine Learning, or you just want to ask me something you can:

A. Follow me on Linkedin, where I publish all my stories
B. Subscribe to my newsletter. It will keep you updated about new stories and give you the chance to text me to receive all the corrections or doubts you may have.
C. Become a referred member, so you won’t have any “maximum number of stories for the month” and you can read whatever I (and thousands of other Machine Learning and Data Science top writer) write about the newest technology available.