Adrian PD, Author at Towards Data Science

How to do fast multiplication using the FFT

Adrian PD — Wed, 18 May 2022 03:11:13 +0000

Current Deep Learning workflows rely on thousands of integer multiplications, therefore getting an efficient multiplication performance is critical nowadays.

The multiplication operation is the workhorse kernel in many scientific and Data Science applications. The complexity of this operation is not linear, thus scaling it in time can be a difficult task to achieve. Current Deep Learning workflows rely on thousands of integer multiplications, and this number grows together with the complexity of the DL architectures or available data to train models. Therefore, getting an efficient multiplication performance is critical nowadays. This article shows how to perform integer multiplications using the most-important signal discovery of the 20th century, the Fast Fourier Transform.

Not only Deep Learning convolutions depend on integer multiplication, other scientific and computing applications, such as rendering fractal images at high magnification and public-key cryptography, rely on integer multiplication. The classical vector multiplication has a complexity of O(N²), where N is the number of digits. Barely speaking, any complexity over O(logN) is difficult to scale on many-core platforms, which means that it is going to be the bottleneck of our computations despite increasing the number of computing units in our Cloud infrastructures. Any compiler optimizations, in addition to other performance techniques like tiling or caching strategies, are useless unless the quadratic complexity is softened. Is it possible to decrease the complexity of the classical multiplication operation? Yes, it is.

All is about Polynomials

Before explaining how to reduce the complexity of the multiplication operation, let me show you that any integer in base x can be decomposed into a polynomial coefficient vector. The resulting polynomial is described by a set of coefficients and the original integer is built by evaluating the polynomial in base x:

Image by the author

The number of coefficients is equal to the number of digits; that is, the size of the polynomial. This is called coefficient representation. Remember from your math lessons that the product of two polynomials results in a third polynomial of size 2N, and this process is called vector convolution.

Above I claimed that any integer with N digits can be decomposed in a polynomial of size N, and the multiplication of two polynomials of size N is another polynomial of size 2N. The next question is to know how we can perform the polynomial multiplication:

Image by the author

As I pointed on the image above, the natural way of convoluting two polynomials is to use the distributive property and multiply term by term and then reduce redundant terms by adding the coefficients with the same polynomial degree. This is the classical multiplication that I mentioned before, with O(N²) steps. However, look at the points that I marked on the image. Each point has coordinates (i, P(i)) where P(i) is the evaluation of P(x) for x=i. Any polynomial of degree d is defined by d+1 points on the plane and this is called value representation. The idea behind the Fft multiplication is to sample A(x) and B(x) for at least d+1 points, (x_i, A(x_i)) and (x_i, B(x_i)), and then simply multiply the function values one by one (pairwise product) in order to get the value representation of the two polynomials:

The value representation multiplication reduces significantly the number of performed operations compared to the coefficient representation multiplication. Hence, the idea is to change from the coefficient representation to the value representation, perform the multiplication in a pairwise fashion, and transform back the value representation to the coefficient representation. We need some more concepts before understanding how to do it.

The Discrete Fourier Transform for dummies

Well, after having refreshed our polynomial knowledge, we are now able to dig into the signals world. If you have never heard about the Fourier Transform, you just have to know that signals can be decomposed as a sum of other signals. While a time-domain graph shows how a signal amplitude changes along the time, frequency-domain graph shows how signals lie within each frequency band over a range of frequencies:

Image by author

The Fourier Transform decomposes any signal into a sum of complex sine and cosine waves. Although it can be applied to continuous and discrete waves, I am only focusing on the discrete waves in this article, aka the Discrete Fourier Transform (DFT). The DFT is the function that allows you to change from one domain to the other, and the FFT is an algorithm that computes the DFT very very efficiently, but we will talk about it later.

Image by the author

Now we have a global idea about polynomials and the DFT, so we can move forward with our multiplication problem. We said that:

Hence, the idea is to change from the coefficient representation to the value representation, perform the multiplication in a pairwise fashion, and transform back the value representation to the coefficient representation.

An intuitive way for understanding the FFT

Maybe you are thinking at this point that the DFT can help to transform from one representation to the other one. You are right, let’s see why. We said that we would need to sample at least d+1 points to have the polynomial values that accurately define the polynomial. There are some tricks that may help to reduce the number of sample evaluations even more:

Image by the author

If we are able to apply some polynomial properties, we can reduce the number of sample evaluations we need. Keep in mind that in an even polynomial, it is true that P(-x)=P(x), which means it is symmetric with respect to the y-axis. Let’s decompose the polynomial P(x) into even polynomials: P(x)=P_e(x) + x*P_o(x):

Image by the author

In such a case, the polynomial evaluation of the point -k reuses the polynomial evaluation of k:

Image by the author

This means that we can evaluate n/2 points instead (the complementing n/2 points are instantaneity obtained), which reduces the number of operations even more. At this point, you are likely thinking in a recursion procedure, and you are right. We can build this process intuitively as follows. For the given polynomial P(x) and n points (in fact n/2 paired points), we transform P(x)=P_e(x²)+x*P_o(x²) and evaluate recursively the P_e polynomial over the n/2 positive points on the one hand; and also evaluate recursively P_o over the same n/2 positive points. After getting their result, it is easy to get the P(x) evaluation on the remaining n/2 negative points:

Image by the author

The next question to answer is what n points we choose to evaluate our polynomial. Can we select any set of pair numbers? No, we cannot:

We are passing half of the points to the next recursive level. The algorithm assumes that the polynomial will have positive and negative paired points for evaluation on every recursion call. However, each point is squared and ends up being positive. Is there any number that keeps the sign even when squaring it? The answer is the complex number i.
In addition to this, each recursive level relies on the fact that the second half of the received points are equal to the first half but with the opposite sign. ** This ends up being an equation system, where the bottom level of recursion will have a single point that values 1. There are different ways of solving this, but the most immediate is to use the Nth root of unity (find the** complex numbers that yield 1 when raised to some positive integer power N).

Imagine that our n is equal to 12, solving the 12th root of unity gives the following points to evaluate:

Source askIITians

Note that we are going to have power-of-two levels of recursion, thus it is convenient to find n as the smallest power of two greater than d+1, even though __ we only evaluate n/2 of them. The good news, we have just concluded with the most important algorithm of signal processing! This algorithm is known as Fast Fourier Transform.

Let’s put it all together into a pseudo-code:

Reducible Youtube Channel

Thanks to the FFT, we have obtained the value representation for each polynomial applying the Discrete Fourier Transform, and this is done in just O(logN) complexity. To multiply two polynomials in value representation, we just do a pairwise multiplication of the function evaluations on each point, where pairwise multiplication means multiplying the vectors in pairs, element by element, which is very cheap computationally speaking.

The Strassen FFT algorithm for multiplying large integers

This algorithm was invented by Strassen and Schönhage in 1971, but at this point of the article, you will be able to understand it easily.

If we want to multiply two large integers A and B of size N, we first transform them into their polynomial coefficient representation on base x. We store the resulting coefficients into vectors a and b, respectively. According to the convolution theorem, if c is the convolution of two input vectors a and b, c = a·b, then the Discrete Fourier Transform (DFT) of c is equal to the pairwise multiplication of the DFT transform of each input vector, DFT(c) = DFT(a)DFT(b).

Thus, the c vector can be also obtained as the Inverse Discrete Fourier Transform (IDFT) of this pairwise multiplication, c = IDFT(DFT(a)DFT(b)).

If the input vectors do not have the same length, M < N, the shortest one is filled with zeros until M = N. Finally, the coefficients have to be normalized to the same base as the one in which the integer is represented, and this step is commonly known as carry propagation. This is important since each element of c will host, after the inverse FFT, an integer larger than the base x because of the nature of the pairwise multiplication. The following image shows an example of the carry propagation method, where elements finally keep in base 10.

Image by the author

Computing approaches

When implementing the Strassen algorithm, most of the existing libraries do not use the FFT in the complex field. In the complex field, the integer coefficients are transformed into complex numbers. This ends up with values like 14.999878 instead of 15 after performing the inverse FFT operation. In contrast, most implementations use the finite field Z=pZ, with prime p. In this case, it is desirable to use a p number that minimizes the latency of the modulo operation and Fermat prime numbers are chosen for this end in most cases.

Nevertheless, there are several important restrictions with the finite field, in addition to find the n-th root of unity:

The maximum value must fit in the field, that is, (n/2)(x-1)² .

Multiplying in Z=pZ must be modulo p, thus the existence of a fast modulo p operator is desirable (like the Montgomery reduction algorithm).

Traditionally, the implementations on the finite field have been less efficient than the implementation on the complex field, as different modulo operations are required. In this recent paper, we have demonstrated that the Strassen algorithm with FFTs working on the complex field runs better than other approaches, and it is well-suitable for Gpu architectures. We also demonstrate mathematically that it is safe to use it (despite working on complex numbers) for integers with less than 500K digits.

The important topic to understand here is that although classical multiplication can perform well for small sizes, the complexity never lies. When increasing the number of digits, performance drops despite any optimizations like tiling because of its O(N²). Instead, the Strassen algorithm, with O(Nlog N (log log N)) when counting all operations involved, it is much faster. The following graph plots the performance of Real/Complex FFT Strassen implementations (by two different libraries ID/CuFFT) vs the tiling-based implementation of the classical algorithm:

Image by the author

Adrian Perez researches at the Berkeley Lab and has a PhD in Parallel Algorithms for Supercomputing. You can check out more about his Spanish and English content in his Medium profile.

Bibliography

Dieguez et al. Efficient high-precision integer multiplication on the GPU. International Journal of High Performance Computing Applications (2022).
The post How to do fast multiplication using the FFT appeared first on Towards Data Science.

An interesting walk from Bayesian statistics: Differences between MAP and MLE.

Adrian PD — Fri, 23 Jul 2021 23:02:30 +0000

This article will allow you to understand the basis of any machine learning model. With simple words and few formulas, you will get fully proficency in Bayes, the Maximum Likelihood Estimation and the Maximum A Posteriori estimator.

Specifically, let’s start discussing about binary classification in terms of probability and discrimination functions. An interesting walk to understand the principles of machine learning algorithms beyond the facilities of scikit-learn and keras.

Imagine we have to classify cars in two families: eco and polluting. The information we have about the cars is the manufacturing year and the consumption, which we represent by two random variables X1 and X2. Of course, there are other factors which influence on the classification, but they are nonobservabable. Taking in mind what we can observe, the classification of a car is denoted by a Bernoulli random variable C (C=1 indicates eco, C=0 polluting) conditioned on the observables X=[X1,X2]. Thus, if we know P(C|X1,X2), when a new entry datapoint appears with X1=x1 and X2=x2, we would say:

Image by author

where the probability of error is 1-max(P(C=1|x1,x2),P(C=0|x1,x2)). Denoting x=[x1,x2] then we can use Bayes Theorem and it can be written as:

Image by author

P(C) is called the prior probability because it is the information about C that we have in our dataset previous to analyze the observables x. As you have imagined, 1=P(C=1)+P(C=0).

P(x|C) is called the class likelihood because it is the probability that an event, which belongs to C=c, has the associated observation x.

P(x) is the evidence since it is the probability than an observation x is seen in our dataset. This term is also known as the normalization term since it divides the expression to get a value between 0 and 1. The evidence probability is the sum of all possible scenarios (also named as Law of total probability):

P(x)=P(x|C)1)P(C=1)+P(x|C=0)P(C=0)

Finally, P(C|x) is what we want to know, the probability of each class given an observation. This is called posterior probability as we classify it after seeing the observation x. Of course P(C=0|x)+P(C=1|x)=1.

In case of having several classes, we can extend this expression to K classes:

Image by author

And the Bayes classifier will choose the class with the highest posterior probability P(Ci|x) from the K options.

In a previous article, I explained how to understand classification in terms of discrimination functions, which limit the hypothesis class region in the space. Following the same idea, we can implement that set of discriminant functions. Specifically, we choose C_i if:

Image by author

If we see the discriminant function as the Bayes classifier, the maximum discriminant function corresponds to the maximum posterior probability, obtaining:

Image by author

Please, observe that we have removed the normalization term (evidence) as it is the same for all discriminant functions. This set divides the feature space into K decision regions:

Image by author

In practice, the pieces of information we observe, aka datapoints, they usually follow a given probability distribution which we represent as P(X=x) as we have discussed previously. Bayes classification seems easy to calculate when we have the values like P(C=1|x) or P(X=x). However, if we do not exactly know P(X), because we do not have access to all datapoints, we try to estimate it from the available information, and statistics come on stage.

A statistic is any value calculated from a given sample.

P(X) can have different shapes, but if the given sample is drawn from a known distribution, then we say it follows a parametric distribution. The parametric ones are quite desired because we can draw a complete distribution just knowing a couple of parameters. What do we do? Based on what we see (the sample), we estimate the parameters of the distribution by calculating statistics from the sample, the sufficient statistics of the distribution. This allows us to get an estimated distribution of our data, which we will use to make assumptions and take decisions.

Maximum Likelihood Estimation (MLE)

If you have reached this point, please forget everything about the previous Bayes explanation.

Let’s start from scratch: we have N datapoints in our dataset. It seems that the probability density of these datapoints follow a clear distribution family p(x), but we have no clue about the shape of the real distribution (its real center, its real dispersion, and so on). If we are facing a parametric distribution, it is as simple as calculating the parameters which define the real distribution. How? Let’s see the most likely option based on the given sample we have.

We have an independent and identically distributed (iid) sample:

Image by author

Its probability density is defined by a set of parameters θ. Let’s find the specific θ parameters that makes the distribution as likely as possible for the given sample. As the datapoints are independent amog them, their likelihood is the product of the likelihood of the individual points. Specifically, for a given θ, let’s see the probability p(x|θ). The higher probability, the higher match.

Image by author

Then, the likelihood of θ for a given dataset X, l(θ|X) is equal to p(X|θ). In order to do things easier, we usually apply the log to the likelihood as the computations are faster (sums instead of multiplications):

Image by author

Now, we just have to drive the log likelihood, then maximizing it with regard to θ (for example using a Gradient Descent).

IMPORTANT NOTE: We usually interchange the terms "likelihood" and "probability". However, they are different things. The probability is the area under a fixed distribution (particular situation). With probability we want to know the chance of a situation given the distribution. In contrast, the likelihood is the y-axis value for a fixed point within a distribution that could be moved. Likelihood refers to finding the best distribution for a fixed value of some feature. The likelihood contains much less information about uncertainty, it is just a point instead of an area.

Image by author

We run a search for different θ values, and we choose the θ that maximizes the log likelihood:

Image by author. Blue points represent the datapoints of the sample. Different distributions are drawn from different θ values, choosing the one with "the best match".

If we observe many different samples, the estimator of the parameters of the distribution will be more accurate and deviate less from the real parameters. Later we will see how to evaluate the quality of the estimators, in other words, how much the estimator is different from θ.

Applying MLE to a Bernoulli distribution

In Bernoulli, there are two possible states: true or false, yes or no. An event occurs or it doesn’t. Thus, each datapoint can take two values: 0 or 1. Actually, it takes the value 1 with probability p, and the value 0 with probability 1-p. Therefore:

Image by author

The distribution is modeled with a single parameter: p. Thus, we want to calculate its estimator p̂ from a sample. Let’s calculate its log likelihood in terms of p:

Image by author

And now, solving dL/dp=0 in order to obtain the p̂ which maximizes the expression:

Image by author

Maximum A Posteriori (MAP)

Now it is time to come back to Bayes theory. Sometimes, we could have some information about the possible value range that parameters θ may take, before looking at the sample. This information can be very useful when the sample is small.

Thus, the prior probability p(θ) defines the likely values that θ may take, previous to look at the sample. If we want to know the likely θ values after looking at the X sample, p(θ|X), we can use Bayes:

Image by author based on [1]

Note: We can ignore the normalizing term (or evidence), when we are talking about optimization, it does not affect.

Thus, if we want a model to predict an output for a given input x, y=g(x), as in regression, we really calculate y=g(x|θ). As it depends on the distribution.

However, since we do not know the real value of θ, we would have to take an average over predictions using all possible values of _θ (_weighted by their probability):

Image by author

Evaluating an integral can be very complex and expensive. Instead of doing that, we can take just a single point, the most likely point, the maximum a posteriori.

Thus, we want to get the most likely value of θ, or in other words, maximize P(θ|X), which is the same as maximizing P(X|θ)P(θ):

Image by author based on [1]

If you take a second to compare this formula to the MLE equation, you can realize that it only differs in the inclusion of prior P(θ) in MAP. In other words, the likelihood is weighted with the information which comes from the prior. The MLE is a special case of MAP with no prior information.

The MLE is a special case of MAP.

If this prior information is constant (a uniform distribution), it is not contributing to the maximization, and we would have the MLE formula.

Specifically, imagine that θ can take six different values. Thus, P(θ) is 1/6 everywhere in the distribution:

Image by author based on [1]

However, if the prior is not constant (it depends on the region of the distribution), the probability is not constant neither and it has to be taken into account (for example, if the prior is Gaussian).

IMPORTANT: The maximum a posteriori can be chosen instead of computing the integral, only if we can assume that P(θ|X) has a narrow peak around its mode. Why? Because we are taking a single point, instead of computing all the region, and there is no representation of uncertainty. It may happen things like this when there is no a narrow peak around the mode:

Image by author

Bibliography:

[1] Agustinus Kristiadi’s Blog https://wiseodd.github.io/techblog/2017/01/01/mle-vs-map/

[2] Ethem Alpaydin. Introduction to Machine Learning, 4th edition.
The post An interesting walk from Bayesian statistics: Differences between MAP and MLE. appeared first on Towards Data Science.

Maths behind Supervised Learning for Dummies: The theory in plain words (Part II)

Adrian PD — Sun, 30 May 2021 11:04:10 +0000

In my previous article, I wrote the first part of this series: Maths behind Supervised Learning for Dummies: The theory in plain words, where we saw an overview of how geometry&algebra are the basis of supervised learning. In the second part of this series, I will introduce the Vapnik-Chervonenkis dimension, the Probably Approximately Correct learning, how to extend our binary classifier to multiple classes, and regression.

The Vapnik-Chervonenkis Dimension a.k.a Brute Force

If four individuals (x1,x2,x3,x4) can be classified into 2 classes (class A or class B), then there are 2^4=16 possibilities of classification: (x1 in A, x2 in A, x3 in A, x4 in A) or (x1 in A, x2 in A, x3 in A, x4 in B) or (x1 in A, x2 in A, x3 in B, x4 in A) or etc etc. Thus, 2^N learning problems (or possible labels) can be given by N data points. If for all these learning problems, we can establish a h (which belongs to H class) that separates the elements correctly into their two classes, then H shatters N points. In other words, the given N points are always correctly separated with no error by a hypothesis h from H class.

The maximum number of points shattered by H is called the Vapnik-Chervonenkis dimension (it is enough to shatter a specific sample of N points in the space and not any N points in the 2-dimension). For example, four points can be shattered by rectangles, although there is no way if they are aligned, we have found a sample for which it does. For a given 4 points in two dimensions, we can find a rectangle h which allows dividing correctly any of the 2⁴=16 learning problems (or possible labelings), thus 4 points are shattered by H:

Image by the author

Nevertheless, there is no way of shattering 5 points with rectangles in two dimensions. For any 5 points in two dimensions, there are always some of the 2⁵=32 possible labelings which cannot be solved:

Image by the author

The Vapnik-Chervonenkis dimension of H, VC(H), is equal to 4 when H is the hypothesis class of axis-aligned rectangles in two dimensions.

Why did I say that this is also known as Brute Force? Well, most real-life problems cannot take 2^N possibilities. Real-life problems are generally conditioned to a distribution from which their points are drawn and points close in the space usually have the same label. We do not have to concern about all possible learning problems and we are going to use a hypothesis class with small VC dimensions over other hypothesis classes with higher dimensions. This will simplify our search and it will work fine for most given points (do you remember the bias definition?). There will be cases where we cannot divide all drawn points correctly into two classes. In such cases, we will have to say how good or accurate our hypothesis class is, for example, it works fine with a 94% of accuracy or it works with an error of less than 3% in more than the 95% of times.

Probably Approximately Correct Learning aka Real-life approximations

So, imagine we have a rectangle hypothesis h from H to solve a real-life problem whose data points are drawn from an unknown probability distribution. We want to find the number of points, N, for which the hypothesis h has an error at most E with probability 1-d. In other words, if the original class from H, which divides correctly all points, is C, then the region of difference between C and h will produce false negatives when using h as hypothesis. The probability of misassigning points in that region must be less than E (error probability) in the 1-d% of cases (confidence probability):

Image by author

The region between C and h is another rectangle (shadow rectangle in image), thus it can be divided into 4 strips (overlapping in the corners). The probability of falling in each strip is E/4, and the probability of a randomly drawn point misses each strip is 1-E/4. The probability of all N independent points miss the strip is (1-E/4)^N. Additionally, the probability of all N independent points miss any of the four strips is 4(1-E/4)^N. This probability must be d at most, thus 4(1-E/4)^N<d. We do not want to go into greater detail, but dividing by four in both sides, taking logs and rearranging, we get N >(4/E)log(4/d). So we need at least this N size to ensure that error probability with the confidence probability given. Depending on your hypothesis class (we were working with rectangles), this formula must be calculated.

Learning Multiple Classes

So far, we have focused on binary classifiers, but what if there are several classes. If we have K different classes, where each instance belongs only to one of them, we can see this as K two-class problems. The empirical error from a training set X is now the sum over the predictions for all classes (k) over all instances (N):

Regression

In classification, we calculate the empirical error with a binary behavior: The prediction is well done or it is not. For a given instance x, C(x) was either 0 or 1. When we are not predicting classes and we predict values instead, the problem is different and is called regression. Now, r is a real value and is obtained from the inputs as follows r=f(x)+E. We must add noise E to the output, which represents the effect of hidden variables that we cannot see. To be 100% accurate, we should say that r=f(x,z) where z denotes the hidden variables. If there was no noise, it would be an interpolation instead of regression (f is interpolation for example). Please, observe that the noise in the natural output function is not due to the imprecision in recording, or errors in labeling, these errors only affect when predicting.

So we want to approximate f(·) with our hypothesis class G(·), and the empirical error on the training set X must consider the numeric distance in the output, and not just the equal/not equal used in classification. A way to consider this distance is to use the square of the difference :

Our G(·) can be linear or quadratic or any other higher-order function. We just have to replace G(·) with its definition on the empirical error formula and minimize it by taking partial derivatives for the coefficients _w (_which is an analytical solution for the parameters). Please, observe that first we choose the G(·) class (for example we decide to use a linear function), but once we set the coefficients that minimize the empirical error, then we have an instance g(·) from the class G(·). Which G(·) should we choose? As I said in the previous article, when the order of the function increases, the training error decreases, but you may fall into overfitting.

Adrian Perez works as Head of Data Science and has a PhD in Parallel Algorithms for Supercomputing. You can check out more about his Spanish and English content in his Medium profile.

Bibliography

Introduction to Machine Learning. Fourth Edition. Ethem Alpaydin. The MIT Press 2020.
The post Maths behind Supervised Learning for Dummies: The theory in plain words (Part II) appeared first on Towards Data Science.

Maths behind Supervised Learning for Dummies: The theory in plain words (Part I).

Adrian PD — Tue, 18 May 2021 19:25:11 +0000

Photo by Tim Mossholder on Unsplash

Machine learning underlies the coolest technologies of today. This is how Ethem Alpaydin starts the Preface of his book, "Introduction to Machine Learning", which I had to read some years ago when I started with the Data Science world. Today, the book is still considered as a bible in the academic. The difference with many of the books you can find in Amazon is that it does not speak about technology, it only shows the algorithmic part of the Machine Learning. This is the point. Most people who work today with machine learning, they just use technology but have no idea about the mathematical basis of the algorithms that this technology encapsulates. It is very simple to call the ‘LDA’ algorithm in python on a pandas dataframe; but is even easier to execute a linear regression with caret on a R dataframe. This is fine if you need to build a model with some accuracy; but if you want to really explote your resources, you really need to completely understand the algorithms and maths behind the library or package you employ.

I write Medium stories because it is a straightforward way to review concepts and do not forget them. Thus, I am going to write a couple of series about the maths behind Machine Learning, starting with Supervised Learning. Do not worry, I know that theory is boring, so I will combine both theory with practical tips in order to do it as helpful as possible.

What is Supervised Learning? How is possible?

All of us know that an algorithm is a sequence of orders which allow to transform an input into one output. However, it happens many times that we have no idea about the orders or instructions needed to transform the input into the output; we are not able to build the algorithm despite of years of research. What we do is to learn to do the task, even if we do not the exact steps of the algorithm. Our approach starts with a very general model based on different parameters, and we will see how well the model approximates the output with those parameters. The learning passes through refining the parameters until finding the most accurate output. Learning corresponds to adjust the values of those parameters. We know that we cannot build the perfect approach, but yes a good and useful approximation.

In the process, there is a statistical perspective, since the core task is making inferences from a sample of available data: get parameters that better suit for those sample of data, creates a model based on those parameters and extrapolated that model for the global case.

The model can be predictive or descriptive, depending on if we want to make predictions or to gain explainability from data. With this, we are able to build different applications: Association Rules, Classification tasks, Regression, Pattern Recognition, Knowledge Extraction, Compression or Outlier detection, among many others.

Both regression and classifications are supervised learning. The standard shorthand is to refer the input variables as X and the output variables as Y (a number in regression and a class code in classification). The task is to learn the mapping from X to Y, and we assume we can do it thanks to a model defined up to a set of parameters: Y=G(X|P) where G(·) is the model and P the parameters. During the learning process the machine learning algorithm optimizes the parameters in order to minimize the error, trying to be as close as possible to the correct values.

Before continuing, there is an important note here: the difference between Machine Learning algorithms and models, since it seems we can use both concepts interchangeably. The model is the final representation learned from the specific data with a specific and defined parameters; while the algorithm is the process to build that model: Model=Algorithm(Data). Linear regression is a machine learning algorithm; but y=3×1+3×2+1 is a model.

Parametric vs Nonparametric. Supervised vs Unsupervised

Source: STHDA

From the previous section, we use a learning target function G(·) that best maps input variables X to an output Y, trying to represent the real function F(·) which we do not know and we try to approximate. If we knew how F(·) looks like, we would use it directly and we would skip the learning phase.

Y=G(X)+E

There is an error E in our estimations, our model is accurate but not perfect. This error can have up to three different components: Bias error, variance error and irreductible error. We will see them later.

Parametric vs Nonparametric

In parametric algorithms, we map the model to a form, while nonparametric algorithms can learn any mapping from inputs to outputs.

In the case of parametric algorithms, we are assuming the form. This simplifies the process, but it is also limiting what it can learn. So we have to decide which form the model will have and then look for the parameters for such form. In the case of linear regression we are saying to the algorithm: "hey dude, you will learn something with this form: Y=B2X2 + B1X1+ B0", and the algorithm will look for the B2, B1 and B0 parameters that minimize the error of that expression on the given data. Other examples of parametric algorithms are LDA or Logistic Regression. The advantages are clear: simpler, faster and less data required. The withdraws are that you are forcing the model to fit a form that maybe it is not the best one for your case, and therefore the accuracy is poorer.

In contrast, the nonparametric algorithms are free to learn the shape, they do not require previous knowledge (SVM, Random Forest, Bayes, DNN), but of course they will require more data and there is higher risk of overfitting to the training data.

Supervised vs Unsupervised

The difference between these two kinds of algorithms is clear: if you have the real output with which you can compare the accuracy during the training, then it is supervised learning (you can answer how well did I do my estimation?); but if you have no clue about the output, then it is unsupervised learning (Does this person fit better to the cluster 1 or to the cluster 2? Anyone can tell me the real cluster of this person?).

In the case of supervised learning, we have a reference about how good is our model, as we have a target output with which compare during the training. The learning stops when it reaches an acceptable accuracy with respect to the given outputs of the training data. However, with the unsupervised algorithms, you only have the input data (X) but no an output target; so there is no way to calculate the accuracy of your estimations. This happens when clustering or finding association rules (HDBScan, Apriori, k-medoids).

Bias, Variance and Irreductible Errors in Supervised algorithms

Some paragraphs above, we saw that Y=G(X)+E in supervising learning. These supervised algorithms can be seen as a trade-off between bias and variance. Bias is the error assumptions we do when we say that the model has to fit a specific form. This makes the problem easier to solve. Variance is the sensitivity of a model to the training data.

In supervising learning, an algorithm learns a model from training data. We estimate G(·) from the training data, and G(·) is almost Y, but there is an error E. This error can be split into: bias error, variance error and irreductible error.

Irreductible error. It always happens despite of the data set you use for training or the algorithm you choose. This error is caused by external factors like hidden variables. You always have this error regardless all the effort you can do.

Bias error. The assumptions we do to make the G(·) function easier to learn. Thus, parametric algorithms have a higher bias. High bias makes algorithm faster to learn but less flexible. For example, Linear Regression and Logistic Regression are high-bias algorithms; while SVM or kNN are low-bias algorithms.

Variance error. How different our model is if we choose another portion of training data. Ideally, it shouldn’t change but it unfortunately changes in many cases. High-variance suggests that changes in the training dataset will largely change the G(·) function. Examples of high-variance algorithms are: kNN and SVM.

As we cannot control the irreductible error, we will focus on the bias and variance error. Ideally, the goal is to get low bias and low variance. But it is difficult to achieve both, so we have to balance them. Parametric or linear algorithms often achieve high bias but low variance; while non-parametric or non-linear algorithms usually get low bias but high variance.

Overfitting and Underfitting

Our learning process is made by learning a G(·) function from training data, trying to inference general patterns from specific examples, in contrast to deduction which learns specific concepts from general rules. The final goal of machine learning is to generalize well from specific data to any other data from the problem domain.

On the one hand, overfitting means to learn from training data too much, but not generalizing well to other data. This is the most common problem in practice, since it takes the noise from the training. Overfitting is more common with nonparametric and nonlinear models (low bias). On the other hand, underfitting means to train our model with insufficient data. That is the reason why we usually iterates until reaching a certain degree of error.

In practice, we usually plot the evolution of the error for a training set in addition to the error for a test set. If we iterate too much, we will reach a point where the training error continues decreasing and decreasing, but suddenly the error of the test starts to grow. This is because the model is picking the noise from the training data over the iterations. The sweet point can be just before the test error starts to rise; however, we use resampling and a validation dataset in practice in order to find that sweet point.

Image by the author

Classification

So far, we have described different concepts you should know for understanding any machine learning analysis. Let’s analyze now the maths behind classification, one of the most important fields within machine learning.

When learning from training data, we would like to have a specific pattern shared by all A-class examples (aka positive examples) and none of the B-class examples (aka negative examples). For instance, we train our data with million of car descriptions, which say whether each description belongs to a family car or not. Our model has learnt and, after that, it is able to say if a car, that we have not seen before, is a family car or not, just based on the given description.

Imagine the following example, we have a dataset X which describes different car models. Each datapoint x is composed of two features x1,x2 which represent the price (x1) and the engine power (x2) for each car x. Some of these cars are familiar (red points) or they don’t (green points):

Image by the author

What we have to do is to find the C rectangle which discriminates red from green points. Specifically, we have to find the p1 and p2 values for the price (x1) and e1 and e2 values for the engine power (x2) which limit the C rectangle. In other words, to be a familiar car is the same as being plotted between p1 for suitable values of p1,p2,e1 and e2. This equation represents H, the hypothesis class from which we believe C is drawn, namely, the set of possible rectangles (the algorithm). However, the learning finds the particular hypothesis/rectangle h (which belongs to H) specified by a particular set of p1, p2, e1 and e2 values (the model). So, the parametric learning starts from a hypothesis class H, and finds values for p1, p2, e1 and e2 based on the available data, creating a hypothesis h which is equal or closest to the real C. Pointing out that we restrict our search to this hypothesis class (we are forcing to draw a rectangle, but maybe the real C is a circle), introducing a bias. Additionally, please observe that if there was a green point inside C, it would be a false positive; while if there was a red point outside C, it would be a false negative.

In real life, we do not know C. We have a dataset of data with the tag "it is familiar" or the tag "it is not familiar" for each vehicle x. This tag is named r. Then, we compare h(x) to r, and we see if they match or don’t. This is called the empirical error for a model h trained from a dataset X:

Image by the author

The question to be answered is how well will it classify future datapoints which are not included in the training set? This is the problem of generalization. Depending on how generous or restrictive we are for drawing our rectangle h when finding C, we could have the tightest rectangle S or the most general G:

Image by the author

Any h from H between S and G are consistent with the training set. However, which would be better for future datapoints?

Depending on H and X, we will have different S and G. Let’s choose a h halfway between S and G. We define margin as the halfway distance from the boundary to the closest datapoint. So we choose the h with largest margin to balance the possibilites. The datapoints underlined in yellow define or support the margin:

Image by the author

We use an error (loss) function that, instead of checking whether an instance is on the correct class or not, it includes how far away it is from the boundary. In some critical applications, if there is any future instance which falls between S and G is a case of doubt, which is automatically discarded.

In the next article of this serie, we will review the Vapnik-Chervonenkis dimension, the Probably Approximately Correct learning, learning multiple classes and regression.

Adrian Perez works as Data Scientist and has a PhD in Parallel Algorithms for Supercomputing. You can checkout more about his Spanish and English content in his Medium profile.

Bibliography

Introduction to Machine Learning. Fourth Edition. Ethem Alpaydin. The MIT Press 2020.

Master Machine Learning Algorithms. PhD Jason Brownlee. 2016.
The post Maths behind Supervised Learning for Dummies: The theory in plain words (Part I). appeared first on Towards Data Science.

How to load and store MLFlow models in R on DataBricks: Hacking the constraints.

Adrian PD — Fri, 19 Mar 2021 20:44:37 +0000

How to load and store MLFlow models in R on DataBricks

Hacking the constraints

Databricks has became an important building block in Cloud Computing, especially now, after Google announces the launch of Databricks on Google Cloud. However, it must be said that it is still an experimenting technology, with a long way to go in terms of efficiency and coverage.

R is one of the most powerful-and-employed languages and environments for Data Science, years and years of statistical folks working hard on developing different libraries. Thanks to that, one can find almost any mathematical model implemented with R, which is quite useful to success in most DS projects you can imagine. However, R has an important drawback: BigData. SparkR and other formulas are still far away to provide a huge catalog of ML libraries when working with massive data. Most BigData SaaS and PaaS are focused on Python and Spark, such as Databricks, making efforts to develop a complete framework and trying to define the end-to-end cycle of a ML artifact, rather than paying attention to completely support their products in R.

This happens, for example, when using MLFlow for R models within Databricks. It is true that Databricks supports both R, Python and Scala codes, but different weaknesses are found when working with MLFlow and R, specifically when trying to register a ML model.

What MLFlow is

MLFlow is a ML platform built around REST APIs which allows to record instances of ML models as a repository. These instances are packaged in order to be sent to a deploy tool, but it also register the different metrics, data around the model, config or code versions when running your experiments. Later you can visualize them and compare the output of multiple runs.

If you are not familiar with MLFlow, please take a look at their documentation and try a free trial with Databticks.

An example from the DataBricks website

MLFlow is currently in alpha, reason why it has still weaknesses, but it is a powerful idea to keep in mind when thinking in deploying our ML models.

How MLFlow works on Python/Spark

Let’s see a quick example of MLFlow on Python. First, we have to install and load the libraries:

After this, we must create or set the experiment using the following command: _mlflow.set_experiment(path_toexperiment). Before registering any experiment, we can go to the given path and we will see that the experiment was created but it is empty:

Then, we can run our model and register the execution in our experiment:

With autolog() we automatically log all parameters, scores and the model itself, but if we do not want to keep track of the whole context, we can select what to save with _log_metric(), log_param(), logartifact() or _logmodel() functions. Actually, I find really interesting the _logmodel function, as it allows us to track the model together with a label, which can help us to load the model later from the repository. Check the doc here.

After this, if you go to the experiments path now, you will see the record of all experiments you have logged. Something like this:

An example from the DataBricks website

Once the model is trained and logged, it can be propagated from None to Staging and to Production stages (look for the _transition_model_versionstage() function) in the model repository. In Databricks, the repository can be visualized in the "Models" tab:

An example from the DataBricks website

Then, it can be easily loaded from the repository to be used in our predictions. One just has to call the _mlflow.load_model(path_tomodel) instruction to use the desired model within a notebook.

Trying to use the R API of MLFlow in Databricks

Following the same structure as the Python flow, we firstly load the libraries and install mlflow:

The first difference appears when setting the experiment. In R, we cannot set an experiment that does not exist; so it is mandatory to use a try-catch to capture the error, just-in-case, and create the experiment if needed. After that, let’s try to run the execution and log the model:

Now, if we check the experiment path we will see that the experiment and the run were created; however, no model was tracked in the repository. This is surprising, since no error was returned. Most people are here stuck, and MLFlow has to work on this, but here the reason:

The _registered_modelname (‘predictorLDA’ in our example) is a useful parameter to use within _mlflow.logmodel, unfortunately is not currently available in the R library.

This is a pain in the neck if we want to load MLFlow models in our R notebooks, but there is a solution. At any Databrick notebook, you can override the default language by specifying the language magic command % at the beginning of a cell. The supported magic commands are: %python, %r, %scala, and %sql. The solution passes through calling to python (or use the REST API) from the R notebook:

We first set the experiment we have created in the previous R commands, and then we look for the last run of that experiment. Remember that both the experiment and the run were registered in R, the only thing we were no able to track was the model. Once we have found the run, we can get the URI of the created artifact and we will use this URI to get the model in python and register it. We just create a mlflow client to do the registration with the _create_modelversion instruction. This operation can take up to 300 seconds to finish the creation, so keep it in mind (it could be interesting to use a time.sleep(300) in the next command). Now, after this "trick" the model has been correctly registered in the Model repository and it is ready to be used.

Both Databricks and MLFlow are powerful and promising technologies that are being developed. Especially MLFlow is still too green, so we must be patient. In the meantime, I hope this trick can help you with your cloud deploys.

Adrian Perez works as Data Scientist and has a PhD in Parallel Algorithms for Supercomputing. You can checkout more about his Spanish and English content in his Medium profile.
The post How to load and store MLFlow models in R on DataBricks: Hacking the constraints. appeared first on Towards Data Science.

The Definitive Guide for Data Preparation that Beginners should read

Adrian PD — Tue, 08 Dec 2020 12:21:21 +0000

The best tips and steps to start any Data Science project from scratch.

Photo by Kaleidico on Unsplash

In my humble opinion, the different data scientists can be classified into:

People focus on testing many different algorithms but not paying attention to the distribution that the algorithms require. The algorithm is a blackbox for them, they just memorize the recipes.

Old school scientists who address all challenges with wonderful (but rare) R algorithms, without considering how the problem scales with real (and huge) datasets or how difficult is to find those libraries in Spark.

Fancy guys who directly decline using linear regressions because Ockham was wrong and, for them, the simplest way to do things is never the right one.

People obsessed on drawing the best plots ever, in order to tell the story behind the numbers, but using the same models over and over.

The good ones. The Data Scientists who pay attention to every detail, from using efficient data structures in their notebooks till understanding what the algorithm does, taking care of documentation, writing clean code and thinking in advance about the later deployment stage. Sadly, they are a small portion of the cake.

Which one you want to be? I am sure you are answering "the good one". However, to get such expertise, it is necessary to put the cart before the horse. The first stage in any Data Science project is the Data Preparation stage. Most people pay much more attention to later stages of the data pipeline, rather than spending their time on analyzing and preparing the data. Training a Machine Learning model is easy, the challenging part is to understand the distribution of your data and choosing the correct algorithm based on that distribution (or preparing your data for the chosen algorithm). The Data Preparation stage, with a common sense dose, is still no replaceable by automatic tools and let Data Scientists earn money.

The goal of this article is to give you some tips: how to process the data of your project before starting thinking in models + how to process the data after you have chosen the model.

Tip 1: Plot data. Never rely on aggregated metrics.

First of all, you should review your statistical notes and completely understand what the mode, the median and the mean are. Additionally, you should also review the concepts of skewed and long-tail distributions. Why do I say this?

Most beginners try to figure out how the data are distributed just by analyzing the values of their aggregated metrics. However, if you have read about the Anscombe’s quartet, you know that the mean/variance/correlation of your data does not tell us anything. If you haven’t, let me show you the following four distributions:

Anscombe’s Quartet (Image by author)

The four datasets have the same mean, variance and correlation. Nevertheless, the reader can observe that the four distributions are completely different: the first one matches with a linear regression; the second one should be approximated by a polynomial; while the third one follows a linear growth where the regression slope is slightly affected by outliers, and finally the last one is centered on x=8 but the effect of one outlier clearly introduces noise into the regression. Only plotting, or carefully analyzing, the data you will be able to realize about this issue.

Tip 2. Do hypothesis testing.

Can I assume that my data follow a normal distribution? This is a requisite in many algorithms. Do these two samples belong to the same population distribution? Especially useful when preprocessing data and you are not 100% sure about the data sources, or when working with stratified sampling. Does my recommendation engine work well?

There are many scenarios where testing your hypothesis at some level of significance is critical, and this is a must. Let me summarize some tests which you can find useful, there is life beyond Kolmogorov-Smirnov and T-student tests:

Hypothesis testing

Additionally, before doing your A/B testing, I strongly recommend you to run a power analysis with the expected results in order to obtain the minimal sample size of your test/control groups.

Related with the hypothesis testing, it is very important to understand the Central Limit Theorem (CLT):

Take a sample containing many observations, which are independent and whose variance is neither null nor infinite. The mean of the samples approximates a normal distribution as the sample size gets larger.

There is no assumption on the probability distribution of the variables; the theorem remains true no matter what distributions are. This very powerful theorem is core in Data Science, since it help us to making inference about our samples.

Statistical diagram (Image by author)

In addition to CLT, understanding Confindence Intervals (CI) is necessary. Once we have obtained a metric value from our sample data, we can infer the approximated value for this metric in the population. This is done at a given confidence level (generally at 99%, 95% or 90%). If I take an infinite number of samples and I calculate the confidence interval for a metric, then the 99/95/90% of them will contain the real population metric. Thus, a 1/5/10% of samples does not contain the real population value for that metric. This should not be ignored, we cannot guarantee that the real value is there.

Let me show you an interesting shortcut that I have applied for obtaining the minimal size when the whole data is available (millions of rows), in order to train a model (when computing resources are limited, time benefits from small sample sizes) that works on average values. If I know that my dataset follows a normal distribution and the features are (mostly) independent each other, I can use the formula for getting the CI on the population mean when the standard deviation is known. For that, I just take the most important feature (column) of my whole dataset. Then, I obtain its average (let’s say 32.87), its sd (26.59) and the margin of error which I am willing to assume (for example 3%). Et voilà, I can save time and get a reliable result by taking a sample size of 2793 elements:

A trick when my computation is bounded by dataset size and certain requesites are fulfilled (Image by author)

Step 1. Inspect carefully each feature (column) of your dataset

If you are reaching this point of the text, I assume you have read Tip 1 and Tip 2. Now, it is time to put in practice all the theory. I will focus on R in my examples, but do not worry, it exists the equivalent syntax on Python.

In addition to use the command summary(), plotting each feature of the dataset is mandatory, as it will help us to understand the distribution of data. For that, I always recommend the Histogram, the Density Plot, the Box Plot, and the Bar Plot. Furthermore, it is also a good idea to use the Missing Plot.

The histogram is quite useful for detecting outliers, although it is only applied for numerical attributes. The command hist in R will allow you to execute it. In order to improve your analysis, you should also draw a Density Plot, with command plot(density(…)) , since it is a smoothed version of the histogram that will help you to recognize the distribution shape, not affected by bins. Also, you can execute a Bar plot analysis (command barplot) for categorical features.

Histogram and Density Plot (Image by author)

The Box plot is much more interesting (command boxplot), it will tell us about the symmetry, if data are skewed and how data are grouped. The interquartile range (IQR) is the difference between the upper quartile and the lower quartile (Q1-Q3); while whiskers are an extent of data up to 1.5*IQR. Outside this limit, we obtain the candidates for outliers. In the case of a normal distribution, as we see in the figure below, 50% of data is contained into the IQR; while outliers are 0.7% of the data.

The Box-Plot on a Normal distribution (Source: KDNuggets)

After this, you can continue your analysis with a Missing plot (command missmap). It helps to see the portion of missing data in your dataset, but also to easily identify the features with higher NA occurrences. The x-axis shows the features, while the y-axis shows the row number, that means, horizontal lines are the missing attributes for a row and vertical blocks shows the number of missing data for a specific feature.

Finally, there are more plots for showing the interactions of features with each other: Correlation plot, Scatterplot, etc.

Step 2. Formatting the data

At this point, you should be able to have an idea about the distribution, the outliers and the missing values of your data.

The first question is to be sure about the consistency of your data. For that, you should ask yourself:

All data follow business constraints? Validity

If data are missed, can I re-check the source? Completeness

Do values contradict each other? Consistency

Do the instances have the same unit of measure? Uniformity

After that, remove extra white spaces and pad strings values. Also, fix typos (you can find an interesting article in CRAN for that). As advice, the order of columns matters, especially when trying to do correlations or working with dummies. Try to sort your dataset, setting the numeric attributes at the beginning. It is a good practice to get sure you are not working with duplicated entries (use groupby or unique routines on your data).

The next thing to do is to work with missing values. This is one of the most challenging aspects of the Data Preparation part. Here you can find an extensive article. Basically, check if the missing values correspond to a pattern or if they are random among your instances. If random, you can choose between drop the whole instance or use statistical information for re-generating its value. Here, the median is a good candidate for numerical values, since it is more robust than the mean. The mean can be used if the data is not skewed. For categorical values, the mode is a good approximation.

How can you detect if the missing values of column A follow a pattern? Considering that all entries for your column B are not missed, take a sample from the instances which have no missing values for column A. Take also a sample from the instances with missing values for column A. Then, analyze the distribution of column B for both samples. Do they follow the same distribution? Repeat this operation for other columns with no missing values.

In case of finding a pattern (aka a cluster), you can use a KNN to re-generate the missing values, a linear regressor or using statistical metrics of the involved cluster.

What about outliers? The most accurate rule here is the common sense. And for the sake of common sense, it is necessary to understand the distribution of data before. Do not drop the outliers if the resulting model is critical (if you work at the NASA or Airbus or similar, please think twice before dropping anything). Otherwise, if the outlier is there because of a mistake during the process, then drop it. Drop the outliers if they do not affect to the accuracy or the final result. However, if dropping the outlier changes the results, run an analysis before and after dropping them in order to understand what is happening.

If you finally decide to drop the outlier, then you have two options. On the one hand, you can eliminate the whole row (instance), if you do not have complete certain about the cause but you have enough data. On the other hand, try to fill the gap. Median/Mode are good options in case of mistake when introducing the instance, but use the nearest value otherwise. If you have decided to keep the outlier, then transform the data: log, sqrt, softmax work well. Also PCA.

And how can you identify outliers? If your data follows a normal distribution, then Pierce’s Criterion, Chavenet’s Criterion, Grubb’s test or Dixon Q’s test. An easier way, the 99.7% rule: if your distribution is approximately normal, the sd, 2sd and 3sd ranges will compriss 99.7–95–68% of the dataset. Instances far away of the 3*sd with respect to the mean can be considered outliers. Another technique is to use the Box Plot we have seen above. For other kind of distributions, read about DBScan for detecting outliers, the isolation forest or the Random Cut Forest.

The 99.7% rule for detecting outliers in Normal distributions (Source: TowardsDataScience)

Regarding the question of working with the whole dataset or just a sample of it. I strongly recommend you to train and analyze all the data. Using sampling for saving time and costs was ok some years ago, but computing resources are not a limit anymore thanks to the Cloud. Sampling requires to make statistical inference about the population, and it will contain errors: selection bias / sampling error. Here you can find parametric and non-parametric methods for inference and simple size calculations.

As you likely know, in order to work with categorical values, we use dummy variables, which are artificial variables created to represent an attribute with 2 or more distinct levels. This way of coding is also known as one-hot encoding. If your data column has K different levels, then you must use K-1 dichotomous variables to represent it, as the K-th variable is redundant and creates multicollinearity.

Finally, the Feature Selection is another core of the Data Preparation stage. You can do this in several ways, using backward or forward techniques, but an interesting approach could be the Automatic Feature Elimination (basically backward). It iterates over the number of features. On each iteration, a model is trained and tested with the current dataset, and each attribute is scored by importance. The least important one is removed for the next iteration. Basically, it looks for attributes that contribute to the prediction on the target variable. This is a easy way to find the importance of your attributes. However, if the attributes are highly correlated, its ability to identify strong attributes might be affected.

Step 3. Transforming the data

Some Machine Learning algorithms need data in a specific way, whereas other algorithms can perform better if data are accurately transformed. However, these transforms are application specific. Here, a list of transformations you can find in R:

Scale: Divides each value by the sd of the attribute.

Center: Removes the mean of the attribute to each value.

Standardize: Scale + Center

Normalize: Scale values into the range [0,1]

Log, SQRT

Box-Cox: Transform non-normal dependent variables into normal shape. Only for positive values.

Yeo-Johnson: Similar to Box-Cox but considering negative and zero values.

For example, the use of the log transformation is very useful when you have a long-tail distribution (mainly affected by the effect of outliers, this is very common in Pareto distributions) and you want to use an algorithm that expects a normal curve. In such case, the log transformation can help. However, once you did the transformation, you should check if the resulting feature follows the normal curve. You can do it in two ways, comparing graphically what you got vs what you expected, or running a hypothesis test.

In the picture below, see how much money people spend on pizza depending on the season. As you can see, this is a tail distribution, few people spend a lot of money, while most people spend much less. If we apply log(x), the resulting distribution should look like a normal one. Actually, when you plot it, it seems so. But this is not enough, use a Kolmogorov-Smirnov to check if you can reject the normal hypothesis (the null hypothesis).

Transforming a tail distribution into normal with Log (Image by the author)

Another issue may be to work with unbalanced datasets, especially when working with classification algorithms (think about decision trees). In order to not fool ourselves, we should work with robust metrics like Precision, Recall, F1 Score, Kappa or ROC Curves (analyzing sensitivity and specificity), rather than the Accuracy. It may be also interesting to think in resampling for balancing the different classes of the dataset. Sampling up and down are common options, but please, also dive into the Rose and Smote techniques which create synthetic values that balance the dataset. In my experience, Smote works better than Rose, since it creates fewer unrealistic values as Rose. However, think about the consequences of doing this, it is not free at all.

Finally, as I briefly mentioned above, many algorithms can suffer if the attributes are highly correlated. One option is to find a high grade of correlation (>0.7) among attributes, and discard them. However, a most sophisticated way may be to transform the attributes in order to make them uncorrelated. This is possible with techniques such as Principal Component Analysis (aka PCA).

PCA is an important mechanism from the Algebra, which works with orthogonal projections, in order to produce linearly uncorrelated attributes. This may be understood as the way of taking a picture to someone. Depending on the angle of the photo, you can zoom in/zoom out each dimension of the body. If you find the correct angles (orthogonal projections in PCA), you will know how much each dimension contributes to the overall body (think in plan, elevation and section drawings).

Orthogonal projections (Source: builtin.com)

Before applying the PCA, the deviation is more or less the same for the projection of each instance directly on each axis (see Figure below). However, if we project each instance vector on an orthogonal basis (uncorrelated components), we will realize that some attributes (components) have more variance than others. The principal components are constructed in such a manner that the first principal component accounts for the largest possible variance in the data set.

PCA transformation (Image by author)

An interesting application of PCA is the dimensionality reduction. Among all my components, I will only choose the ones with highest variance (as they represent the most significance in our dataset). Of course, the loss will be no important and we will still have most of the information that is carried by the components with highest variance. In any case, we should apply standardization prior to PCA, as this technique is quite sensitive about the variances of the initial variables. Another recommended techniques are LDA and Auto-Encoders, with which you can do the same. I invite you to go in greater detail with these methods.

To conclude, I would like to say that each project requires a great deal of context understanding and common sense, spending a lot of time in analyzing the data you have rather than being an expert in Data Science. Anyway, I think you can use these Tips and Steps as a first approach to fully understand the data with a guarantee of success.

Adrian Perez works as Data Scientist and has a PhD in Parallel Algorithms for Supercomputing. You can checkout more about his Spanish and English content in his Medium profile.
The post The Definitive Guide for Data Preparation that Beginners should read appeared first on Towards Data Science.

How the hell are GPUs so fast?

Adrian PD — Tue, 06 Oct 2020 18:57:15 +0000

A HPC walk along Nvidia CUDA-GPU architectures. From zero to nowadays.

Photo by Rafael Pol on Unsplash

Someone defined Machine learning as the perfect harmony among maths (algorithms), engineering (High Performance Computing) and human hability (experience). So, any progress on any of these fields will help Machine Learning to grow. Today is the turn of HPC, specifically we are talking about GPUs advances.

Nvidia just announced its Geforce RTX 30 Series (RTX3090, RTX3080, RTX3070), based on the Ampere architecture. Ampere is the last architecture of our favorite GPU brand, but several generations of CUDA-capable GPUs has been released so far. In the following paragraphs, I will describe a global overview of the CUDA architectures from beginning to end today, let’s drive the interesting road from Fermi to Ampere together. But before going into further details, I strongly recommend you to visit my previous post about the CUDA execution model, if you are not familiar with GPU computing.

Following the natural timeline of Nvidia GPUs, the company first produced a chip capable of programmable shading in 2001, the _GeForce 3,_ used by the Playstation 2 and the Xbox. Previous to GeForce 3 (codename NV20) there were some others: NV1 (1995), NV3 (1997), NV4 (1998), NV5 (1999), GeForce I (late 1999) and GeForce II (2000). However, the GeForce 3 was likely the first popular Nvidia GPU.

It is interesting to point out the difference between target categories and architectures in the Nvidia world, which can be confusing for the readers. Traditionally, Nvidia has designed a different type of product for each target category of clients, given the name to four different products: GeForce, Quadro, Tesla and (more recently) Jetson; although the underlying architecture used internally is the same for the four products. In Nvidia words, they four have the same compute capability. The GeForce line is focused on desktop and gamers; the Quadro is thought to workstations and developers who create video-content; whereas Tesla is designed for supercomputers and HPC. Finally, the Jetson line contains embedded GPUs in chips.

As we just saw above, Nvidia started its adventure in the early 90s with GPUs focused on grapichs, but we waited until 2007 to work with the first Cuda architecture: Tesla (yes, you are right, they used the same name of the architecture for a product line later, that’s why I said it can be confusing). Tesla is a quite simple architecture, so I decided to start with Fermi directly, which introduces Error-Correcting Code memory, really improves the context switching, the memory hierarchy and the double precision.

Fermi Architecture

Each Fermi Streaming Multiprocessor (SMI) is composed of 32 CUDA cores (Streaming Processors), 16 load/store units (LD/ST units) to address memory operations for sixteen threads per clock, four special function units (SFU) to execute transcendental mathematical instructions, a memory hierarchy and warp schedulers.

Fermi Streaming Multiprocessor (Image by author)

The board has six 64-bit memory partitions with a 384-bit memory interface which supports up to 6 GB of GDDR5 DRAM memory. The CPU is connected to the Gpu via a PCI-e bus. Each CUDA core has a fully pipelined arithmetic logic unit (ALU) as well as a floating point unit (FPU). In order to execute double precision, the 32 CUDA cores can perform as 16 FP64 units. Each SM has two warp schedulers which enable issue and execute 2 warps concurrently.

A key block of this architecture is the memory hierarchy. It introduces 64 KB of configurable shared memory and an L1 cache per SM, which can be configured as 16 KB of L1 cache with 48 KB of shared memory; or 16 KB of shared memory with 48 KB of L1 cache. Whereas the CPU L1 cache is designed for spatial and temporal locality, the GPU L1 is only optimized for spatial locality. Frequent accesses to a cached L1-memory location does not increase the probability of hitting the datum, but it is attractive when several threads are accessing to adjacent memory spaces. The 768 KB L2 cache is unified and shared among all SMs that services all operations (load, store and texture). Both caches are used to store data in local and global memory, including register spilling. However, it is necessary to configure whether reads are cached in both L1 and L2, or only L2. This architecture is represented as compute capability 2.x, the special Nvidia term to describe the hardware version of the GPU which comprises a major revision number (left digit) and a minor revision number (right digit). Devices with the same major revision number belong to the same core architecture, whereas the minor revision number corresponds to an incremental improvement to the core architecture.

Fermi Memory Hierarchy (Image by author)

Kepler Architecture

Kepler includes up to 15 SMs and six 64-bit memory controllers. Each SM has 192 single-precision CUDA cores, 64 double-precision units, 32 SFUs, 32 LD/ST units and 16 texture units.

Kepler Streaming Multiprocessor (Image by author)

Also, four warp schedulers, each with 2 dispatch units, which allow four warps to be issued and executed concurrently. It also increases the number of registers accessed by each thread, from 63 in Fermi, to 255; it introduces the shuffle instructions and improves the atomic operations by introducing native support for FP64 atomics in global memory. It also introduces the CUDA Dynamic Parallelism, the capacity of launching kernels from a kernel. Additionally, the memory hierarchy is organized similarly to Fermi.

Kepler Memory Hierarchy (Image by author)

The 64 KB shared memory/L1 cache is improved by permitting a 32 KB/32 KB split between the L1 cache and shared memory. It also increases the shared memory bank width from 32 bits in Fermi to 64 bits, and introduces a 48 KB Read-Only Data cache to cache constant data. The L2 cache is also increased to 1536 KB, doubling the Fermi L2 cache capacity. Additionally, Kepler compute capabilities are represented with the 3.x code.

Maxwell architecture

Maxwell consists of up to 16 SMs and four memory controllers. Each SM has been reconfigured to improve performance per watt. It contains four warp schedullers, each capable of dispatching two instructions per warp every clock cycle. The SM is partitioned into four 32-CUDA core processing blocks, each with eight texture units, 8 SFUs and 8 LD/ST units.

Maxwell Streaming Multiprocessor (Image by author)

Regarding the memory hierarchy, it features a 96 KB dedicated shared memory (although each threadblock can only use up to 48 KB), while the L1 cache is shared with the texture caching function. The L2 cache provides 2048 KB of capacity. The memory bandwidth is also increased, from 192 GB/sec in Kepler, to 224 GB/sec, and native support is introduced for FP32 atomics in shared memory. Maxwell is represented as compute capabilities 5.x.

Maxwell Memory Hierarchy (Image by author)

Pascal Architecture

A Pascal board is composed of up to 60 SMs and eight 512-bit memory controllers. Each SM has 64 CUDA cores and four texture units. It has the same number of registers as Kepler and Maxwell, but provides much more SMs, thus many more registers overall. It has been designed to support many more active warps and threadblocks than previous architectures. The shared memory bandwidth is doubled to execute code more efficiently. It allows the overlapping of load/store instructions to increase floating point utilization, also improving the warp scheduling, where each warp scheduler is capable of dispatching two warp instructions per clock. CUDA cores are able to process both 16-bit and 32-bit instructions and data, facilitating the use of deep learning programs, but also providing 32 FP64 CUDA cores for numerical programs. Global memory native support is also extended to include FP64 atomics.

Pascal Streaming Multiprocessor (Image by author)

The memory hierarchy configuration is also changed. Each memory controller is attached to 512 KB of L2 cache, providing 4096 KB of L2 cache, and introduces HBM2 memory, providing a bandwidth of 732 GB/s. It presents 64 KB of shared memory per SM, and an L1 cache that can also serve as texture cache, which acts as a coalescing buffer to increase warp data locality. Its compute capability is represented with the 6.x code.

Pascal Memory Hierarchy (Image by author)

Finally it presents the NVLink technology. The idea behind is that any 4-GPU and 8-GPU system configurations can work on the same problem. Even, several groups of multi-GPU systems are being interconnected using InfiniBand® and 100 Gb Ethernet to form much larger and more powerful systems.

Volta Architecture

A Volta board has up to 84 SMs and eight 512-bit memory controllers. Each SM has 64 FP32 CUDA cores, 64 INT32 CUDA cores, 32 FP64 CUDA Cores, 8 tensor cores for deep learning matrix arithmetic, 32 LD/ST units, 16 SFUs. Each SM is divided into 4 processing blocks, each containing a new L0 instruction cache to provide higher efficiency than previous intructions buffers and a warp scheduler with a dispatch unit, as opposed to Pascal’s 2-partition setup with two dispatch ports per sub-core warp scheduler. This means that Volta loses the capability to issue a second, independent instruction from a thread for a single clock cycle.

Volta Streaming Multiprocessor (Image by author)

A merged 128 KB L1 Data Cache/shared memory is introduced, providing 96 KB of shared memory. The HBM2 bandwidth is also improved, obtaining 900 GB/s. Additionally, the full GPU includes a total of 6144 KB of L2 cache and its compute capability is represented with the 7.0 code.

Volta Memory Hierarchy (Image by author)

However, the biggest change comes from its independent thread scheduling. Previous architectures execute warps in SIMT fashion, where a single program counter is shared among the 32 threads. In the case of divergence, an active mask indicates which threads are active at any given time, leaving some threads inactive and serializing the execution for the different branch options. Volta includes a program counter and call stack per thread. It also introduces a schedule optimizer that determines what threads from the same warp must execute together into SIMT units, giving more flexibility, as threads can now diverge at sub-warp granularity.

The newest breakout feature of Volta was called a Tensor Core, which makes up to 12x faster for deep learning applications compared to previous Pascal P100 accelerator. They are essentially arrays of mixed-precision FP16/FP32 cores. Each of the 640 tensor cores operates on a 4×4 matrix, and their associated datapaths are custom-designed to increase floating-point compute throughput of the operations over this kind of matrix. Each tensor core performs 64 floating-point fused-multiply-add (FMA) operations per clock, delivering up to 125 TFLOPS for training and inference applications.

Additionally, the second generation of NVLink delivers higher bandwidth, more links, and improved scalability for multi-GPU system configurations. Volta GV100 supports up to 6 NVLink links and total bandwidth of 300 GB/sec, compared to 4NVLink links and 160 GB/s total bandwidth on GP100.

Turing (Not a completely new) Architecture

Nvidia CEO Jen-Hsun Huang provided an interesting reply about the architectural differences between Pascal, Volta, and Turing. Basically, he explained that Volta and Turing have different target markets. Volta is intended for large-scale training, with up to eight GPUs that can be connected, with the fastest HBM2, and other features specifically for datacenters. Turing on the other hand is designed with three applications in mind: Pro Visualization, video gaming, and image generation that uses the Tensor Core. Actually, Turing has the same compute capability as Volta, 7.x, that’s why I said that Turing is not a completely new architecture.

The most remarkable achievements: use of GDDR6 memory and the introduction of RT cores that enables to render visually realistic 3D games and complex professional models:

Turing Streaming Multiprocessor. Source:NVIDIA

Like Volta, the Turing SM is divided into 4 processing blocks with each having a single warp scheduler and dispatch unit. Turing is almost identical to Volta performing instructions over two cycles but with schedulers that can issue an independent instruction every cycle. Additionally, rather than per-warp like Pascal, Volta and Turing have per-thread scheduling resources, with a program counter and stack per-thread to track thread state.

Ampere Architecture

The most recent CUDA architecture is called Ampere and delivers the highest GPU performance so far. Someone else has done a really complete review here, which I strongly recommend.

Each Ampere SM contains four processing blocks with each having L0 cache for data caching, a warp scheduler, 16 INT32 CUDA cores, 16 FP32 CUDA cores, 8 FP64 CUDA cores, 8 LD/ST cores, a Tensor core for matrix multiplication, and a 16K 32-bit register file. Each SM has an 192 KB of combined shared memory and L1 data cache; and at the GPU level, it has 40MB of L2 cache to increase performance (7x larger than V100 in Volta). The L2 cache is split into two partitions to achieve higher bandwidth.

Ampere Streaming Multiprocessor. Source: NVIDIA.

The GA100 GPU chip is composed of 128 SMs, but mainly because of -marketing- manufacturing, the different Ampere GPUs will only enable some of them. For example, the A100 GPU only exposes 108 SMs. Anyway, the full GA100 is composed of 8 GPC, each with 16 SM, and 6 HBM2 stacks. In the case of the A100 GPU, this is translated into 40 GB of HBM2 DRAM memory at 1555 GB/s.

It also introduces the third generation NVIDIA Tensor cores (after Volta and Turing), allowing it to compute an 8×4×8 mixed-precision matrix multiplication per clock (multiplying 8×4 matrix with 4×8 matrix). For example, each A100 Tensor Core executes 256 FP16 FMA (Fused Multiply-Add) operations per clock. Ampere supports many data types including FP16, BF16, TF32, FP64, INT8, INT4, and Binary on its Tensor Core.

Finally, the third generation of NVLink is presented on Ampere. In the case of the A100 GPU, it has 12 NVLink links with 600 GB/s total bandwidth for multi-GPU computing. Regarding PCIe connections, the A100 GPU supports PCIeGen 4 which provides 31.5 GB/sec of bandwidth per direction (for x16 connections), doubling the bandwidth of PCIe 3.

References

Fermi whitepaper

Kepler whitepaper

Maxwell whitepaper

Pascal whitepaper

Volta whitepaper

Turing whitepaper

Ampere whitepaper

Parallel algorithms on a GPU
The post How the hell are GPUs so fast? appeared first on Towards Data Science.