Flutter Web Audio Player Alternative

A solution for playing self hosted audio files (sounds and music with various formats: MP3, WAV, OGG) in Flutter apps built for the web platform.

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转

Images from the Convolutional World

We take neural networks one step further from our previous post, towards classifying images with a process called Convolution.

CNN (Convolutional Neural Network) is a type of Artificial Neural Network that processes its layers imitating the human eye, making it possible for it to identify objects and “see”.

CNN contains several specialized hidden layers with a hierarchy, for example the first layers can detect lines, curves, and edges, and then go specializing until they reach deeper layers that recognize complex shapes such as the face or a part of the human body.

In this trip, first we must understand how to work with this new type of input data:

A digital image has 3 attributes: width and height in pixels (the resolution), and color in RGB (red, green, blue) format called channels.

In our example we will work with a dataset of tiny images called CIFAR-10 that serves our learning purposes just fine. It consists of 60,000 32×32 color (RGB) images, labeled with an integer corresponding to 1 of 10 classes (airplane (0), automobile (1), bird (2), cat (3), deer (4), dog (5), frog (6), horse (7), ship (8), and truck (9)).

We will continue using Pytorch for our project, so we will import this dataset from your torchvision library.

Ok, each image of the dataset is an instance of an RGB PIL image, so let’s visualize some of them:

Label:1 — Class:automobile

Now, we will need to convert our image to a Tensor before we can do anything with Pytorch

torch.Size([3, 32, 32])

So, the image has been turned into a 3 (RGB channels) × 32 × 32 tensor. Ok, let’s transfer all our image dataset to Pytorch Tensors using the dataset transform parameter:

(torch.Size([3, 32, 32]), torch.float32)

The ToTensor transform turns the data into a 32-bit floating-point per channel, scaling the values down from 0.0 to 1.0, also keep in mind that we must change the order of the axes to re-view our image with matplotlib (from RGB channels-H-W to H-W-RGB channels).

Normalize image data

This is because by choosing activation functions that are linear around 0 plus or minus 1 (or 2), we keep the data in the same range and thus neurons are more likely to have non-zero gradients, therefore they will learn earlier. Also, normalize each channel to have the same distribution will ensure that channel information can be mixed and updated through the gradient descent using the same learning rate.

So, we have to compute the mean value and the standard deviation of each channel across the dataset and apply the following transform:

Let’s compute them for the CIFAR-10 training set:

torch.Size([3, 32, 32, 50000])

tensor([0.4914, 0.4822, 0.4465])

view(3, -1) keeps the 3 channels and merges all the remaining dimensions into one, so our 3 × 32 × 32 image is transformed into a 3 × 1,024 vector, then the mean is taken over the 1,024 elements of each channel.

tensor([0.2470, 0.2435, 0.2616])

Suppose we want to build a fully connected NN (as explained in the previous post), and this time we will make it deep, for which we will add a new hidden layer. Our input layer will have 3072 values (3 * 32 * 32) and will result in an output of 1024 values for the 1st hidden layer, 512 for the 2nd. hidden layer and 128 for the output layer that will output 10 values (probability of each of the classes)

Let’s see how many parameters (only weights, it remains to add the bias) we would have in our network:

(3738506, [3145728, 1024, 524288, 512, 65536, 128, 1280, 10])

Wow!! 3.738.506 parameters, why so many?

Remember that a linear layer computes y = weight * x + bias, and if x has length 3,072, and y must have length 1,024, then the weight tensor needs to be of size 1,024 × 3,072 and the bias size must be 1,024. So 1,024 * 3,072 + 1,024 = 3,146,752 parameters for the 1st layer.

This is telling us that our NN does not scale well when it comes to pixels, imagine what would happen if we had 1024x1024 RGB images, only 3.1 million input values, and more than 3 billion parameters !!

Convolutions to the rescue

If we want to recognize patterns corresponding to objects, such as a car on a route, we will probably need to look at how nearby pixels are arranged, and we will be less interested in how pixels that are far from each other appear in combination, so combinations of important features tend to be in pixels together with each other. If we wanted to detect our Ford car in an image, it wouldn’t matter if it has a tree or a cloud in the corner or not.

To translate this mathematically, we could calculate the weighted sum of a pixel to its immediate neighbors, rather than to all other pixels in the image. This would be equivalent to constructing weight matrices, one per output feature and output pixel location, in which all weights beyond a certain distance from a central pixel would be zero.

For these localized patterns have an effect on the output regardless of their location in the image, we must achieve invariant translation.

Fortunately we have available a linear, local and invariant operation in image translation:

We can define the convolution for a 2D image, as the dot product of a weight matrix: the kernel with each neighborhood at the input, generating a new output matrix.

That kernel will move from left to right and top to bottom through the input image, as if a patch were being put on the image.

In summary the advantages are:

The kernel (also called convolution matrix) are generally square and small (3x3, 5x5), and are usually initialized with random values. Of course there is a tradeoff to choose the kernel size that we will talk about later.

Let’s start to see some code, Pytorch provides convolutions for 1, 2, and 3 dimensions: nn.Conv1d for time series, nn.Conv2d for images and nn.Conv3d for volumes/videos. We will create a 2d convolution for an image:

(torch.Size([16, 3, 3, 3]), torch.Size([16]))

Ok, applies the convolution over our example image:

(torch.Size([1, 3, 32, 32]), torch.Size([1, 16, 30, 30]))

The unsqueeze add a new dim 0 to the output, because Conv2d expects a tensor in the form of B(atch) × C × H × W as input, in this case the Batch is only 1 image.

And show the convolution image:

Note that the shape of the output is 30x30 and not 32x32, so after the convolution we’re missing two pixels in each dimension. To solve this Pytorch gives us the possibility of padding the image by creating ghost pixels around the border that have value zero.

Now let’s say we want our kernel to perform an edge detection. How could we assign the weights to this new kernel?

In this way we can build a lot of more elaborate filters. The job of a CNN is to estimate the kernel of a set of filter banks, in successive layers that will transform a multichannel image into another multichannel image, where different channels correspond to different features.

We will have an output channel x Kernel (like a channel for the average, another channel for the vertical edges, etc.)

Kernel size tradeoff

Smaller kernel will gives you a lot of details, but it can lead you to overfitting and they are computationally expensive.

Larger kernel will gives you loss a lot of details and it can lead you to underfitting, but computational time is faster and memory usage is smaller.

So, you should tune your model to find the best size. It’s very common to use odd kernels, being 3x3 and 5x5 the most used.

The downsampling or pooling aims to reduces the spatial dimensions of the image based on certain mathematical operations such as average or max-pooling. Combining convolutions and downsampling can help us recognize larger structures.

For example scaling an image in half is the equivalent of taking 4 neighboring pixels (locality) as input and producing one pixel as output. This downsampling of the image could be done by applying:

Downsampling helps capture essential structural features of rendered images without fussing with fine details and generally acts as a noise suppressant.

Advantages of combining convolutions and poolings

In the example above, the first set of kernels operates in small neighborhoods and low-level features, while the second set of kernels effectively operates in larger neighborhoods, producing features that are compositions of the previous features.

This combination gives CNN the ability to view very complex scenes.

Feature mapping

The feature map is the output of one filter applied to the previous layer.

So, if we have a 1st convolution with 16 kernels, we will have 16 output matrices (feature mapping)

CNN coding time !

Ok it’s time to rebuild our neural network with convolutions and pooling and then check if we have a number of acceptable parameters so that the training is faster and computationally less expensive than Fully connected NN.

The first Convolution takes 3 channels to 16, so it generates 16 independent features that will serve to discriminate the low-level characteristics of the image, then we apply a Tanh activation function and last the 16-channel 32 × 32 image is pooled to a 16-channel 16×16 image (MaxPool2d).

Same process to a second Convolution, Tanh and Pool, finally we pass an 8 channel 8x8 image to a linear module and output 32 elements to a final linear that output 10 elements (10 probabilities, 1 per each class of image in the Cifar10 dataset).

Now we obtain the number of parameters that this network needs to compare with the fully connected:

(18354, [432, 16, 1152, 8, 16384, 32, 320, 10])

18.354 vs 3.738.506 thats a great downsize !!

If we try to apply an image to our model to make its prediction, it will give an error. This is because after the last convolution we must reshape from an 8 channels * 8*8 image to 512 1D vector.

But unfortunately, we don’t have any explicit visibility of the output of each module when we use nn.Sequential in Pytorch, so we must subclass nn.Module:

We will do a new refactoring, since some modules like nn.Tanh and nn.MaxPool2d do not have parameters and it is not necessary to register them in the new subclass. For this Pytorch has a functional API (torch.nn.functional) that we will use to perform this task.

Apply an image to a Net model:

The loss function to use in CNN classification model : Softmax Cross Entropy Loss

In our previous post about NN we had used MSE as a loss function, but here we are facing a classification problem so we should use a function that better interprets the output values as probabilities. That is, each value of the array must be between 0–1 and the vector must total 1 for each sample.

The Softmax Cross Entropy exploits these characteristics by producing gradients steeper than MSE for the same input. It have 2 components:

normalize(np.array([10, 6, 4]) ==> array([0.5, 0.3, 0.2])

softmax(np.array([10, 6, 4]) ==> array([0.84, 0.11, 0.04])

Example:

Cross entropy loss versus MSE when y = 0

Ok now we have to train our model, let’s get to work. To be able to execute our training more quickly we will use a reduced dataset from Cifar10 that only has images of 3 classes instead of the original 10 (airplanes, cars and ships) called Cifar3 :). You can see the code in the github repo.

We create a method to train the network in a loop of n epochs. We will use the data loader provided by Pytorch to feed the network with batches of images (64 in our example).

Then we will create an instance of the CNN model, one SGD optimizer, one of our Cross Entropy loss, and pass them to the training net.

2021–08–27 15:35:11.218508 Epoch 1, Training loss 1.2309487659880456

2021–08–27 15:36:10.194867 Epoch 10, Training loss 0.46954463948594766

2021–08–27 15:37:15.662480 Epoch 20, Training loss 0.38312686726133877

2021–08–27 15:38:21.163712 Epoch 30, Training loss 0.3325024703081618

…

2021–08–27 15:44:56.936097 Epoch 90, Training loss 0.18055609463060157

2021–08–27 15:46:02.644376 Epoch 100, Training loss 0.1627131676103206

Now we will measure the accuracy of our model against the validation data:

Accuracy train: 0.9295

Accuracy val: 0.8713

Ok folks, too much information in this post. In the next one we will be talking a little about:

how to store and retrieve the trained parameters of our neural network, methods to regularize our network fighting overfitting, and run our training over GPUs.

Your comments are appreciated, thanks.

Flutter Web Audio Player Alternative

Images from the Convolutional World

Add a comment

Related posts:

The Negativity About Withdrawal is Just BS

I Asked AI to Create Biblically Accurate Angels

Behind Closed Screens