CAMERA IMAGE PROCESSING ON ESP32 MICROCONTROLLER WITH HELP OF CONVOLUTIONAL NEURAL NETWORK

— This paper analyzes a common ESP32 microcontroller with a built-in camera for image classification tasks using a convolutional neural network. ESP32 is commonly used in IoT devices to read data and control sensors, so its computing power is not significant, which has a positive effect on the cost of the device. The prevalence of ultra-low power embedded devices such as ESP32 will allow the widespread use of artificial intelligence built-in IoT devices. The duration of photographing and photo processing is obtained in the paper, as this can be a bottleneck of the microcontroller, especially together with machine learning algorithms. Deployed convolutional neural network, pre-trained on another device, MobileNet architecture on microcontroller and proved that ESP32 capacity is sufficient for simultaneous operation of both the camera and convolutional neural network.


I. INTRODUCTION
The Internet of Things(IoT) has become a reality.Smart home devices range from laptops to TVs, doorway cameras and dishwashers.
Intelligent buildings often include sensors, and electronic devices connected to the network for control, monitoring, and recording.Some sensors can measure vital alarms, location, or user activity.Finally, there are environmental sensors that detect things like temperature, light, or user presence.
This data as well as energy consumption measurement data of some devices can be recorded during the day and then uploaded to a remote server.
Also, it can be used for teaching machine learning models to provide the greatest comfort, economy and safety for the residents of a smart building.
On the other hand, technology, and solutions that are used to create IoT devices have significant limitations.
For example, if place a battery less camera in front of the door, then there are problems with the organization of its power supply.The nearby outlets can not be found, and battery power is not very handy, because exploitation will often have to change the batteries because of the rapid discharge during the streaming video transmission.
Thus, the organization of power supply becomes an important task in the development of IoT devices, which are constantly in an active state.To overcome this barrier, IoT devices must be "intelligent" [1].It is necessary for them to act as independent processing devices and independently perform data processing, thus reducing the volume of transferred traffic and reducing energy consumption.
One of the ways to solve these problems is to integrate machine learning, namely neural networks into the intelligent block [2].
Neural networks solve tasks that traditional methods cannot compete with, it can successfully solve tasks, focusing on non-conventional, noisy, and spoiled information.A neural network is a system consisting of several simple computing elements (neurons) interconnected in some way.The most widespread are multilayer networks in which neurons are combined into spheres.The sphere, in its turn, is an assemblage of neurons to information from other neurons of the measure, i.e. outputs, is sent in parallel at every stroke of the clock.
However, one of the disadvantages of neural networks is that work requires a significant amount of computing resources that are available only on cost-effective computer systems.With the lapse of time and the development of mobile devices, the launch of neural networks and correct operation in systems with limited resources, such as microcontrollers (microcontrollers are cheap, programmable system, which often includes memory and input-output interfaces on one chip) became relevant.

V.M. Sineglazov, V.P. Khotsyanovsky Camera Image Processing on ESP32 Microcontroller with Help of Convolutional Neural Network 27
The solution to this problem was the MobileNet family of neural networks [3].
Neural networks will make the house more useful and responsive to the needs of users by prediction instead of relying entirely on direct commands or programmed procedures manually.Also, neural network integration can potentially make energy management more efficient by limiting the use of the device only when it is needed, without causing inconvenience to the residents.

II. MOBILENET
MobileNet is a family of general-purpose computer vision neural networks designed for mobile devices to support classification, detection, etc. Mobile networks are small, low-latency, lowpower models parameterized according to the resource constraints of different use cases.Although the basic MobileNet architecture (Table I) is already tiny and has low latency, sometimes possible to make the model be smaller and faster.To address the issue of faster performance, MobileNetV2 was developed based on the ideas of MobileNetV1, using convolution with depth separation as effective building blocks.However, it introduces two new features into the architecture (Fig. 1): 1) linear bottlenecks between layers; 2) short connections between bottlenecks.The development of the ideas laid down in the first versions of networks was the creation of MobileNetV3.It is tuned to the processors of phones by combining a network architecture, taking into account the hardware, augmented by the NetAdapt algorithm [17] and the new architecture (Fig. 2).To build a less resource-intensive model, MobileNet introduces a parameter α (alpha), which is called the width multiplier.The role of the width multiplier α is to liquefy the network evenly on each layer.However, the use of neural networks involves training them based on a training sample.The first mention of the concept of transfer learning in machine learning dates back to 1993 in [5].The concept consists in transferring knowledge obtained in one or more original tasks and using it to improve learning in the current task (see Fig. 3).
The techniques that enable knowledge transfer aim to make the machine learning process as effective as human learning.As a result, it was possible to retrain a convolutional neural network trained on a single sample of data to perform tasks on a new set of data, which significantly accelerated the learning process of the network.As a consequence, the material and time costs of forming the training sample and training the neural network were significantly reduced.Over the past few years, many microcontroller manufacturers have worked on implementing machine learning on microcontrollers.Some of them have developed special libraries with machine learning features [6], [7], and others have implemented special hardware with advanced machine learning capabilities [8], [9].
Therefore, the goal of this paper is to organize the input of information from the camera to the ESP32 microcontroller and process it using a convolutional neural network that is deployed on this but has been trained on a different device.
The ESP32 was chosen because it generated a lot of interest from the beginning of production.The M5CAMERA (which is based on the ESP32 microcontroller) from was used in this work.The board has 4 MB of PSRAM memory and a camera model OV2640 [10].
Since the camera operation and the image generation for the microcontroller is a complex task, before starting to work with neural networks it is necessary to investigate the image generation time to find the optimal quality without critical problems with the resulting system performance.Another task would be to investigate the effect of PSRAM memory [11] on the performance of such a system since modules with this memory are more expensive.

V. RESEARCH OF TIME OF THE CREATION OF A PHOTO ON THE ESP32 MICROCONTROLLER
To begin with, should study the effect of the PSRAM memory implementation on the performance of the controller.The measurements were made by inserting the function micros() into the program code [12].This function returns the number of microseconds after the microcontroller starts.If subtracting these values, get the execution time between the inserted functions.The shooting time is the time between the command to shoot and the moment after the execution of the command.The duration of the photo depends on the resolution of the photo and also depends on the time

V.M. Sineglazov, V.P. Khotsyanovsky Camera Image Processing on ESP32 Microcontroller with Help of Convolutional Neural Network 29
of the photo creation.In Table I there are columns maximum number of pixels in the image and used pixels in the image -the results in them are different because the camera takes a photo in full resolution, but the microcontroller turns only a square part of the photo into an image, so there is a difference between the number of pixels in the taken and received photos at the end.Now check the running time of the ESP32 microcontroller with PSRAM memory (Table III).It can be noticed that the duration of shooting is less than a millisecond, despite the resolution of the photos.Not counting the processing time of the photos is much higher than in the Table II.Taking into account data from Table II and III, means that using PSRAM memory decreases photo capture time, but significantly increases photo processing time.Therefore, the optimal camera mode is QVGA both for boards with PSRAM After the formation of the training sample, it should proceed to its automated augmentation by making small random changes to the training data (cropping or rotating images).For this purpose, the script shown in Fig. 4 has been implemented.
After enlarging the data set, its size is 96x96 pixels.Apart from resizing the images, also need to change the color gradation from RGB to grayscale to keep the actual color depth [13].Also, working with grayscale helps to reduce the amount of final memory required for logical output.
After conducting operations on the training sample, move on to training the network.It will perform pre-training on a pre-trained model [16] MobileNetV1, which is trained on the ImageNet dataset [14], using Edge Impulse Studio [15].
MobileNetV1 is used because MobileNetV2 will need about 1.3 MB of RAM and 2.6 MB of ROM to run the model, which will cause significant delays in operation.At the same time, using MobileNetV1 and setting α = 0.10 will result in less accuracy, but will only need about 53.2 KB of RAM and 101 KB of ROM.The results of the network training are shown in Fig. 5.The model achieved about 77% accuracy, but the amount of RAM to be used in the output is about 60 Kbytes, which is reasonable when using the ESP32 controller along with the camera without significant power issues.

VII. PREPARATION OF THE TRAINING SAMPLE AND TRAINING OF THE NEURAL NETWORK
The learned model can be deployed as a library generated by Edge Impulse Studio, which will store the weights of the network to be connected to the project using code (Fig. 6).Where #include <ESP32-CAM.h>connects the file with the scales.
This code is a template that receives the image unprocessed (stored in the array features) and runs the classifier to output the network.Primarily need to get the image from the camera, pre-process it by resizing it to 96x96, turning it into grayscale and smoothing it out.This will be the input tensor of the model.The output tensor will be a vector with values showing the probability of each of the classes.
For this, was taken the official code available for testing of the camera https://github.com/edgeimpulse/example-esp32-camand merged it with the code of the trained neural network.

VIII. CONCLUSIONS
The ESP32 microcontroller was used in this work because its quality is proven.The OV2640 camera has sufficient resolution for most machine learning tasks.
The creation process can take a long time on the ESP32, but machine learning algorithms usually use low resolution images, so it is recommended to set the camera resolution to the lowest level.
It can be concluded that the ESP32 with the OV2640 camera has enough processing power to perform simple machine learning and camera photography tasks.
In a future analysis it will be interesting to test the ESP32 with different neural networks and try to use both Tensilica Xtensa LX106 cores in calculations.

Fig. 2 .
Fig. 2. General architecture MobileNetV3 III.THE PROBLEM OF THE LIMITED TRAINING SAMPLE Due to the active development of neural networks in the last decade, the issues of training data set formation is of particular importance, since in many tasks, deep neural networks demonstrate quality that significantly exceeds other machine

Fig. 3 .
Fig. 3. Learning process of transfer learning IV.PROBLEM STATEMENT Given the advances in implementation and operability of the MobileNet architecture and the possibility of transfer learning, it is possible to run neural networks pre-trained on other devices (such as computers) in systems with low processing power, such as microcontrollers.Over the past few years, many microcontroller manufacturers have worked on implementing machine learning on microcontrollers.Some of them have developed special libraries with machine learning features[6],[7], and others have implemented special hardware with advanced machine learning capabilities[8],[9].Therefore, the goal of this paper is to organize the input of information from the camera to the time1 = micros(); cam.getPhoto(); time2 = micros(); total_time = time2-time1;Without PSRAM memory the M5CAMERA board can use only one photo buffer.In Table II can be seen four different resolutions and duration of photo creation and processing.
memory and without it, because machine learning algorithms usually use images with low resolution VI.PREPARATION OF THE TRAINING SAMPLE AND TRAINING OF THE NEURAL NETWORK Before training the network, it needs to form a training sample.To investigate transfer learning, the generated sample will consist of 210 initial images (70 images per class).

TABLE I .
GENERAL ARCHITECTURE MOBILENETV1

TABLE II .
PROCESSING SPEED WITHOUT PSRAM

TABLE III .
PROCESSING SPEED WITH PSRAM