1. Project Goals and Assumptions
The goal was to create a relatively inexpensive, easy-to-use, and mobile device that would allow blind or visually impaired individuals to read printed text from documents or books.
Key Assumptions:
- Use of widely available, budget-friendly components, including the Raspberry Pi 4B microcomputer.
- Recognition of text in Polish (including diacritical marks) and conversion to speech.
- Support for two classification variants: a custom CNN neural network and the ResNet50 model (transfer learning).
2. Device Construction
2.1 Raspberry Pi 4B
The project utilizes a Raspberry Pi 4B with 4GB of RAM, running on Raspberry Pi OS (Bookworm). It serves as the core of the device, responsible for:
- Image acquisition from a camera.
- Image processing (filtering and segmentation).
- Classification of recognized letters using a neural network model.
- Speech synthesis (eSpeak or another TTS) and output via headphones/Bluetooth speaker.
2.2 Camera 5MP OV5647
For image capture, the 5MP OV5647 camera, compatible with Raspberry Pi, was selected. It provides good quality while maintaining a low price. With a resolution of 2592×1944 pixels, it enables effective text recognition, even with standard font sizes.
2.3 Enclosure
To integrate the components, a 3D-printed, two-piece enclosure (top/bottom) was designed and manufactured using Autodesk Inventor Professional 2025. The enclosure allows for stable mounting of the Raspberry Pi and the camera while ensuring easy access to ports and connectors.
2.4 Communication with a PC
The device connects to a computer (e.g., a laptop) via Wi-Fi. The computer enables:
- Real-time image preview.
- Selection of the neural network model (custom or ResNet50).
- Testing and diagnostics (e.g., using Matlab/Simulink).
3. Image Processing and Text Recognition Algorithm
- Image Capture – capturing an image from the camera and transferring it to the Raspberry Pi.
- Filtering and Binarization – conversion to grayscale, contrast enhancement, noise reduction (Wiener filter), followed by conversion to a black-and-white image.
- Segmentation – separating the image into individual letters (best results were achieved using contour detection-based segmentation).
- Classification – each extracted character is processed by a neural network. The project includes two classification options:
- Custom CNN – smaller and faster in training but slightly less accurate in recognizing complex fonts.
- ResNet50 (transfer learning) – a deep residual network initially trained on a large dataset, then adapted for Polish letters. More complex, achieving higher accuracy at the cost of longer processing time.
- Speech Synthesis – the recognized text is passed to a speech synthesizer (e.g., eSpeak) and played through headphones or a Bluetooth speaker.
4. Neural Networks
4.1 Custom CNN
- Two convolutional layers, ReLU activation, pooling, batch normalization, and dropout.
- Input images of size 28×28 pixels (binary).
- Accuracy in tests: approximately 79–84% (for a dataset of fonts and larger characters).
- Advantage: shorter training time and lower hardware requirements.
4.2 Transfer Learning: ResNet50
- A deep network with 50 convolutional layers and residual blocks.
- Modified final classification layer (adapted for Polish characters).
- Accuracy in tests: 87–92% (depending on dataset and image quality).
- Drawback: slightly longer processing time on Raspberry Pi.
5. Testing and Results
- Recognition Accuracy:
- For well-lit, clear texts, the ResNet50 model significantly outperforms the custom CNN.
- In poor lighting conditions or with small, densely packed letters, both models struggle—image segmentation and quality become key factors.
- Processing Time:
- Custom CNN: ~12 s on average for processing a typical text fragment.
- ResNet50: ~20 s, but with higher accuracy.
- Drawbacks:
- Issues with recognizing overly small or distorted characters, especially with poor segmentation or low lighting.
6. Implementation and Operation
1. Software in MATLAB/Simulink:
- Using MATLAB/Simulink Support Package for Raspberry Pi Hardware enabled quick deployment and testing.
- The MATLAB application provides a camera preview, allows switching between models (CNN/ResNet50), and transmits data to and from the microcomputer.
2. Wireless Connections:
- The Raspberry Pi connects via Wi-Fi to a computer (image preview, model selection).
- Audio output is handled via Bluetooth (e.g., headphones).
3. 3D Enclosure:
- 3D-printed using plastic —a compact design for the RPi and camera.
7. Summary and Development Potential
The designed device meets its primary goal: it can capture an image from a camera, recognize printed text, and convert it into audio output. Thanks to the use of two models (a custom CNN and ResNet50), users can choose between a faster or more accurate recognition solution.
Possible Future Improvements:
- Enhancing image quality: LED lighting, a better camera, and higher resolution.
- Expanding segmentation algorithms (e.g., intelligent text line detection).
- Improving the enclosure (materials, ergonomics, aesthetics).
- Further refining neural network models (more training data, additional layers, different types of transfer learning).
- Implementing faster TTS libraries with improved voice quality.
The project provided hands-on experience in embedded systems, image processing, machine learning, and software engineering. The device could serve as a valuable foundation for further commercialization and the development of assistive technologies for visually impaired individuals.