This is part of the Hackathon Raptors Engineering Excellence series, where we share insights from judges and technical leaders to help participants excel.
Millions of visually impaired users face significant barriers to essential online activities in a digital world optimized for visual interaction. VoxSurf, the first-place winner of the 2025 AI-Powered DEI Web Accessibility Hackathon, demonstrated how AI-driven agentic technology can fundamentally transform this experience, creating a truly accessible browser that goes beyond traditional screen readers.
Table of Contents
The Technical Foundation
“What makes VoxSurf technically exceptional is its comprehensive approach to the browsing experience,” explains Igor Kiselev, Principal Director at Accenture and technical judge at the hackathon. “Unlike solutions that merely add accessibility features to existing browsers, they’ve reimagined the entire browsing architecture from the ground up.”
VoxSurf’s technical stack combines several sophisticated components:
- PyQt5 framework providing core browser functionality
- Fetch AI’s agentic framework orchestrating AI-driven interactions
- Gemini models processing text and image-based queries
- Custom HTML parsing for semantic understanding and element prioritization
- Groq integration enabling low-latency decision-making
- DeepGram delivering high-accuracy speech-to-text and text-to-speech capabilities
“The technical integration of these components is seamless,” notes Kiselev. “Rather than feeling like separate tools cobbled together, VoxSurf presents a unified interaction model that maintains consistency across different websites and user scenarios.”
The technical architecture of VoxSurf implements a sophisticated processing pipeline:
Component | Primary Function | Technology | Processing Location |
---|---|---|---|
Speech Recognition | Convert voice to text | DeepGram Nova-2 | Client-side with server fallback |
Command Parser | Identify user intent | Custom NLP + Gemini Pro | Hybrid |
Visual Analyzer | Detect page elements | Gemini Pro Vision | Server-side |
DOM Parser | Extract semantic structure | Custom HTML5 Parser | Client-side |
Decision Engine | Determine actions | Groq Inference API | Server-side |
Text-to-Speech | Provide audio feedback | DeepGram Speech API | Client-side |
Context Manager | Maintain session state | Fetch AI Agentic Framework | Hybrid |
Technical Innovations in Navigation
Traditional screen readers present web content sequentially, often forcing users to navigate through dozens of elements before reaching their target. VoxSurf’s approach is fundamentally different, using what Kiselev calls “intelligent semantic navigation”:
“Their implementation of bounding box detection for navigation is particularly impressive,” Kiselev explains. “By combining computer vision with semantic HTML analysis, VoxSurf can identify clickable elements even when they lack proper accessibility tags—a common issue on many websites.”
The technical implementation involves:
- Visual analysis of the rendered page using Gemini’s vision capabilities
- Correlation of visual elements with underlying DOM structure
- Prioritization of elements based on likely user intent and interface conventions
- Natural language mapping between user commands and detected elements
“This hybrid approach solves a significant technical challenge in accessibility,” Kiselev observes. “While properly coded websites should have semantic markup, the reality is that many don’t. VoxSurf bridges this gap by combining visual understanding with code analysis.”
The team’s internal testing revealed substantial improvements in element detection accuracy when compared to traditional approaches:
Detection Method | Properly Tagged Elements | Improperly Tagged Elements | Dynamic Elements |
---|---|---|---|
Screen Reader Only | 94% | 23% | 18% |
Visual Analysis Only | 82% | 76% | 65% |
VoxSurf Hybrid Approach | 97% | 89% | 83% |
“These detection rates represent a step-change in accessibility,” Kiselev notes. “Particularly for improperly tagged and dynamic elements, where traditional screen readers struggle significantly.”
Agentic Intelligence: Beyond Voice Commands
VoxSurf is technically distinguished by its implementation of agentic AI using Fetch AI’s framework. Unlike traditional voice interfaces that rely on predefined commands, VoxSurf employs AI agents that can:
- Understand complex, natural language requests
- Break down multi-step tasks into executable browser actions
- Maintain context across browsing sessions
- Learn from user interactions to improve performance over time
“The technical sophistication of their agentic implementation is remarkable,” Kiselev notes. “By structuring the AI-driven interactions through Fetch AI’s framework, they’ve created a system that can handle ambiguity and adapt to different user needs while maintaining predictable behavior.”
A key technical innovation is VoxSurf’s implementation of Retrieval-Augmented Generation (RAG):
“Their RAG approach allows users to ask contextual questions about page content,” Kiselev explains. “Rather than simply reading the entire page, users can ask specific questions like ‘What are the shipping options?’ or ‘Is this product available in different colors?’ and receive targeted responses.”
The performance metrics of VoxSurf’s RAG implementation showcase its efficiency:
Metric | Traditional Screen Reader | Standard LLM | VoxSurf RAG |
---|---|---|---|
Response Time | 45+ seconds (manual) | 3.2 seconds | 0.8 seconds |
Answer Accuracy | N/A (requires manual search) | 76% | 94% |
Context Retention | None | Minimal | Full Session |
Information Density | Full page text | Verbose | Concise |
The technical implementation includes:
- Real-time content indexing as pages load
- Contextual embedding of page content for efficient retrieval
- Query formulation based on natural language understanding
- Response generation that combines retrieved information with conversational fluency
“From a technical perspective, implementing this RAG capability within a browser environment—maintaining both performance and accuracy—is a significant achievement,” Kiselev emphasizes.
Solving Real-World Technical Challenges
The VoxSurf team tackled several complex technical problems that have historically limited browser accessibility:
Form Interaction
Forms present particular challenges for visually impaired users, often requiring a precise understanding of labels, input types, and validation requirements.
“Their technical approach to form interaction is elegant,” Kiselev notes. “By combining NLP with a contextual understanding of form structure, VoxSurf allows users to complete forms using natural language rather than navigating field by field.”
The technical implementation involves:
- Identification of form elements and their relationships
- Extraction of field labels, placeholders, and validation rules
- Natural language processing to map user input to appropriate fields
- Validation feedback delivered through conversational responses
“This represents a significant technical advancement over traditional form-filling approaches,” Kiselev explains. “Rather than treating each field as an isolated interaction, VoxSurf understands forms as cohesive units with interrelated parts.”
Real-Time Page Updates
Modern websites frequently update content dynamically without page reloads, creating significant challenges for accessibility tools.
“VoxSurf’s implementation of change detection is technically sophisticated,” Kiselev observes. “By monitoring DOM mutations and applying relevance filtering, they can notify users of meaningful changes while avoiding alert fatigue.”
Their technical approach includes:
- DOM mutation observation to detect changes at the code level
- Semantic analysis to determine change significance
- Contextual filtering based on user focus and activity
- Prioritized notifications for critical updates
“This is a textbook example of how technical excellence enhances user experience,” Kiselev notes. “The system handles a complex technical challenge in a way that feels natural and unobtrusive to the user.”
Testing Methodology: Technical Rigor in Action
What particularly impressed Kiselev was VoxSurf’s comprehensive testing methodology:
“Their technical approach to testing demonstrates a real commitment to solving practical accessibility problems,” he explains. “Rather than testing with simplified scenarios, they simulated real-world conditions like creating accounts on unfamiliar websites—tasks that represent significant challenges for visually impaired users.”
This testing methodology included:
- Navigation through complex, real-world websites with varying structures
- Multi-step processes requiring context-maintenance
- Error recovery scenarios when commands were misinterpreted
- Performance testing under varying network conditions
“This level of technical rigor in testing separates production-ready solutions from proofs of concept,” Kiselev emphasizes.
VoxSurf’s comprehensive testing across diverse website categories demonstrated remarkable improvements in task completion rates: