AI Translation & Transcription Service
A real-time transcription and translation service, tailored towards intermediate language learners consuming foreign media.
Media
Design Decisions
I take a human first approach: Before writing code, I sat down with my Chinese teacher and some classmates who were serious about learning, to understand and build for their needs instead of just mine. For building something fit to these constraints, I first broke the problem down. From a software design perspective, I made a separate transcription service and translation service, each decoupled from the underlying models. I also listened to my classmate's advice, and found that a desktop app would be better than the web extension I initially had in mind. In order to make the app agnostic (and able to translate across TikTok, Instagram reels, different web browsers), I made it pull from the system audio using WASAPI, Windows' operating system API for audio. For selection of models, I knew that the highest priorities would have to be speed, followed by accuracy. Therefore I knew I would likely need to use a smaller model. Testing across several models, I found that SenseVoice-small won out on latency; it's non auto-regressive which means it's lightning fast when processing live audio. For translation, I used Gemma 4B through an Ollama endpoint, which similarly struck a balance between speed and accuracy. Critically, both these models are also multilingual. Integrating these models together required some more thoughtful design. + I used signal preprocessing techniques (sliding window of RMS) to detect sentence boundaries so models receive clean, complete inputs rather than sentence fragments, which meaningfully improves translation accuracy. + Given the models are multilingual, I also implemented a “jitter window” to prevent the model from switching languages unnecessarily. + I am also currently designing a “meaning matching” feature to explicitly link the meaning of words in the first sentence to its corresponding words in the translation. To accomplish this I’m experimenting with several approaches, including semantic similarity and GPTs. After returning to consult with my Chinese teacher again, she was greatly impressed by this MVP. With her feedback, she advised me to build in a way that’s aware of the user’s fluency level (adhering to HSK standards), and to tag words above their level for review. She also recommended a flashcard system for these new words. My teacher has since expressed interest in bringing the tool into Duke's Language Program curriculum, and that possibility- that something built by one person could shape how a university teaches a language- is exactly the kind of impact I want to keep building toward. With her permission, I also secured a pool of 60 beta users, of ~40 intermediate learners and ~20 beginner learners. While the project is still in its nascent stages, I am encouraged by the great feedback I’ve been getting so far. To this day, I even “dogfood” it myself as I work on it, using it regularly for watching films.
Key Learnings & Takeaways
In building this project, I found it very fulfilling to not just develop with the latest AI tools and agentic workflows, but to use AI to rapidly iterate on problems with real people, real constraints, and real feedback, in order to craft a practical learning tool that’s tailored to users’ experiences and joyful to use. I also learned about keeping guardrails on projects; I watched edits carefully for security concerns or over-complicated implementations, promptly making corrections. Working on these kinds of human-centered projects is always a fulfilling experience.