Canvas & Ratio
Choose your destination platform format
Layout Template
Choose a content structure for your slides
Preset Themes
Typography & Sizing
Brand Kit Customization
AGENCYConfigure brand assets for headers & footers
Outro Slide CTA
Customize your closing call-to-action slide
Background Pattern
Build Your Carousel
Drag and drop any post card below onto a slide, or use the quick buttons to insert content/images instantly!

This was during campus placements-dec'24 (freshers take notes). CTC : <b>84 LPA</b> (including esops) Disclaimer : <b>No DSA was asked</b>


To get an interview call, we had to build a VAD (Voice Activity Detector) from scratch in 2.5 hours on-site (with proctorship), although we were allowed any tool we could use <b>except any external api's</b> (I do remember @ChatGPTapp giving me hallucinated responses that I had to go back to docs.) Dataset was provided (~50 audio files). We were judged on : 1) Accuracy of speech detection 2) Code quality 3) Possible improvements to the approach that we couldn't implement. Also any kind of architecture was welcome for building VAD, I went with Denoiser + WebRTC (GMM based) approach as I knew it would give the highest accuracy and they had the highest weightage for the same. 7 got shortlisted and I was one among them. The interview was led by the head of ASR team.

---

We started with my internship experience at Tokyo where I led the ASR, VAD and open source LLM's integration for a company which were into warehouse management robots, and pivoting into adding speech functionalities into the robots. We discussed : > how I patched the WER using NLP to correct/ fill in the gaps if voice breaks in between. > what VAD architecture I used > how did I reduce CPU/GPU load How I used different @OpenAI whisper models to get p95 latency <800ms. and high level scaling methodologies I used to benchmark and stress test STT models. Then we moved onto Ml and transformer's basics (because I was more into LLM's) : > explain whisper-jax architecture and how it processes audio chunks > coding naive gradient descent from scratch on docs (<s>as @GoogleColab was auto completing for me lmao</s>) > explain perplexity and what other benchmarks do we use for LLM's > touched self attention, differences between encoder - decoder architecture and that day i realized that almost all the new SOTA models are decoder only > He also went into a deep discussion as how we can relate linear algebra with transformers (<s>I took a LinAl course</s>) At last, we discussed @SarvamAI Bulbul models, especially why they use latent space decomposition and how that helps separate speech content from speaker/style representations. **PS: No tokens were harmed in writing this.