Hi,👋 we have updated the app and fixed multiple bugs. We are lacking funds, request to free user not to use Adblock. Ads are non intrusive. 😊

My interview experience with @SarvamAI for ML engineer role.

@ranaharshraj7
20 views Jun 15, 2026
Advertisement

This was during campus placements-dec'24 (freshers take notes).
CTC : 84 LPA (including esops)

Disclaimer : No DSA was asked

Media image

To get an interview call, we had to build a VAD (Voice Activity Detector) from scratch in 2.5 hours on-site (with proctorship), although we were allowed any tool we could use except any external api's (I do remember @ChatGPTapp giving me hallucinated responses that I had to go back to docs.)

Dataset was provided (~50 audio files).

We were judged on :
1) Accuracy of speech detection
2) Code quality
3) Possible improvements to the approach that we couldn't implement.

Also any kind of architecture was welcome for building VAD, I went with Denoiser + WebRTC (GMM based) approach as I knew it would give the highest accuracy and they had the highest weightage for the same.

7 got shortlisted and I was one among them.
The interview was led by the head of ASR team.


We started with my internship experience at Tokyo where I led the ASR, VAD and open source LLM's integration for a company which were into warehouse management robots, and pivoting into adding speech functionalities into the robots.
We discussed :
> how I patched the WER using NLP to correct/ fill in the gaps if voice breaks in between.
> what VAD architecture I used
> how did I reduce CPU/GPU load

How I used different @OpenAI whisper models to get p95 latency <800ms.
and high level scaling methodologies I used to benchmark and stress test STT models.

Then we moved onto Ml and transformer's basics (because I was more into LLM's) :

> explain whisper-jax architecture and how it processes audio chunks
> coding naive gradient descent from scratch on docs (as @GoogleColab was auto completing for me lmao)
> explain perplexity and what other benchmarks do we use for LLM's
> touched self attention, differences between encoder - decoder architecture and that day i realized that almost all the new SOTA models are decoder only
> He also went into a deep discussion as how we can relate linear algebra with transformers (I took a LinAl course)

At last, we discussed @SarvamAI Bulbul models, especially why they use latent space decomposition and how that helps separate speech content from speaker/style representations.

**PS: No tokens were harmed in writing this.

Actions
Visual Editor Carousel Maker NEW
Update Thread
What You Can Do
  • Download as PDF
  • Save to Notion
  • Export as Markdown
  • Visual Editor
  • LinkedIn & Instagram Carousel Maker
Create Free Account

Includes 7-day Premium trial

Advertisement