My interview experience with @SarvamAI for ML engineer role.

This was during campus placements-dec'24 (freshers take notes).
CTC : 84 LPA (including esops)

Disclaimer : No DSA was asked

To get an interview call, we had to build a VAD (Voice Activity Detector) from scratch in 2.5 hours on-site (with proctorship), although we were allowed any tool we could use except any external api's (I do remember @ChatGPTapp giving me hallucinated responses that I had to go back to docs.)

Dataset was provided (~50 audio files).

We were judged on :
1) Accuracy of speech detection
2) Code quality
3) Possible improvements to the approach that we couldn't implement.

Also any kind of architecture was welcome for building VAD, I went with Denoiser + WebRTC (GMM based) approach as I knew it would give the highest accuracy and they had the highest weightage for the same.

7 got shortlisted and I was one among them.
The interview was led by the head of ASR team.

We started with my internship experience at Tokyo where I led the ASR, VAD and open source LLM's integration for a company which were into warehouse management robots, and pivoting into adding speech functionalities into the robots.
We discussed :
> how I patched the WER using NLP to correct/ fill in the gaps if voice breaks in between.
> what VAD architecture I used
> how did I reduce CPU/GPU load

How I used different @OpenAI whisper models to get p95 latency <800ms.
and high level scaling methodologies I used to benchmark and stress test STT models.

Then we moved onto Ml and transformer's basics (because I was more into LLM's) :

> explain whisper-jax architecture and how it processes audio chunks
> coding naive gradient descent from scratch on docs (~~as @GoogleColab was auto completing for me lmao~~)
> explain perplexity and what other benchmarks do we use for LLM's
> touched self attention, differences between encoder - decoder architecture and that day i realized that almost all the new SOTA models are decoder only
> He also went into a deep discussion as how we can relate linear algebra with transformers (~~I took a LinAl course~~)

At last, we discussed @SarvamAI Bulbul models, especially why they use latent space decomposition and how that helps separate speech content from speaker/style representations.

**PS: No tokens were harmed in writing this.

My interview experience with @SarvamAI for ML engineer role.

Actions

What You Can Do