Getting a transcript is the first step to analyzing a call with LLMs.
So if you’re building a sales coaching tool, an interview coaching tool, a meeting summarization app, or any sort of app that analyzes calls, you’ll need to figure out how to get the transcript of the call.
Zoom is one of the most popular platforms for video calls, so you might be wondering: “How can I get transcripts from Zoom programmatically? Is there a Zoom transcript API?”
Unfortunately, Zoom does not have a transcription API, but don’t worry — there are many other ways to get a transcript from Zoom programmatically, which we’ll go through in this blog post.
In this post, we’ll also help you consider the pros and cons of each method, and help you weigh factors you might not have thought about, such as the best method for getting real-time transcription, the best method to get accurate speaker diarization, the lowest cost solution, and so on.
There are 7 ways to get transcripts from a Zoom call programmatically:
- Method 1: Download Transcripts from the Zoom Cloud Recording API
- Method 2: Zoom Cloud Recording + Transcription API
- Method 3: RTMP live streaming + Transcription API
- Method 4: Desktop app to capture system audio + Transcription API
- Method 5: Web app to capture microphone and speaker audio + Transcription API
- Method 6: Build a Meeting Bot + Transcription API
- Method 7: Use Recall.ai
At the bottom of this blog post, we’ve also included a comparison chart of all the different methods to get transcription from Zoom.
Now, let’s dive in.
Method 1: Download Transcripts from the Zoom Cloud Recording API
Zoom Cloud Recording is a powerful feature built natively into Zoom that allows the meeting host to record the call directly into Zoom’s cloud storage. Not only does the audio and video from the meeting get recorded, but Zoom Cloud Recording will also produce a transcript of the meeting in VTT format. The transcript file can be retrieved from Zoom’s Get Meeting Recording endpoint.
On the surface this seems very straightforward - just call the endpoint to get transcripts from Zoom, right? Unfortunately, it’s not so simple.
Let’s start with the benefits of using this method, and then we’ll dive into what you should watch out for.
Pros
-
Speaker separated transcripts
The transcript you get is annotated with the speaker’s name. This is especially important if you’re analyzing the transcript with an LLM because knowing who spoke each word allows the LLM to produce much better results.
-
Transcripts are included free of cost to you
If your users are already on the Pro, Business, or Enterprise tier, transcripts are already included, and free.
Cons
-
Only available on paid Zoom plans
To take advantage of this feature, your users need to be on the Pro, Business or Enterprise tier.
-
Your users must record to Zoom Cloud
A transcript is only produced if your user records their meeting using Zoom Cloud Recording (not Zoom local recording). If your users aren’t used to recording on Zoom Cloud this will be a behavior change they will need to make.
-
Only the host can record
Only the host of the meeting can record to Zoom Cloud, so if your user is not the host of the meeting, this method won’t be applicable.
-
Users must turn on transcription in the settings
The audio transcript feature in Zoom is disabled by default. Your users must enable this setting in Zoom for transcription to be produced. If the toggle is grayed out, they must contact their workspace admin to enable it. If they can’t find the toggle, double-check that they are on a paid Zoom plan. The good news is that your users only need to turn it on once, and it will apply to all future meetings.
-
Long wait time
Zoom Cloud Recordings typically take about 2 times the duration recorded to process, but occasionally may take up to 24 hours due to higher processing loads at that time. For example, the recording of a 30 minute long meeting will be available 1 hour after the meeting is done. This means that your analysis will be delayed.
-
English only
English is the only language supported by Zoom’s transcript feature right now.
-
No customization
You don’t have any control over the quality of the transcripts, because they’re done automatically by Zoom. If there is specific vocabulary you want transcribed (eg. company names, person names, medical terms, etc), you won’t be able to fine-tune the transcription to your needs.
-
OAuth Integration needed
To access the Zoom Cloud Recording API, your users need to connect their Zoom accounts via OAuth. This may need to go through the Zoom workspace admin and can cause additional friction during onboarding.
-
Zoom will need to review your app
Because an OAuth integration is needed, you must build a Zoom app. To use your Zoom app in production, the app will need to go through a Zoom app review process, which takes around 4 weeks.
-
No per-word time stamps
If you want to build out a UI where you can click on a word in the transcript to jump to a moment in the recording, you will need the timestamps for each word. But if that isn’t a need for you, this won’t be a problem.
-
No real-time transcripts
Zoom Cloud transcripts are only available after the meeting is done, so you won’t be able to get the transcription in real-time.
Method 2: Zoom Cloud Recording + Transcription API
Suppose you require a higher quality transcription than the default transcription produced by Zoom Cloud Recording. In that case, another option is to use a transcription API like AWS Transcribe, Google Speech To Text, OpenAI Whisper, or others, to transcribe the video produced by Zoom Cloud Recording.
Concretely, here is how this method would work:
- Record the meeting using Zoom Cloud Recording (same as in Method 1).
- After the recording is complete, call Zoom’s Get Meeting Recording endpoint and get the
download_url
of the recording, which will let you download an MP4 of the recording. - Then, pass the MP4 file to the transcription provider of your choice. The transcription provider will give you back the transcript of the file, typically in
JSON
format.
Now that you understand at a high level how this works, let’s go through the pros and cons.
Pros
-
Higher quality transcription
By using a third-party transcription provider, you can get higher quality transcription than Zoom Cloud Recording produces by default.
-
Custom vocabulary support
By using a third-party transcription provider, you can customize the results of the transcription to better suit your specific use case. For example, many transcription providers support “word boost” or “custom vocabulary” to enable accurate transcription of industry-specific terms or company names.
-
Multi-language support
Most transcription APIs, like Google STT, support multiple languages.
Cons
-
Third-party transcription is required, which is an additional cost
You’ll need to pay an additional cost to the third-party transcription provider, which can be expensive at scale.
-
No speaker separation
Because Zoom only provides a mixed audio stream from Cloud Recordings, you’re not able to get the transcript annotated with speaker names.
-
Your users must record to Zoom Cloud
In this method, a transcript is only produced if your user records their meeting using Zoom Cloud Recording. So if your users aren’t used to recording on Zoom Cloud this will be a behaviour change they will need to make.
-
Only the host can record
Only the host of the meeting can record to Zoom Cloud, so if your user is not the host of the meeting, this method won’t be applicable.
-
Long wait time
This method relies on Zoom Cloud Recording, so you will need to wait for the Zoom Cloud Recording to finish processing, which can take up to 30 minutes for a 1 hour long call. On top of that, you will also need to pass the recording to the transcription provider, which adds additional latency.
-
Only available on paid Zoom plans
Just like with getting transcripts directly from Zoom Cloud, your users must be on a paid Zoom plan, otherwise this functionality is not available.
-
OAuth Integration needed
Just like with getting transcripts directly from Zoom Cloud, your users need to connect their Zoom accounts via OAuth, which can cause user onboarding friction.
-
Zoom will need to review your app
Just like with getting transcripts directly from Zoom Cloud, you will need to create a Zoom app to access your user’s Zoom cloud recordings. The Zoom app will need to go through a review before it can be used in production. This review takes 4 weeks on average.
-
No real-time transcripts
Zoom Cloud recordings are only available after the meeting is done, so you won’t be able to get the transcription in real time.
Method 3: RTMP live streaming + Transcription API
Another option that uses the Zoom API, but can provide real-time data is the RTMP live streaming feature that Zoom provides.
RTMP, or the Real-Time Media Protocol, is a technology that allows you to stream audio and video in real time over the internet. Zoom supports streaming Zoom meetings through RTMP, and you can set up an RTMP endpoint to receive and process these streams to get a transcription.
Here is how this method would work:
- Initiate a live stream using Zoom RTMP, providing your Stream URL and Stream Key.
- The audio will start live streaming to the Stream URL you provided.
- When you receive the audio, you can either:
- Store the audio until the meeting is done, and give the transcription provider the full recording. Or,
- Stream the audio to the transcription provider in chunks. The transcription provider will give you back the live transcription.
Now that you understand at a high level how this works, let’s go through the pros and cons.
Pros
-
No wait time
Because the data is sent in real-time, you don’t need to wait for a Cloud Recording to complete. You can produce the transcript while the call is in progress to have immediate results after the call is done.
-
Real-time support
By streaming the audio to the transcription provider in real-time, you could get the transcription in real-time too.
-
Custom vocabulary & multi-language support
Because you are using a third-party transcription API, you can specify a “dictionary” of industry jargon, people’s names, and other uncommon words to make sure they get transcribed correctly. Most transcription APIs also support multiple languages.
Cons
-
Live streaming badge can make participants feel uncomfortable
When you live stream, a “live streaming” badge shows up in your Zoom meeting so all participants know the call is being streamed. This is natively built into Zoom for compliance reasons, and can’t be removed. Understandably, this badge can cause some participants to feel uncomfortable.
-
Only the host can start a stream
Only the meeting host can start a live stream, so if your user is not the host, this method won’t be applicable.
-
The meeting host must be on a paid Zoom account
The meeting organizer must have a paid Zoom account to use live streaming.
-
Users must turn on live streaming in settings
The live streaming feature in Zoom is disabled by default. Your users must enable the setting
Allow livestreaming of meetings
in Zoom for live streaming to be available. If the option is grayed out, it has been locked at either the group or account level, and your user will need to contact their Zoom admin to make changes.The good news is that your users only need to turn it on once, and it will apply to all future meetings.
-
Set up can be a hassle
Setting up live streaming can be a hassle – only the meeting host can initiate the live stream and they must do so manually for every meeting they host. Alternatively, you can use the Zoom API to start the live stream automatically, however in this case users will need to connect their accounts via OAuth.
-
Third-party transcription is required, which is an additional cost
There is no built-in transcript with Zoom RTMP – you’ll need to work with a third-party transcription provider which will come with an additional cost.
-
No speaker separation
Zoom live streaming only provides a mixed audio stream, with no speaker metadata, so you cannot get a transcript annotated with speaker names. Some transcription providers can use AI to separate the transcription by speaker, however in those cases, you’d get speakers labeled as “Speaker 1, Speaker 2, …” instead of the actual person’s name.
-
High latency
Because RTMP is an inherently high latency protocol, latencies of 10-30s are expected. However, if you aren’t working with the audio in real-time, this will not be a problem for you.
-
You don’t get per-word time stamps
Whether or not this is a problem is going to depend on your use case. For example, if you want to build out a UI where you can click on a word to jump to a timestamp, you will need the timestamps per each word. But if this isn’t a product need for you, this won’t be a problem.
For additional information on this option, here’s a link to our blog post on the pros and cons of Zoom RTMP streaming.
Method 4: Desktop app to capture system audio + Transcription API
An alternative way to get access to audio streams that don’t require Zoom APIs is to build a desktop app that records audio from the user’s computer. You can then use a transcription API to transcribe that audio or you could even run an open-source transcription model like Whisper locally on your user’s computer.
⚠️ Capturing computer audio means that you do not benefit from Zoom's built-in recording consent functionality. Your app will need to handle recording consent to ensure compliance with local laws and regulations.
Let’s weigh the pros and cons of this method.
Pros
-
No wait time
Because the data can be sent in real time, you don’t need to wait for a Cloud Recording to complete. You can produce the transcript while the call is in progress to have immediate results after the call is done.
-
Real-time support
By streaming live audio to the transcription provider, you could get the transcription in real time too.
-
Works on free Zoom plans
Unlike Methods 1, 2, and 3, a desktop app can record calls hosted on free Zoom accounts.
-
Your user can record the meeting, even if they are not the host
Also unlike Methods 1, 2, and 3, a desktop app can record the system audio even if your user is not the host of the meeting.
-
Works on all meeting platforms
This method works the same across all video conferencing platforms (Zoom, Google Meet, Microsoft Teams, etc), which means your users will have a more consistent experience, and you also save the engineering effort of building a new integration for each platform you need to work with.
-
Custom vocabulary & multi-language support
Because you are using a third-party transcription API, you can specify a “dictionary” of industry jargon, people’s names, and other uncommon words to make sure they get transcribed correctly. Most transcription APIs also support multiple languages.
Cons
-
Significant engineering burden
If you don’t already have a desktop app, building one is a significant engineering challenge. To ship a high-quality and stable desktop recorder can take a skilled engineering team several months and continuous maintenance is required. Desktop apps are particularly challenging because end-user computers are much more varied environments than your servers which can trigger more bugs, and you need to support multiple platforms such as MacOS, Windows, and Linux.
-
Users need to install a desktop app
Having your users install a desktop app can be a barrier to adoption as it leads to higher friction. Many companies disallow employees from installing new desktop apps, and many users don’t want to install an app if they’re just trying out your product.
-
No speaker separation
Because you can only get 2 audio streams from a desktop app, the speakers and the microphone, if there are more than 2 participants in a meeting you won’t be able to separate the transcript by speaker.
Just like in Method 2, you could leverage your transcription API’s machine diarization capabilities to split speakers out into “Speaker 1”, “Speaker 2”, etc.
-
Recording consent must be implemented separately
If you use an officially supported Zoom recording mechanism, you benefit from Zoom's built in recording consent functionality, which ensures that all participants have consented to being recorded, and that your app is compliant with recording consent laws and regulations.
If you choose to record outside of that framework, your app will need to handle all of this, in order to make sure that you and your customers are following the relevant laws.
-
Can make your user’s computer slow
Running the desktop app can put additional load on your customer’s computers causing them to heat up, perform more slowly, and have a shorter battery life. This can be especially severe if your desktop app is not highly optimized or if you’re running heavy computations like media encoding or transcription locally.
-
Third-party transcription is required, which is an additional cost
You’ll need to pay an additional cost to the third-party transcription provider, which can be expensive at scale.
Method 5: Web app to capture microphone + Transcription API
For this method to work, the Zoom meeting audio needs to be played out loud so the computer microphone picks up on it. The web app records the audio coming from the microphone and passes it to a transcription provider. This is much easier to build than a desktop app and doesn’t require much maintenance either.
Pros
-
No wait time
Because the data can be sent in real time, you can produce the transcript while the call is in progress to have immediate results after the call is done.
-
Real-time support
By streaming the live audio to the transcription provider, you could get the transcription in real time too.
-
Works on free Zoom plans
A web app can record the microphone audio even if your user is on a free Zoom account.
-
Your user can record the meeting, even if they are not the host
A web app can record the microphone audio even if your user is not the host of the meeting.
-
Works on all meeting platforms
This method works the same across all video conferencing platforms (Zoom, Google Meet, Microsoft Teams, etc), which means your users will have a more consistent experience, and you also save the engineering effort of building a new integration for each platform you need to work with.
-
Custom vocabulary & Multi-language support
Because you are using a third-party transcription API, you can specify a “dictionary” of industry jargon, people’s names, and other uncommon words to make sure they get transcribed correctly. Most transcription APIs also support multiple languages.
Cons
-
Method won’t work if your user has headphones in
The web app won’t be able to capture what the other participants in the meeting are saying if your user is using headphones. If you want to get the audio from the meeting, your user needs to play the meeting audio out loud so their microphone picks up on it.
-
Method won’t work if the computer speaker volume is too quiet
Similarly, if the user’s computer speaker volume is too low, the microphone won’t pick up the audio.
-
No speaker separation
Because you can only get 1 audio stream from the microphone, you won’t be able to separate the transcript by speaker.
Just like in the previous methods, you could leverage your transcription API’s machine diarization capabilities to split speakers out into “Speaker 1”, “Speaker 2”, etc.
-
Third-party transcription is required, which is an additional cost
You’ll need to pay an additional cost to the third-party transcription provider, which can be expensive at scale.
Method 6: Build a Meeting Bot + Transcription API
Building a meeting bot is an option that doesn’t involve a change in user behavior, and also doesn’t put additional load on your customer’s computers. A meeting bot is essentially an instance of Zoom that you run on your servers, which you can use to capture the audio and video data from the meeting in real-time. You can then pass the audio and video to a transcription API. Note that the bot will show up as another participant in the meeting.
Pros
-
Speaker separated transcripts
Meeting bots have access to data on when each person in the meeting is speaking. This means you’ll be able to diarize your transcript accurately and figure out what words were said by which person, no matter which transcription API or model you’re using.
-
Works on free Zoom plans
A meeting bot can record the meeting even if your user is on a free Zoom account.
-
Your user can record the meeting, even if they are not the host
A meeting bot can record the meeting even if your user is not the host of the meeting.
-
No wait time
Because the audio can be streamed in real time, you can produce the transcript while the call is in progress to have immediate results after the call is done.
-
Low latency, real-time support
Meeting bots are a real-time and low-latency form of data capture because they connect directly to the meeting. You can expect minimal latency, from 200-500 ms while using a meeting bot, and you’ll be able to do real-time analysis of the transcription.
-
Consistent user experience
Meeting bots have the same user experience across all the major platforms of Zoom, Meet, Teams, and Webex. Therefore, if you go the meeting bot route, your users will have a consistent experience across all platforms (though you’ll need to maintain separate bot implementations for each platform).
-
Custom vocabulary & multi-language support
Because you are using a third-party transcription API, you can specify a “dictionary” of industry jargon, people’s names, and other uncommon words to make sure they get transcribed correctly. Most transcription APIs also support multiple languages.
Cons
-
Significant engineering burden
Meeting bots are a major effort to build, and the effort must be repeated for each platform you want to support, as a bot built for Zoom won’t work on any other platform. It takes a skilled engineering team over a year to build stable, scalable, and cost-effective bots for the 3 major platforms: Zoom, Meet, and Teams.
-
Ongoing maintenance burden
Meeting bots come with a lot of maintenance and infrastructure toil to operate at scale. Because meeting bots connect to meeting platforms that in many cases don’t have official APIs, there’s a significant amount of maintenance required to ensure that your bots keep up with any platform-level changes that could break them. Additionally, because meeting bots run in your infrastructure, you’ll need to scale, monitor, and debug issues with bots as your customer base grows. The maintenance work required to run meeting bots at scale can take the full-time effort of 3-6 senior engineers.
-
Can be costly to operate
You also need to pay for the infrastructure you’re using to host the meeting bot. Not only are meeting bots difficult to build and maintain, but they can also be costly to run. Each bot is running an instance of Zoom, which means you end up managing a fleet of servers, which can be costly until you reach economies of scale.
-
Third-party transcription is required, which is an additional cost
Because transcription isn’t built-in, you’ll need to pay for transcription whether it comes from a third-party provider, or from an open-source model you’re hosting yourself.
-
Zoom will need to review your app
Zoom meeting bots typically join the Zoom meeting by running an instance of the Zoom SDK. Before your Zoom SDK credentials can be used in production, Zoom will need to review your meeting bot. This takes around 4 weeks.
Method 7: Use Recall.ai
Recall.ai is a hosted meeting bot service. Recall manages the infrastructure, monitors and updates the bot implementations to handle updates on each platform, and allows you to use meeting bots with minimal engineering effort.
Pros
-
Works on all meeting platforms
Recall is a unified API, which means you integrate once, and you’re able to integrate with all the video conferencing platforms (Zoom, Microsoft Teams, Google Meet, etc). Your users will have a more consistent experience, and you also save the engineering effort of building a new integration for each platform you need to work with.
-
Works on free Zoom plans
Recall can record the meeting even if your user is on a free Zoom account.
-
Your user can record the meeting, even if they are not the host
Recall can record the meeting even if your user is not the host of the meeting.
-
Speaker separated transcripts
Recall has access to data on when each person in the meeting is speaking. This means you’ll be able to diarize your transcript accurately and figure out what words were said by which person, no matter which transcription API or model you’re using.
-
No wait time
Recall gives you the complete transcription within seconds after the meeting is done.
-
Low latency, real-time support
Recall is an API for meeting bots, and meeting bots are a real-time and low-latency form of data capture because they connect directly to the meeting. You can expect minimal latency, from 200-500 ms while using Recall, and you’ll be able to do real-time analysis of the transcription.
-
Consistent user experience
Recall gives you the same user experience across all the major platforms of Zoom, Meet, Teams, and Webex. Therefore, if you go the Recall route, your users will have a consistent experience across all platforms, which is a bot joining the meeting. Also, once you integrate with Recall, your implementation automatically works across the other platforms — no extra code is required.
-
You can use third-party transcription APIs OR Zoom’s native transcription
The Recall bot API can scrape captions from the meeting platforms, so you can use Zoom’s native transcription without needing your user to enable any settings.
Recall is also integrated with major transcription API providers such as AWS Transcribe, Deepgram, Assembly, Rev, and more. So if you choose to use a third-party provider for more advanced features, such as custom vocabulary recognition, Recall has that support natively built in.
Both options are available with Recall, so you can pick whichever makes the most sense for your use case.
-
Custom vocabulary & multi-language support
If you do end up using a third-party transcription API, you can specify a “dictionary” of industry jargon, people’s names, and other uncommon words to make sure they get transcribed correctly. Most transcription APIs also support multiple languages.
-
Fast build time
Out of all the options here, Recall is the fastest to build with. On average it takes a developer 72 hours to fully integrate Recall.
Cons
-
Additional cost
It costs money to use the Recall API, compared to options such as the Zoom Cloud Recording transcript, which is free for you to access. However, because the Recall platform is highly optimized, it is generally cheaper to use Recall than to run bots yourself.
-
Zoom will need to review your app
Zoom meeting bots join the Zoom meeting by running an instance of the Zoom SDK. Before your Zoom SDK credentials can be used in production, Zoom will need to review your meeting bot. This typically takes around 4 weeks. Although Recall can’t control Zoom’s review timelines, Recall can make the review process less stressful and guide you, as they have helped hundreds of customers through it.
Comparison chart
Method | Method 1: Download Transcripts from the Zoom Cloud Recording API | Method 2: Zoom Cloud Recording + Transcription API | Method 3: RTMP live streaming + Transcription API | Method 4: Desktop app to capture system audio + Transcription API | Method 5: Web app to capture microphone + Transcription API | Method 6: Build a Meeting Bot + Transcription API | Method 7: Use Recall.ai |
---|---|---|---|---|---|---|---|
Speaker separated transcripts | x | Speaker-separated transcripts are available if the user turns on the “create audio transcript” setting in Zoom. | x | x | |||
Works on any Zoom Plan | x | x | x | x | |||
Users can record, even if they are not the host | x | x | x | x | |||
Transcripts are available instantly after the meeting | x | x | x | x | x | ||
Transcripts are available in real-time | Real-time transcription is available but with high latency (10-30 seconds). | x | x | x | x | ||
Transcripts support multiple languages | x | x | x | x | x | x | |
Transcript vocabulary can be customized | x | x | x | x | x | x | |
Transcripts have per-word timestamps | x | x | x | x | x | x | |
Zoom app review not required | x | x | |||||
Fast to integrate with | x | x | x | ||||
No maintenance required | x | x | x | x | |||
Don’t need to pay a 3rd party for additional costs | x | You don’t need to pay for third-party transcription if you’re transcribing on-device. | |||||
Doesn’t slow down your user’s computer | x | X | X | x | x | x | |
Doesn’t require users to install a desktop app | x | x | x | x | x | x | |
Doesn’t require users to OAuth their Zoom account | x | x | x | x |
Still don’t know which one to go with?
If you’ve read all the options and are still unsure which one makes the most sense for you, we’re happy to help. We’ve seen hundreds of use cases and we’ll be 100% honest with you on which method makes the most sense for you - no BS.
Book a chat with an expert, or sign up for an account, and see you soon!