Technology stack

Please read the introduction first since it also includes the explanation of the general architecture of the project: Welcome to project W!

The Backend

The backend is written in Python and uses the Flask framework to build the REST-API. For more information about the API refer to API.

All information about users, jobs and runners are stored in a SQLite database. We use the SQLAlchemy library to manage it. The backend also keeps track of which runners are currently online and a queue of all pending jobs (using an addressable priority queue). However this information is only stored in memory during runtime and not in the database.

User authentication is done using the Argon2 hasher which is recommended by the OWASP. After initial authentication (using their password) users get a JWT Token which will be used for authentication in subsequent api calls.

For reading values from the config file the backend uses the libraries pyaml-env and jsonschema. The former loads the config values into memory and also takes environment variables into account doing that, and the latter validates those values against a json schema afterwards. If the validation failes the backend will exit during startup and provide the administrator with a (hopefully) helpful error message. This way we can ensure that ones the user gets to use the backend it will not crash or behave incorrectly due to an invalid config file.

The Runners

The runner is also written in Python and does the actual transcription using OpenAIs open-source Python whisper package.

One runner cannot do more than one job at a time. If you want to increase the throughput by parallelization just add more runners since a backend can handle as much runners as you want. However please note that the backend currently has no way of knowing how powerful a runner is, so it will treat them all the same. This means that if you have a runner that is really powerful and one that is not, the backend might still assign the second runner to a long and demanding job. Ideally all runners should be have more or less the same hardware (e.g. one runner per GPU on a GPU-server).

The communication between runner and backend always goes from the runner to the runner to the backend, never the other way around (i.e. the runner will always initialize the communication). The backend is the http-server, while the runner is the http-client. The runner uses the asyncio and aiohttp libraries for this. It has two major advantages:

The runner doesn’t need a publicly reachable IP-address and no special firewall settings or similar (it can run behind a nat and a company firewall if you wan’t). It just needs to be able to reach the backend.
The runner doesn’t need a certificate or key-pair for encryption. As long as the backend has an ssl certificate, the communication between backend and runner will be automatically https encrypted

Each runner send a heartbeat to the backend periodically (currently every 15 seconds). If the backend assigned a job to this runner, it will notify it through the heartbeats response. After that the runner will download the job from the backend and process it. During processing it will send the current progress status to the backend in its heartbeats. After finishing it will upload the transcript to the backend.

The runner authenticates to the backend using runner tokens. You can create them using the /api/runners/create api route of the backend. Refer to Get a new runner token for how to exactly do that. Each runner has a unique token and the backend uses them for example to send them the correct job data.

The runner uses ffmpeg to decode the audio data from the provided files into a format that whisper understands. Whisper can transcript that audio using AI models of different sizes (e.g. tiny, medium, large, …). It can also auto-detect the language, but you can also explicitly tell it which language the audio is in.

For managing its config file the runner uses the same system as the backend (pyaml-env + jsonschema).

The Frontend

The Frontend is a SPA written in the Javascript-Framework Svelte. I chose Svelte because it is easy to learn and use and its code is nice and compact. Also Svelte is compiled into native Javascript meaning that we do not need to ship a runtime environment to the user which decreases the package size a lot. I opted to use Typescript to get some type-safety and to quickly identify potential issues before even running the code.

The development environment we use is Vite. The code repository and a lot of default configuration is generated by it. It also provides access to the Svelte compiler and provides a development-server with hot module reloading (HMR). Because of this the development-server can automatically reload the component you just edited without you having to restart it or even refresh the webpage in the browser (the component just gets replaced in place).

As a router I chose svelte-spa-router. Svelte has its own Router called SvelteKit, however I didn’t want to use it since it would have been way too overkill for this project. SvelteKit does a lot more than just Routing (e.g. server-side rendering) and most importantly runs on the server. This would have meant that we would have needed a lot of additional dependencies for the server that serves the frontend (most notably nodejs) which would have been unnecessarily complicated. We already have our Python Backend, so there is no need for us to also run Javascript Code on the server. svelte-spa-router on the other hand runs exclusively on the client. It uses hash-based routing so the server will not be queried for every route (-> no page reloads during navigation). This way we have a strict client-server model between our Frontend running in the browser and our Python backend API.

I made heavy use of the flowbite-svelte UI-component library (as you can see in the result since I didn’t bother to change the color scheme from the default). Most if not all components of the frontend are from this library. I also used flowbite-svelte-icons for the Icons. This also means that the CSS-framework tailwindcss is a big part of the project since flowbite makes heavy use of it. It makes writing CSS a lot easier and more convenient through the pre-made CSS-classes it provides.

All the dependencies of the project are managed using the pnpm Package Manager. It was the recommended way to use flowbite-svelte and offers some nice benefits over npm (like being faster and more efficient). The used package versions are version-locked in the pnpm-lock.yaml file in the root of the repository. If strongly recommend using pnpm and this lock-file since else you might not get the same versions of all the dependencies we used which might result in a different result than intended or maybe even not compiling code (especially flowbite-svelte is under constant development).