Thanks to visit codestin.com
Credit goes to github.com

Skip to content
Manuel edited this page Sep 20, 2025 · 2 revisions

API Documentation (Not fully tested, please test yourself first)

The API is primarily webhook-driven. You submit a file for processing, and the server notifies your specified callback_url when the job is complete.

Base URL: https://your-server-address.com

Authentication

All API endpoints require authentication via a Bearer Token unless the server is running in LOCAL_ONLY_MODE.

Include an OIDC access token in the Authorization header of your requests.

Header format:

Authorization: Bearer <YOUR_OIDC_ACCESS_TOKEN>

API v1 Endpoints

This section details the primary endpoints for programmatic interaction.

Process a File

This is the main endpoint for submitting a file for any supported task. It accepts a file and task parameters, queues the job, and immediately returns a job identifier.

  • Endpoint: POST /api/v1/process

  • Description: Submits a file for asynchronous processing. The result is sent to a callback URL.

  • Request type: multipart/form-data

Form fields

Name Required Type Description
file Yes File The file to be processed.
task_type Yes String The type of task to perform. See Task Types section below for valid options.
callback_url Yes String The URL where the server will send a POST request with the job results upon completion. Must be on the server's allowed list.
model_size Optional String For transcription tasks. Specifies the Whisper model size (e.g., tiny, base, small). Defaults to base.
model_name Optional String For tts tasks. Specifies the TTS model/voice to use (e.g., piper/en_US-lessac-medium). See GET /api/v1/tts-voices.
output_format Optional String For conversion tasks. Specifies the tool and target format (e.g., libreoffice_pdf, ghostscript_pdf_screen, sox_wav_48k_16b).

Example Workflow (Transcription)

  1. Submit the Job: Your application sends a POST request to /api/v1/process with your audio file, task_type=transcription, and your callback_url.

  2. Receive Job ID: The server immediately responds with 202 Accepted and a JSON body: {"job_id": "...", "status": "pending"}. Your application should store this job_id.

  3. Wait for Webhook: The server processes the file in the background. Your application waits for an incoming POST request at the callback URL it provided.

  4. Handle Webhook: Once the job is done, the server sends the webhook payload to your callback URL:

    • If status is completed, your application can now use the download_url to fetch the resulting text file.

    • If status is failed, your application can inspect the error_message to diagnose the issue.

  5. (Optional) Check Status: At any point after submission, your application can make a GET request to /job/{job_id} to poll for the job's status.


Notes

  • Ensure the callback_url you provide is on the server’s allowed list.

  • The exact available output_format values for conversion tasks are defined in the server's settings.yml.

  • Webhook requests may include an Authorization: Bearer <callback_bearer_token> header if configured.



Settings File Documentation (settings.yml)

This file configures all the operational parameters of the File Processor server. It's written in YAML, a human-readable data format. The server reads this file on startup to determine its capabilities, limits, and behavior.

app_settings

This section controls global application settings like file limits and allowed types.

  • max_file_size_mb: The maximum size for any single uploaded file, in megabytes. In the example, it's set to 2000 MB (2 GB).
  • app_public_url: The publicly accessible URL of the server. This is crucial for generating correct download links in webhook notifications. It's commented out by default but should be set to your server's public address (e.g., https://files.example.com).
  • allowed_all_extensions: A comprehensive list of all file extensions that are permitted for upload across all tools. This acts as a primary security and validation filter. If an extension is not in this list, the file will be rejected.

auth_settings

This section configures the OpenID Connect (OIDC) authentication, which is used to secure the web UI and the API.

Important: You need to whitelist the redirects https://your-server-address.com/ and https://your-server-address.com/auth

in your OAuth Application for this app.

  • oidc_client_id: The Client ID for this application, as registered with your OIDC provider.
  • oidc_client_secret: The Client Secret for this application. Keep this value confidential.
  • oidc_server_metadata_url: The URL to your OIDC provider's discovery document. This URL typically ends in .well-known/openid-configuration and allows the server to automatically discover other required endpoints.
  • oidc_userinfo_endpoint: The direct URL to fetch user profile information.
  • oidc_end_session_endpoint: The URL to redirect users to when they log out, ensuring their session with the OIDC provider is also terminated.
  • admin_users: A list of email addresses for users who should have administrator privileges. Admins can view and modify the server's settings from the UI.

webhook_settings

This section controls the behavior of the API and its webhook notifications.

  • enabled: A boolean (True or False) that toggles the entire programmatic API (/api/v1/...). If False, all API calls will be rejected.
  • allow_chunked_api_uploads: A boolean (True or False) to enable or disable the chunked upload endpoints for the API. This is an advanced feature and can be left False if not needed.
  • allowed_callback_urls: A list of URL prefixes that are allowed to be used as callback URLs in API requests. This is a security measure to prevent the server from sending notifications to arbitrary, potentially malicious, endpoints.
  • callback_bearer_token: An optional secret token that the server will include in the Authorization: Bearer <TOKEN> header of every webhook it sends. This allows your receiving service to verify that the webhook is genuinely from this server.

transcription_settings

Configures the audio transcription feature, which uses the Whisper model.

  • whisper:
    • compute_type: Specifies the precision for model computation (e.g., int8, float16). int8 is generally faster and uses less memory at a small cost to accuracy.
    • allowed_models: A list of Whisper model sizes that users are allowed to choose from. The available models in the example are tiny, base, small, medium, large-v3, and distil-large-v2.

ocr_settings

Configures the Optical Character Recognition (OCR) feature for PDF files, which uses OCRmyPDF.

  • ocrmypdf:
    • deskew: If True, the tool will attempt to straighten crooked pages before performing OCR.
    • clean: If True, the tool will clean up noise and specks from the pages.
    • optimize: Sets the optimization level (0-3). 1 provides a good balance of size reduction and quality.
    • force_ocr: If True, OCR will be performed even if the PDF already contains text.

tts_settings

Configures the Text-to-Speech engines.

  • piper: Settings for the Piper TTS engine.
    • model_dir: The local directory where Piper voice models are stored.
    • use_cuda: If True, a compatible NVIDIA GPU will be used for synthesis, which is much faster. Requires the correct drivers and libraries to be installed.
    • synthesis_config: Advanced parameters to control the voice generation.
      • length_scale: Controls the speed of the speech (>1 is slower, <1 is faster).
      • noise_scale / noise_w: Control the variability and expressiveness of the voice.
  • kokoro: Settings for the Kokoro TTS engine.
    • model_dir: The directory where Kokoro's model files are stored.
    • command_template: The shell command used to run Kokoro TTS. It uses placeholders like {input} and {output} that the server will replace with the correct file paths.

conversion_tools

This is the largest section and defines all the file conversion capabilities of the server. Each top-level key (e.g., libreoffice, ffmpeg) represents a command-line tool that the server can use.

Each tool has the following structure:

  • name: The user-friendly name displayed in the UI (e.g., "LibreOffice").
  • command_template: The actual shell command to execute. The server uses placeholders like {input}, {output}, {output_ext}, etc., which are replaced with the appropriate values for each job.
  • timeout: The maximum number of seconds a job is allowed to run before being terminated.
  • formats: A dictionary where each key is a short identifier for an output format (e.g., pdf, mp4_hevc) and the value is the user-friendly description (e.g., "PDF Document", "MP4 Video (H.265/AAC)"). These are the options presented to the user in the UI.