Serverless deployment in Dagster Cloud#

This guide is applicable to Dagster Cloud.

Dagster Cloud Serverless is a fully managed version of Dagster Cloud, and is the easiest way to get started with Dagster. With Serverless, you can run your Dagster jobs without spinning up any infrastructure.


When to choose Serverless#

Serverless works best with workloads that primarily orchestrate other services or perform light computation. Most workloads fit into this category, especially those that orchestrate third-party SaaS products like cloud data warehouses and ETL tools.

If any of the following are applicable, you should select Hybrid deployment:

  • You require substantial computational resources. For example, training a large machine learning (ML) model in-process.
  • Your dataset is too large to fit in memory. For example, training a large machine learning (ML) model in-process on a terabyte of data.
  • You need to distribute computation across many nodes for a single run. Dagster Cloud runs currently execute on a single node with 4 CPUs.
  • You don't want to add Elementl as a data processor.

Limitations#

Serverless is currently in early access and is subject to the following limitations:

  • Maximum of 100 GB of bandwidth per day
  • Maximum of 4500 step-minutes per day
  • Runs receive 4 vCPU cores, 16 GB of RAM and 128 GB of ephemeral disk
  • Sensors receive 0.25 vCPU cores and 1 GB of RAM
  • All Serverless jobs run in the United States

Enterprise customers may request a quota increase by contacting Sales.


Getting started with Serverless#

With GitHub#

If you are a GitHub user, our GitHub integration is the fastest way to get started. It uses a GitHub app and GitHub Actions to set up a repo containing skeleton code and configuration consistent with Dagster Cloud's best practices with a single click.

When you create a new Dagster Cloud organization, you'll be prompted to choose Serverless or Hybrid deployment. Once activated, our GitHub integration will scaffold a new git repo for you with Serverless and Branch Deployments already configured. Pushing to the main branch will deploy to your prod Serverless deployment. Pull requests will spin up ephemeral branch deployments using the Serverless agent.

Without GitHub (GitLab, BitBucket, or local development)#

If you don't want to use our GitHub integration, we offer a powerful CLI that you can use in another CI environment or on your local laptop.

First, create a new project with the Dagster open-source CLI.

pip install dagster
dagster project from-example \
  --name my-dagster-project \
  --example assets_modern_data_stack

Once scaffolded, add dagster-cloud as a dependency in your setup.py file.

Next, install the dagster-cloud CLI and log in to your org. Note: The CLI requires a recent version of Python 3 and Docker.

pip install dagster-cloud
dagster-cloud configure

You can also configure the dagster-cloud tool noninteractively; see the CLI docs for more information.

Finally, deploy your project with Dagster Cloud Serverless:

dagster-cloud serverless deploy-python-executable \
  --location-name example \
  --package-name assets_modern_data_stack

Adding secrets#

Often you'll need to securely access secrets from your jobs. Dagster Cloud supports several methods for adding secrets - refer to the Dagster Cloud environment variables and secrets documentation for more info.

Adding dependencies#

Any dependencies specified in either requirements.txt or setup.py will be installed for you automatically by the Dagster Cloud Serverless infrastructure.


Customizing the runtime environment#

Dagster Cloud Serverless packages your code as PEX files and deploys them on Docker images. Using PEX files significantly reduces the time to deploy since it does not require building a new Docker image and provisioning a new container for every code change. Many apps will work fine with the default Dagster Cloud Serverless setup. However, some apps may need to make changes to the runtime environment, either to include data files, use a different base image, different Python version, or install some native dependencies. You can customize the runtime environment using various methods described below.

Including data files#

To add data files to your deployment, use the Data Files Support built into Python's setup.py. This requires adding a package_data or include_package_data keyword in the call to setup() in setup.py. For example, given this directory structure:

- setup.py
- my_dagster_project/
  - __init__.py
  - repository.py
  - data/
    - file1.txt
    - file2.csv

If you want to include the data folder, modify your setup.py to add the package_data line:

# setup.py
from setuptools import find_packages, setup

if __name__ == "__main__":
    setup(
        name="my_dagster_project",
        packages=find_packages(exclude=["my_dagster_project_tests"]),
        # Add the following line. Here "data/*" is relative to the my_dagster_project sub directory.
        package_data={"my_dagster_project": ["data/*"]},
        install_requires=[
            "dagster",
            ...
        ],
    )

Using a different Python version#

The default version of Python for Serverless deployments is Python 3.8. Versions 3.7, 3.9 and 3.10 are also supported. You can specify the version you want by updating your GitHub workflow or using the --python-version command line argument:

  • With GitHub: Change the python_version parameter for the build_deploy_python_executable job in your .github/workflows files. For example:

    - name: Build and deploy Python executable
      if: env.ENABLE_FAST_DEPLOYS == 'true'
      uses: dagster-io/dagster-cloud-action/actions/build_deploy_python_executable@pex-v0.1
      with:
        dagster_cloud_file: "$GITHUB_WORKSPACE/project-repo/dagster_cloud.yaml"
        build_output_dir: "$GITHUB_WORKSPACE/build"
        python_version: "3.9" # Change this value to the desired Python version
      env:
        GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
    
  • With the CLI: Add the --python-version CLI argument to the deploy command to specify the registry path to the desired base image:

    dagster-cloud serverless deploy --location-name=my_location --python-version=3.9
    

Using a different base image or using native dependencies#

Dagster Cloud runs your code on a Docker image that we build as follows:

  1. The standard Python "slim" Docker image, such as python:3.8-slim is used as the base.
  2. The dagster-cloud[serverless] module installed in the image.

As far as possible, add all dependencies by including the corresponding native Python bindings in your setup.py. When that is not possible, you can build and upload a custom base image that will be used to run your Python code.

To build and upload the image, use the command line:

  1. Build your Docker image using docker build or your usual Docker toolchain. Ensure the dagster-cloud[serverless] dependency is included. You can do this by adding the following to your Dockerfile:

    RUN pip install "dagster-cloud[serverless]"
    
  2. Upload your Docker image to Dagster Cloud using the upload-base-image command. Note that this command prints out the tag used in Dagster Cloud to identify your image:

    $ dagster-cloud serverless upload-base-image local-image:tag
    
    ...
    To use the uploaded image run: dagster-cloud deploy-python-executable ... --base-image-tag=sha256_518ad2f92b078c63c60e89f0310f13f19d3a1c7ea9e1976d67d59fcb7040d0d6
    
  3. To use a Docker image you have published to Dagster Cloud, use the --base-image-tag tag printed out by the above command.

    • With GitHub: Set the SERVERLESS_BASE_IMAGE_TAG environment variable in your GitHub Actions configuration (usually at .github/workflows/deploy.yml):

      env:
        DAGSTER_CLOUD_URL: ...
        DAGSTER_CLOUD_API_TOKEN: ...
        SERVERLESS_BASE_IMAGE_TAG: "sha256_518ad2f92b078c63c60e89f0310f13f19d3a1c7ea9e1976d67d59fcb7040d0d6"
      
    • With the CLI: Add the --base-image-tag CLI argument to the deploy command:

      dagster-cloud serverless deploy-python-executable \
        --location-name example \
        --package-name assets_modern_data_stack \
        --base-image-tag sha256_518ad2f92b078c63c60e89f0310f13f19d3a1c7ea9e1976d67d59fcb7040d0d6
      

Disabling PEX-based deploys#

Prior to using PEX files, Dagster Cloud deployed code using Docker images. This feature is still available. To deploy using a Docker image instead of PEX:

  • With GitHub: Delete the ENABLE_FAST_DEPLOYS: 'true' line in your GitHub Actions configuration (usually at .github/workflows/deploy.yml):

    env:
      DAGSTER_CLOUD_URL: ...
      DAGSTER_CLOUD_API_TOKEN: ...
      # ENABLE_FAST_DEPLOYS: 'true' # disabled
    
  • With the CLI: Use the deploy command instead of the deploy-python-executable command:

    dagster-cloud serverless deploy \
      --location-name example \
      --package-name assets_modern_data_stack
    

The Docker image deployed can be customized using either lifecycle hooks or customizing the base image.

This method is the easiest to set up, and does not require setting up any additional infrastructure.

In the root of your repo, you can provide two optional shell scripts: dagster_cloud_pre_install.sh and dagster_cloud_post_install.sh. These will run before and after Python dependencies are installed. They are useful for installing any non-Python dependencies or otherwise configuring your environment.


Transitioning to Hybrid#

If your organization begins to hit the limitations of Serverless, you should transition to a Hybrid deployment. Hybrid deployments allow you to run an agent in your own infrastructure and give you substantially more flexibility and control over the Dagster environment.

To switch to Hybrid, navigate to Status > Agents in your Dagster Cloud account. On this page, you can disable the Serverless agent on and view instructions for enabling Hybrid.


Security and data protection#

Unlike Hybrid, Serverless Deployments on Dagster Cloud require direct access to your data, secrets and source code.

  • Dagster Cloud Serverless does not provide persistent storage. Ephemeral storage is deleted when a run concludes.
  • Secrets and source code are built into the image directly. Images are stored in a per-customer container registry with restricted access.
  • User code is securely sandboxed using modern container sandboxing techniques.
  • All production access is governed by industry-standard best practices which are regularly audited.

Run isolation#

Dagster Cloud Serverless offers two settings for run isolation: isolated and non-isolated. Non-isolated runs are for iterating quickly and trade off isolation for speed. Isolated runs are for production and compute heavy Assets/Jobs.

Isolated runs#

Isolated runs each take place in their own container with their own compute resources: 4 cpu cores and 16GB of RAM.

These runs may take up to 3 minutes to start while these resources are provisioned.

When launching runs manually, select Isolate run environment in the Launchpad to launch an isolated runs. Scheduled, sensor, and backfill runs are always isolated.

The isolated run toggle in Dagster Cloud

Note: if non-isolated runs aren't enabled (see the section below), the toggle won't appear and all runs will be isolated.

Non-isolated runs#

This can be enabled or disabled in deployment settings with

non_isolated_runs:
  enabled: True

Non-isolated runs provide a faster start time by using a standing, shared container for each code location.

They have fewer compute resources: 0.25 vCPU cores and 1GB of RAM. These resources are shared with other processes for a code location like sensors. As a result, it's recommended to use isolated runs for compute intensive jobs and asset materializations.

While launching runs from the Launchpad, leave Isolate run environment unchecked to launch a non-isolated run. Materializing assets from the UI also defaults to non-isolated.

The isolated run toggle in Dagster Cloud

By default only one non-isolated run will execute at once. While a run is in progress, the the Launchpad will swap to only launching isolated runs.

This limit can be configured in deployment settings. Take caution; The limit is in place to help wih avoiding crashes due to OOMs.

non_isolated_runs:
  enabled: True
  max_concurrent_non_isolated_runs: 1