Summary and Setup
Workshop Overview
This workshop introduces you to foundational workflows in Amazon SageMaker, covering data setup, code repo setup, model training, and hyperparameter tuning within AWS’s managed environment. You’ll learn how to use SageMaker notebooks to control data pipelines, manage training and tuning jobs, and evaluate model performance effectively. We’ll also cover strategies to help you scale training and tuning efficiently, with guidance on choosing between CPUs and GPUs, as well as when to consider parallelized workflows (i.e., using multiple instances).
To keep costs manageable, this workshop provides tips for tracking and monitoring AWS expenses, so your experiments remain affordable. While AWS isn’t entirely free, it can be very cost-effective for common machine learning (ML) workflows in used in research. For example, training roughly 100 small to medium-sized models (e.g., logistic regression, random forests, or lightweight deep learning models with a few million parameters) on a small dataset (under 10GB) can cost under $20, making it accessible for many research projects.
What This Workshop Does Not Cover (Yet)
Currently, this workshop does not include:
- AWS Lambda: Lambda lets you run small pieces of code without setting up or managing a server. You write a function, tell AWS when to run it (for example, when a file is uploaded or a request comes in), and it automatically runs and scales as needed. This is useful for simple tasks like cleaning data as it arrives, kicking off a training job, or running a quick analysis without having to keep a server running all the time.
- Bedrock: Amazon Bedrock can be used to build and scale generative AI applications by providing API access to a range of foundation models from AWS and third-party providers, without managing infrastructure. It’s designed to simplify integrating text, image, and other generative capabilities into your workflows using familiar AWS tools. Much of what Bedrock enables can also be done in SageMaker, but Bedrock trades off flexibility for simplicity — you get faster access to models and lower setup overhead, but have less control over training, fine-tuning, and the underlying infrastructure, which can matter for research workflows that need custom architectures, reproducibility, or integration with existing pipelines
- Additional AWS services beyond the core SageMaker ML workflows.
If there’s a specific ML or AI workflow or AWS service you’d like to see included in this curriculum, please let us know! We’re happy to develop additional content to meet the needs of researchers, students, and ML/AI practitioners. Please post an issue on the lesson GitHub or contact endemann@wisc.edu with suggestions or requests.
Setup (Complete Before The Workshop)
Before attending this workshop, you’ll need to complete a few setup steps to ensure you can follow along smoothly. The main requirements are:
- GitHub Account – Create an account and be ready to fork a repository.
- AWS Access – Use a shared AWS account (if attending Machine Learning Marathon or Research Bazaar) or sign up for an AWS Free Tier account.
- Titanic Dataset – Download the required CSV files in advance.
- Workshop Repository – Fork the provided GitHub repository for use in AWS.
- Visit Glossary — Find and briefly review the workshop glossary
- (Optional) AWS Skill Builder — For a broader overview of AWS, visit the Getting Started with the AWS Cloud Essentials course
Details on each step are outlined below.
1. GitHub Account
You will need a GitHub account to access the code provided during this lesson. If you don’t already have a GitHub account, please sign up for GitHub to create a free account. Don’t worry if you’re a little rusty on using GitHub/git; we will only use a couple of git commands during the lesson, and the instructor will provide guidance on these steps.
2. AWS Account
There are two ways to get access to AWs for this lesson. Please wait for a pre-workshop email from the instructor to confirm which option to choose.
Option 1) Shared Account
If you are attending this lesson as part of the Machine Learning Marathon or Research Bazaar, the instructors will provide a shared AWS account for all attendees. You do not need to set up your own AWS account. What to expect:
- Before the workshop, you will receive an email invitation from the instructor with access details for the shared AWS account.
- During the lesson, you will log in using the credentials provided in the email.
- This setup ensures that all participants have the same environment and eliminates concerns about unexpected costs for attendees.
- These shared AWS credits should not be wasted, as we repurpose them
for additional training events eac year.
- Attendees are expected to stick to the lesson materials to ensure expensive pipelines (e.g., training/tuning LLMs) do not lead to high costs and wasted credits.
- Do not use any tools we do not explictly cover without discussing with the instructors first.
Option 2) AWS Free Tier — Skip If Using Shared Account
If you are attending this lesson as part of the Machine Learning Marathon or Research Bazaar, you can skip this step. We will provide all attendees with a shared account. Otherwise, please follow these steps:
- Go to the AWS Free Tier page and click Create a Free Account.
- Complete the sign-up process. AWS offers a free tier with limited monthly usage. Some services, including SageMaker, may incur charges beyond free-tier limits, so be mindful of usage during the workshop. If you follow along with the materials, you can expect to incur around $10 in compute fees (e.g., from training and tuning several different models with GPU enabled at times).
Once your AWS account is set up, log in to the AWS Management Console to get started with SageMaker.
3. Download the Data
For this workshop, you will need the Titanic dataset, which can be used to train a classifier predicting survival.
Please download the following zip file (Right-click -> Save as): data.zip
Extract the zip folder contents (Right-click -> Extract all on Windows; Double-click on mac)
Save the two data files (train and test) to a location where they can easily be accessed. E.g., …
~/Downloads/data/titanic_train.csv
~/Downloads/data/titanic_test.csv
In the first episode, you will create an S3 bucket and upload this data to use with SageMaker.
4. Get Access To Workshop Code (Fork GitHub Repo)
You will need a copy of our AWS_helpers repo on GitHub to explore how to manage your repo in AWS. This setup will allow you to follow along with the workshop and test out the Interacting with Repositories episode.
To do this:
- Go to the AWS_helpers GitHub repository.
- Click Fork (top right) to create your own copy of the repository under your GitHub account. You will only need the main branch. You can leave “Copy the main branch only” selected.
- Once forked, you don’t need to do anything else. We’ll clone this fork once we start working in the AWS Jupyter environment using…
5. Review the Workshop Glossary Page
When learning cloud tools for the first time, understanding new terminology is half the battle. We encourage learners to briefly review the Glossary page (also accessible from the top menu of each lesson page) before the workshop. You don’t need to memorize the terms—just a quick read-through will help familiarize you with key concepts. Once we start running our own AWS SageMaker experiments, these terms will start to make more sense in context. If you feel lost at any point during the workshop, please ask the instructor for assistance or refer back to the glossary.
6. (Optional) AWS Skill Builder — Getting Started with the AWS Cloud Essentials
Attendees who want a stronger foundational understanding of AWS before diving into SageMaker and other ML services are encouraged to complete this self-paced course, Getting Started with the AWS Cloud Essentials, as optional pre-work for this workshop. It’s designed for beginners and provides a broad overview of the AWS Cloud, including core services, global infrastructure, pricing basics, and security concepts.