Troubleshooting
Common issues and their solutions.
Installation Issues
UV Command Not Found
Problem: After installing UV, the command is not recognized.
Solution:
- Restart your terminal
- Add cargo bin directory to PATH:
- Make it permanent by adding to your shell profile:
Python Version Mismatch
Problem: Wrong Python version installed.
Solution:
Dependency Resolution Errors
Problem: UV can't resolve dependencies.
Solution:
# Clear UV cache
uv cache clean
# Reinstall dependencies
uv sync --reinstall
# If still failing, check pyproject.toml for conflicts
Environment Configuration
Environment Variables Not Set
Problem: nnU-Net can't find data directories.
Solution:
- Ensure
.envfile exists: - Load environment variables:
- Verify they're set:
- If empty, check
.envfile has correctexportsyntax: or
Kaggle API Credentials Invalid
Problem: Can't download data from Kaggle.
Solution:
- Verify credentials in
.env: - Check Kaggle API key is current:
- Go to kaggle.com/settings
- Create new token
- Update
.envwith new credentials - Ensure no extra spaces or quotes:
Weights & Biases Authentication
Problem: W&B login fails.
Solution:
Data Issues
Data Download Fails
Problem: Download from Kaggle times out or fails.
Solution:
# Try again with verbose output
uv run python -c "import kaggle; kaggle.api.dataset_download_files(
'santurini/semantic-segmentation-drone-dataset',
path='data/raw',
unzip=True
)"
# Check internet connection
ping kaggle.com
# Check disk space
df -h
Data Export Fails
Problem: Converting to nnU-Net format fails.
Solution:
- Verify data was downloaded:
- If missing, re-download:
Preprocessing Fails
Problem: nnU-Net preprocessing crashes.
Solution:
- Check data integrity:
- Verify environment variables are absolute paths:
- Re-export data:
Python 3.13 Compatibility / Distutils Error
Problem: Preprocessing can fail on Python 3.13 with errors related to distutils (e.g., ModuleNotFoundError: No module named 'distutils'). This is due to distutils being removed in recent Python versions. Note that this error may not occur for everyone, as it depends on your specific environment and how dependencies are resolved.
Solution: Downgrade your environment to Python 3.12, which is the fully supported version for this project's dependencies.
# Force UV to use Python 3.12
uv python install 3.12
uv venv --python 3.12
uv sync
# Then try preprocessing again
uv run invoke preprocess
Training Issues
CUDA Out of Memory
Problem: GPU runs out of memory during training.
Solution:
- Reduce batch size (nnU-Net does this automatically, but you can force smaller batch):
- Use smaller model dimension:
- Close other GPU-using applications
- Monitor GPU usage:
Frozen Training / Background Workers Stopped
Problem: Training hangs indefinitely or fails with an error like DataLoader worker (pid X) is killed or Background workers stopped. No explicit "Out of Memory" error is shown. This is often due to shared memory (shm) exhaustion or system resource leaks during long training runs.
Solution:
- Clear System Cache: Often a system restart (as you found) clears the leaked resources.
- Reduce Workers: If it happens frequently, try reducing the number of data loader workers in your training configuration (though nnU-Net handles this, system-level pressure can still trigger it).
- Check Shared Memory: If running in Docker, ensure you have increased
--shm-size(see Docker OOM section above). - Monitor RAM: Ensure your system is not swapping heavily.
Docker Out of Memory
Problem: Training crashes inside Docker with a "Killed" message or generic memory errors, even if your GPU has enough memory. This usually happens because the Docker VM itself (on macOS or Windows) has allocated too little RAM.
Solution: Increase the memory allocated to Docker Desktop.
- Open Docker Desktop Settings (gear icon).
- Go to Resources -> Advanced.
- Increase Memory (we recommend at least 16GB for stable training).
- Increase Swap to 4GB or more.
- Click Apply & Restart.
For Linux / CLI Users:
On Linux, Docker usually has access to all system memory unless limited. If you encounter issues, ensure you are providing enough shared memory (essential for multi-processing in nnU-Net) and avoiding hard limits.
# Add these flags to your 'docker run' command if needed:
docker run --shm-size=8gb \ # Increase shared memory (v. important)
--memory=16g \ # Set a limit if host is unstable
--memory-swap=20g \ # Total limit including swap
...
Device Not Found
Problem: Can't use GPU for training.
Solution:
- Check CUDA availability:
- If False, train on CPU:
- On Mac with M chip, use MPS:
Training Extremely Slow
Problem: Training takes forever.
Solution:
- Use GPU:
- Check you have preprocessed data:
- Monitor system resources:
API Issues
Port Already in Use
Problem: Can't start API server.
Solution:
# Find process using port 8000
lsof -i :8000
# Kill the process
kill -9 <PID>
# Restart it
uv run invoke app
Module Not Found Error
Problem: API or scripts refer to the project package but can't find it.
Solution:
- Reinstall dependencies:
- If it persists, install in editable mode manually:
Model Not Found
Problem: API can't find model checkpoint.
Solution:
- Download pre-trained model:
- Or train your own:
- Verify checkpoint exists:
Prediction Fails
Problem: API returns error on prediction.
Solution:
- Check image format (should be PNG or JPG)
- Verify file is uploaded correctly in request:
- Check API logs for detailed error
- Run integration tests:
- Test with a sample image from the dataset
Docker Issues
Permission Denied
Problem: Docker build or run fails with permission error.
Solution:
# Add user to docker group (Linux)
sudo usermod -aG docker $USER
newgrp docker
# Or run with sudo
sudo docker build ...
GPU Not Available in Docker
Problem: Docker container can't access GPU.
Solution:
- Install NVIDIA Container Toolkit:
- Verify Docker can see GPU:
- Ensure you use
--gpus allflag:
Build Context Too Large
Problem: Docker build is very slow.
Solution:
- Check
.dockerignorefile exists - Add large directories to
.dockerignore:
DVC Issues
DVC Pull Fails
Problem: Can't download data from DVC.
Solution:
- Check you have access to GCS bucket (team members only)
- Verify you are logged into Google Cloud with Application Default Credentials (ADC):
- Ensure the correct project is set:
- Force re-pull:
DVC Push Fails
Problem: Can't push data to DVC remote.
Solution:
- Check you have write permissions to GCS bucket.
- Re-authenticate with ADC:
Testing Issues
Tests Fail
Problem: Pytest tests fail.
Solution:
- Ensure dependencies are installed:
- Check test requirements:
- Run specific failing test with more output:
Import Errors in Tests
Problem: Tests can't import modules.
Solution:
- Install package in editable mode:
- Verify pythonpath in
pyproject.toml:
Documentation Issues
MkDocs Build Fails
Problem: Documentation won't build.
Solution:
# Ensure you're running from project root
cd /path/to/DTU_MLOps_111
# Check config file exists
ls docs/mkdocs.yaml
# Build with explicit config path
uv run mkdocs build --config-file docs/mkdocs.yaml
Missing Plugin
Problem: MkDocs complains about missing plugin.
Solution:
# Install mkdocs dependencies
uv add --dev mkdocs-material mkdocstrings mkdocstrings-python
# Or sync existing dependencies
uv sync
General Tips
Check Logs
Most tools provide verbose logging:
# UV with debug output
uv -v sync
# Invoke with echo
uv run invoke command --echo
# Python with debug
PYTHONVERBOSE=1 uv run python script.py
Clean Start
If all else fails, start fresh:
# Remove virtual environment
rm -rf .venv
# Remove UV cache
uv cache clean
# Reinstall everything
uv sync --reinstall
# Reload environment
source .env
Get Help
If you're still stuck:
- Check error messages carefully
- Search GitHub issues
- Check nnU-Net documentation: nnU-Net GitHub
- Contact team members (see homepage)
Still Having Issues?
If your problem isn't listed here:
- Check the nnU-Net documentation
- Review error messages and stack traces
- Enable verbose/debug logging
- Contact the project team