Troubleshooting

Common issues and their solutions.

Installation Issues

UV Command Not Found

Problem: After installing UV, the command is not recognized.

uv: command not found

Solution:

Restart your terminal
Add cargo bin directory to PATH:
```
export PATH="$HOME/.cargo/bin:$PATH"
```

Make it permanent by adding to your shell profile:

echo 'export PATH="$HOME/.cargo/bin:$PATH"' >> ~/.zshrc  # macOS
echo 'export PATH="$HOME/.cargo/bin:$PATH"' >> ~/.bashrc  # Linux
source ~/.zshrc  # or ~/.bashrc

Python Version Mismatch

Problem: Wrong Python version installed.

error: Python 3.11 is not supported (requires Python 3.12+)

Solution:

# Use specific Python version with UV
uv sync --python 3.13

Dependency Resolution Errors

Problem: UV can't resolve dependencies.

error: Failed to resolve dependencies

Solution:

# Clear UV cache
uv cache clean

# Reinstall dependencies
uv sync --reinstall

# If still failing, check pyproject.toml for conflicts

Environment Configuration

Environment Variables Not Set

Problem: nnU-Net can't find data directories.

RuntimeError: nnUNet_raw is not defined

Solution:

Ensure .env file exists:
```
ls .env
```
Load environment variables:
```
source .env
```

Verify they're set:

echo $nnUNet_raw
echo $nnUNet_preprocessed
echo $nnUNet_results

If empty, check .env file has correct export syntax:

nnUNet_raw="/absolute/path/to/nnUNet_raw"

or

export nnUNet_raw="/absolute/path/to/nnUNet_raw"

Kaggle API Credentials Invalid

Problem: Can't download data from Kaggle.

401 Unauthorized: Invalid API credentials

Solution:

Verify credentials in .env:
```
cat .env | grep KAGGLE
```
Check Kaggle API key is current:
Go to kaggle.com/settings
Create new token
Update .env with new credentials

Ensure no extra spaces or quotes:

KAGGLE_USERNAME="your_username"
KAGGLE_KEY="your_api_key"

Weights & Biases Authentication

Problem: W&B login fails.

wandb: ERROR authentication failed

Solution:

# Set API key in .env
WANDB_API_KEY="your_key_here"

# Or login manually
uv run wandb login

Data Issues

Data Download Fails

Problem: Download from Kaggle times out or fails.

Solution:

# Try again with verbose output
uv run python -c "import kaggle; kaggle.api.dataset_download_files(
    'santurini/semantic-segmentation-drone-dataset',
    path='data/raw',
    unzip=True
)"

# Check internet connection
ping kaggle.com

# Check disk space
df -h

Data Export Fails

Problem: Converting to nnU-Net format fails.

FileNotFoundError: data/raw/classes_dataset not found

Solution:

Verify data was downloaded:

ls data/raw/classes_dataset/classes_dataset/original_images/

If missing, re-download:
```
uv run invoke download-data
```

Preprocessing Fails

Problem: nnU-Net preprocessing crashes.

RuntimeError: Dataset verification failed

Solution:

Check data integrity:

ls nnUNet_raw/Dataset101_DroneSeg/imagesTr/ | wc -l
# Should show number of images

Verify environment variables are absolute paths:

echo $nnUNet_raw
# Should be /full/path/to/nnUNet_raw, not ./nnUNet_raw

Re-export data:

rm -rf nnUNet_raw/Dataset101_DroneSeg
uv run invoke export-data

Python 3.13 Compatibility / Distutils Error

Problem: Preprocessing can fail on Python 3.13 with errors related to distutils (e.g., ModuleNotFoundError: No module named 'distutils'). This is due to distutils being removed in recent Python versions. Note that this error may not occur for everyone, as it depends on your specific environment and how dependencies are resolved.

Solution: Downgrade your environment to Python 3.12, which is the fully supported version for this project's dependencies.

# Force UV to use Python 3.12
uv python install 3.12
uv venv --python 3.12
uv sync

# Then try preprocessing again
uv run invoke preprocess

Training Issues

CUDA Out of Memory

Problem: GPU runs out of memory during training.

RuntimeError: CUDA out of memory

Solution:

Reduce batch size (nnU-Net does this automatically, but you can force smaller batch):
```
# Train on CPU if GPU is too small
uv run invoke train --device cpu
```

Use smaller model dimension:

uv run invoke train --dim 2d  # Instead of 3d_fullres

Close other GPU-using applications
Monitor GPU usage:
```
nvidia-smi
```

Frozen Training / Background Workers Stopped

Problem: Training hangs indefinitely or fails with an error like DataLoader worker (pid X) is killed or Background workers stopped. No explicit "Out of Memory" error is shown. This is often due to shared memory (shm) exhaustion or system resource leaks during long training runs.

Solution:

Clear System Cache: Often a system restart (as you found) clears the leaked resources.
Reduce Workers: If it happens frequently, try reducing the number of data loader workers in your training configuration (though nnU-Net handles this, system-level pressure can still trigger it).
Check Shared Memory: If running in Docker, ensure you have increased --shm-size (see Docker OOM section above).
Monitor RAM: Ensure your system is not swapping heavily.

Docker Out of Memory

Problem: Training crashes inside Docker with a "Killed" message or generic memory errors, even if your GPU has enough memory. This usually happens because the Docker VM itself (on macOS or Windows) has allocated too little RAM.

Solution: Increase the memory allocated to Docker Desktop.

Open Docker Desktop Settings (gear icon).
Go to Resources -> Advanced.
Increase Memory (we recommend at least 16GB for stable training).
Increase Swap to 4GB or more.
Click Apply & Restart.

For Linux / CLI Users:

On Linux, Docker usually has access to all system memory unless limited. If you encounter issues, ensure you are providing enough shared memory (essential for multi-processing in nnU-Net) and avoiding hard limits.

# Add these flags to your 'docker run' command if needed:
docker run --shm-size=8gb \        # Increase shared memory (v. important)
           --memory=16g \          # Set a limit if host is unstable
           --memory-swap=20g \     # Total limit including swap
           ...

Device Not Found

Problem: Can't use GPU for training.

RuntimeError: No CUDA-capable device is detected

Solution:

Check CUDA availability:

uv run python -c "import torch; print(torch.cuda.is_available())"

If False, train on CPU:
```
uv run invoke train --device cpu
```
On Mac with M chip, use MPS:
```
uv run invoke train --device mps
```

Training Extremely Slow

Problem: Training takes forever.

Solution:

Use GPU:

uv run invoke train --device cuda  # or mps if using Mac with M chip

Check you have preprocessed data:

ls nnUNet_preprocessed/Dataset101_DroneSeg/

Monitor system resources:
```
htop  # or Activity Monitor on macOS
```

API Issues

Port Already in Use

Problem: Can't start API server.

ERROR: [Errno 48] Address already in use

Solution:

# Find process using port 8000
lsof -i :8000

# Kill the process
kill -9 <PID>

# Restart it
uv run invoke app

Module Not Found Error

Problem: API or scripts refer to the project package but can't find it.

ModuleNotFoundError: No module named 'dtu_mlops_111'

Solution:

Reinstall dependencies:
```
uv sync --reinstall
```
If it persists, install in editable mode manually:
```
uv pip install -e .
```

Model Not Found

Problem: API can't find model checkpoint.

FileNotFoundError: nnUNet_results/Dataset101_DroneSeg/.../checkpoint_final.pth

Solution:

Download pre-trained model:
```
uv run invoke download-models
```
Or train your own:
```
uv run invoke train
```

Verify checkpoint exists:

find nnUNet_results/ -name "checkpoint*.pth"

Prediction Fails

Problem: API returns error on prediction.

422 Unprocessable Entity

Solution:

Check image format (should be PNG or JPG)

Verify file is uploaded correctly in request:

curl -X POST "http://localhost:8000/predict/" \
  -F "data=@/path/to/image.png"

Check API logs for detailed error

Run integration tests:

uv run pytest tests/integrationtests/test_apis.py

Test with a sample image from the dataset

Docker Issues

Permission Denied

Problem: Docker build or run fails with permission error.

permission denied while trying to connect to Docker daemon

Solution:

# Add user to docker group (Linux)
sudo usermod -aG docker $USER
newgrp docker

# Or run with sudo
sudo docker build ...

GPU Not Available in Docker

Problem: Docker container can't access GPU.

Solution:

Install NVIDIA Container Toolkit:

# Follow: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html

Verify Docker can see GPU:

docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi

Ensure you use --gpus all flag:
```
docker run --gpus all ...
```

Build Context Too Large

Problem: Docker build is very slow.

Sending build context to Docker daemon  5.234GB

Solution:

Check .dockerignore file exists

Add large directories to .dockerignore:

data/
.venv/
nnUNet_preprocessed/
nnUNet_results/
wandb/
.git/

DVC Issues

DVC Pull Fails

Problem: Can't download data from DVC.

ERROR: failed to pull data from remote

Solution:

Check you have access to GCS bucket (team members only)
Verify you are logged into Google Cloud with Application Default Credentials (ADC):
```
uv run gcloud auth application-default login
```

Ensure the correct project is set:

gcloud config set project your-project-id

Force re-pull:
```
uv run dvc pull --force
```

DVC Push Fails

Problem: Can't push data to DVC remote.

ERROR: failed to push data to remote - permission denied

Solution:

Check you have write permissions to GCS bucket.

Re-authenticate with ADC:

uv run gcloud auth application-default login

Testing Issues

Tests Fail

Problem: Pytest tests fail.

Solution:

Ensure dependencies are installed:
```
uv sync
```
Check test requirements:
```
uv run pytest tests/ -v
```

Run specific failing test with more output:

uv run pytest tests/integrationtests/test_apis.py::test_function -vv -s

Import Errors in Tests

Problem: Tests can't import modules.

ModuleNotFoundError: No module named 'dtu_mlops_111'

Solution:

Install package in editable mode:
```
uv sync
```

Verify pythonpath in pyproject.toml:

[tool.pytest.ini_options]
pythonpath = ["src"]

Documentation Issues

MkDocs Build Fails

Problem: Documentation won't build.

Error: Config file 'mkdocs.yaml' does not exist

Solution:

# Ensure you're running from project root
cd /path/to/DTU_MLOps_111

# Check config file exists
ls docs/mkdocs.yaml

# Build with explicit config path
uv run mkdocs build --config-file docs/mkdocs.yaml

Missing Plugin

Problem: MkDocs complains about missing plugin.

Error: Plugin 'mkdocstrings' not found

Solution:

# Install mkdocs dependencies
uv add --dev mkdocs-material mkdocstrings mkdocstrings-python

# Or sync existing dependencies
uv sync

General Tips

Check Logs

Most tools provide verbose logging:

# UV with debug output
uv -v sync

# Invoke with echo
uv run invoke command --echo

# Python with debug
PYTHONVERBOSE=1 uv run python script.py

Clean Start

If all else fails, start fresh:

# Remove virtual environment
rm -rf .venv

# Remove UV cache
uv cache clean

# Reinstall everything
uv sync --reinstall

# Reload environment
source .env

Get Help

If you're still stuck:

Check error messages carefully
Search GitHub issues
Check nnU-Net documentation: nnU-Net GitHub
Contact team members (see homepage)

Still Having Issues?

If your problem isn't listed here:

Check the nnU-Net documentation
Review error messages and stack traces
Enable verbose/debug logging
Contact the project team