Welcome to the high-octane world of production ML pipelines! We're thrilled to present an epic demonstration showcasing numerous MLOps concepts packed into a single dbt project. Strap in as we unveil this treasure trove of tools, tailored to empower data teams within organizations, speeding up the journey of ML models to production!
Imagine a scenario of daily (or weekly) sports betting where you're on a quest to outsmart the bookies. This project houses the code for a data warehouse powered by the European Soccer Database. Utilizing team and player statistics, performance metrics, FIFA stats, and bookie odds, we'll hunt down opportunities where our model paints a more accurate picture than at least one bookie. When our odds stack up better against theirs, it's our chance to strike gold! 💰
Within our pipeline, you can:
- Version Your Dataset: run preprocessing to (re)generate your ML dataset
- Experiment & Store: run and save experiments
- Model Management: save and compare models
- Reproducibility: ensure inference pipelines run without train/serving skew (run simulations)
- Feature Store: house all input features with the available KPIs at that time
- Prediction Audit: maintain a log of all predictions
This thrilling adventure requires:
- Python
- Access to a Databricks cluster (e.g., Azure free account)
- A firm grasp on dbt for seamless execution of these examples
Buckle up for the setup ride:
- install virtual environment
virtualenv venv source venv/bin/activate pip install -r requirements.txt
- Download data from here -> you need a Kaggle account. Drop the resulting
database.sqlite
file in the data folder. - Convert data to parquet and csv files
python scripts/convert_data.py
- Databricks
- Create a SQL warehouse -> check the connection details for your profile in the next step
- Create a personal access token, keep this token close and use to connect dbt to your sql warehouse.
- Upload data (parquet files) to warehouse, into the
default
schema in thehive_metastore
catalog. Your catalog should look something like this
- Create a compute cluster
- check the cluster id (you can find in the SparkUI), and set as env var:
COMPUTE_CLUSTER_ID=...
- dbt
- initialise and install dependencies.
cd dbt_your_latest_bet dbt deps
- setup your dbt profile, should look something like this:
databricks: outputs: dev: catalog: hive_metastore host: xxx.cloud.databricks.com http_path: /sql/1.0/warehouses/$SQL_WAREHOUSE_ID schema: football threads: 4 # max number of parallel processes token: $PERSONAL_ACCESS_TOKEN type: databricks target: dev
riskrover
python package, managed with poetry- build and install the package in your local environment
cd riskrover poetry build pip install dist/riskrover-x.y.z.tar.gz
- Install the resulting
riskrover
whl file on your databricks compute cluster
You should now be able to run the entire pipeline without any trained models (i.e. the preprocessing):
dbt build --selector gold
Explore and command the powers of our pipeline.
For these examples to work -> you need to move to the root dir of the dbt project, i.e. dbt_your_best_bet
.
The default variables are stored in dbt_project.yaml
. We find ourselves on 2016-01-01 in our simulation, with the option to run until 2016-05-25.
cd dbt_your_best_bet
# Preprocessing
dbt build --selector gold
# Experimentation (by default -> training set to 2015-07-31, and trains a simple logistic regression with cross validation)
dbt build --selector ml_experiment
# Inference on test set (2015-08-01 -> 2015-12-31)
dbt build --selector ml_predict_run
# moving forward in time, for example with a weekly run
dbt build --vars '{"run_date": "2016-01-08"}'
dbt build --vars '{"run_date": "2016-01-15"}'
dbt build --vars '{"run_date": "2016-01-22"}'
...
cd dbt_your_best_bet
dbt docs generate
dbt docs serve
It's like a grand lineage tale with no models documented yet—stay tuned! We can already check the lineage:
Mostly maintenance, no plans on new features unless requested.
- Documentation
- Tests
- Extra sql analysis models
All contributions are welcome!
If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
) - Commit your Changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
Distributed under the MIT License.