Luigi ML Pipeline
Luigi ML Pipeline
This repository showcases an easy-to-follow method for automating data transformations, modeling, and a Luigi data pipeline.
Key Components
- Python version 3.7 or higher
- Streamlit for interactive applications
- Scikit-learn for machine learning tasks
- Pandas for data handling
- Luigi for workflow automation
Concept
The entire workflow is encapsulated in an interactive application found in the pipeline.py script. Refer to the instructions in the “How to Run the Scripts” section for details on setting up and launching the application.
Configuration
- Set up a dedicated virtual environment (using
condais suggested):conda create --name data_workflow python=3.7 - Activate your new virtual environment:
conda activate data_workflow - Install the necessary packages:
pip install -r requirements.txt
Running the Scripts
Interactive Application
To launch the interactive app, use the Streamlit command within your activated virtual environment:
(data_workflow) streamlit run pipeline.py
This will start a local server accessible at: http://localhost:8501.
Data Workflow
To run a specific task, for instance TaskX located in the workflow.py script, use the following command:
PYTHONPATH=. luigi --module workflow TaskX --local-scheduler
Feel free to expand upon the code by adding your own custom tasks!