Synopsis
The Challenge
Building traditional AI / ML requires immense resource allocation to train the models with computationally demanding resources (expensive GPUs). Inputs to the training process include massive datasets (up to terabytes to petabytes for large models), and the actual models are anywhere from a few hundred MiB to tens of GiB.
The process involves:
- Data collection & preparation—prepare massive datasets for the training process.
- Pre-training & fine-tuning—take prepared data and let the model learn from it.
- Inference—let end users interact with the model where an input leads to an output, both of which require storage & compute (lower relative to training process).
Our Solution
Enables object storage with verifiable data pipelines and decentralized architecture, access control and ownership, providing tools that solve common challenges in ML/AI including:
- Pool and collaborate on data with many actors writing to a single repo, aggregate fragmented data into a unified and valuable asset.
- Provision access & monetize data with programmable read & write access control, configurable pricing & licensing, and flexible governance options.
- Add verifiability & provenance to data, having it signed at its source to enable verifiable data origin information
Collaboration Over Large Datasets
How it Works
Basin makes data available, replicating datasets & models to decentralized storage for open access.
Benefits
Provide redundancy, fault tolerance, and retrieval to reduce hosted storage costs, guarantee data liveliness, and enable open data access—driving a better data consumer experience.
Data Provenance & Transparency
How it Works