Open-source data version control platform that transforms object storage into Git-like repositories. Enables atomic operations, branching, and versioning for data lakes with support for S3, GCS, Azure Blob, and other cloud storage systems.
Use it when
•Managing data lakes with git-like operations (branch, commit, merge)
•Need atomic and isolated operations in data pipelines
•Creating isolated dev/test environments from production data
•Implementing data governance and audit trails for compliance
•Rolling back data changes safely without data duplication
•Setting up CI/CD workflows for data quality validation
•Managing multiple data environments with lightweight branching
•Integrating with existing MLOps tools for complete ML lifecycle management
Watch out
⚠Performance bottlenecks with extremely large datasets (millions of files)
⚠Additional storage overhead from branching and versioning mechanisms
⚠Not suitable for real-time streaming or low-latency requirements
⚠Operational complexity in managing branching strategies and merge conflicts
⚠Some analytics tools may need modifications for S3-compatible API
⚠Limitations with certain file formats and data processing engines
⚠Learning curve for teams new to data versioning concepts
⚠Infrastructure requirements for self-hosting and maintenance overhead
Available in stages
Data Versioning
Installation
docker run --pull always --rm --publish 8000:8000 treeverse/lakefs:latest run --quickstart