lakeFS

lakeFS

Open-source data version control platform that transforms object storage into Git-like repositories. Enables atomic operations, branching, and versioning for data lakes with support for S3, GCS, Azure Blob, and other cloud storage systems.

Use it when

  • Managing data lakes with git-like operations (branch, commit, merge)
  • Need atomic and isolated operations in data pipelines
  • Creating isolated dev/test environments from production data
  • Implementing data governance and audit trails for compliance
  • Rolling back data changes safely without data duplication
  • Setting up CI/CD workflows for data quality validation
  • Managing multiple data environments with lightweight branching
  • Integrating with existing MLOps tools for complete ML lifecycle management

Watch out

  • Performance bottlenecks with extremely large datasets (millions of files)
  • Additional storage overhead from branching and versioning mechanisms
  • Not suitable for real-time streaming or low-latency requirements
  • Operational complexity in managing branching strategies and merge conflicts
  • Some analytics tools may need modifications for S3-compatible API
  • Limitations with certain file formats and data processing engines
  • Learning curve for teams new to data versioning concepts
  • Infrastructure requirements for self-hosting and maintenance overhead

Available in stages

Data Versioning

Installation

docker run --pull always --rm --publish 8000:8000 treeverse/lakefs:latest run --quickstart

Example stacks

Example stacks coming soon...