πŸš€ Deduplication Engine

A high-performance, scalable deduplication engine built with Go, featuring variable-block chunking, intelligent caching, and microservices architecture.

πŸ“‹ Overview

The Deduplication Engine is a modern, cloud-native solution for efficient data storage and backup. Built with Go and containerized with Docker, it provides enterprise-grade deduplication capabilities with microservices architecture.

99.92%
Deduplication Ratio
640
Chunks/Second
10MB/s
Processing Speed
4
Microservices

✨ Features

πŸ”§ Variable-Block Chunking

Uses content-defined chunking with Blake3 hashing for optimal deduplication across different file types and sizes.

⚑ Intelligent Caching

LRU cache with Cuckoo filter for fast duplicate detection, reducing storage overhead and improving performance.

πŸ—οΈ Microservices Architecture

Distributed services for ingest, storage, and stream handling with gRPC communication for high performance.

🐳 Containerized

Full Docker support with docker-compose for easy deployment and scaling in any environment.

πŸ—„οΈ Database Integration

CockroachDB for metadata storage with ACID compliance and distributed capabilities.

☁️ Object Storage

MinIO integration for scalable chunk storage with S3-compatible API.

πŸ—οΈ Architecture

Stream Handler File Reading gRPC Client Ingest Node Chunking Deduplication Data Storage Metadata DB Object Store CockroachDB Metadata & Chunk Index MinIO Chunk Storage & Object Store

Technology Stack

πŸ“Š Performance Results

File Type Testing

File Type Size Deduplication Chunks Result
Text (unique) 28B 0% 1 βœ… Expected
Text (edited) 46B 0% 1 βœ… Expected
Small binary 32B 0% 1 βœ… Expected
Large binary 5MB 0% 640 βœ… Expected
Compressed (.zip) 204B 0% 3 βœ… Expected
Compressed (.gz) 60B 0% 1 βœ… Expected
Repetitive 10MB 99.92% 1280 βœ… Excellent

Performance Metrics

πŸš€ Quick Start

Prerequisites

1. Clone the Repository

git clone https://github.com/radhakrish-venkat/dedupe-engine.git cd dedupe-engine

2. Start the Services

docker-compose up -d

This will start:

3. Test the System

# Test with a small file docker run --rm --network dedupe-engine_dedupe-net \ -v $(pwd):/data dedupe-engine-stream-handler \ -file /data/test-file.txt -ingest-addr ingest-node:50051

4. Monitor Services

# Check service status docker-compose ps # View logs docker-compose logs -f ingest-node

☸️ Kubernetes Deployment

Prerequisites

1. Build and Load Images

# Build all images (from project root) docker build -f Dockerfile.data-storage -t dedupe-engine-data-storage-node:latest . docker build -f Dockerfile.ingest -t dedupe-engine-ingest-node:latest . docker build -f Dockerfile.stream-handler -t dedupe-engine-stream-handler:latest . # Load images into kind cluster (replace 'desktop' with your cluster name if different) kind load docker-image dedupe-engine-data-storage-node:latest --name desktop kind load docker-image dedupe-engine-ingest-node:latest --name desktop kind load docker-image dedupe-engine-stream-handler:latest --name desktop

2. Deploy to Kubernetes

cd k8s ./deploy-no-build.sh

3. Test the Deployment

kubectl apply -f stream-handler-job.yaml # Check job status kubectl get pods -n dedupe-engine # View job logs kubectl logs job/stream-handler-test -n dedupe-engine

Troubleshooting

Cleanup

# Remove all resources and namespace ./cleanup.sh

πŸ” API Reference

gRPC Services

BackupService (Ingest Node)

service BackupService { rpc StreamBackup(stream BackupRequest) returns (stream BackupResponse); }

StorageService (Data Storage Node)

service StorageService { rpc StoreChunk(StoreChunkRequest) returns (StoreChunkResponse); rpc RetrieveChunk(RetrieveChunkRequest) returns (RetrieveChunkResponse); }

Configuration

Environment Variables

Variable Default Description
COCKROACH_HOST localhost CockroachDB host
COCKROACH_PORT 26257 CockroachDB port
MINIO_ENDPOINT localhost:9000 MinIO endpoint
MINIO_ACCESS_KEY minioadmin MinIO access key
MINIO_SECRET_KEY minioadmin MinIO secret key

πŸ“¦ GitHub Repository

View the complete source code, contribute, and track issues on GitHub:

View on GitHub Report Issues Join Discussions

Repository Features