$cat blogs/sparse-matrix-optimization-ondc.md

Optimizing 300 Billion Data Points: Building ONDC's High-Performance Serviceability System

How we built a sub-50ms query system handling 20M+ merchants across 30K+ pincodes using Apache HBase and smart caching strategies. A deep dive into scaling sparse matrix operations.

2024-11-28
5 min read
Blog
system-designperformancehbaseoptimizationscaleondc

Content

5 min read

📢 Listen to this article

Use the audio player below to listen to this article. You can customize the voice and reading speed with the settings button.

Listen to this article

Optimizing 300 Billion Data Points: Building ONDC's High-Performance Serviceability System

Picture this: You're building a system where 20 million merchants need to define which of India's 30,000+ pincodes they can serve. That's potentially 600 billion data points in a sparse matrix where 99% of the values are empty.

Oh, and it needs to respond in under 50ms while handling 10,000+ requests per minute.

Welcome to the ONDC Pincode Serviceability Challenge - and how we built a system that earned us TOP 7 National Finalist status at Build for Bharat 2024.

The Problem: Scale Meets Speed

ONDC (Open Network for Digital Commerce) separates serviceability definition from verification. Merchants define their serviceable pincodes, while buyer apps verify if a merchant can serve a specific location. Simple concept, massive scale problem.

The numbers that kept us awake:

  • 30,000+ Indian pincodes
  • 20+ million potential merchants
  • 10% enabling pincode serviceability = 2M active merchants
  • Sparse matrix: 2M × 30K = 60 billion potential data points
  • Target: Sub-50ms response time
  • Throughput: 10,000+ requests/minute

The Architecture: HBase + Smart Caching

Why Apache HBase?

Traditional databases crumble under this scale. We needed something built for:

  • Sparse data storage - Most merchant-pincode combinations don't exist
  • Horizontal scaling - Linear performance as data grows
  • Fast random access - Sub-millisecond key-value lookups
  • Column-family storage - Efficient sparse matrix representation
// HBase table design for optimal sparse storage // Row Key: merchant_id // Column Family: 'pincodes' // Column Qualifier: pincode // Value: serviceability_metadata public class ServiceabilityTable { private static final String TABLE_NAME = "merchant_serviceability"; private static final String CF_PINCODES = "pincodes"; public void addServiceability(String merchantId, List<String> pincodes) { Put put = new Put(Bytes.toBytes(merchantId)); for (String pincode : pincodes) { put.addColumn( Bytes.toBytes(CF_PINCODES), Bytes.toBytes(pincode), Bytes.toBytes("ACTIVE") ); } table.put(put); } }

The Smart Caching Layer

Raw HBase performance wasn't enough. We implemented a multi-tier caching strategy:

@Service public class ServiceabilityService { @Cacheable(value = "pincode-merchants", key = "#pincode") public List<String> getMerchantsByPincode(String pincode) { // L1: Local cache (Caffeine) - 1ms lookup List<String> cached = localCache.get(pincode); if (cached != null) return cached; // L2: Redis cluster - 5ms lookup cached = redisTemplate.opsForValue().get("merchants:" + pincode); if (cached != null) { localCache.put(pincode, cached); return cached; } // L3: HBase - 20ms lookup return fetchFromHBase(pincode); } }

This approach reduced database queries by 70% and achieved our sub-50ms target.

Performance Optimizations: The Details That Matter

1. Batch Processing for Bulk Uploads

Merchants don't add pincodes one by one - they upload CSVs with thousands of entries. We optimized for this:

public class BulkUploadProcessor { private static final int BATCH_SIZE = 1000; public void processBulkUpload(String merchantId, List<String> pincodes) { List<Put> puts = new ArrayList<>(); for (String pincode : pincodes) { Put put = createPut(merchantId, pincode); puts.add(put); if (puts.size() >= BATCH_SIZE) { table.put(puts); puts.clear(); } } // Process remaining if (!puts.isEmpty()) { table.put(puts); } } }

Result: 40-50 rows/second processing speed for bulk uploads.

2. Intelligent Row Key Design

HBase performance depends heavily on row key design. Our strategy:

// Bad: Sequential keys cause hotspotting String badRowKey = merchantId; // Good: Hash prefix prevents hotspots String goodRowKey = DigestUtils.md5Hex(merchantId).substring(0, 4) + "_" + merchantId;

This distributed load evenly across HBase regions, preventing bottlenecks.

3. Parallel Query Processing

For multi-pincode queries, we process them concurrently:

public Map<String, List<String>> getMerchantsForPincodes(List<String> pincodes) { Map<String, List<String>> results = new ConcurrentHashMap<>(); pincodes.parallelStream().forEach(pincode -> { List<String> merchants = getMerchantsByPincode(pincode); results.put(pincode, merchants); }); return results; }

Real-World Performance Results

After optimizations, our system delivered:

  • Query Response: < 50ms for single pincode lookup
  • Throughput: 10,000+ requests per minute
  • Storage Efficiency: 90%+ compression for sparse data
  • Concurrent Users: 1000+ simultaneous connections
  • Bulk Upload: 40-50 rows/second processing

The Spring Boot Microservices Architecture

We built separate services for different concerns:

Seller API Service

@RestController @RequestMapping("/seller") public class SellerController { @PostMapping("/upload/csv") public ResponseEntity<UploadResponse> uploadCsv( @RequestParam("file") MultipartFile file) { String processingId = UUID.randomUUID().toString(); // Async processing CompletableFuture.runAsync(() -> bulkUploadService.processCsv(processingId, file) ); return ResponseEntity.ok(new UploadResponse(processingId)); } }

Buyer API Service

@RestController @RequestMapping("/buyer") public class BuyerController { @GetMapping("/merchants") public ResponseEntity<List<String>> getMerchants( @RequestParam String pincodes, @RequestParam(defaultValue = "pincodes") String mode) { List<String> pincodeList = Arrays.asList(pincodes.split(",")); List<String> merchants = serviceabilityService .getMerchantsForPincodes(pincodeList); return ResponseEntity.ok(merchants); } }

Key Lessons for Building High-Scale Systems

1. Choose the Right Database for Your Access Patterns

Traditional RDBMS couldn't handle our sparse matrix efficiently. HBase's column-family storage was perfect for our use case.

2. Cache Strategically, Not Everything

Our multi-tier caching reduced database load by 70%, but we only cached frequently accessed data to avoid memory bloat.

3. Design for Your Data Distribution

Understanding that 99% of merchant-pincode combinations would be empty guided our entire architecture.

4. Optimize for Bulk Operations

Real users don't make single requests - they upload thousands of records at once. Design for this reality.

5. Monitor What Matters

We tracked P95 response times, not just averages. That 95th percentile tells you about real user experience.

Beyond the Hackathon: Production Considerations

While our hackathon solution proved the concept, production deployment would need:

  • Data partitioning across geographic regions
  • Disaster recovery with cross-region replication
  • Rate limiting to prevent abuse
  • Monitoring and alerting for system health
  • A/B testing framework for optimizations

The Impact

This project demonstrated that with the right architecture, you can build systems that handle massive scale while maintaining blazing-fast performance. The ONDC network can now efficiently verify serviceability for millions of merchants across thousands of pincodes in real-time.

$ Terminal v1.0.25 â–ˆ