BEP XET
BEP XET: Xet Protocol Extension for Content-Defined Chunking and Deduplication¶
Overview¶
The Xet Protocol Extension (BEP XET) is a BitTorrent protocol extension that enables content-defined chunking (CDC) and cross-torrent deduplication through a peer-to-peer Content Addressable Storage (CAS) system. This extension transforms BitTorrent into a super-fast, updatable peer-to-peer file system optimized for collaboration and efficient data sharing.
Rationale¶
The Xet protocol extension addresses key limitations of traditional BitTorrent:
-
Fixed Piece Sizes: Traditional BitTorrent uses fixed piece sizes, leading to inefficient redistribution when files are modified. CDC adapts to content boundaries.
-
No Cross-Torrent Deduplication: Each torrent is independent, even if sharing identical content. Xet enables chunk-level deduplication across torrents.
-
Centralized Storage: Traditional CAS systems require external services. Xet builds CAS directly into the BitTorrent network using DHT and trackers.
-
Inefficient Updates: Updating a shared file requires redistributing the entire file. Xet only redistributes changed chunks.
By combining CDC, deduplication, and P2P CAS, Xet transforms BitTorrent into a super-fast, updatable peer-to-peer file system optimized for collaboration.
Key Features¶
- Content-Defined Chunking (CDC): Gearhash-based intelligent file segmentation (8KB-128KB chunks)
- Cross-Torrent Deduplication: Chunk-level deduplication across multiple torrents
- Peer-to-Peer CAS: Decentralized Content Addressable Storage using DHT and trackers
- Merkle Tree Verification: BLAKE3-256 hashing with SHA-256 fallback for integrity
- Xorb Format: Efficient storage format for grouping multiple chunks
- Shard Format: Metadata storage for file information and CAS data
- LZ4 Compression: Optional compression for Xorb data
Use Cases¶
1. Collaborative File Sharing¶
Xet enables efficient collaboration by: - Deduplication: Shared files across multiple torrents share the same chunks - Fast Updates: Only changed chunks need to be redistributed - Version Control: Track file versions through Merkle tree roots
2. Large File Distribution¶
For large files or datasets: - Content-Defined Chunking: Intelligent boundaries reduce chunk redistribution on edits - Parallel Downloads: Download chunks from multiple peers simultaneously - Resume Capability: Track individual chunks for reliable resume
3. Peer-to-Peer File System¶
Transform BitTorrent into a P2P file system: - CAS Integration: Chunks stored in DHT for global availability - Metadata Storage: Shards provide file system metadata - Fast Lookups: Direct chunk access via hash eliminates need for full torrent download
Implementation Status¶
The Xet protocol extension is fully implemented in ccBitTorrent:
- ✅ Content-Defined Chunking (Gearhash CDC)
- ✅ BLAKE3-256 hashing with SHA-256 fallback
- ✅ SQLite deduplication cache
- ✅ DHT integration (BEP 44)
- ✅ Tracker integration
- ✅ Xorb and Shard formats
- ✅ Merkle tree computation
- ✅ BitTorrent protocol extension (BEP 10)
- ✅ CLI integration
- ✅ Configuration management
Configuration¶
CLI Commands¶
# Enable Xet protocol
ccbt xet enable
# Show Xet status
ccbt xet status
# Show deduplication statistics
ccbt xet stats
# Clean up unused chunks
ccbt xet cleanup --max-age-days 30
Enable Xet Protocol¶
Configure Xet support in ccbt.toml:
[disk]
# Xet Protocol Configuration
xet_enabled = false # Enable Xet protocol
xet_chunk_min_size = 8192 # Minimum chunk size (bytes)
xet_chunk_max_size = 131072 # Maximum chunk size (bytes)
xet_chunk_target_size = 16384 # Target chunk size (bytes)
xet_deduplication_enabled = true # Enable chunk-level deduplication
xet_cache_db_path = "data/xet_cache.db" # SQLite cache database path
xet_chunk_store_path = "data/xet_chunks" # Chunk storage directory
xet_use_p2p_cas = true # Use P2P Content Addressable Storage
xet_compression_enabled = true # Enable LZ4 compression for Xorb data
Protocol Specification¶
Message Types¶
The Xet extension defines four message types:
- CHUNK_REQUEST (0x01): Request a specific chunk by hash
- CHUNK_RESPONSE (0x02): Response containing chunk data
- CHUNK_NOT_FOUND (0x03): Peer does not have the requested chunk
- CHUNK_ERROR (0x04): Error occurred while retrieving chunk
Message Format¶
CHUNK_REQUEST¶
CHUNK_RESPONSE¶
CHUNK_NOT_FOUND¶
CHUNK_ERROR¶
Extension Handshake¶
The Xet extension follows BEP 10 (Extension Protocol) handshake:
- Client sends
ut_metadataextension handshake with Xet extension ID - Server responds with Xet extension ID and message ID mapping
- Messages are sent using the assigned extension message ID
Chunk Discovery¶
Chunks are discovered through multiple mechanisms:
- DHT (BEP 44): Store and retrieve chunk metadata using DHT
- Trackers: Announce chunk availability to trackers
- Peer Exchange: Exchange chunk availability information with peers
- Torrent Metadata: Extract chunk hashes from torrent Xet metadata
Architecture¶
Core Components¶
1. Protocol Extension (ccbt/extensions/xet.py)¶
The Xet extension implements BEP 10 (Extension Protocol) messages for chunk requests and responses.
Xet Protocol Extension implementation.
Initialize Xet Extension.
Source code in ccbt/extensions/xet.py
decode_chunk_request(data: bytes) -> tuple[int, bytes]
¶
Decode chunk request message.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
bytes
|
Encoded request message |
required |
Returns:
| Type | Description |
|---|---|
tuple[int, bytes]
|
Tuple of (request_id, chunk_hash) |
Raises:
| Type | Description |
|---|---|
ValueError
|
If message is invalid |
Source code in ccbt/extensions/xet.py
decode_chunk_response(data: bytes) -> tuple[int, bytes]
¶
Decode chunk response message.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
bytes
|
Encoded response message |
required |
Returns:
| Type | Description |
|---|---|
tuple[int, bytes]
|
Tuple of (request_id, chunk_data) |
Raises:
| Type | Description |
|---|---|
ValueError
|
If message is invalid |
Source code in ccbt/extensions/xet.py
decode_handshake(data: dict[str, Any]) -> bool
¶
Decode Xet extension handshake data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
dict[str, Any]
|
Extension handshake data dictionary |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if peer supports Xet extension |
Source code in ccbt/extensions/xet.py
encode_chunk_error(request_id: int, error_code: int = 0) -> bytes
¶
Encode chunk error message.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
request_id
|
int
|
Request ID |
required |
error_code
|
int
|
Error code (0 = generic error) |
0
|
Returns:
| Type | Description |
|---|---|
bytes
|
Encoded error message |
Source code in ccbt/extensions/xet.py
encode_chunk_not_found(request_id: int) -> bytes
¶
Encode chunk not found message.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
request_id
|
int
|
Request ID |
required |
Returns:
| Type | Description |
|---|---|
bytes
|
Encoded not found message |
Source code in ccbt/extensions/xet.py
encode_chunk_request(chunk_hash: bytes) -> bytes
¶
Encode chunk request message.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
chunk_hash
|
bytes
|
32-byte chunk hash |
required |
Returns:
| Type | Description |
|---|---|
bytes
|
Encoded request message |
Source code in ccbt/extensions/xet.py
encode_chunk_response(request_id: int, chunk_data: bytes) -> bytes
¶
Encode chunk response message.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
request_id
|
int
|
Request ID to respond to |
required |
chunk_data
|
bytes
|
Chunk data bytes |
required |
Returns:
| Type | Description |
|---|---|
bytes
|
Encoded response message |
Source code in ccbt/extensions/xet.py
encode_handshake() -> dict[str, Any]
¶
Encode Xet extension handshake data.
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dictionary containing Xet extension capabilities |
Source code in ccbt/extensions/xet.py
get_capabilities() -> dict[str, Any]
¶
Get Xet extension capabilities.
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Capabilities dictionary |
Source code in ccbt/extensions/xet.py
handle_chunk_request(peer_id: str, request_id: int, chunk_hash: bytes) -> bytes
async
¶
Handle chunk request from peer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
peer_id
|
str
|
Peer identifier |
required |
request_id
|
int
|
Request ID |
required |
chunk_hash
|
bytes
|
32-byte chunk hash |
required |
Returns:
| Type | Description |
|---|---|
bytes
|
Response message (chunk data, not found, or error) |
Source code in ccbt/extensions/xet.py
handle_chunk_response(peer_id: str, request_id: int, chunk_data: bytes) -> None
async
¶
Handle chunk response from peer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
peer_id
|
str
|
Peer identifier |
required |
request_id
|
int
|
Request ID |
required |
chunk_data
|
bytes
|
Chunk data bytes |
required |
Source code in ccbt/extensions/xet.py
set_chunk_provider(provider: Callable[[bytes], bytes | None]) -> None
¶
Set function to provide chunks by hash.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
provider
|
Callable[[bytes], bytes | None]
|
Callable that takes chunk_hash (32 bytes) and returns chunk data bytes or None if not available |
required |
Source code in ccbt/extensions/xet.py
Message Types:
```23:29:ccbt/extensions/xet.py class XetMessageType(IntEnum): """Xet Extension message types."""
CHUNK_REQUEST = 0x01 # Request chunk by hash
CHUNK_RESPONSE = 0x02 # Response with chunk data
CHUNK_NOT_FOUND = 0x03 # Chunk not available
CHUNK_ERROR = 0x04 # Error retrieving chunk
```
Key Methods:
- encode_chunk_request(): ccbt/extensions/xet.py:89 - Encode chunk request message with request ID
- decode_chunk_request(): ccbt/extensions/xet.py:108 - Decode chunk request message
- encode_chunk_response(): ccbt/extensions/xet.py:136 - Encode chunk response with data
- handle_chunk_request(): ccbt/extensions/xet.py:210 - Handle incoming chunk request from peer
- handle_chunk_response(): ccbt/extensions/xet.py:284 - Handle chunk response from peer
Extension Handshake:
- encode_handshake(): ccbt/extensions/xet.py:61 - Encode Xet extension capabilities
- decode_handshake(): ccbt/extensions/xet.py:75 - Decode peer's Xet extension capabilities
2. Content-Defined Chunking (ccbt/storage/xet_chunking.py)¶
Gearhash CDC algorithm for intelligent file segmentation with variable-sized chunks based on content patterns.
Content-defined chunking using Gearhash algorithm.
The Gearhash algorithm uses a rolling hash with a precomputed gear table to find content-defined chunk boundaries. This ensures that similar content in different files will produce the same chunk boundaries, enabling cross-file deduplication.
Attributes:
| Name | Type | Description |
|---|---|---|
target_size |
Target average chunk size (default: 16 KB) |
|
gear_table |
Precomputed 256-element gear table for rolling hash |
Initialize chunker with target chunk size.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
target_size
|
int
|
Target average chunk size in bytes (default: 16 KB) Must be between MIN_CHUNK_SIZE and MAX_CHUNK_SIZE |
TARGET_CHUNK_SIZE
|
Source code in ccbt/storage/xet_chunking.py
chunk_buffer(data: bytes) -> list[bytes]
¶
Chunk data using Gearhash CDC.
This method processes the input data and finds content-defined chunk boundaries using the Gearhash rolling hash algorithm. Chunks will be between MIN_CHUNK_SIZE and MAX_CHUNK_SIZE bytes, with an average size close to target_size.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
bytes
|
Input data to chunk |
required |
Returns:
| Type | Description |
|---|---|
list[bytes]
|
List of chunks, each between MIN_CHUNK_SIZE and MAX_CHUNK_SIZE bytes |
Source code in ccbt/storage/xet_chunking.py
chunk_file(file_path: str, chunk_size_hint: int = 1024 * 1024) -> Iterator[bytes]
¶
Chunk a file using Gearhash CDC.
This method reads a file in chunks and applies CDC chunking, yielding content-defined chunks as they are found.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
str
|
Path to file to chunk |
required |
chunk_size_hint
|
int
|
Hint for read buffer size (default: 1 MB) |
1024 * 1024
|
Yields:
| Type | Description |
|---|---|
bytes
|
Content-defined chunks (bytes) |
Source code in ccbt/storage/xet_chunking.py
chunk_stream(stream: Iterator[bytes]) -> Iterator[bytes]
¶
Chunk a stream of data using Gearhash CDC.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
stream
|
Iterator[bytes]
|
Iterator yielding bytes chunks |
required |
Yields:
| Type | Description |
|---|---|
bytes
|
Content-defined chunks (bytes) |
Source code in ccbt/storage/xet_chunking.py
Constants:
- MIN_CHUNK_SIZE: ccbt/storage/xet_chunking.py:21 - 8 KB minimum chunk size
- MAX_CHUNK_SIZE: ccbt/storage/xet_chunking.py:22 - 128 KB maximum chunk size
- TARGET_CHUNK_SIZE: ccbt/storage/xet_chunking.py:23 - 16 KB default target chunk size
- WINDOW_SIZE: ccbt/storage/xet_chunking.py:24 - 48 bytes rolling hash window
Key Methods:
- chunk_buffer(): ccbt/storage/xet_chunking.py:210 - Chunk data using Gearhash CDC algorithm
- _find_chunk_boundary(): ccbt/storage/xet_chunking.py:242 - Find content-defined chunk boundary using rolling hash
- _init_gear_table(): ccbt/storage/xet_chunking.py:54 - Initialize precomputed gear table for rolling hash
Algorithm: The Gearhash algorithm uses a rolling hash with a precomputed 256-element gear table to find content-defined boundaries. This ensures similar content in different files produces the same chunk boundaries, enabling cross-file deduplication.
3. Deduplication Cache (ccbt/storage/xet_deduplication.py)¶
SQLite-based local deduplication cache with DHT integration for chunk-level deduplication.
Chunk-level deduplication manager.
Manages local deduplication cache using SQLite and provides integration with DHT for global chunk discovery.
Attributes:
| Name | Type | Description |
|---|---|---|
cache_path |
Path to SQLite cache database |
|
chunk_store_path |
Directory where chunks are physically stored |
|
db |
SQLite database connection |
|
dht_client |
Optional DHT client for global chunk discovery |
Initialize deduplication with local cache.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cache_db_path
|
Path | str
|
Path to SQLite cache database file |
required |
dht_client
|
Any | None
|
Optional DHT client instance for global chunk discovery |
None
|
Source code in ccbt/storage/xet_deduplication.py
check_chunk_exists(chunk_hash: bytes) -> Path | None
async
¶
Check if chunk exists locally.
Queries the database for the chunk hash and updates the last_accessed timestamp if found.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
chunk_hash
|
bytes
|
32-byte chunk hash |
required |
Returns:
| Type | Description |
|---|---|
Path | None
|
Path to stored chunk if exists, None otherwise |
Source code in ccbt/storage/xet_deduplication.py
cleanup_unused_chunks(max_age_seconds: int = 30 * 24 * 60 * 60) -> int
async
¶
Remove chunks that haven't been accessed recently.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
max_age_seconds
|
int
|
Maximum age in seconds before chunk is considered unused |
30 * 24 * 60 * 60
|
Returns:
| Type | Description |
|---|---|
int
|
Number of chunks removed |
Source code in ccbt/storage/xet_deduplication.py
close() -> None
¶
get_cache_stats() -> dict
¶
Get statistics about the deduplication cache.
Returns:
| Type | Description |
|---|---|
dict
|
Dictionary with cache statistics |
Source code in ccbt/storage/xet_deduplication.py
get_chunk_info(chunk_hash: bytes) -> dict | None
¶
Get information about a stored chunk.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
chunk_hash
|
bytes
|
32-byte chunk hash |
required |
Returns:
| Type | Description |
|---|---|
dict | None
|
Dictionary with chunk information or None if not found |
Source code in ccbt/storage/xet_deduplication.py
query_dht_for_chunk(chunk_hash: bytes) -> PeerInfo | None
async
¶
Query DHT for peers that have this chunk.
Uses existing DHT infrastructure to find peers that have the specified chunk. This enables global deduplication across the peer network.
The method: 1. Converts 32-byte chunk hash to 20-byte DHT key (using SHA-1) 2. Queries DHT using BEP 44 get_data() method 3. Parses returned value to extract peer information 4. Returns PeerInfo if found
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
chunk_hash
|
bytes
|
32-byte chunk hash |
required |
Returns:
| Type | Description |
|---|---|
PeerInfo | None
|
PeerInfo if found, None otherwise |
Source code in ccbt/storage/xet_deduplication.py
177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 | |
remove_chunk_reference(chunk_hash: bytes) -> bool
¶
Remove a reference to a chunk.
Decrements the reference count. If ref_count reaches zero, the chunk file is deleted.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
chunk_hash
|
bytes
|
32-byte chunk hash |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if chunk was removed, False otherwise |
Source code in ccbt/storage/xet_deduplication.py
store_chunk(chunk_hash: bytes, chunk_data: bytes) -> Path
async
¶
Store chunk with deduplication.
Checks if chunk already exists. If it does, increments reference count. Otherwise, stores the chunk physically and creates a database entry.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
chunk_hash
|
bytes
|
32-byte chunk hash |
required |
chunk_data
|
bytes
|
Chunk data to store |
required |
Returns:
| Type | Description |
|---|---|
Path
|
Path to stored chunk (may be existing or new) |
Source code in ccbt/storage/xet_deduplication.py
Database Schema:
- chunks table: ccbt/storage/xet_deduplication.py:65 - Stores chunk hash, size, storage path, reference count, timestamps
- Indexes: ccbt/storage/xet_deduplication.py:75 - On size and last_accessed for efficient queries
Key Methods:
- check_chunk_exists(): ccbt/storage/xet_deduplication.py:85 - Check if chunk exists locally and update access time
- store_chunk(): ccbt/storage/xet_deduplication.py:112 - Store chunk with deduplication (increments ref_count if exists)
- get_chunk_path(): ccbt/storage/xet_deduplication.py:165 - Get local storage path for chunk
- cleanup_unused_chunks(): ccbt/storage/xet_deduplication.py:201 - Remove chunks not accessed within max_age_days
Features:
- Reference counting: Tracks how many torrents/files reference each chunk
- Automatic cleanup: Removes unused chunks based on access time
- Physical storage: Chunks stored in xet_chunks/ directory with hash as filename
4. Peer-to-Peer CAS (ccbt/discovery/xet_cas.py)¶
DHT and tracker-based chunk discovery and exchange for decentralized Content Addressable Storage.
Peer-to-peer Content Addressable Storage client.
Uses DHT and trackers for chunk discovery instead of HuggingFace CAS. This enables distributed chunk storage and retrieval without external dependencies.
Attributes:
| Name | Type | Description |
|---|---|---|
dht |
DHT client instance |
|
tracker |
Optional tracker client instance |
|
local_chunks |
dict[bytes, str]
|
Dictionary mapping chunk hash to local storage path |
Initialize P2P CAS with DHT and tracker clients.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dht_client
|
Any | None
|
DHT client instance (will be obtained from session if None) |
None
|
tracker_client
|
Any | None
|
Optional tracker client instance |
None
|
key_manager
|
Any
|
Optional Ed25519KeyManager for signing chunks |
None
|
Source code in ccbt/discovery/xet_cas.py
announce_chunk(chunk_hash: bytes) -> None
async
¶
Announce chunk availability to DHT/trackers.
Stores chunk metadata in DHT (BEP 44) and announces to tracker if configured. Other peers can discover this chunk via hash lookup.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
chunk_hash
|
bytes
|
32-byte chunk hash |
required |
Source code in ccbt/discovery/xet_cas.py
download_chunk(chunk_hash: bytes, peer: PeerInfo, torrent_data: dict[str, Any] | None = None, connection_manager: Any | None = None) -> bytes
async
¶
Download chunk from peer using BitTorrent protocol extension.
Uses BEP 10 extension protocol with Xet extension for chunk requests.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
chunk_hash
|
bytes
|
32-byte chunk hash |
required |
peer
|
PeerInfo
|
Peer that has the chunk |
required |
torrent_data
|
dict[str, Any] | None
|
Torrent data for connection (required) |
None
|
connection_manager
|
Any | None
|
AsyncPeerConnectionManager instance (optional) |
None
|
Returns:
| Type | Description |
|---|---|
bytes
|
Chunk data bytes |
Raises:
| Type | Description |
|---|---|
ValueError
|
If download fails |
NotImplementedError
|
If extension protocol not available |
Source code in ccbt/discovery/xet_cas.py
241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 | |
find_chunk_peers(chunk_hash: bytes) -> list[PeerInfo]
async
¶
Find peers that have a specific chunk.
Queries DHT and tracker (if configured) to find peers that can provide the requested chunk.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
chunk_hash
|
bytes
|
32-byte chunk hash |
required |
Returns:
| Type | Description |
|---|---|
list[PeerInfo]
|
List of peers that can provide this chunk |
Source code in ccbt/discovery/xet_cas.py
149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 | |
get_local_chunk_path(chunk_hash: bytes) -> str | None
¶
Get local path for a chunk if available.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
chunk_hash
|
bytes
|
32-byte chunk hash |
required |
Returns:
| Type | Description |
|---|---|
str | None
|
Local path if available, None otherwise |
Source code in ccbt/discovery/xet_cas.py
register_local_chunk(chunk_hash: bytes, local_path: str) -> None
¶
Register a locally stored chunk.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
chunk_hash
|
bytes
|
32-byte chunk hash |
required |
local_path
|
str
|
Path to local chunk file |
required |
Source code in ccbt/discovery/xet_cas.py
Key Methods:
- announce_chunk(): ccbt/discovery/xet_cas.py:50 - Announce chunk availability to DHT (BEP 44) and trackers
- find_chunk_peers(): ccbt/discovery/xet_cas.py:112 - Find peers that have a specific chunk via DHT and tracker queries
- request_chunk_from_peer(): ccbt/discovery/xet_cas.py:200 - Request chunk from a specific peer using Xet extension protocol
DHT Integration:
- Uses BEP 44 (Distributed Hash Table for Mutable Items) to store chunk metadata
- Chunk metadata format: ccbt/discovery/xet_cas.py:68 - {"type": "xet_chunk", "available": True}
- Supports multiple DHT methods: store(), store_chunk_hash(), get_chunk_peers(), get_peers(), find_value()
Tracker Integration: - Announces chunks to trackers using first 20 bytes of chunk hash as info_hash - Enables tracker-based peer discovery for chunks
Storage Formats¶
Xorb Format¶
Xorbs group multiple chunks for efficient storage and retrieval.
Xorb (XOR of blocks) format handler.
Groups multiple chunks into a single xorb for efficient storage. Each xorb can contain multiple chunks up to MAX_XORB_SIZE.
Attributes:
| Name | Type | Description |
|---|---|---|
chunks |
list[tuple[bytes, bytes]]
|
List of (hash, data) tuples |
total_size |
Total size of all chunks in bytes |
Initialize empty xorb.
Source code in ccbt/storage/xet_xorb.py
add_chunk(chunk_hash: bytes, chunk_data: bytes) -> bool
¶
Add chunk to xorb.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
chunk_hash
|
bytes
|
32-byte chunk hash |
required |
chunk_data
|
bytes
|
Chunk data bytes |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if added, False if xorb would exceed MAX_XORB_SIZE |
Source code in ccbt/storage/xet_xorb.py
clear() -> None
¶
deserialize(data: bytes) -> Xorb
staticmethod
¶
Deserialize xorb from binary format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
bytes
|
Serialized xorb data |
required |
Returns:
| Type | Description |
|---|---|
Xorb
|
Xorb instance |
Raises:
| Type | Description |
|---|---|
ValueError
|
If data is invalid or format is incorrect |
Source code in ccbt/storage/xet_xorb.py
208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 | |
get_chunk_by_hash(chunk_hash: bytes) -> bytes | None
¶
Get chunk data by hash.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
chunk_hash
|
bytes
|
32-byte chunk hash |
required |
Returns:
| Type | Description |
|---|---|
bytes | None
|
Chunk data if found, None otherwise |
Source code in ccbt/storage/xet_xorb.py
get_chunk_count() -> int
¶
get_compressed_size(compress: bool = True) -> int
¶
Get size of xorb when serialized with compression.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
compress
|
bool
|
Whether to calculate compressed size |
True
|
Returns:
| Type | Description |
|---|---|
int
|
Size in bytes |
Source code in ccbt/storage/xet_xorb.py
get_compression_ratio() -> float
¶
Get compression ratio if compression is enabled.
Returns:
| Type | Description |
|---|---|
float
|
Compression ratio (compressed_size / uncompressed_size) |
float
|
Returns 1.0 if compression is not available or not beneficial |
Source code in ccbt/storage/xet_xorb.py
get_total_size() -> int
¶
get_xorb_hash() -> bytes
¶
Compute xorb hash for deduplication.
Returns the hash of the serialized xorb data, which can be used to identify identical xorbs for deduplication.
Returns:
| Type | Description |
|---|---|
bytes
|
32-byte hash (BLAKE3-256 or SHA-256) |
Source code in ccbt/storage/xet_xorb.py
is_full() -> bool
¶
Check if xorb is full (would exceed MAX_XORB_SIZE with next chunk).
Returns:
| Type | Description |
|---|---|
bool
|
True if full, False otherwise |
serialize(compress: bool = False) -> bytes
¶
Serialize xorb to binary format.
Format: [Header: 16 bytes] - Magic: 4 bytes ("XORB") - Version: 1 byte - Flags: 1 byte (compression flag, reserved bits) - Reserved: 10 bytes
[Chunk count: 4 bytes (uint32)]
[Chunk entries: variable] - For each chunk: - Hash: 32 bytes - Size: 4 bytes (uint32, uncompressed size) - Compressed size: 4 bytes (uint32, 0 if not compressed) - Data: variable (compressed if flags indicate)
[Metadata: variable] - Total size: 8 bytes (uint64, uncompressed)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
compress
|
bool
|
Whether to compress chunk data with LZ4 |
False
|
Returns:
| Type | Description |
|---|---|
bytes
|
Serialized xorb data |
Source code in ccbt/storage/xet_xorb.py
Format Specification:
- Header: ccbt/storage/xet_xorb.py:123 - 16 bytes (magic 0x24687531, version, flags, reserved)
- Chunk count: ccbt/storage/xet_xorb.py:149 - 4 bytes (uint32, little-endian)
- Chunk entries: ccbt/storage/xet_xorb.py:140 - Variable (hash, sizes, data for each chunk)
- Metadata: ccbt/storage/xet_xorb.py:119 - 8 bytes (total uncompressed size as uint64)
Constants:
- MAX_XORB_SIZE: ccbt/storage/xet_xorb.py:35 - 64 MiB maximum xorb size
- XORB_MAGIC_INT: ccbt/storage/xet_xorb.py:36 - 0x24687531 magic number
- FLAG_COMPRESSED: ccbt/storage/xet_xorb.py:42 - LZ4 compression flag
Key Methods:
- add_chunk(): ccbt/storage/xet_xorb.py:62 - Add chunk to xorb (fails if exceeds MAX_XORB_SIZE)
- serialize(): ccbt/storage/xet_xorb.py:84 - Serialize xorb to binary format with optional LZ4 compression
- deserialize(): ccbt/storage/xet_xorb.py:200 - Deserialize xorb from binary format with automatic decompression
Compression:
- Optional LZ4 compression: ccbt/storage/xet_xorb.py:132 - Compresses chunk data if compress=True and LZ4 available
- Automatic detection: ccbt/storage/xet_xorb.py:22 - Falls back gracefully if LZ4 not installed
Shard Format¶
Shards store file metadata and CAS information for efficient file system operations.
Shard format handler for metadata storage.
Shards group file metadata and CAS information for efficient retrieval. Each shard contains: - File information (paths, sizes, hashes) - Xorb references - Chunk hashes - HMAC for integrity verification
Attributes:
| Name | Type | Description |
|---|---|---|
files |
list[dict]
|
List of file metadata dictionaries |
xorbs |
list[bytes]
|
List of xorb hashes |
chunks |
list[bytes]
|
List of chunk hashes |
Initialize empty shard.
Source code in ccbt/storage/xet_shard.py
add_chunk_hash(chunk_hash: bytes) -> None
¶
Add a chunk hash to the shard.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
chunk_hash
|
bytes
|
32-byte chunk hash |
required |
Source code in ccbt/storage/xet_shard.py
add_file_info(file_path: str, file_hash: bytes, xorb_refs: list[bytes], total_size: int) -> None
¶
Add file information to shard.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
str
|
Path to the file |
required |
file_hash
|
bytes
|
32-byte Merkle root hash of the file |
required |
xorb_refs
|
list[bytes]
|
List of 32-byte xorb hashes that contain this file's chunks |
required |
total_size
|
int
|
Total file size in bytes |
required |
Source code in ccbt/storage/xet_shard.py
add_xorb_hash(xorb_hash: bytes) -> None
¶
Add a xorb hash to the shard.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
xorb_hash
|
bytes
|
32-byte xorb hash |
required |
Source code in ccbt/storage/xet_shard.py
deserialize(data: bytes, hmac_key: bytes | None = None) -> XetShard
staticmethod
¶
Deserialize shard from binary format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
bytes
|
Serialized shard data |
required |
hmac_key
|
bytes | None
|
Optional HMAC key for verification |
None
|
Returns:
| Type | Description |
|---|---|
XetShard
|
XetShard instance |
Raises:
| Type | Description |
|---|---|
ValueError
|
If data is invalid or HMAC verification fails |
Source code in ccbt/storage/xet_shard.py
247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 | |
get_file_by_path(file_path: str) -> dict | None
¶
Get file information by path.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
str
|
Path to file |
required |
Returns:
| Type | Description |
|---|---|
dict | None
|
File info dictionary if found, None otherwise |
Source code in ccbt/storage/xet_shard.py
get_file_count() -> int
¶
serialize(hmac_key: bytes | None = None) -> bytes
¶
Serialize shard to binary format with optional HMAC.
Format: [Header: 24 bytes] - Magic: 4 bytes ("SHAR") - Version: 1 byte - Flags: 1 byte (HMAC flag, reserved bits) - Reserved: 2 bytes - File count: 4 bytes (uint32) - Xorb count: 4 bytes (uint32) - Chunk count: 4 bytes (uint32) - Reserved: 4 bytes
[File Info Section: variable] - For each file: - Path length: 4 bytes (uint32) - Path: variable (UTF-8) - Hash: 32 bytes - Size: 8 bytes (uint64) - Xorb count: 4 bytes (uint32) - Xorb refs: variable (32 bytes each)
[CAS Info Section: variable] - Xorb hashes: variable (32 bytes each) - Chunk hashes: variable (32 bytes each)
[Footer with HMAC: variable] - HMAC: 32 bytes (if key provided)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hmac_key
|
bytes | None
|
Optional HMAC key for integrity verification |
None
|
Returns:
| Type | Description |
|---|---|
bytes
|
Serialized shard data |
Source code in ccbt/storage/xet_shard.py
Format Specification:
- Header: ccbt/storage/xet_shard.py:142 - 24 bytes (magic "SHAR", version, flags, file/xorb/chunk counts)
- File Info Section: ccbt/storage/xet_shard.py:145 - Variable (path, hash, size, xorb refs for each file)
- CAS Info Section: ccbt/storage/xet_shard.py:148 - Variable (xorb hashes, chunk hashes)
- HMAC Footer: ccbt/storage/xet_shard.py:150 - 32 bytes (HMAC-SHA256 if key provided)
Constants:
- SHARD_MAGIC: ccbt/storage/xet_shard.py:19 - b"SHAR" magic bytes
- SHARD_VERSION: ccbt/storage/xet_shard.py:20 - Format version 1
- HMAC_SIZE: ccbt/storage/xet_shard.py:22 - 32 bytes for HMAC-SHA256
Key Methods:
- add_file_info(): ccbt/storage/xet_shard.py:47 - Add file metadata with xorb references
- add_chunk_hash(): ccbt/storage/xet_shard.py:80 - Add chunk hash to shard
- add_xorb_hash(): ccbt/storage/xet_shard.py:93 - Add xorb hash to shard
- serialize(): ccbt/storage/xet_shard.py:106 - Serialize shard to binary format with optional HMAC
- deserialize(): ccbt/storage/xet_shard.py:201 - Deserialize shard from binary format with HMAC verification
Integrity: - HMAC verification: ccbt/storage/xet_shard.py:170 - Optional HMAC-SHA256 for shard integrity
Merkle Tree Computation¶
Files are verified using Merkle trees built from chunk hashes for efficient integrity verification.
Xet protocol hashing functions.
Provides BLAKE3-256 hashing for chunks and Merkle tree construction for file-level hashing. Falls back to SHA-256 if blake3 is not available.
build_merkle_tree(chunks: list[bytes]) -> bytes
staticmethod
¶
Build Merkle tree from chunk hashes.
Constructs a binary Merkle tree bottom-up from chunk hashes. Each level pairs hashes and hashes them together until a single root hash remains.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
chunks
|
list[bytes]
|
List of chunk data (not hashes - will be hashed) |
required |
Returns:
| Type | Description |
|---|---|
bytes
|
32-byte root hash (Merkle tree root) |
Source code in ccbt/storage/xet_hashing.py
build_merkle_tree_from_hashes(chunk_hashes: list[bytes]) -> bytes
staticmethod
¶
Build Merkle tree from existing chunk hashes.
This variant takes pre-computed chunk hashes instead of chunk data. Useful when you already have the hashes and don't need to recompute them.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
chunk_hashes
|
list[bytes]
|
List of 32-byte chunk hashes |
required |
Returns:
| Type | Description |
|---|---|
bytes
|
32-byte root hash (Merkle tree root) |
Source code in ccbt/storage/xet_hashing.py
compute_chunk_hash(chunk_data: bytes) -> bytes
staticmethod
¶
Compute BLAKE3-256 hash for a chunk.
Uses BLAKE3 if available for better performance, otherwise falls back to SHA-256 for compatibility.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
chunk_data
|
bytes
|
Chunk data to hash |
required |
Returns:
| Type | Description |
|---|---|
bytes
|
32-byte hash (BLAKE3-256 or SHA-256) |
Source code in ccbt/storage/xet_hashing.py
compute_xorb_hash(xorb_data: bytes) -> bytes
staticmethod
¶
Compute hash for xorb data.
Xorbs are collections of chunks stored together. This method computes the hash of the xorb data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
xorb_data
|
bytes
|
Xorb data to hash |
required |
Returns:
| Type | Description |
|---|---|
bytes
|
32-byte hash |
Source code in ccbt/storage/xet_hashing.py
hash_file_incremental(file_path: str, chunk_callback: Callable[[bytes], None] | None = None) -> bytes
staticmethod
¶
Compute file hash incrementally by reading and hashing chunks.
This method reads a file in chunks and computes the hash incrementally, which is memory-efficient for large files.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
str
|
Path to file to hash |
required |
chunk_callback
|
Callable[[bytes], None] | None
|
Optional callback function called with each chunk |
None
|
Returns:
| Type | Description |
|---|---|
bytes
|
32-byte file hash |
Source code in ccbt/storage/xet_hashing.py
verify_chunk_hash(chunk_data: bytes, expected_hash: bytes) -> bool
staticmethod
¶
Verify chunk data against expected hash.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
chunk_data
|
bytes
|
Chunk data to verify |
required |
expected_hash
|
bytes
|
Expected hash (32 bytes) |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if hash matches, False otherwise |
Source code in ccbt/storage/xet_hashing.py
Hash Functions:
- compute_chunk_hash(): ccbt/storage/xet_hashing.py:43 - Compute BLAKE3-256 hash for chunk (falls back to SHA-256)
- compute_xorb_hash(): ccbt/storage/xet_hashing.py:63 - Compute hash for xorb data
- verify_chunk_hash(): ccbt/storage/xet_hashing.py:158 - Verify chunk data against expected hash
Merkle Tree Construction:
- build_merkle_tree(): ccbt/storage/xet_hashing.py:78 - Build Merkle tree from chunk data (hashes chunks first)
- build_merkle_tree_from_hashes(): ccbt/storage/xet_hashing.py:115 - Build Merkle tree from pre-computed chunk hashes
Algorithm: The Merkle tree is built bottom-up by pairing hashes at each level: 1. Start with chunk hashes (leaf nodes) 2. Pair adjacent hashes and hash the combination 3. Repeat until single root hash remains 4. Odd numbers: duplicate the last hash for pairing
Incremental Hashing:
- hash_file_incremental(): ccbt/storage/xet_hashing.py:175 - Compute file hash incrementally for memory efficiency
Hash Size:
- HASH_SIZE: ccbt/storage/xet_hashing.py:40 - 32 bytes for BLAKE3-256 or SHA-256
BLAKE3 Support: - Automatic detection: ccbt/storage/xet_hashing.py:21 - Uses BLAKE3 if available, falls back to SHA-256 - Performance: BLAKE3 provides better performance for large files