Reduce token waste in BOS bestfit by cropping shortest doc (#445)

When no document fits the remaining row space, crop the shortest
document in the buffer instead of the first. This minimizes
discarded tokens.

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
Yamahammer
2026-01-16 21:50:34 -05:00
committed by GitHub
parent 6460dc6382
commit e1dafc510f
+3 -2
View File
@@ -178,8 +178,9 @@ def tokenizing_distributed_data_loader_with_state_bos_bestfit(
doc = doc_buffer.pop(best_idx) doc = doc_buffer.pop(best_idx)
row.extend(doc) row.extend(doc)
else: else:
# No doc fits - crop first doc to fill remaining # No doc fits - crop shortest in buffer to fill remaining and minimize waste
doc = doc_buffer.pop(0) shortest_idx = min(range(len(doc_buffer)), key=lambda i: len(doc_buffer[i]))
doc = doc_buffer.pop(shortest_idx)
row.extend(doc[:remaining]) row.extend(doc[:remaining])
rows.append(row[:row_capacity]) rows.append(row[:row_capacity])