Advanced¶
Bulk Import¶
For high-volume imports (1k-10k+ records), the default per-record persistence can be slow: one DB call and one progress broadcast per row. Bulk mode uses insert_all for 10-100x throughput on simple create scenarios.
Bulk mode is opt-in per target -- existing targets are unaffected.
Basic usage¶
Add bulk_mode to your target. The default persist_batch calls insert_all on the model class:
class GuestTarget < DataPorter::Target
label "Guests"
model_name "Guest"
sources :csv
bulk_mode batch_size: 500
columns do
column :first_name, type: :string, required: true
column :last_name, type: :string
column :email, type: :email
end
def persist(record, context:)
Guest.create!(record.attributes)
end
end
persist is still required as a fallback (see conflict strategies below).
The default persist_batch automatically injects created_at and updated_at timestamps into each row.
Options¶
| Option | Default | Description |
|---|---|---|
batch_size |
500 |
Number of records per insert_all call |
on_conflict |
:retry_per_record |
What to do when a batch fails |
Conflict strategies¶
When a batch fails (e.g. a unique constraint violation), the on_conflict option controls what happens:
| Strategy | Behavior |
|---|---|
:retry_per_record |
Re-process the failed batch record-by-record via persist. Records that succeed are counted as created; records that fail are counted as errored. This is the safest default. |
:fail_batch |
Mark all records in the failed batch as errored. No individual retry. Use this when you want fast failure and don't need partial recovery. |
Custom batch logic¶
Override persist_batch for full control over batch persistence. This is useful for upsert_all, custom conflict handling, or injecting extra data:
class OrderTarget < DataPorter::Target
label "Orders"
model_name "Order"
sources :csv
bulk_mode batch_size: 200, on_conflict: :fail_batch
columns do
column :external_id, type: :string, required: true
column :total, type: :decimal
end
def persist_batch(records, context:)
Order.upsert_all(
records.map { |r| r.data.merge("shop_id" => import_params["shop_id"]) },
unique_by: :external_id
)
end
def persist(record, context:)
Order.create!(record.data.merge("shop_id" => import_params["shop_id"]))
end
end
Note
When overriding persist_batch, you are responsible for handling timestamps and any extra attributes. The auto-injected created_at/updated_at only applies to the default implementation.
How it works¶
- After parsing, importable records are sliced into batches of
batch_size - Each batch is passed to
persist_batch(custom or defaultinsert_all) - On success, all records in the batch are counted as created
- On failure, the conflict strategy kicks in (retry or fail)
- Progress is broadcast once per batch, not per record
- Transform and validate still run per-record during the parse phase (unchanged)
Performance tips¶
- Batch size: 500 is a good default. Going above 1,000 may lock the table for too long or stress the database WAL. Going below 100 reduces the throughput benefit.
- Indexes: Ensure your table has the right indexes for
insert_all/upsert_all(especiallyunique_bycolumns). - No ActiveRecord callbacks:
insert_allbypasses validations, callbacks (before_save,after_create, etc.), and Ruby-level defaults. All validation should happen in the parse phase via column types andvalidate. If you need callbacks, overridepersist_batchwithcreate!calls or use the default per-record mode. - Memory: Records are processed lazily in batches -- only one batch is in memory at a time during the import phase.