refactor(ntx-builder): simplify coordinator-actor messaging with Notify#1699
refactor(ntx-builder): simplify coordinator-actor messaging with Notify#1699SantiagoPittella wants to merge 9 commits intonextfrom
Conversation
crates/ntx-builder/src/actor/mod.rs
Outdated
| let resolved = self.db | ||
| .is_transaction_resolved(account_id, awaited_id) | ||
| .await | ||
| .expect("should be able to check tx status"); |
There was a problem hiding this comment.
Do we really want to stop everything if a single actor fails on this basis?
There was a problem hiding this comment.
If an actor panics, the coordinator catches the error in coordinator.next() and logs it, but the system keeps running without the actor. Though know that you mentioned this, probably the best would be to return an specific error for this Db fail case.
crates/ntx-builder/src/actor/mod.rs
Outdated
| /// before the failure counts are updated in the database. | ||
| async fn mark_notes_failed(&self, nullifiers: &[Nullifier], block_num: BlockNumber) { | ||
| let (ack_tx, ack_rx) = tokio::sync::oneshot::channel(); | ||
| let _ = self |
There was a problem hiding this comment.
_ is type Result<(), SendError<ActorRequest>>. Are we sure we don't want to be handling that error here?
There was a problem hiding this comment.
Same with let _ = ack_rx.await; below
There was a problem hiding this comment.
Replaced them with .is_err checks
sergerad
left a comment
There was a problem hiding this comment.
LGTM just a few questions on errors we don't handle
igamigo
left a comment
There was a problem hiding this comment.
Leaving some comments. Would be good to load test this even if lightly to see that it performs well
| ) -> Result<T, ActorShutdownReason> { | ||
| result.map_err(|err| { | ||
| tracing::error!(err = err.as_report(), account_id = %account_id, "{context}"); | ||
| ActorShutdownReason::DbError(account_id) |
There was a problem hiding this comment.
Should this error variant include the miden_node_db::DatabaseError ?
| use crate::store::StoreClient; | ||
|
|
||
| // ACTOR NOTIFICATION | ||
| /// Converts a database result into an `ActorShutdownReason` error, logging the error on failure. |
There was a problem hiding this comment.
Docs are a bit misleading (it's an internal fn so feel free to disregard):
| /// Converts a database result into an `ActorShutdownReason` error, logging the error on failure. | |
| /// Converts a database result mapping the error into an `ActorShutdownReason`, logging it on failure. |
| let has_notes = match db_query( | ||
| account_id, | ||
| self.db.has_available_notes(account_id, block_num, self.max_note_attempts).await, | ||
| "failed to check for available notes", | ||
| ) { |
There was a problem hiding this comment.
minor nit, feel free to disregard: A pattern similar to
self.db.has_available_notes(...).map_err(|err| map_db_err(acc_id, "failed to check notes"))reads a bit better IMO
| match request { | ||
| ActorRequest::NotesFailed { nullifiers, block_num, ack_tx } => { | ||
| if let Err(err) = self.db.notes_failed(nullifiers, block_num).await { | ||
| tracing::error!(err = %err, "failed to mark notes as failed"); | ||
| } | ||
| let _ = ack_tx.send(()); |
There was a problem hiding this comment.
Is this supposed to be acked even if the db.notes_failed() call failed?
| ActorShutdownReason::DbError(account_id) => { | ||
| tracing::error!(account_id = %account_id, "Account actor shut down due to DB error"); | ||
| Ok(()) | ||
| }, |
There was a problem hiding this comment.
Is the idea not to update the registry and remove the actor handle when this happens? Took a quick look at the other branches and could nto find it either
| pub fn broadcast(&self) { | ||
| for handle in self.actor_registry.values() { | ||
| handle.notify.notify_one(); | ||
| } | ||
| } |
There was a problem hiding this comment.
Will this scale well? Feels like broadcasting to all actors will cost a lot of unnecessary queries. For instance, I saw that select_candidate_from_db queries account state and notes through 2 different queries in the DB. The account state one is probably not trivial though the other one might be.
I know this is what the PR was partially trying to change but wonder if sending the data here to filter in memory is worth it.
There was a problem hiding this comment.
Also, I believe there is an low-hanging optimization opportunity with the 2 queries in the DB. You can check for notes first and bail if none are found.
First task of #1694
mpsc<Arc<MempoolEvent>>channels withArc<Notify>. The DB is already the source of truth (since chore(ntx): replace in memory with sqlite database #1662), so actors only need a "re-check your state" signal and not the event payload. This removes allSendErrorhandling, failed-actor cleanup, and thesend()helper from the coordinator.TransactionInflightmode, actors now query the DB (is_transaction_resolved) to check if their awaited transaction was committed/reverted, sinceNotifydoesn't carry payload.ActorRequestenum over onempscchannel.NotesFailedcarries a oneshot ack to prevent a race condition where the actor could re-select notes before failure counts are persisted (context).CacheNoteScriptremains fire-and-forget.