Skip to content
View zahere's full-sized avatar

Block or report zahere

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
zahere/README.md

Zaher Khateeb | AI/ML Engineer

Founder of AgentiCraft — infrastructure layer for production multi-agent systems.

I specialize in multi-agent systems architecture, LLM infrastructure, and distributed systems reliability. My work sits at the intersection of formal methods and production engineering — building systems that are provably correct, not just empirically okay.


Current Research

Fault-Dependent Resilience in Multi-Agent LLM Systems

Extending classical network reliability theory to stochastic agent quality. The core result: an iff characterization of when topology choice actually matters — crash-stop faults make all mesh topologies equivalent (a mathematical identity), while Byzantine faults break that equivalence in ways determined by the coordination protocol, not the graph structure.

Validated across ~34,000 LLM experiments spanning 13 coordination topologies, two fault regimes, two task domains, and two model generations. Preparing for submission to a top-tier ML systems venue.

Standalone libraries from this research:

Library Description
stochastic-circuit-breaker CUSUM-optimal circuit breaker for LLM agents and stochastic systems. 4-state FSM with statistically principled degradation detection and provably minimax detection delay.
reliability-polynomials Generalized reliability polynomials where coefficients encode quality, not just connectivity. Fault-dependent crossover analysis, three theorems.

Technical Focus

Multi-Agent Systems — mesh coordination architecture, fault-dependent topology selection, Byzantine fault tolerance for LLM systems, stochastic service mesh, MCP/A2A protocol integration

Formal Methods — session type theory for deadlock-freedom guarantees, runtime property verification, CSP process algebra, refinement checking

LLM Infrastructure — provider-agnostic inference abstraction, statistical circuit breakers with CUSUM-optimal change detection, quality-weighted reliability theory

Distributed Systems — consensus protocols, fault injection and fault modeling, observability, Kubernetes-native deployment


Tech Stack

Languages: Python (expert), C++, TypeScript, SQL, Bash

AI/ML: PyTorch, RAG, fine-tuning (LoRA, QLoRA), LLM evaluation, OpenTelemetry

Infrastructure: Kubernetes, Docker, Helm, CI/CD, service mesh, PostgreSQL, Redis, Qdrant

Cloud: AWS, GCP, Azure, Nebius AI Cloud


Background

  • B.Sc. Industrial Engineering & Management (Data Science concentration) — Tel Aviv University
  • Advanced Data Science & AI Program — Nebius Academy (Y-DATA), Tel Aviv University
  • Previously: AI & Infrastructure Engineer at Visual Arena (Gothenburg, Sweden)

Website AgentiCraft LinkedIn

Pinned Loading

  1. stochastic-circuit-breaker stochastic-circuit-breaker Public

    Statistically optimal circuit breaker for stochastic systems. 4-state CUSUM-based FSM with provably minimax detection delay (Moustakides 1986). Zero dependencies.

    Python

  2. reliability-polynomials reliability-polynomials Public

    Generalized reliability polynomials for quality-weighted network analysis. Every reliability library assumes binary survival — this one doesn't. Zero dependencies.

    Python