[{"content":"The library described in this section is available on GitHub.\nThe project serves as an experimental work for:\ndesigning a non-trivial machine learning framework. implementing low-level primitives, on top of mature libraries such as cuDNN and cuBLAS. exposing Python bindings for common workflows. Its long-term goal is to implement classical CNN architectures such as LeNet, AlexNet, GoogLeNet, VGGNet, and ResNet without relying on high-level deep learning frameworks.\n","permalink":"https://rfvasile.github.io/posts/rml/core-infrastructure/","summary":"Overview of the core infrastructure, covering architectural design, CUDA primitives and Python bindings.","title":"Infrastructure of the rml library"},{"content":" Update (2026-03-25): See GitHub for a NumPy/Torch implementation of the experiment described below. Update (2026-04-03): Here is a list of common differentiation rules. Logistic regression is used to model a binary label (i.e. 0/1) through a linear function followed by a sigmoid. Broadly speaking, the sigmoid is the common approach for modelling a Bernoulli distribution, which is a statistical model for predicting outcomes involving two distinct categories. Training a model involves maximizing conditional likelihood, or equivalently minimizing binary cross-entropy.\nSolution\u0026nbsp;(Python sketch) The following pseudocode provides a broad idea of the derivations described below. Full code availabe here.\nclass LogisticRegression(ABC): \u0026#34;\u0026#34;\u0026#34;Base wrapper for the training loop.\u0026#34;\u0026#34;\u0026#34; def _sigmoid(self, logits: Tensor) -\u0026gt; Tensor: \u0026#34;\u0026#34;\u0026#34;Used to obtain probabilities from specified logits\u0026#34;\u0026#34;\u0026#34; return 1/(1+torch.exp(-logits)) def fit(self, X: Tensor, y: Tensor) -\u0026gt; LogisticRegression: for _ in range(self.n_iters): logits = X @ self.weights # Apply the sigmoid and the log likelihood function p = self._sigmoid(logits) p = torch.clamp(p, min=1e-12, max=1-1e-12) loss = -torch.mean(y*torch.log(p) + (1-y)*torch.log(1-p)) # Update step step = self.optimizer(X, y) # N self.weights = self.weights - self.lr * step class GradientDescent(LogisticRegression): \u0026#34;\u0026#34;\u0026#34;Calculate gradient using gradient descent (first order method).\u0026#34;\u0026#34;\u0026#34; def optimizer(self, X: Tensor, y: Tensor) -\u0026gt; Tensor: \u0026#34;\u0026#34;\u0026#34;Calculate the gradient\u0026#34;\u0026#34;\u0026#34; logits = X @ self.weights # (MxN) x (N) -\u0026gt; M p = self._sigmoid(logits) # M return torch.mean(X.T * (p-y), dim=1) # (NxM) x (M) -\u0026gt; N x M -\u0026gt; N class NewtonsMethod(LogisticRegression): \u0026#34;\u0026#34;\u0026#34;Calculate gradient using Newton\u0026#39;s method (second order method).\u0026#34;\u0026#34;\u0026#34; def optimizer(self, X: Tensor, y: Tensor) -\u0026gt; Tensor: \u0026#34;\u0026#34;\u0026#34;Calculate the gradient.\u0026#34;\u0026#34;\u0026#34; z = X @ self.weights # (MxN) x N -\u0026gt; M p = self._sigmoid(z) # M S = torch.diag(p*(1-p)) # MxM # Dimensions: (NxM x MxM x MxN)^-1 x NxM x M -\u0026gt; N H = X.T @ S @ X g = X.T @ (p - y) # Apply a dampening term so the system becomes stable eps = 1e-4 I = torch.eye(H.shape[0], dtype=H.dtype, device=H.device) step = torch.linalg.solve(H + eps * I, g) return step Figure\u0026nbsp;1: Red and green points denote the two classes. The red line is the decision boundary found by logistic regression. (Image Source: Hastie\u0026#32; \u0026#32;et\u0026#32;al.,\u0026#32;2009 Hastie\u0026#32; T.,\u0026#32;et al.\u0026#32; (2009). \u0026#32;The Elements of Statistical Learning: Data Mining, Inference, and Prediction. \u0026#32; Springer.\u0026#32;Retrieved from\u0026#32; https://books.google.it/books?id=eBSgoAEACAAJ , pp.134) Model Setup Setup. Let $X \\in \\mathbb{R}^{N \\times p}$ contain rows $x_i^\\top \\in \\mathbb{R}^{1 \\times p}$ and let $y \\in \\{0,1\\}^N$ contain the binary targets. If a bias is used, either augment each feature vector with a leading 1 and absorb the bias into $w$, or keep the bias separate. For coding convenience the former approach is preferred since it streamlines calculations (see here). The explanation below keeps the bias separate.\nDefinition 1 (Sigmoid). For each sample, logistic regression maps the linear score to a Bernoulli probability. For calculating probabilities we apply the sigmoid function: \\begin{equation} \\label{eq:logistic-sigmoid} z_i = x_i^\\top w + b, \\qquad p_i = \\sigma(z_i) = \\frac{1}{1+e^{-z_i}}. \\end{equation} Here $\\sigma$ denotes the sigmoid (or logistic) function, which maps the real-valued score $z_i$ to a number in $\\{0,1\\}$.\nThe sigmoid derivative is $$ \\sigma\u0026rsquo;(z) = \\sigma(z)\\bigl(1-\\sigma(z)\\bigr). $$\nTo obtain binary outputs, a common approach is to apply thresholding at $0.5$: $$ \\hat y_i = \\begin{cases} 1, \u0026amp; p_i \u0026gt; 0.5, \\\\ 0, \u0026amp; \\text{otherwise}. \\end{cases} $$\nDerivation\u0026nbsp;(Sigmoid Derivative) Start from the sigmoid definition:\n$$ p_i = \\sigma(z_i) = \\frac{1}{1+e^{-z_i}}. $$\nDifferentiate with respect to $z_i$:\n$$ \\frac{\\partial p_i}{\\partial z_i} = \\frac{\\partial}{\\partial z_i}(1+e^{-z_i})^{-1} = -(1+e^{-z_i})^{-2}\\frac{\\partial}{\\partial z_i}(1+e^{-z_i}). $$\nSince:\n$$ \\frac{\\partial}{\\partial z_i}(1+e^{-z_i}) = -e^{-z_i}, $$\nsubstituting gives:\n$$ \\frac{\\partial p_i}{\\partial z_i} = -(1+e^{-z_i})^{-2}(-e^{-z_i}) = \\frac{e^{-z_i}}{(1+e^{-z_i})^2}. $$\nNow rewrite the result in terms of $p_i$. Because\n$$ p_i = \\frac{1}{1+e^{-z_i}}, \\qquad 1-p_i = 1-\\frac{1}{1+e^{-z_i}} = \\frac{e^{-z_i}}{1+e^{-z_i}}, $$\nwe have:\n$$ p_i(1-p_i) = \\frac{1}{1+e^{-z_i}}\\frac{e^{-z_i}}{1+e^{-z_i}} = \\frac{e^{-z_i}}{(1+e^{-z_i})^2}. $$\nTherefore,\n$$ \\frac{\\partial p_i}{\\partial z_i} = p_i(1-p_i). $$\n$\\blacksquare$ Objective Definition 2 (Negative Log-Likelihood). Now, to train a model to predict probabilities, logistic regression minimizes the Bernoulli negative log-likelihood: \\begin{equation} \\label{eq:logistic-objective} L(w,b) = -\\frac{1}{N}\\ell(w,b) = -\\frac{1}{N}\\sum_{i=1}^N \\left[y_i \\log p_i + (1-y_i)\\log(1-p_i)\\right]. \\end{equation} With $z = Xw + b\\mathbf 1$ and $p = \\sigma(z)$, the same loss is: $$ L(w,b) = -\\frac{1}{N}\\left(y^\\top \\log p + (\\mathbf 1-y)^\\top \\log(\\mathbf 1-p)\\right). $$\nDerivation For one sample: $$ \\ell_i = -\\left[y_i\\log p_i + (1-y_i)\\log(1-p_i)\\right]. $$ Differentiate with respect to $p_i$: $$ \\frac{\\partial \\ell_i}{\\partial p_i} = -\\frac{y_i}{p_i} + \\frac{1-y_i}{1-p_i} = \\frac{p_i-y_i}{p_i(1-p_i)}. $$ Since: $$ \\frac{\\partial p_i}{\\partial z_i} = p_i(1-p_i), $$\nthe chain rule gives: \\begin{equation} \\label{eq:logistic-dlogit} \\frac{\\partial \\ell_i}{\\partial z_i} = \\frac{\\partial \\ell_i}{\\partial p_i}\\frac{\\partial p_i}{\\partial z_i} = p_i-y_i. \\end{equation} Here the sigmoid and cross-entropy combine so that the extra factor $p_i(1-p_i)$ cancels. $\\blacksquare$\nResult 1 (Stable Per-Sample Form). Implementation-wise, a more stable form is obtained by writing the sample loss as a function of the logit $z_i = x_i^\\top w + b$ and the label $y_i$:\nFor one sample with $p_i = \\sigma(z_i)$ and $y_i \\in \\{0,1\\}$, the stable per-sample loss is: \\begin{equation*} \\ell_i(z_i,y_i) = \\log(1+e^{z_i}) - y_i z_i = \\operatorname{softplus}(z_i) - y_i z_i. \\end{equation*} This form avoids evaluating $\\log(\\sigma(z_i))$ and $\\log(1-\\sigma(z_i))$ separately when $|z_i|$ is large.\nDerivation Start from the Bernoulli negative log-likelihood for one sample: $$ \\ell_i = -\\left[y_i\\log p_i + (1-y_i)\\log(1-p_i)\\right]. $$ Using $$ p_i = \\sigma(z_i) = \\frac{1}{1+e^{-z_i}} = \\frac{e^{z_i}}{1+e^{z_i}}, \\qquad 1-p_i = \\frac{1}{1+e^{z_i}}, $$ substitute into the loss: $$ \\ell_i = -\\left[y_i \\log\\left(\\frac{e^{z_i}}{1+e^{z_i}}\\right) + (1-y_i)\\log\\left(\\frac{1}{1+e^{z_i}}\\right)\\right]. $$ Expanding the logarithms gives: $$ \\log\\left(\\frac{e^{z_i}}{1+e^{z_i}}\\right) = z_i - \\log(1+e^{z_i}), \\qquad \\log\\left(\\frac{1}{1+e^{z_i}}\\right) = -\\log(1+e^{z_i}), $$ so: $$ \\ell_i = -\\left[y_i\\bigl(z_i-\\log(1+e^{z_i})\\bigr) - (1-y_i)\\log(1+e^{z_i})\\right]. $$ Collect the two log terms: $$ y_i\\log(1+e^{z_i}) + (1-y_i)\\log(1+e^{z_i}) = \\log(1+e^{z_i}), $$ which yields: $$ \\ell_i(z_i,y_i) = \\log(1+e^{z_i}) - y_i z_i. $$ Finally, by definition of the softplus function, $$ \\operatorname{softplus}(z_i) := \\log(1+e^{z_i}), $$ so: $$ \\ell_i(z_i,y_i) = \\operatorname{softplus}(z_i) - y_i z_i. $$\n$\\blacksquare$ Optimization Generally, first-order methods are commonly used for logistic regression. For smaller, more difficult problems, second-order methods such as Newton\u0026rsquo;s method and IRLS are also common because they can converge in fewer iterations, at the cost of being more compute-expensive.\nFirst-Order Methods Gradient. Stack the logits into $z \\in \\mathbb{R}^N$ and the probabilities into $p \\in \\mathbb{R}^N$. Using equation $\\eqref{eq:logistic-dlogit}$ and $z = Xw + b\\mathbf 1$, the gradient formulas are: \\begin{equation} \\label{eq:logistic-gradient} \\nabla_w L(w,b) = \\frac{1}{N}X^\\top(p-y), \\qquad \\frac{\\partial L}{\\partial b} = \\frac{1}{N}\\sum_{i=1}^N (p_i-y_i). \\end{equation}\nMotivation\u0026nbsp;(Backprop View) For one sample, it is useful to think of logistic regression as a one-layer model: $$ z_i = x_i^\\top w + b, \\qquad p_i = \\sigma(z_i), \\qquad \\ell_i = \\ell_i(p_i,y_i). $$ Then the chain rule gives: $$ \\frac{\\partial \\ell_i}{\\partial w} = \\frac{\\partial \\ell_i}{\\partial p_i} \\frac{\\partial p_i}{\\partial z_i} \\frac{\\partial z_i}{\\partial w}. $$ From equation $\\eqref{eq:logistic-dlogit}$, $$ \\frac{\\partial \\ell_i}{\\partial z_i} = p_i-y_i. $$ Also, $$ \\frac{\\partial z_i}{\\partial w} = x_i, \\qquad \\frac{\\partial z_i}{\\partial b} = 1. $$ So for one sample, $$ \\frac{\\partial \\ell_i}{\\partial w} = (p_i-y_i)x_i, \\qquad \\frac{\\partial \\ell_i}{\\partial b} = p_i-y_i. $$ Averaging over all samples gives: $$ \\nabla_w L(w,b) = \\frac{1}{N}\\sum_{i=1}^N (p_i-y_i)x_i = \\frac{1}{N}X^\\top(p-y), \\qquad \\frac{\\partial L}{\\partial b} = \\frac{1}{N}\\sum_{i=1}^N (p_i-y_i). $$ This is the same chain-rule pattern used in backpropagation; logistic regression is just the single-neuron case with sigmoid output. $\\blacksquare$\nNote that $X^\\top(p-y) \\in \\mathbb{R}^p$, so the gradient matches the weights vector. If an $L_2$ penalty $\\frac{\\lambda}{2N}\\lVert w\\rVert_2^2$ is added, then $$ \\nabla_w L_{\\mathrm{reg}}(w,b) = \\frac{1}{N}X^\\top(p-y) + \\frac{\\lambda}{N}w, $$ while the bias derivative is commonly left unregularized.\nUpdate Rule. Applying gradient descent to equation $\\eqref{eq:logistic-objective}$ using equation $\\eqref{eq:logistic-gradient}$ gives: \\begin{equation} \\label{eq:logistic-update} w_{t+1} = w_t - \\eta \\frac{1}{N}X^\\top(p_t-y), \\qquad b_{t+1} = b_t - \\eta \\frac{1}{N}\\sum_{i=1}^N (p_{t,i}-y_i). \\end{equation} With $L_2$ regularization, the weight update becomes: $$ w_{t+1} = w_t - \\eta \\left[\\frac{1}{N}X^\\top(p_t-y) + \\frac{\\lambda}{N}w_t\\right]. $$\nSecond-Order Methods For Newton\u0026rsquo;s method and IRLS (iteratively reweighted least squares), use the augmented matrix $\\tilde X = [\\mathbf 1; X]$, the augmented parameter $\\tilde w = (b,w)$, and the diagonal matrix $$ S := \\operatorname{diag}(p_i(1-p_i)). $$\nHessian. The Hessian of the unregularized loss is: \\begin{equation} \\label{eq:logistic-hessian} \\nabla^2 L(\\tilde w) = \\frac{1}{N}\\tilde X^\\top S\\tilde X. \\end{equation} Because $S$ has nonnegative diagonal entries, equation $\\eqref{eq:logistic-hessian}$ is positive semidefinite1, so the objective is convex (so any local minimum is global). This is the matrix used by Newton\u0026rsquo;s method and IRLS.\nNewton Step. Using equation $\\eqref{eq:logistic-hessian}$, a Newton step has the form: $$ \\tilde w_{t+1} = \\tilde w_t - (\\tilde X^\\top S_t \\tilde X)^{-1}\\tilde X^\\top(p_t-y), $$ which is the basis of IRLS. In difficult problems it often converges in fewer iterations than plain gradient descent, but each step is more expensive because the solution involves solving a linear system.\nImplementation notes. Main implementation notes include:\nFor stability, the predicted probabilities are clamped to $[10^{-12}, 1-10^{-12}]$ before taking logarithms. In Newton’s method, a damping term term is used, replacing $H$ with $H+\\varepsilon I$ for $\\varepsilon=10^{-4}$, to stabilize the linear solver when the Hessian is ill-conditioned (i.e. when the columns are nearly linearly dependent). Experiments confirm that Newton’s method converged faster and used $\\eta=1.0$, while gradient descent still required a relatively large learning rate, $\\eta=0.1$. A symmetric matrix $A$ is positive semidefinite if $v^\\top A v \\ge 0$ for every vector $v$, so its quadratic form is never negative.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","permalink":"https://rfvasile.github.io/posts/experiments/logistic_regression/","summary":"Binary logistic regression as a Bernoulli model: setup, cross-entropy objective, and gradient-based fitting.","title":"Logistic Regression"},{"content":" Update (2026-03-14): See GitHub for a NumPy/Torch implementation of the linear regressor described below. Linear regression estimates a linear function that maps each input vector to a target variable. In ordinary least squares, the model parameters are estimated by minimizing the sum of squared errors between predicted and observed values. The gradient descent update rule follows directly from this loss function, making linear regression a sensible initial example. For a more thorough review, see (Hastie\u0026#32; \u0026#32;et\u0026#32;al.,\u0026#32;2009 Hastie\u0026#32; T.,\u0026#32;et al.\u0026#32; (2009). \u0026#32;The Elements of Statistical Learning: Data Mining, Inference, and Prediction. \u0026#32; Springer.\u0026#32;Retrieved from\u0026#32; https://books.google.it/books?id=eBSgoAEACAAJ ).\nSolution\u0026nbsp;(Python sketch) The following pseudocode provides a broad idea of the derivations described below. Full code available here.\nfor _ in range(self.n_iters): # Linear model forward pass and calculate MSE. y_pred = X.dot(self.w) error = y - y_pred L = np.mean(error**2) + self.regularization(self.w) self.loss_values.append(L) # Gradient of mean squared error: dL/dw = -(2/n) X^T (y - y_hat). dL_dw = -2 * X.T.dot(error) / X.shape[0] + self.regularization.grad(self.w) # Update weights self.w -= self.lr * dL_dw Figure\u0026nbsp;1: A linear model imposes a single affine decision boundary in the input space. (Image Source: Hastie\u0026#32; \u0026#32;et\u0026#32;al.,\u0026#32;2009 Hastie\u0026#32; T.,\u0026#32;et al.\u0026#32; (2009). \u0026#32;The Elements of Statistical Learning: Data Mining, Inference, and Prediction. \u0026#32; Springer.\u0026#32;Retrieved from\u0026#32; https://books.google.it/books?id=eBSgoAEACAAJ , pp.13) Setup. Let $X \\in \\mathbb{R}^{N \\times p}$ contain rows $x_i^\\top$ and let $y \\in \\mathbb{R}^N$ contain the targets. If an intercept (bias term) is added, either augment each feature vector with a leading 1 and absorb the bias into $w$, or keep the bias separate. For simplicity, we decide the former approach as it simplifies the code.\nDefinition 1 (Linear Predictor). For each sample, linear regression predicts by \\begin{equation} \\hat y_i = x_i^\\top w + b. \\end{equation}\nDefinition 2 (Squared Error Objective). Given the sample $(X,y)$, ordinary least squares estimates $\\hat w = \\arg\\min_w L(w)$ (this is the optimization we try to achieve) through the squared-error objective below. \\begin{equation} \\label{eq:ols-objective} L(w) = \\frac{1}{N}\\lVert y-Xw \\rVert_2^2 = \\frac{1}{N}\\sum_{i=1}^N (y_i-x_i^\\top w)^2. \\end{equation} One key term is the residual vector $$ r = Xw-y $$ as it is the main intermediate quantity: it has shape $N$, and both the objective and gradient are built from it.\nDerivation. Expanding equation $\\eqref{eq:ols-objective}$ gives \\begin{equation*} L(w) = \\frac{1}{N}(y-Xw)^\\top(y-Xw) = \\frac{1}{N}\\left(y^\\top y - 2y^\\top Xw + w^\\top X^\\top Xw\\right). \\end{equation*} Differentiating equation $\\eqref{eq:ols-objective}$ with respect to $w$ gives \\begin{equation} \\label{eq:ols-gradient} \\nabla_w L(w) = -\\frac{2}{N}X^\\top y + \\frac{2}{N}X^\\top Xw = \\frac{2}{N}X^\\top(Xw-y). \\end{equation} Note that luckily we have: $X^\\top(Xw-y)\\in\\mathbb{R}^p$, so the gradient matches the parameter vector, which is necessary for compatibility while coding. For a general backpropagation theory beyond linear models, see Deep Learning (Goodfellow et al., 2016 Goodfellow Ian, et al. (2016). Deep Learning. MIT Press. ), pp. 212-213. $\\blacksquare$\nClosed Form. Setting the gradient in equation $\\eqref{eq:ols-gradient}$ to zero gives the normal equations1 and, when invertible, the closed-form estimator is defined below: \\begin{equation} \\label{eq:ols-normal} X^\\top X \\hat w = X^\\top y. \\end{equation} Because OLS2 minimizes the differentiable quadratic loss in equation $\\eqref{eq:ols-objective}$, any minimizer must satisfy $\\nabla_w L(w)=0$. If $X^\\top X$ is invertible, equation $\\eqref{eq:ols-normal}$ gives the closed form solution: \\begin{equation} \\label{eq:ols-closed-form} \\hat w = (X^\\top X)^{-1}X^\\top y. \\end{equation} For coding, the linear system can be solved via QR or SVD, which is faster than explicitly finding the inverse.\nImplementation Note The normal equations (exact solution), QR, SVD, and gradient descent are different ways to compute the estimator. If $X^\\top X$ is invertible, the inverse formula, QR and SVD all have the same solution; usually QR or SVD yield faster calculations, but the solution is equally precise.\nRegularization is different: once a penalty is added to the loss function, the estimator is no longer the common OLS problem but a modified objective, usually named ridged regression.\nGradient Update. Applying gradient descent to equation $\\eqref{eq:ols-objective}$ using equation $\\eqref{eq:ols-gradient}$ gives \\begin{equation} \\label{eq:ols-update} w_{t+1} = w_t - \\eta \\nabla_w L(w_t) = w_t - \\eta \\frac{2}{N}X^\\top(Xw_t-y). \\end{equation} If the bias is kept separate instead of being merged into $X$, the gradients become $$ \\nabla_w L = \\frac{2}{N}X^\\top(\\hat y-y), \\qquad \\frac{\\partial L}{\\partial b} = \\frac{2}{N}\\sum_{i=1}^N (\\hat y_i-y_i). $$ The bias derivative is just the summed residual because $\\hat y_i = x_i^\\top w + b$ gives $\\partial \\hat y_i / \\partial b = 1$ for every sample.\nCaveats. The algebra above is fairly simple, considering that other problems share similarities in terms of the broad steps that are followed (see the logistic regressor). Nonetheless, this estimator can still be a poor fit for data that is either non-linearly separable or numerically unstable. If columns of $X$ are nearly collinear, $X^\\top X$ becomes ill-conditioned and the closed-form solution is unstable. Squared loss also magnifies large residuals, so a small number of outliers can dominate.\nIf the true conditional mean $\\mathbb{E}[Y \\mid X = x]$ is nonlinear in $x$, OLS still returns only the best linear approximation in the chosen feature space. In this case, $\\mathbb{E}[Y \\mid X = x]$ is the true mean of $Y$ conditional on $X = x$; that is, the average outcome obtained by fixing $X = x$ and sampling repeatedly from the population.\nConnection. Ridge regression keeps the same setup and gradient, but replaces $X^\\top X$ with $X^\\top X + \\lambda I$. That is the minimal modification when plain ordinary least squares is too unstable.\nA normal equation is a closed-form (exact) solution for calculating the parameter vector.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nOrdinary least squares (OLS) is a linear regression method in which the model parameters are estimated by minimizing the squared-error loss.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","permalink":"https://rfvasile.github.io/posts/experiments/linear_regression/","summary":"Ordinary least squares as a linear mean model: setup, normal equations, and gradient updates.","title":"Linear Regression"},{"content":" Q:\u0026nbsp;How can I get an update when a new post comes out?\nIt is possible to subscribe to the RSS feed to be notified via email. Q:\u0026nbsp;How to find interesting Wikipedia articles?\nPetScan is a useful tool to filter and analyse Wikipedia articles based on categories and other criteria. It is a great tool to discover new topics to read about. Q:\u0026nbsp;How do you organize your code snippets?\nI use Github Gists to store code that is useful to reuse across projects. Q:\u0026nbsp;Blogs that are particularly interesting?\nHuggingface Blog Deeplearning.ai ","permalink":"https://rfvasile.github.io/resources/","summary":"A list of resources I find useful.","title":"FAQ"},{"content":"Neurosymbolic Reasoning and Constraint Satisfaction This group highlights work at the intersection of artificial intelligence and logic.\nMaster's Thesis: Distilling Neurosymbolic Reasoning for Linear Algebra in Small Language Models GitHub · Report · Poster · Demo · HF Space The main drawback of using generative AI models for advanced mathematics is that they are not deterministic logical reasoning engines. Because neural models are probabilistic, they can produce outputs that look convincing while still containing subtle errors that are hard to detect without careful algorithmic verification. Symbolic methods can largely address this issue by performing exact, deterministic calculations through code execution. However, solving a full problem typically requires an explicit plan: the correct sequence of tool invocations and the dependencies between intermediate results must be specified, and this often requires human intervention. Our premise is that this planning burden can be reduced by using neural models to orchestrate tool use, while symbolic solvers provide the exact computation.\nIn this thesis, we demonstrate an end-to-end workflow that combines neural models with symbolic solvers to solve linear algebra problems through tool-use interactions, in a controlled, verifiable setting with a small, audited tool library. Our results show that, starting from a small pre-trained base model (Qwen2.5-3B), it is possible to achieve 90% test-set accuracy (verifier-checked on a fixed held-out evaluation set) on problem traces requiring up to three tool interactions.\nThe pipeline includes synthetic dataset generation, distillation, supervised fine-tuning (SFT), and reinforcement learning via Group Sequence Policy Optimization (GSPO). Using parameter-efficient fine-tuning (LoRA) and on-demand cloud GPUs, the full pipeline is reproducible within a $75 budget. This provides a concrete recipe for practitioners to train self-hostable tool-using models, and a pedagogical blueprint for students learning to build tool-calling agents beyond prompt engineering.\nBachelor's Thesis: Solving Sudoku Using Propositional Logic GitHub · Report · Presentation This project replicates algorithms and techniques used to visualize the structure of difficult Sudoku puzzles. The approach involves implementing an efficient CDCL SAT solver, incorporating well-known procedures such as Backjumping and Conflict-Driven Clause Learning, alongside two decision heuristics: Variable State Independent Decaying Sum (VSIDS) and Largest Individual Sum (LIS).\nAdditionally, we encode the Sudoku puzzle into SAT format and implement dynamic visualization of the resolution process. This provides fine-grained insights into the solver’s state, the sequence of logical decisions, and the propagation and conflict resolution mechanisms as they occur.\nFinally, we perform an ablation study to assess the impact of the two decision heuristics (VSIDS vs. LIS) on solver performance. We also profile the application’s performance using flame graphs, comparing two separate SAT encodings: minimal and extended.\nDeep Learning and training pipelines This group focuses on deep learning including training pipelines and low-level programming.\nDistributed and Parallel Techniques for Deep Neural Networks GitHub · Report · Poster This paper presents a systematic literature review of distributed and parallel techniques for running deep neural networks on multiple machines and GPUs. It is composed of three parts: 1) a review of the available libraries that enable distributed training across GPU clusters, 2) a review of the most popular frameworks that facilitate parallelizing the training process on GPUs, and 3) a practical section that demonstrates a proof-of-concept implementation of the training process using common frameworks, namely PyTorch DDP and cuDNN.\nThe review synthesizes research from the past decade, examining various approaches to distributed training, their effectiveness, and implementation challenges. The work is aimed at students and practitioners, with the goal to provide an introduction to the topic and help frame a general idea of the most common libraries in each domain.\nThe distributed experiments use data parallelism to accelerate the training process, while the GPU experiments use cuDNN, cuBLAS and manual kernel implementations to train a small network. The effectiveness of each approach is demonstrated and to aid reproduction and experimentation Docker environments are provided. This allows to simulate a multi-GPU setup on a single Nvidia GPU, promoting ease of use by not relying on cloud services.\nVisual QA: Using Generative Models on Classification Tasks GitHub · Report · Poster · W\u0026amp;B A lot of attention has been given in recent years to the development of complex architectures designed to integrate multimodal capabilities. While extending existing infrastructures with more and more intricate and complex modules, some of the more common use-cases may not be immediately applicable. While popular libraries like Transformers offer rich APIs, they often lack direct support for common classification tasks when using multimodal models like BLIP-2. To address this limitation, I adapt the BLIP-2 architecture, which was originally designed for question-answering, to perform classification tasks within the Transformers library.\nThe algorithm is evaluated using two datasets: Easy-VQA and Daquar. Easy-VQA contains simple questions about geometric shapes, while Daquar is more challenging, requiring answers to questions about objects in indoor scenes. The model achieves 91% accuracy on Easy-VQA and 78% accuracy on Daquar, outperforming generative baselines on both benchmarks. UMAP visualization of the learned features confirms that the model is effectively capturing semantic distinctions, particularly for the simpler geometric cases.\nAgents, vector stores and tool use This category covers broad tool-using pipelines, where agents interact with external environments.\nGit Inspector: Querying GitHub Repositories with Local LLMs GitHub · Report · Poster · Demo · Website Navigating large, unfamiliar codebases can be a significant challenge for developers, often leading to a steep learning curve and inefficient debugging processes. However, recent advances in machine learning offer promising ways of addressing this problem. Building on these advances, this work presents a Retrieval Augmented Generation (RAG) pipeline designed to facilitate code retrieval and understanding within GitHub repositories. The approach empowers LLMs by combining non-parametric memory (retrieved code snippets) with parametric memory (pre-trained LLM weights) to generate insightful, context-aware answers.\nThe project emphasizes the engineering process, adhering to the agile methodology and documenting the development process. Notably, the system utilizes open-source technologies such as Ollama and Qdrant, enabling the utilization of various open LLMs through local indexing and retrieval of code snippets without reliance on proprietary services.\nThe work aims to reduce the steep learning curve associated with understanding large codebases, and provide insightful explanations for complex coding concepts. The library is built in Scala, on top of the Langchain4j framework, and facilitates integration with the LLM through interfaces built with Gradio and Scala.js. A usability study validated the interface, achieving an average SUS score of 85%.\nLibrarian Assistant GitHub · Report · Poster · Demo Large Language Models (LLMs) have gained significant popularity in recent years due to their remarkable question answering capabilities. However, when tackling a large corpus of text, the quality of the answers varies, largely due to the model’s inability to focus on contextualized information. This may lead to less accurate answers, poor handling of long-tail questions and exposure bias to the data it was pre-trained on. I present a creative approach to tackle these challenges by employing data-agents powered through LLMs.\nThese agents employ complex workflows to intelligently perform operations over the knowledge base. These operations can be characterized as follows: 1) decompose the task into a series of function calls (thoughts), 2) employ multiple fetch operations over the knowledge base to retrieve relevant information (actions), 3) summarize at each step the extracted information to facilitate the final aggregation (observations) and 4) synthesize a final answer by combining the results. The project supports the adoption of Open LLMs, making the library usable freely without the financial burden of using proprietary providers.\nQuestLlama: An Autonomous Agent in Minecraft GitHub · Presentation · Demo This project extends QuestLlama, a Voyager-based autonomous agent capable of completing complex in-game tasks in Minecraft through retrieval-augmented generation and code execution. The system leverages open-source Large Language Models (LLMs) to generate and execute Python code, allowing for dynamic interaction with the game environment. A key contribution is the integration of local-model backends via Ollama and OpenAI-compatible APIs, enabling experimentation and deployment without reliance on proprietary cloud providers. This approach demonstrates the feasibility of using lightweight, locally-hosted models for autonomous agent tasks that traditionally require heavy, closed-source infrastructure.\nDeveloper tools This section illustrates software engineering work tailored at improving developer workflows.\nDocker UI GitHub · Report · Figma · Docker Hub This project presents a web-based interface for the management and orchestration of Docker Compose environments. The system interacts with the Docker daemon via HTTP API to enable container lifecycle management, log inspection, and environment configuration directly through a browser. Its purpose is to reduce the steep learning curve associated with container management, where a web interface abstracts command-line operations into visual controls. Usability and interface design were refined through iterative prototyping in Figma and validated via user feedback questionnaires.\n","permalink":"https://rfvasile.github.io/projects/","summary":"\u003ch2 id=\"neurosymbolic-reasoning-and-constraint-satisfaction\"\u003eNeurosymbolic Reasoning and Constraint Satisfaction\u003c/h2\u003e\n\u003cp\u003eThis group highlights work at the intersection of artificial intelligence and logic.\u003c/p\u003e\n\u003cdetails class=\"statement statement--foldable\" open\u003e\n  \u003csummary\u003e\u003cstrong\u003eMaster's Thesis: Distilling Neurosymbolic Reasoning for Linear Algebra in Small Language Models\u003c/strong\u003e\u003c/summary\u003e\n  \u003cdiv class=\"statement__content statement--plain statement__content--with-subheader\"\u003e\n    \u003cdiv class=\"statement__subheader\"\u003e\u003ca href=\"https://github.com/rfvasile/linalg-zero\"\u003eGitHub\u003c/a\u003e · \u003ca href=\"https://github.com/rfvasile/linalg-zero/blob/main/docs/report.pdf\"\u003eReport\u003c/a\u003e · \u003ca href=\"https://github.com/rfvasile/linalg-zero/blob/main/docs/poster.pdf\"\u003ePoster\u003c/a\u003e · \u003ca href=\"https://www.youtube.com/watch?v=Dxc3yTr-AE0\"\u003eDemo\u003c/a\u003e · \u003ca href=\"https://huggingface.co/spaces/rfvasile/linalg-zero\"\u003eHF Space\u003c/a\u003e\u003c/div\u003e\n\u003cp\u003eThe main drawback of using generative AI models for advanced mathematics is that they are not deterministic logical reasoning engines. Because neural models are probabilistic, they can produce outputs that look convincing while still containing subtle errors that are hard to detect without careful algorithmic verification. Symbolic methods can largely address this issue by performing exact, deterministic calculations through code execution. However, solving a full problem typically requires an explicit plan: the correct sequence of tool invocations and the dependencies between intermediate results must be specified, and this often requires human intervention. Our premise is that this planning burden can be reduced by using neural models to orchestrate tool use, while symbolic solvers provide the exact computation.\u003c/p\u003e","title":"Projects"}]