[{"content":" A prototype of Bayesweeper can be found at https://bayesweeper.vercel.app/\nIntroduction For those unfamiliar with Minesweeper, I suggest reading up on it and trying it out for yourself. Briefly, it is a game that involves uncovering hidden tiles on a rectangular board without opening any tiles that have a mine hidden underneath them. Every uncovered tile reveals a very important piece of information: the number of mines in its neighboring tiles. This number can range between 1 and 8, since any tile can have at most 8 neighboring tiles (essentially the cardinal and ordinal directions). Intelligent Minesweeper play traditionally involves uncovering tiles that are logically forced to be mine-free. In particular, the player may construct a mini-proof that having a mine under a particular tile would result in a contradiction with the information revealed thus far, thus making that tile safe to uncover.\nOne of the things about the Windows version of Minesweeper that has always bothered me is the existence of so-called 50-50 scenarios. Scenarios where there are two possible underlying configurations of the hidden tiles entirely consistent with all information revealed thus far. More generally, these are called Forced Guess scenarios. For example, there exist situations in common implementations of minesweeper in which not every tile is just as likely to be safe if we give equal weight to all consistent configurations. The easiest illustrative example is the \u0026#34;Finned Box\u0026#34; configuration:\nthis configuration illustrates a scenario where we have the following valid configurations for the hidden tiles:\nM (empty), (empty) (empty) M\nM (empty), (empty) M (empty)\n(empty) M, M (empty) (empty)\nIf we assume that each of the three configurations above has a $\\frac{1}{3}$ chance of being the \u0026#34;true\u0026#34; state, the induced probability that any given hidden tile is a mine is uneven across the hidden tiles. For example, tiles in positions (4, 4), (5, 3), and (5, 5) (using 1-indexing) each have two configurations where they don\u0026#39;t have an underlying mine, making it so that they each have a $\\frac{1}{3}$ chance of being a mine. Meanwhile, tiles in positions (4, 3) and (5, 4) each have probability $\\frac{2}{3}$ of being a mine.\nForcing a guess out of the player has always felt unfair and against the spirit of a logic challenge. There are several no-guess implementations of Minesweeper available these days. These implementations make sure that at time step $t$, the resulting board is guaranteed to have at least one hidden tile that is logically guaranteed to be mine-free. Playing some of these no-guess options got me thinking: would it be possible to develop a variant of Minesweeper that preserves the existence of Forced Guess scenarios, but instead of punishing players for making Forced Guesses, rewards them for being good Bayesians?\nDesign The general idea with Bayesweeper is to reward players for making probabilistically optimal guesses, and to optionally punish them for making suboptimal guesses.\nBackground Let $X = (k, S)$ denote the state of the board, where $k$ is the total number of mines on the board, and $S$ is a matrix of dimension $m \\times n$ where each cell contains the state of the corresponding tile: either \u0026#34;hidden\u0026#34; or the number of neighboring mines. It turns out that if we start playing the game with the prior that every board state with the correct dimensions and mine count is equally likely, Bayes\u0026#39; theorem reveals a neat posterior rule.\nIf $x$ is a hidden tile and $S$ is the current board, we have the following statement:\n$$P(x \\text{ is not a mine} \\mid S) = \\dfrac{P(S \\mid x \\text{ is not a mine}) \\cdot P (x \\text{ is not a mine})}{P(S)}$$\nLet $\\Omega$ be the set of all possible fully revealed configurations for the fixed number of mines $k$ and the board shape $m \\times n$. Then $P(S)$ is the number of those configurations that are consistent with $S$ divided by the size of $\\Omega$:\n$$P(S) = \\dfrac{|\\{\\omega \\text{ is consistent with } S : \\omega \\in \\Omega\\}|}{|\\Omega|}$$\nThe likelihood $P(S \\mid x \\text{ is not a mine})$ is given by the number of states in $\\Omega$ that are consistent with $S$ and in which $x$ is not a mine, divided by the number of states in $\\Omega$ in which $x$ is not a mine:\n$$P(S \\mid x \\text{ is not a mine}) = \\dfrac{|\\{\\omega \\text{ is consistent with } S \\text{ and } x \\text{ is not a mine in } \\omega: \\omega \\in \\Omega\\}|}{|\\{x \\text{ is not a mine in } \\omega : \\omega \\in \\Omega\\}|}$$\nFinally, the prior is\n$$P(x \\text{ is not a mine}) = \\dfrac{|\\{x \\text{ is not a mine in } \\omega : \\omega \\in \\Omega\\}|}{|\\Omega|}$$\nTo keep symbolic manipulation concise, we define the following shorthands:\n$A_x(\\omega) := x \\text{ is not a mine in } \\omega$ $B_S(\\omega) := \\omega \\text{ is consistent with } S$ Then putting everything together, we get:\n$$P(x \\text{ is not a mine} \\mid S) = \\dfrac{|\\{A_x(\\omega) \\text{ and } B_S(\\omega) : \\omega \\in \\Omega\\}||\\{A_x(\\omega) : \\omega \\in \\Omega\\}||\\Omega|}{|\\{A_x(\\omega) : \\omega \\in \\Omega\\}||\\Omega||\\{B_S(\\omega) : \\omega \\in \\Omega\\}|}$$\ncancelling out like terms, we get\n$$P(x \\text{ is not a mine} \\mid S) = \\dfrac{|\\{A_x(\\omega) \\text{ and } B_S(\\omega) : \\omega \\in \\Omega\\}|}{|\\{B_S(\\omega) : \\omega \\in \\Omega\\}|}$$\nin short, the probability that the tile $x$ is not a mine, given the board state $S$, is the number of consistent configurations in which $x$ is safe divided by the total number of configurations consistent with $S$.\nThe Twist: Interventionist Creator The main twist introduced by Bayesweeper is a benevolent creator that perfectly rewards optimal play and (optionally) punishes suboptimal play. When the player starts by clicking anywhere on the board, a small, local portion of the tiles is revealed. In order to make sure that this initial state is even feasible, we can check that it is consistent with at least one global configuration. Next, for each tile, we maintain the probability that it is a safe tile given the current board state, as determined by the formula from the previous subsection. We define the set of optimal tiles $M(S)$:\n$$M(S) = \\underset{{x \\in {\\text{ hidden tiles}}}}{\\text{argmax}} P(x \\text{ is not a mine} \\mid S)$$\nIf the player chooses a tile belonging to $M(S)$, we force the distribution to shrink by randomly sampling a configuration that is consistent with $S$ and in which $x$ is safe. That is the \u0026#34;perfect\u0026#34; reward. Suppose the player chooses a tile outside $M(S)$. We have two options:\nHarsh: the distribution immediately collapses to a configuration in which $x$ is a mine. This is a particularly difficult mode that would be extremely difficult to win. Lenient: we just randomly sample a configuration for $x$ based on its computed posterior. This is a simpler mode that allows for slightly suboptimal players to still have a chance. It would be a very interesting task to train an RL agent to get good at this mode. The Implementation A fully functional prototype was created entirely with Cursor\u0026#39;s plan mode, which can be tried now here.\nI described the game\u0026#39;s specification and let Cursor figure out the implementation details. I also mentioned that I wanted it to be a webapp in order to make it accessible. My biggest concern going into it was how it would resolve the complexity associated with enumeration. Naively, performing the enumeration of tiles in $M(S)$ is very expensive. The agent correctly identified this and searched the web for solutions. The following article: https://www.lrvideckis.com/blog/2020/07/17/minesweeper_probability.html describes a heuristically improved approach to computing the probability that a given tile is a mine. It uses clever optimization tricks including splitting by components, dynamic programming, and local deductions to come up with an algorithm that works very quickly in many cases. The agent used this algorithm along with some information from other articles to come up with an engine that could compute the necessary probabilities. While I haven\u0026#39;t had the time to do a deep dive into the algorithm design, it seems to be working well for toy test cases. A much deeper dive into the algorithm and all the optimizations is a potential topic for a future article. For now, in Opus we trust.\nThe current implementation operates under the \u0026#34;harsh\u0026#34; mode described earlier, where any suboptimal move is punished by triggering a mine immediately. Adding the \u0026#34;lenient\u0026#34; mode is in the works, and will allow for a more forgiving human-friendly version of the game. There is also currently an option to display tile safety probabilities, which is essentially a \u0026#34;cheat\u0026#34; mode.\nTwo things to note about the current implementation:\nEven with the optimizations, the game can sometimes take a long time to sample a board update. Making it more responsive in general is another direction of improvement. It is often the case that there are no Forced Guess scenarios for the entire game. This is especially true about the smaller boards. Yet another direction to further develop the game would be to modify the sampling process to increase the likelihood of getting Forced Guess scenarios. I invite readers to play around with the code base at https://github.com/gokhaleaaroh/Bayesweeper if they would like to make improvements or adjustments to the existing prototype.\nConclusion The main focus of this article has been to introduce a no-guess variant of Minesweeper that requires the player to move beyond pure logic and into the realm of Bayesian statistics to achieve perfect play. While this variant of Minesweeper allows theoretical perfect play, it would be very difficult for a human to actually learn how to play it perfectly. Though my initial hopes were to create a more intellectually stimulating version of Minesweeper, I have my doubts about whether Bayesweeper would even be fun to play for humans. For now, it can exist as a fun demo and a playground for AI. A future project idea that I am seriously considering is to train a reinforcement learning agent to play this game well. There is a good chance that an RL agent could learn how to perform quite well in the \u0026#34;lenient\u0026#34; mode, though I have my doubts regarding the \u0026#34;harsh\u0026#34; mode.\n","permalink":"https://aarohgokhale.github.io/projects/bayesweeper/","summary":"Introducing Bayesweeper - A Bayesian take on no-guess Minesweeper. Traditional no-guess Minesweeper ensures the existence of at least one deterministically mine-free tile at every point in the game. In this project, I introduce a twist on no-guess Minesweeper that maintains the non-deterministic nature of class Minesweeper while allowing a player to, in theory, always win.","title":"Bayesweeper"},{"content":" All code for experiments can be found at https://github.com/gokhaleaaroh/diffusion-vs-flows/\nIntroduction Last week I talked about neural ODEs and some of their applications. One of these applications was to use neural ODEs for generative modeling, and more specifically for implementing a continuous-time variant of normalizing flows. This week, I will be exploring normalizing flows in further depth and comparing them to the current heavyweight in generative modeling, diffusion models.\nWhat is Generative Modeling? For those unfamiliar with the term and for those that need a quick refresher, generative modeling is the field of machine learning that seeks to design and train models that are capable of learning an unknown data distribution from samples, with the goal of producing new samples that look like they come from the original distribution. More concretely, suppose we have samples $x \\in \\mathcal{X}$ drawn from some underlying distribution $p$. We train a model distribution $q_{\\theta}$ that minimizes some notion of discrepancy from $p$, e.g. Kullback–Leibler divergence. We usually require $q_{\\theta}$ to provide a tractable mechanism for producing fresh samples that follow this learned distribution. Normalizing Flows Normalizing flows were developed to model complex probability distributions while allowing for tractable evaluation of exact likelihood. The key trick used to achieve this is to start with a simple probability distribution with a known exact likelihood, such as the standard normal distribution, and then to morph it with learnable transformations that are differentiable, invertible, and whose inverses are differentiable, known otherwise as diffeomorphisms. Since the composition of diffeomorphisms is a diffeomorphism, we only need to ensure that each layer in the overall transformation is a diffeomorphism. To evaluate the likelihood of a sample in the transformed distribution, we first invert it to find its origin in the source distribution, compute the likelihood in the source, and then use the Jacobian determinant of the inverse of the transformation to compute how the density expands or contracts, yielding the following formula:\n$$\\log p(x) = \\log p(z) + \\log{\\left | \\det D_{x}f^{-1}(x) \\right |}$$\nwhere $z$ is the point in the source distribution that maps to the sample $x$ in the transformed distribution under the learned map $f$. This formula is derived by applying the chain rule to a composition of probability densities. It is the the same idea as the change-of-variables constant from multivariable calculus. Though the Jacobian determinant is usually multiplied, we see an addition here due to fact that $\\log(ab) = \\log a + \\log b$. The map $f$ is defined as the composition of the series of transformations that take the simple source distribution to the desired target distribution, and can parameterized by a neural network. As discussed in my previous article, if all the layers taken on the form $x_{t} = x_{t - 1} + f_{t - 1}(x_{t - 1})$, we can also express the flow as a continuous-time differential equation.\nFor a more detailed survey of normalizing flows, the reader may refer to \u0026#34;Normalizing Flows: An Introduction and Review of Current Methods\u0026#34; by Kobyzev et al.\nDiffusion Models Diffusion models are a class of generative models that learn how to model a data distribution by successively adding noise to sample data and then learning how to \u0026#34;denoise\u0026#34; the result to reconstruct the original sample. In their popular formulation, the step that adds noise is isn\u0026#39;t invertible, making exact likelihood computation difficult. In contrast, normalizing flows are explicitly designed with the goal of exact likelihood computation in mind. A common formulation of diffusion models is Denoising Diffusion Probabilistic Models (DDPMs). In DDPMs, the idea of successively adding noise and then learnig how to successively denoise is formally modeled with two Markov chains, one going forward in time, and the other going backward. The forward Markov chain is fixed and models the process of adding noise over time, with the transition function being defined as a distribution $q(\\mathbf{x}_t \\mid \\mathbf{x}_{t - 1})$, typically defined by\n$$q(\\mathbf{x}_t \\mid \\mathbf{x}_{t - 1}) = \\mathcal{N}(\\mathbf{x}_t; \\sqrt{1 - \\beta_t}\\mathbf{x}_{t - 1}, \\beta_t I)$$\nwhich is the Gaussian distribution centered at $\\sqrt{1 - \\beta_t} \\mathbf{x}_{t - 1}$ with covariance $\\beta_t I$, where $\\beta_t \\in (0, 1)$ is a hyperparameter. The backward Markov chain is composed of learnable neural networks that learn how to transition backward in time, defined typically as\n$$p_{\\theta}(\\mathbf{x}_{t - 1} | \\mathbf{x}_t) = \\mathcal{N}(\\mathbf{x}_{t - 1} \\mid \\mu_{\\theta}(\\mathbf{x}_t, t), \\Sigma_{\\theta}(\\mathbf{x}_t, t))$$\nwhere $\\mu_{\\theta}$ and $\\Sigma_{\\theta}$ are parameterized by neural networks. These are trained to ensure that the joint probability distribution of the forward chain closely matches the joint probability distribution of the reverse chain. Once the model has been trained, generating a new sample involves generating a noisy sample, typically Gaussian, and then passing it through the reverse Markov chain to reconstruct a sample from the learned distribution.\nReaders interested in understanding diffusion models in further depth may refer to \u0026#34;Diffusion Models: A Comprehensive Survey of Methods and Applications\u0026#34; by Yang et al.\nWhich one should I use? On the surface, we can see that both models give us a practical way to generate new samples that \u0026#34;look like\u0026#34; the samples from our data. Why should someone choose one model over the other? In their most common formulations, the clearest difference between these models is that normalizing flows allow you to easily compute an exact likelihood for any given sample, whereas diffusion models don\u0026#39;t. In a situtation where you would want to compute exact likelihoods, normalizing flows seem to be the way to go. Why, then, do we see such a dominance of diffusion models in the modern generative AI?\nThere are a few key advantanges that diffusion models have over normalizing flows that make them well suited for high-dimensional generative tasks such as image, video, and audio generation:\nJacobian Computation: Normalizing flow models require computing the Jacobian of the neural network with respect to the full input. Naively, this means that the higher the dimension of your sample space is, the more work needs to be done by the autodifferentiation software to compute this value. While there are model design choices that minimize this cost, they serve to constrain the model. On the other hand, inversion in diffusion models is posed as a reverse time Markov chain where each transition is determined by a mean vector and a covariance matrix, parameterized by the network. The only gradients that need to be computed during training are with respect to network parameters, not with respect to the full input. Flexibility: Normalizing flows inherently constrain the kinds of neural architectures that can be used to build them. Any function used to construct a normalizing flow must be differentiable and invertible (with a differentiable inverse). Diffusion models are significantly more flexible in architecture choice. They allow swapping in countless state-of-the-art architectures (including transformers) for predicting the backward transitions. Sample Quality: Another area where diffusion models win out is sample quality. Sample quality is a measure of how realistic and diverse the sample generation is. While sample quality can be measured through human evaluation, it is often measured more objectively with scores such as the Fréchet Inception Distance (FID). Papers consistently find relatively poor FID scores for flows when compared to diffusion models. [3][4][5][6] Experiments In order to get a hands-on feel for the differences in these two models, I decided to train them for the same image generation task and examine the results visually and empirically with the FID score. I chose to train the models on the AFHQ dataset, which is a large collection of face pictures of dogs, cats, and miscellaneous wild animals.\nModel Choice and Results For the diffusion model, I used a small implementation of the U-Net architecture with MSE loss, and for normalizing flows I used a version of the Real NVP architecture (the original paper has promising generative results) with negative log-likelihood loss. I measured the FID score for both models using the cleanfid python library. Initially, I got really poor results on the normalizing flows model, with the resulting sampling being essentially complete noise. After some iteration and back-and-forth with LLMs, I was able to get a final version that showed some promising results. Both models were trained using NVIDIA A100 GPUs with 40GB of memory on a Google Colab Python 3 kernel. They were trained for the same number of iterations and the same training data. The iteration speed for training was also roughly the same, with about 1.11 iterations per second. With the final state of the models, the diffusion model achieved a FID score of roughly 330, and the flow achieved a FID score of roughly 336 (a lower score is better). It could be the case that due to the limited model and dataset size, it isn\u0026#39;t as easy to distinguish between these two types of models. Another important note to keep in mind is that this experiment was more exploratory than definitive, and a deeper investigation would be necessary to obtain conclusive results. For example, I haven\u0026#39;t outlined things such as parameter count or the exact model architecture, each of which would play a significant role in performance.\nSome of the produced images are outlined below.\nFirst, we take a look at simple generation.\nImage generation with diffusion Image generation with flows With the limited training size and the fact that the generation is unconditional, we do not expect to see any discernible animals. However, it is intuitively obvious in both figures that the images in the top row are much noisier than the images in the bottom row, which seem to be significantly more structured. Next, we take a look at an illustrative figure that demonstrates a key difference between these two models: we take a known picture, pass it through the \u0026#34;noising\u0026#34; (or the normalizing direction for flows), and then \u0026#34;denoise\u0026#34; the result. Noising and Denoising with Diffusion Going Forward and Backward with Flows The funny thing here, of course, is that the \u0026#34;denoising\u0026#34; step in the flow results in a perfect match for the original image. The constraint that flows be invertible ensures that no matter what image we feed into the normalizing direction of the flow model, inverting it will reconstruct that image exactly. On the other hand, the diffusion model progressively destroys information by injecting noise into the image at every step. The learned inverse operator is inexact. It learns how to predict the noise, but it cannot know how to invert it exactly.\nRecent Advancements in Normalizing Flows for Generative Modeling While diffusion models have dominated when it comes to media generation tasks in recent years, advancements in normalizing flows have shown promise. Last summer, Apple\u0026#39;s Machine Learning Research team published a paper titled \u0026#34;Normalizing Flows are Capable Generative Models\u0026#34; (linked in an earlier section of this article), where they demonstrated state-of-the-art generative modeling capabalities with normalizing flows. In this paper, Zhai et al. show that implementing autoregressive flows with transformers significantly improves performance in image generation. With their model named TarFlow, the authors demonstrate state-of-the-art performance for likelihood estimation while having very promising and competitive performance on sample quality in image generation tasks. Even visually, some of the images generated by their model look quite impressive.\nA Surprising Connection So far, we have discussed normalizing flow as generative modeling with exact likelihoods and diffusion models as generative modeling with better sampling quality and scaling. During my research for this article, I discovered that there is a surprising connection between these two that also touches on the topic for my previous article, neural ODEs. In \u0026#34;Maximum Likelihood Training of Score-Based Diffusion Models\u0026#34; by Song et al., the authors discuss how it is possible to compute exact likelihood in a score-based diffusion model by transforming it into a continuous normalizing flow. The score-based formulation of diffusion models is as a reverse-time stochastic differential equation (SDE). Through some math that is far beyond the scope of this article, it is possible to convert the reverse-time SDE into what is called a probability flow ODE, which is able to recover the marginal probability distribution of the original SDE at any time $t$. This transformation allows us to treat the continuous-time variant of diffusion as a continuous normalizing flow! This reformulation not only allows for better likelihood estimation, it also enables maximum likelihood training of the diffusion model (although that may be quite expensive).\nConclusion Diffusion models and normalizing flows both tackle generative modeling, but they do so in strikingly different ways. Each comes with clear trade-offs: normalizing flows choose invertibility and exact likelihood over architectural flexibility, while diffusion models choose flexibility and sample quality at the cost of information preservation. Though diffusion models have come to dominate the space of generative AI, recent research in normalizing flows suggests that they can remain competitive at modern standards. The bridge betwee normalizing flows and diffusion models through the theory of stochastic differential equations and neural ODEs is fascinating, and could be the topic of in-depth discussion for another day.\nReferences [1] I. Kobyzev, S. J. D. Prince, and M. A. Brubaker, “Normalizing Flows: An Introduction and Review of Current Methods,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1, 2020, doi: https://doi.org/10.1109/TPAMI.2020.2992934.\n[2] L. Yang et al., “Diffusion Models: A Comprehensive Survey of Methods and Applications,” ACM Computing Surveys, vol. 56, no. 4, pp. 1–39, Nov. 2023, doi: https://doi.org/10.1145/3626235. ‌\n[3] A. Vahdat, K. Kreis, and J. Kautz, “Score-based Generative Modeling in Latent Space,” 2021. Accessed: Feb. 16, 2026. [Online]. Available: https://proceedings.nips.cc/paper/2021/file/5dca4c6b9e244d24a30b4c45601d9720-Paper.pdf\n[4] Q. Zhang and Y. Chen, “Diffusion Normalizing Flow,” 2021. Accessed: Feb. 16, 2026. [Online]. Available: https://proceedings.neurips.cc/paper/2021/file/876f1f9954de0aa402d91bb988d12cd4-Paper.pdf\n[5] S. Sajekar, “Diffusion Augmented Flows: Combining Normalizing Flows and Diffusion Models for Accurate Latent Space Mapping,” 2023. Accessed: Feb. 16, 2026. [Online]. Available: https://www.ai.uga.edu/sites/default/files/inline-files/theses/sajekar_soham_202305_ms.pdf\n[6] S. Zhai et al., “Normalizing Flows are Capable Generative Models,” Icml.cc, May 08, 2025. https://icml.cc/virtual/2025/poster/46564 (accessed Feb. 16, 2026).\n[7] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional Networks for Biomedical Image Segmentation,” Lecture Notes in Computer Science, vol. 9351, pp. 234–241, 2015, doi: https://doi.org/10.1007/978-3-319-24574-4_28.\n[8] L. Dinh, Jascha Sohl-Dickstein, and Samy Bengio, “Density estimation using Real NVP,” Openreview.net, Feb. 06, 2017. https://openreview.net/forum?id=HkpbnH9lx (accessed Feb. 16, 2026).\n[9] Y. Song, C. Durkans, I. Murray, and S. Ermon, “Maximum Likelihood Training of Score-Based Diffusion Models,” 2021. Accessed: Feb. 16, 2026. [Online]. Available: https://papers.nips.cc/paper/2021/file/0a9fdbb17feb6ccb7ec405cfb85222c4-Paper.pdf\n","permalink":"https://aarohgokhale.github.io/technical/diffusion-vs-normalizing-flows/","summary":"Diffusion has come to dominate the field of audio-visual media generation, but Normalizing Flows have their own strengths. In this article, I explore the difference between these models, demonstrate that difference through a small experiment, discuss some promising recent developments in the use of flows for media generation tasks, and explore an interesting bridge that connects these two models.","title":"Generative Modeling: Diffusion vs Normalizing Flows"},{"content":" All code for the demos can be found at https://github.com/gokhaleaaroh/NeuralODE-experiments/\nIntroduction Is it possible to learn a continuous representation of ordered data? Many temporal models such as RNNs and transformers are typically formulated on discrete steps, where the model state is updated at fixed sequence positions or fixed time intervals. Neural ODEs take a different approach: they aim to represent evolution in continuous time by learning a dynamics function rather than a discrete update rule. Neural ODEs learn a derivative field $\\dot{x} = f_{\\theta}(x,t)$ whose integral traces the observed trajectories. Training then amounts to integrating the learned dynamics and matching the result against trajectory samples.\nIn this article, we’ll cover the basic design of neural ODEs, including approaches to computing their gradient with respect to their parameterization, and walk through two simple applications: learning the dynamics of a Van der Pol oscillator and training a continuous normalizing flow to match a 2D spiral distribution.\nBasics Neural networks are capable of learning complex, nonlinear functions between high-dimensional spaces through empirical risk minimization. Often, neural networks are designed to be of the form of a long sequence of differentiable transformations that progressively increase the expressive power of the network. Certain neural networks are defined by adding to an existing hidden state rather than applying a transformation to it directly:\n$$\\mathbf{h}_{t + 1} = \\mathbf{h}_t + f(\\mathbf{h}_t, \\boldsymbol{\\theta}_t)$$\nResNets are a clear example of this type of architecture. Certain normalizing flows, such as planar and radial flows, also obey this residual form. Neural ODEs (Chen et al., 2018) are built on the observation that this form is exactly the Euler\u0026#39;s method discretization of the differential equation\n$$\\dfrac{d\\mathbf{h}}{dt} = f(\\mathbf{h}(t), \\boldsymbol{\\theta}(t), t)$$\nNeural ODEs aim to directly parameterize and learn $f = \\dfrac{d\\mathbf{h}}{dt}$ using a neural network, which then allows for a continuous-time representation of this evolution. Inferencing using this learned derivative involves integrating it forward in time along a desired set of time steps. A key feature of neural ODEs that makes them particularly interesting is that they allow the use of a black-box forward integrator/solver, even though the loss function is a function of the output of the numerical integration. This makes it so that various choices of integrators can be tested out with the same code base very quickly, without needing to determine whether they are internally auto-differentiable.\nComputing Gradients The Secret Sauce: the Adjoint Sensitivity Method In the previous paragraph, I essentially claimed that even if the loss is a function of some black box integrator, i.e. $\\mathcal{L}\\left(I\\left[f_{\\theta}\\right]\\right)$, we can train $f_{\\theta}$ with backpropagation. How is that even possible? If we don\u0026#39;t know how to make the gradients flow back from $I$, aren\u0026#39;t we stuck? The adjoint sensitivity method, introduced by Pontryagin in the early 60s, allows us to compute the gradient by solving a second ODE, backward in time. Consider the following problem: given\n$$\\mathcal{L}\\left(\\text{ODESolve}(\\mathbf{z}(t_0), f, t_0, t_1, \\theta)\\right)$$\n, where $\\text{ODESolve}$ is a callable forward integrator, $t_0$ is the initial time, $t_1$ is the final time, $\\mathbf{z}(t)$ is the state, and $f$ is the current dynamics network parameterized by $\\theta$, compute $\\dfrac{\\partial \\mathcal{L}}{\\partial \\theta}$.\nFirst, define the adjoint, $\\mathbf{a}(t)$, as the gradient of the loss with respect to the state $\\mathbf{z}(t)$:\n$$\\mathbf{a}(t) = \\dfrac{\\partial \\mathcal{L}}{\\partial \\mathbf{z}(t)}$$\nThe key identity that allows us to find $\\dfrac{\\partial \\mathcal{L}}{\\partial \\theta}$ is:\n$$\\dfrac{\\partial \\mathcal{L}}{\\partial \\theta} = - \\displaystyle\\int_{t_1}^{t_0} \\mathbf{a}(t)^{T} \\dfrac{\\partial f(\\mathbf{z}(t), t, \\theta)}{\\partial \\theta} dt$$\nThis tells us that integrating $-\\mathbf{a}(t)^T \\dfrac{\\partial f(\\mathbf{z}(t), t, \\theta)}{d\\theta}$ backward in time allows us to retrieve the derivative of the loss function with respect to the parameters. Now as long as we know how to evaluate the trajectory of $\\mathbf{a}(t)$ backward in time, we can get our gradients for optimization. Thus, it is of interest to know the dynamics of $\\mathbf{a}(t)$. The following identity gives us just that:\n$$\\dot{\\mathbf{a}}(t) = -\\mathbf{a}(t)^{T}\\dfrac{\\partial f(\\mathbf{z}(t), t, \\theta)}{\\partial \\mathbf{z}}$$\nOne small point about the notation here, is that in the previous equation defining $\\mathbf{a}(t)$, we had the notation $\\partial \\mathbf{z}(t)$, whereas here we have $\\partial \\mathbf{z}$. The way I interpret this is that, because $\\mathcal{L}$ is a functional of the entire trajectory $\\mathbf{z}([t_0, t_1])$, we need to specify which $\\mathbf{z}$ along the trajectory we\u0026#39;re nudging. On the other hand, $f$ is just a regular function with three inputs, meaning there is only one $\\mathbf{z}$ to nudge to begin with.\nArmed with the knowledge of how both $\\mathbf{z}$ and $\\mathbf{a}$ evolve backward in time, we run our numerical integrator on three sets of dynamics:\n$$(f, -\\mathbf{a}(t)^T \\dfrac{\\partial{f}}{\\partial \\mathbf{z}}, -\\mathbf{a}(t)^T \\dfrac{\\partial{f}}{\\partial \\theta})$$\nwith the third coordinate being of interest for the actual loss gradient. Another interesting point to note here is that since the third coordinate depends on both $\\mathbf{a}(t)$ and $\\mathbf{z}(t)$, there is a choice to be made of whether to use the updated or previous values of those functions when discretizing the evolution.\nI initially considered providing justifications for the above identities, but decided against it to prevent this post from getting excessively long. Interested readers can refer to Vaibhav Patel\u0026#39;s blog post deriving these identities.\nClassic Backward Propagation If the internal steps of the numerical integrator are known and written to be autodifferentiable, it is possible to compute the gradients by classic backpropagation. In fact, as we will see, though this alternative is not as mathematically elegant, it might actually be preferable in certain scenarios. In the original paper, the adjoint sensitivity method is proposed as a more memory-efficient approach with significantly lower numerical error. However, in practice, I found that the adjoint sensitivity method took a very long time, and that the standard backprop approach yielded results that were just as good, if not better, for the simple problems posed in my experiments.\nApplications The strength of neural ODEs is that they allow continuous time modeling of data. There is no need to specify the exact time-steps for which samples will be generated, since what is being trained is the time derivative of the output, rather than the actual value. Another major benefit of neural ODEs is that they ensure smooth trajectories automatically due to the fact that the learned operator is the derivative of the desired value function.\nI ran some experiments to test out neural ODEs for myself. The following subsections detail their designs and results. Learning Physical Dynamics One of the most natural applications of neural ODEs is to learn the propagation dynamics of physical systems by learning from trajectory data. Given a state vector $\\mathbf{x}$ that evolves according to some hidden dynamics $f(\\mathbf{x}, t)$, we can train a neural ODE to automatically interpolate meaningfully between observed points by training on observed data $(\\mathbf{x}_{t_{k}})_{k = 0}^{k = N}$. The resulting neural ODE can then be used as a surrogate to simulate hypothetical trajectories. One important caveat is that the original method of adjoint sensitivity only outlines how we can compute the derivative of a loss that depends on the final state in the trajectory. In order to enforce trajectory consistency throughout the desired time-interval, we need to supervise with intermediate time-steps. The tricks necessary to incorporate intermediate points are beyond the scope of this article. In the following experiment, I trained a neural ODE to learn the dynamics of a Van der Pol oscillator, which is a dynamical system that exhibits very interesting trajectories and evolves according to the following ODE:\n$$\\ddot{x} ​=\\mu (1 - x^2) \\dot{x} - x$$\nWith $\\mu$ set to $1$, I sampled the phase space coordinates $(x, \\dot{x})$ over a time interval $[0, 20]$ with a step size of $0.1$ and a batch size of $16$, and trained a neural network for $5000$ iterations to learn propagation dynamics that result in trajectories that match these samples.\nA Note on Gradient Computation If the internals of the numerical integration algorithm are known, it is possible to compute the loss gradient by directly backpropagating through the steps of the solver. Initially, I unknowingly did just that, using torchdiffeq.odeint instead of torchdiffeq.odeint_adjoint, which actually uses the adjoint sensitivity method. To be thorough, I also ran the same experiment while using torchdiffeq.odeint_adjoint with the \u0026#34;dopri5\u0026#34; solver option (corresponding to the Dormand-Prince method). What I discovered was that using the adjoint method was much, much slower than backpropagating through the solver steps. The difference was night and day. While the adjoint method eventually did manage to drive down the training and validation loss, it took several hours to do so, as opposed to the several minutes taken by classic backpropagation. This little detour ended up informing my understanding of how neural ODEs should be trained in practice. If the solver is known internally, classic backprop is certainly worth trying out, and might even be significantly better. In particular, classical backprop seems to be much better suited for the supervision of intermediate points. If the internals of the solver are truly unknown, the adjoint sensitivity method still provides a reliable way to compute gradients.\nThe following visualizations demonstrate how the learned dynamics compare to the ground truth:\nSample trajectory generated by dynamics learned with regular backprop:\nSample trajectory generated by dynamics learned with adjoint sensitivity: Propagating from various different initial points (regular backprop): As can be seen above, neural ODEs are capable of learning dynamics that mimic the ground truth dynamics remarkably well from just trajectory data. Generative Modeling with Continuous Normalizing Flows Another cool application mentioned in the original paper is generative modeling with normalizing flows. The paper shows that using a continuous time representation of normalizing flows with neural ODEs allows evaluating flow models in $O(M)$ time, where $M$ is the number of hidden units. This is contrasted with standard normalizing flows, which need $O(M^3)$ operations for the same task. This means that neural ODEs are capable of efficiently evaluating \u0026#34;wide\u0026#34; flow layers where the dynamics might be defined as sum of several functions. The paper coins the term \u0026#34;Continuous Normalizing Flows\u0026#34; (CNF) for these continuous-time flow models. In the following experiment, I trained a CNF to learn how to morph a standard multivariate normal distribution in 2D into a spiral distribution defined by $(x, y)$, where\n$$x \\sim r\\cos{\\theta} + \\mathcal{N}(0, \\sigma^2)$$ $$y \\sim r\\sin{\\theta} + \\mathcal{N}(0, \\sigma^2)$$ $$\\theta \\sim 4\\pi \\cdot \\mathcal{U}_{[0, 1)}$$ $$r = a + b\\theta$$\nThe model generates a target distribution by first taking a sample $z_0 \\sim \\mathcal{N}(0, I_2)$ and then morphing it into a sample $x$ obtained by forward integrating $z_0$ with our dynamics neural network $f_{\\theta}$. Some clever math from the original paper allows us to compute the log-probability of $x$ by finding the accumulated change in log-probability density by integrating negative the divergence of $f_{\\theta}$ back in time. We train the network by sampling examples of $x$ from the target distribution defined above, then reverse integrating $(f_{\\theta}, -\\nabla \\cdot f_{\\theta})$ to find the corresponding basepoint $z$ and the accumulated change in log-probability density, and finally computing the parameter gradients. Interestingly enough, here the \u0026#34;forward pass\u0026#34; itself is a backward-in-time integration.\nI trained the model to integrate with $200$ steps over the time interval $[0, 1]$, with $5000$ training loop iterations and a batch (resampled every time with the spiral generator) of size 1024.\nThe following visualizations demonstrate the learned transformation. In the first visual, we see the standard normal distribution in 2D morph into the learned distribution. In the second visual, we see the true spiral distribution superimposed onto the learned distribution. As can be seen above, the CNF was able to roughly capture the shape of the target distribution. Generating a new sample is as simple as sampling from the standard normal distribution and forward propagating to get the corresponding sample in the learned distribution. Since it is normalizing flow, we also have the property that any sample from the learned probability distribution can be mapped back to a unique sample in the original standard normal distribution, which is done here with reverse integration.\nConclusion Neural ODEs offer an interesting and mathematically elegant way of modeling continuous-time transformations of data. While their applications in real software are currently limited, they are part of ongoing research efforts in physics-informed machine learning and time-series modeling. This article has explored their basic structure and two simple applications in physical dynamics-learning and normalizing flows. One aspect of neural ODEs that this article hasn\u0026#39;t covered in depth is their ability to incorporate irregular temporal data. Since neural ODEs learn the dynamics operator of a transformation rather than specific values at specific time-steps, they can incorporate data collected at arbitrary points in a trajectory. This ability could be the topic of a future project incorporating real world time-series data that is irregular and sporadic.\nFurther Readings The original paper proposing neural ODEs came out in 2018. Since then, there has been a growing body of machine learning research that has both improved and leveraged this architecture for various applications:\nA more advanced variation on neural ODEs is modeling stochastic differential equations (SDEs) with neural networks, as discussed in \u0026#34;Scalable Gradients for Stochastic Differential Equations\u0026#34; by Li et al. Neural ODEs also enable incorporating certain interesting inductive biases, such as a Hamiltonian dynamics bias, as discussed in \u0026#34;Symplectic ODE-Net: Learning Hamiltonian Dynamics with Control\u0026#34; by Zhong et al., which also addresses how a control term $u$ can be incorporated into the existing neural ODE architecture, and how prediction error in long time horizons can be mitigated. A proposed replacement to neural ODEs is provided in \u0026#34;Neural Flows: Efficient Alternative to Neural ODEs\u0026#34; by Biloš et al. This paper discusses an approach to modeling ODEs that does not require using numerical integration. References Ricky T. Q. Chen et al. “Neural ordinary differential equations”. In: Proceedings of the 32nd In- ternational Conference on Neural Information Processing Systems. NIPS’18. Montréal, Canada: Curran Associates Inc., 2018, 6572â6583. papers.nips.cc/paper_files/paper/2018/hash/69386f6bb1dfed68692a24c8686939b9-Abstract.html\n","permalink":"https://aarohgokhale.github.io/technical/neural-odes/","summary":"Neural ODEs are a relatively niche deep learning architecture designed to represent continuous-time differential processes. In this post, I provide an introduction to the basics of Neural ODEs and two simple applications to demonstrate their use.","title":"Neural ODEs"},{"content":" In my recent readings, I have encountered several papers that convert the usual momentum-based optimization algorithms such as Polyak\u0026#39;s Heavy Ball Method or Adam to their continuous-time variants in order to perform some form of analysis on them.\nHere, I would like to explore the broad notions used in these papers.\nContinuous Time Forms One of the interesting notions in many of these papers is to look at the continuous-time forms of momentum based optimizers. The main problem I tend to face is that these papers usually just don\u0026#39;t explain how they arrived to a particular system of differential equations for the continuous time form of a particular optimizer. A friend of mine suggested that it could be just looking at the differential equations for which the optimizer update matches Euler\u0026#39;s Method for solving ODEs. For example, the usual vanilla gradient descent rule is\n$$x_{t^{(i + 1)}} = x_{t^{(i)}} - \\eta \\cdot \\nabla f (x_{t^{(i)}})$$\nwhere $f$ is the loss function and $\\eta$ is some learning rate. If we instead let $\\eta = t^{(i + 1)} - t^{(i)} = \\Delta t$, we get\n$$x_{t^{(i + 1)}} = x_{t^{(i)}} - \\nabla f (x_{t^{(i)}}) \\Delta t$$\nwhich is indeed the recurrence relation of Euler\u0026#39;s method for the differential equation\n$$x\u0026#39; = -\\nabla f(x)$$\nThe classic momentum update rule described by Polyak is given by\n$$p_{t^{(i)}} = (1 - \\beta)p_{t^{(i - 1)}} - \\beta\\nabla f(x_{t^{(i)}})$$\n$$x_{t^{(i + 1)}} = x_{t^{(i)}} + \\beta \\cdot p_{t^{(i)}}$$\nwhere $\\beta$ is the size of the timestep.\nWe may see that this is just a discretization of the following second order system of differential equations:\n$$p\u0026#39; = -p - \\nabla f(x)$$ $$x\u0026#39; = p$$\nTo see that these do indeed map to the discrete versions, we can use finite differences:\n$$x\u0026#39; \\approx \\dfrac{x_{t^{(i)}} - x_{t^{(i - 1)}}}{\\beta}$$\n$$x\u0026#39;\u0026#39; \\approx \\dfrac{x_{t^{(i + 1)}} - 2x_{t^{(i)}} + x_{t^{(i - 1)}}}{\\beta^2}$$\nThen since $p = x\u0026#39;$, we get:\n$$\\dfrac{x_{t^{(i + 1)}} - 2x_{t^{(i)}} + x_{t^{(i - 1)}}}{\\beta^2} = -\\dfrac{x_{t^{(i)}} - x_{t^{(i - 1)}}}{\\beta} - \\nabla f(x_{t^{(i)}})$$\n$$\\iff x_{t^{(i + 1)}} - 2x_{t^{(i)}} + x_{t^{(i - 1)}} = -\\beta (x_{t^{(i)}} - x_{t^{(i - 1)}}) - \\beta^2 \\nabla f(x_{t^{(i)}})$$\n$$\\iff x_{t^{(i + 1)}} = 2x_{t^{(i)}} - x_{t^{(i - 1)}} -\\beta (x_{t^{(i)}} - x_{t^{(i - 1)}}) - \\beta^2 \\nabla f(x_{t^{(i)}})$$\n$$\\iff x_{t^{(i + 1)}} = x_{t^{(i)}} + x_{t^{(i)}} - x_{t^{(i - 1)}} -\\beta (x_{t^{(i)}} - x_{t^{(i - 1)}}) - \\beta^2 \\nabla f(x_{t^{(i)}})$$\n$$\\iff x_{t^{(i + 1)}} = x_{t^{(i)}} + (1 - \\beta)(x_{t^{(i)}} - x_{t^{(i - 1)}}) - \\beta^2 \\nabla f(x_{t^{(i)}})$$\nSince $p_{t^{(i)}} = (1 - \\beta)p_{t^{(i - 1)}} - \\beta\\nabla f(x_{t^{(i)}})$ in our discretization, we need to show that\n$$\\beta((1 - \\beta)p_{t^{(i - 1)}} - \\beta\\nabla f(x_{t^{(i)}})) = (1 - \\beta)(x_{t^{(i)}} - x_{t^{(i - 1)}}) - \\beta^2 \\nabla f(x_{t^{(i)}})$$\nWhich trivially boils down to showing\n$$\\beta(1 - \\beta)p_{t^{(i - 1)}} = (1 - \\beta)(x_{t^{(i)}} - x_{t^{(i - 1)}})$$\nSince $p = x\u0026#39;$, this is true by the first order approximation. Of course, just because the approximation is correct doesn\u0026#39;t mean that the trajectory is of any use. It must be shown that $x(t)$ converges a stationary point of the loss function $f$. Hamiltonian Dynamics The paper entitled \u0026#34;Hamiltonian Descent Methods\u0026#34; by Maddison et al. notes that this system mirrors a Hamiltonian dynamics system from physics, which is typically stated as\n$$x\u0026#39;_t = \\nabla_p \\mathcal{H}(x_t, p_t) = \\nabla k(p_t)$$\n$$p\u0026#39;_t = -\\nabla_x \\mathcal{H}(x_t, p_t) = -\\nabla f(x)$$\nwhere $\\mathcal{H}(x_t, p_t)$ is the total energy in the system as a function of the position and momentum, $k$ is the kinetic energy as a function of the momentum, and $f$ is the potential energy as a function of the position. If we let $k(p_t) = \\langle p_t, p_t \\rangle / 2$, the equations are nearly identical to the previously continuous time version of th e momentum optimizer, with the caveat that the the second differential equation from the previous section also has the term $-p$, which the paper talks about as a \u0026#34;disspiation field\u0026#34; of the form $p\u0026#39;_t = -\\gamma \\cdot p_t$, thus giving us a final field $(x\u0026#39;_t, p\u0026#39;_t) = F(x_t, p_t) + G(x_t, p_t)$ where $F$ is the Hamiltonian field $(\\nabla k (p_t), -\\nabla f(x))$ and $G$ is the dissipation field $(0, -\\gamma \\cdot p_t)$. The paper calls this a \u0026#34;conformal\u0026#34; Hamiltonian field/system. It also allows for a notion of a more general conformal Hamiltonian system, where $k$ is allowed to be an arbitrary nonnegative convex function with $k(0) = 0$ and $\\gamma \u0026gt; 0$. This poses the problem of optimization as a problem of computing the trajectory of a particle $x$ placed in a force field defined by the loss function $f$ starting at some position $x_0$ with some velocity $p_0$, where there also exists a dissipation force that scales with the particle\u0026#39;s velocity and makes it an important factor in deciding where the particle will go next. The solutions to the system of differential equations remains in the set $\\{(x, p) : \\mathcal{H}(x, p) = H_0\\}$ for conservative forces $\\nabla f$. However, when we add a disspitation force, such as the one defined by the momentum term in our optimizer, the total energy of the system decreases over time. A result from this paper is that given sufficiently nice conditions to the system, there exists a unique solution $(x_t, p_t)$ given initial conditions $(x_0, p_0)$ and that the position function $x$ converges to a stationary point of $f$.\nConditions The specific conditions for existence are as follows:\n$k$ is nonnegative and convex with $k(0) = 0$ (I stated this above as the domain of functions that $k$ can be chosen from). $\\nabla f$ and $\\nabla k$ are continuous. $\\mathcal{H}$ is radially unbounded: $\\mathcal{H}(x, p) \\to \\infty$ as $||(x, p)||_{2} \\to \\infty$. This notation is a little unclear to me. The best guess I can make is that for any $\\epsilon \u0026gt; 0$, there exists $\\delta \u0026gt; 0$ such that for all $(x, p)$, $||(x, p)||_2 \u0026gt; \\delta \\implies \\mathcal{H}(x, p) \u0026gt; \\epsilon$. For uniquness, the additional condition that $\\nabla f$ and $\\nabla k$ are continuously differentiable is imposed.\nFor convergence to a stationary point of $f$, given a solution $(x_t, p_t)$ to the system with initial conditions $(x_0, p_0) = (x, p)$, the following conditions are imposed:\n$f$ and $k$ are continuously differentiable $k$ is strictly convex with a minimum $k(0) = 0$ $\\mathcal{H}$ is radially unbounded $f$ is bounded bel, though it might not be so useful for those that are only really familiar with the mathematics of it. Given the above conditions, the paper shows that $||\\nabla f(x_t)||_2 \\to 0$.\nLooking at this optimization problem from a physics perspective is perhaps insightful for people who have a strong understanding of and intuition for physics, though it might not be so useful for those that are only really familiar with the mathematics of it. ","permalink":"https://aarohgokhale.github.io/technical/continuous_time_optimizers/","summary":"An exploration of how the common problem of function optimization over $\\mathbb{R}^d$ can be viewed through the lens of physics, and in particular, Hamiltonian mechanics. This perspective is taken by many modern papers on optimization algorithms, and I attempt to give a brief and accessible introduction to it here.","title":"A Physics View of Function Optimization"},{"content":" Here, I summarize and try to explain in detail what I learned from the paper entitled \u0026#34;End-to-End Differentiable Proving\u0026#34; by Rocktäschel and Riedel.\nMain Ideas The main idea of this paper is to combine strategies from automated symbolic reasoning and learned vector representations symbolic entities to get a hybrid model of theorem proving, where proof search is enhanced by indicating that a proof is more likely to succeed if the entities within the goal have high similarity to entities in known proofs.\nBasics The first barrier to entry in this paper is understanding the logic programming framework being used. Theorems are posed as queries to a database, proving a theorem involves systematically substituting the terms in the query until something that already exists in the database is reached, at which point the query is considered a success. If nothing is found, it fails.\nDefinitions: atom: An atom consists of a predicate and a list of terms. For example, something like $[\\text{isCoprimeWith}, 8, 9]$. More generally, $[R, t_1, t_2, \\ldots, t_n]$, where $R$ is an $n$-ary relation called the predicate. In this paper, a term can be either a constant or a variable. The example they use is $[\\text{grandfatherOf}, Q, \\text{BART}]$, where $\\text{grandfatherOf}$ is the predicate, $Q$ is a variable, and $\\text{BART}$ is a constant. rule: A rule is a structure of the form $H \\mathrel{:-} \\mathbb{B}$, where $H$ is an atom called the head and $\\mathbb{B}$ is a possibly empty conjuction of atoms called the body of the rule. From my understanding, a rule of the form $q \\mathrel{:-} [p_1, p_2, \\ldots, p_N]$ describes the implication $p_1 \\land p_2 \\land \\ldots \\land p_N \\rightarrow q$. A rule with no free variables (all the variables are universally quantified) is called a ground rule, and a ground rule with no body is called a fact. substitution set: A substitution set is a set of the form $\\psi = \\{X_1/t_1, \\ldots, X_n/t_n\\}$, which represents an assignment of free variables $X_1, \\ldots, X_n$ to terms $t_1, \\ldots, t_n$ respectively. Applying a substitution to an atom replaces all occurrences of variables in the substitution set with their repsective terms. Backward Chaining Algorithm: The algorithm used to prove a statement or to The basic idea of the algorithm is as follows: a function called OR is applied to the query (the goal). It iterates through the set of all rules and finds a unification of the goal with the rule\u0026#39;s head (by trying to substitute the variables in either formula to match each other, and by making sure that when both terms are constants that they are equal). If OR is succesful in finding a unification, it calls a function called AND, which then proves all the atoms in the body of that rule (since each rule is an implication based on a conjuction of premises). AND uses the substitution set that was used to unify the goal with the rule head, and applies it to the subgoals (the atoms in the body). It then calls OR on the subgoals one by one to prove them.\nAs someone with no experience in prolog and this framework of thinking, I found the pseudocode for the backward chaining algorithm a little bit cryptic. Below I transliterate the pseudocode from the appendix of the paper in just plain text (the original had $\\LaTeX$):\nor(G, S) = [S\u0026#39; | S\u0026#39; in and(B, unify(H, G, S)) for H :- B in K] and(_, FAIL) = FAIL and([], S) = S and(G : bigG, S) = [S\u0026#39;\u0026#39; | S\u0026#39;\u0026#39; in and(bigG, S\u0026#39;) for S\u0026#39; in or(substitute(G, S), S)] unify(_,_,FAIL) = FAIL unify([],[],S) = S unify([],_,_) = FAIL unify(_,[],_) = FAIL unify(h : H, g : G, S) = unify(H, G, S + {h/g} if h in V, S + {g/h} if g in V and h not in V, S if g = h FAIL otherwise) substitute([], _) = [] substitute(g : G, S) = x if g/x in S, g otherwise : substitute(G,S) Explanation: The first function, or, is collecting sets S\u0026#39; such that they belong to the result of taking applying and to the bodies, B, of the rule heads, H, that manage to be unified with the goal, G. The and function, when given a set of subgoals, called bigG, and an existing substitution set S, goes through the subgoals one by one and applies or to them after applying the substitution corresponding to S. If or succeeds, it returns a new substitution set S\u0026#39;, which can then be used for the remaining subgoals in bigG. If and succesfully goes through all subgoals, it returns a final set of substitution sets that essentially together contain the proof for G. The logic used for unify and substitute is really straightforward, so I won\u0026#39;t go into that here. The notational quirk that I was somewhat unware of, is the use of the colon : to denote splitting a list\u0026#39;s first element from the rest of the list. This is really quite similar to the way Haskell splits lists, with the notation [x:xs] used to denote the list as a whole but with the first element x and the rest of the list xs accessible as values.\nDifferentiable Prover The main thing defined in this paper is the NTP (Neural Theorem Prover), which is a neural network that takes in a goal and tries to prove it using a modified version of prolog\u0026#39;s backward chaining algorithm, and spits out a sucess score. NTPS are defined in terms of modules, which are subgraphs that are designed for one particular task that is part of a larger goal of the whole network. Each module takes atoms, rules, and a proof state as input, and returns a list of new proof states. A proof state is a tuple $S = (\\psi, \\rho)$, where $\\psi$ is the substitution set constructed in the proof so far, and $\\rho$ is a neural network that outputs a real valued success score of a partial proof. Once a module is constructed, it recursively instantiates submodules to continue the proof. The substitution set of a proof state $S$ is denoted $S_{\\psi}$, and the corresponding neural network for calculating proof success is denoted $S_{\\rho}$.\nUnification Module One of the main modifications that NTPs make to the original backward chaining algorithm is that when unifying two atoms, symbol comparison is replaced with a computation that measures the similarity of the vector representations of those two symbols. The example used is the comparison of the predicates grandfatherOf and grandpaOf, which aren\u0026#39;t symbolically the same, but which can have very close learned representations, using something like ComplEx. The unify module updates the input substitution set and creates a neural network for comparing vector representations of non-variable symbols in two sequences of terms (i.e. the terms in the two atoms). The module iterates pairwise through the terms of the two atoms being compared, and if one of the symbols is a variable, a substitution is added to the substitution set, and if they are both constants, their vector representations are compared using a Radial Basis Function Kernel.\nThe following is the pseudocode for the unify module taken directly from the paper:\nOne thing that confused me about this is on line 4, where there seem to be two assignments. My educated guess is that they meant to say that $S\u0026#39; = (S\u0026#39;_{\\psi}, S\u0026#39;_{\\rho})$, but accidentally ended up writing $\\text{unify}_{\\theta}(H, G, S\u0026#39;) = (S\u0026#39;_{\\psi}, S\u0026#39;_{\\rho})$, since only the former makes sense as a recursive definition.\nMoving on to actually analyzing the code, the two things of significance are the definitions of $S_{\\psi}\u0026#39;$ and $S_{\\rho}\u0026#39;$. The main pseudocode is very similar to the backward chaining pseudocode from earlier, with a couple of subtle differences. One difference is that there is no pattern match for the case $\\text{unify}_{\\theta}(\\_,\\_,\\text{FAIL})$, since the final pattern match never results in a call to $\\text{unify}_{\\theta}$ with $S\u0026#39; = \\text{FAIL}$ like it did in the original pseudocode. Another difference is that due to the new structure of $S$, the final case constructs the two different components of $S\u0026#39;$, only one of which was seen in the previous case (namely, $S_{\\psi}\u0026#39;$). Another thing to notice is that $S_{\\psi}\u0026#39;$ does not result in $\\text{FAIL}$ even if neither term is a variable and the terms aren\u0026#39;t equal. This is because the $S_{\\rho}$ component explicitly contains the calculation of a score for the cases when neither term is a variable, which is the term $$\\exp\\left(\\frac{-||\\mathbf{\\theta}_{h:} - \\mathbf{\\theta}_{g:}||_{2}}{2\\mu^2}\\right)$$\nObserve that when $h$ and $g$ are the same, they will have the same vector representation, and thus $\\mathbf{\\theta}_{g:} = \\mathbf{\\theta}_{h:}$, which results in $S_{\\rho} = e^0 = 1$. The futher apart the vectors $\\theta_{g:}$ and $\\theta_{h:}$ are (with repsect to the Euclidean metric), the larger the norm of their difference is, which in turn translates to a smaller value of $S_{\\rho}\u0026#39;$. In fact, the decay is exponential which means that only really similar vectors get high scores. Note that we are taking the minimum of this new score and the old score, which means that by the end of the algorithm, the remaining value of $S^{(n)}_{\\rho}$ will be decided by the pair of terms furthest away from each other in their vector representations.\nCredit: Desmos Another observation pointed out in the paper is that with this new algorithm, the only cases where the $\\text{FAIL}$ output can be achieved is when the two atoms don\u0026#39;t have the same number of terms (i.e. arity mismatch).\nOR Module The or module is defined as\nThe knowledge base, as a set of rules, is denoted by $\\mathfrak{K}$. The or module in this case differs from the original symbolic or function in that it takes a new input, $d \\in \\mathbb{N}$, which defines the maximum proof depth of the neural network, and that it now uses the unify module, defined above, and the and module, defined below. The main difference between the symbolic and the neural or modules is that the neural module can capture similarities between different symbolic terms because it uses the neural unify module.\nAND Module The and module is defined as\nThe new parameter, $d$, introduces one new case, where we automatically fail if $d = 0$. The main case (line 4) itself is different in that it uses the neural versions of and and or (the substitute function is actually the exact same as before), and makes sure that subsequent calls to or and and get lower proof depths.\nFinal Model The final aggregate model for proving a goal $G$ using a Knowledge Base $\\mathfrak{K}$ with parameters $\\mathbf{\\theta}$ and proof depth $d$ is given by\nAnalysis of Final Model The final model takes in two inputs, the goal $G$, and the maximum proof depth, $d$. It then iterates through all the successful solutions produced by $\\text{or}_{\\mathbf{\\theta}}^{\\mathfrak{K}}(G,d,(\\varnothing, 1))$, and finds the one with the highest $S_{\\rho}$ score. The call $or_{\\mathbf{\\theta}}^{\\mathfrak{K}}(G,d,(\\varnothing, 1))$ starts by constructing several unify modules, which are then all connected to and modules, which then goes through substitue before going back to or with depth $d - 1$. This continues until there is either a succesful solution, or if $d = 0$ or unification fails, which only happens if arity doesn\u0026#39;t match.\nTraining Training Objective The paper uses a negative log-likelihood loss function on the proof success score defined above. This paper also uses corrupted fact triples much in the same way that the paper on NTNs used them for training, with the main difference being that the corrupted data is explicitly given a score of $0$. The labeled training data is the set $\\mathcal{T}$. The loss function is given by\n$$ \\mathcal{L}_{\\text{ntp}_{\\mathbf{\\theta}}^{\\mathfrak{K}}} = \\sum_{([s,i,j], y) \\in \\mathcal{T}} -y\\log(\\text{ntp}_{\\mathbf{\\theta}}^{\\mathfrak{K}}([s,i,j],d)_{\\rho}) - (1 - y)\\log(1 - \\text{ntp}_{\\mathbf{\\theta}}^{\\mathfrak{K}}([s,i,j],d)_{\\rho})$$\nwhere $[s, i, j]$ is an atom and $y$ is the labeled proof score, which is $1$ for original ground atoms and $0$ for the corrupted ones that were added in later.\nExperimental Results Looking at the table of results, NTP$\\lambda$ (NTP combined with ComplEx) had comparable results with ComplEx across the board, even though the accuracy was slighly higher in NTP$\\lambda$ for most metrics. The paper points out that one advantange that NTPs have is that they are more interpretable, in the sense that their induced rules can be examined.\nBut what is \u0026#34;End-To-End Differentiable\u0026#34;? End-To-End Differentiability, from what I have understood, refers to the fact that each of the modules within the larger ntp module has a derivative with respect to the vector representations of terms, making it possible to perform gradient descent on the loss. According to the appendix, the caveat is that the graph is so large that it becomes infeasible to backpropogate through it to get an exact gradient, which means that they resort to a heuristic of the gradient.\nFinal Thoughts This paper was quite interesting and exposed me to several new concepts in automated reasoning and machine learning. I am not so sure whether this area has potential for future success, but it could remain in my purview. ","permalink":"https://aarohgokhale.github.io/technical/end-to-end-diff-prove/","summary":"Here, I summarize and try to explain in detail what I learned from the paper entitled \u0026#34;End-to-End Differentiable Proving\u0026#34; by Rocktäschel and Riedel.","title":"A Summary of \"End-to-End Differentiable Proving\""},{"content":" Here, I summarize and try to explain in detail what I read and understood in the paper entitled \u0026#34;Reasoning With Neural Tensor Networks for Knowledge Base Completion\u0026#34; by Socher, Chen, Manning, and Ng from Stanford.\nMain Ideas The overall goal of the paper is to answer whether two entities, $(e_1, e_2)$, are in a given relation $R$. Neural Tensor Network This is a modified neural network architecture that has a bilinear tensor layer instead of a standard linear layer that directly relates the two entities. The aim of this model is compute a score that indicates how likely it is for the two entities to be in a given relationship. The function is defined by:\n$$g(e_1, R, e_2) = u^{T}_{R} f \\left(e_1^T W_{R}^{[1:k]}e_2 + V_{R}\\begin{bmatrix}e_1 \\\\ e_2\\end{bmatrix} + b_R\\right)$$\n$W_{R}^{[1:k]} \\in \\mathbb{R}^{d \\times d \\times k}$ is a tensor, and $e_{1}^{T}W_{R}^{[1:k]}e_2$ is what the paper calls a \u0026#34;bilinear tensor product\u0026#34; (I couldn\u0026#39;t find a formal definition of this anywhere online), which is then added to the output of a standard layer, $V_R$, which is then added to the the bias, $b_R$. The whole sum is then passed through $f$, which is elementwise $\\tanh$, and finally multiplied on the left by $u_R^{T}$, where $u_R$ is determines how the activated weights are combined to get a signle final score $g \\in \\mathbb{R}$.\nThis equation seemed a bit daunting to me at first, so here\u0026#39;s a more careful examination of what is going on:\nFirst, a reminder of what $\\tanh$ looks like:\nCredit: Desmos The following sum is fed into an elementwise $\\tanh$ that operates on a vector in $\\mathbb{R}^{k}$:\n$$e_{1}^{T}W_{R}^{[1:k]}e_2 + V_{R}\\begin{bmatrix}e_1 \\\\ e_2\\end{bmatrix} + b_R$$\nThe easiest to identify thing here is the bias node, which is represented by the vector $b_R$. The next easy thing to identify here is the regular neural network layer, represented by the product $V_{R}\\begin{bmatrix}e_1 \\\\ e_2\\end{bmatrix}$, where $V_{R} \\in \\mathbb{R}^{k \\times 2d}$ represents the weight matrix that specifies how to linearly combine the input in $k$ different ways. The vector $\\begin{bmatrix}e_1 \\\\ e_2 \\end{bmatrix}$ is just a singular vector in $\\mathbb{R}^{2d}$ constructed by vertically concatenating the entries of $e_1$ and $e_2$ into one vector. So far so good. The expression $V_{R}\\begin{bmatrix}e_1 \\\\ e_2 \\end{bmatrix} + b_R$ itself is taken straight out of the expression for a single neural network layer in a classical neural network, where this sum is then passed through an activation function and then through the remaining layers. All that remains to parse is the most interesting and different part of the sum, the bilinear tensor product, $e_{1}^{T}W_{R}^{[1:k]}e_{2}$. This notation was slightly confusing, but a diagram from the paper was illustrative: this operation represents stacking $k$ bilinear forms $e_1^{T}W_{R}^{i}e_2, i \\in \\{1, \\ldots, k\\}$ on top of each other to get a vector in $\\mathbb{R}^{k}$. The tensor $W_{R}^{[1:k]}$ can be thought of as $k$ slices put together, where each slice is a $d \\times d$ matrix relating entries from $e_1$ to entries in $e_2$. A concrete example might be of use: Let $d, k = 2$, let $e_1 = \\begin{bmatrix}a \\\\ b\\end{bmatrix}$ and let $e_2 = \\begin{bmatrix}c \\\\ d\\end{bmatrix}$. Let $W^{1}_{R} = \\begin{bmatrix} w_{11}^1 \u0026amp; w_{12}^1 \\\\ w_{21}^1 \u0026amp; w_{22}^1 \\end{bmatrix}$ and $W_{R}^{2} = \\begin{bmatrix} w_{11}^2 \u0026amp; w_{12}^2 \\\\ w_{21}^2 \u0026amp; w_{22}^2 \\end{bmatrix}$. Then $$W_{R}^{1}e_2 = \\begin{bmatrix}w_{11}^{1}c + w_{12}^{1}d \\\\ w_{21}^{1}c + w_{22}^{1}d\\end{bmatrix} \\text{ and } W_{R}^{2}e_2 = \\begin{bmatrix}w_{11}^{2}c + w_{12}^{2}d \\\\ w_{21}^{2}c + w_{22}^{2}d\\end{bmatrix}$$ Then\n$$e_{1}^{T}W_{R}^{1}e_2 = a(w_{11}^{1}c + w_{12}^{1}d) + b(w_{21}^{1}c + w_{22}^{1}d) $$\nand\n$$e_{1}^{T}W_{R}^{2}e_2 = a(w_{11}^{2}c + w_{12}^{2}d) + b(w_{21}^{2}c + w_{22}^{2}d)$$\nThe interesting thing to note in both of these forms is that each entry in $e_1$ gets to be multiplied with each entry in $e_2$, and the product of any two individual entries is given a distinct weight. The final \u0026#34;bilinear tensor product\u0026#34; is then\n$$e_{1}^{T}W_{R}^{[1:2]}e_{2} = \\begin{bmatrix} w_{11}^{1}ac + w_{12}^{1}ad + w_{21}^{1}bc + w_{22}^{1}bd \\\\ w_{11}^{2}ac + w_{12}^{2}ad + w_{21}^{2}bc + w_{22}^{2}bd \\end{bmatrix}$$\nWhat I have understood through this example is that the bilinear tensor product term is just $k$ bilinear forms stacked on top of each other. What this means is that each entry in $e_1$ gets to be multiplied with each entry in $e_2$ $k$ times with $k$ different weights. Thus, we get to turn $k$ knobs, where a knob is a $d \\times d$ matrix representing the strength of association between pairs of entries in $e_1$ and $e_2$. The paper explains that this bilinear term allows us the model to explicitly relate the two inputs multiplicatively, rather than just having an implict nonlinear association that we would get with this term removed.\nIn summary, not only do we get to control how the stacked input vector is recombined, we also get to control how pairwise products of the vector entries are weighted.\nFinally, once the big sum is passed through the $\\tanh$ activation function, the resulting $k$-vector gets multiplied by $u_{R}^{T}$, which is a row $k$-vector, thus giving us a single score at the very end.\nThe paper points out that the Neural Tensor Network model, as defined above, combines the ideas and strengths from several different model types.\nLoss Function The loss function or training objective in this paper is called a \u0026#34;contrastive max-margin\u0026#34; objective function. The paper descrbes one main idea used to motivate this objective function: if we have a training set $T^{(i)} = (e_{1}^{(i)}, R^{(i)}, e_{2}^{(i)})$, each triplet that actually belongs to the training set should receive a higher score than a triplet where one of the entities is replaced randomly with a new entity. This seems like a natural requirement, since the relationships defined by triplets in the training set are known to be true. The triplets where an entity has been replaced by a random entity is called a corrupted triplet. The set of corrupted triplets is denoted by $T_{c}^{(i)} = (e_{1}^{(i)}, R^{(i)}, e_c)$. Here, $e_c$ has been randomly sampled from the set of all entities that can appear at that position in the relation $R^{(i)}$. (‼ one point I was confused about here was whether or not $e_c$ is parameterized by $i$. It seems like it should be, since the possible choices of $e_c$ depends on the relation $R^{(i)}$, which itself is indexed by $i$). What I found a little bit interesting here is that the corruption only happens in one position. A relation $R$ doesn\u0026#39;t have to be symmetric, which means that a corruption $(e_1, R, e_c)$ is different from a corruption $(e_c, R_, e_2)$. Why, then, do we only corrupt on the right?\nAs we saw earlier, the Neural Tensor Network model itself is parameterized by the choice of relation $R$, and in particular, each relation $R$ has its own set of weight matrices/tensors, $W_R, V_R, u_R, b_R$. Here, I faced another point of confusion. The paper defines $\\mathbf{\\Omega}$ to be the set of NTN parameters for all relationships, and it is comprised of $\\mathbf{u}$, $\\mathbf{W}$, $\\mathbf{V}$, $\\mathbf{b}$, and $\\mathbf{E}$. While the first four of these are clear, I am a little confused about what $E$ is supposed to be. Is it the set of all entities? Finally, the paper defines the objective function as:\n$$J(\\mathbf{\\Omega}) = \\sum_{i = 1}^{N}\\sum_{c = 1}^{C}\\max\\left(0, 1 - g(T^{(i)}) + g(T_{c}^{(i)})\\right) + \\lambda ||\\mathbf{\\Omega}||_{2}^{2}$$\nWhere $N$ is the number of training points, $C$ is the number of randomly sampled corrupted triplets of each given correct triplet (i.e. in the training set). The max in the summation forces the the minimizer to drive $g(T^{(i)})$ to be as much larger than $g(T_{c}^{(i)})$ as possible, up until it reaches exactly $1$ more than $g(T_{c}^{(i)})$, at which point any additional increase in $g(T^{(i)})$ is meaningless for the output $J$. The $\\lambda ||\\mathbf{\\Omega}||_{2}^{2}$ summand is a standard $L_2$ regularization term that helps with overfitting.\nThis equation for the objective function was a little puzzling initially, since it isn\u0026#39;t quite clear what it means to take the $2$-norm of $\\mathbf{\\Omega}$, which itself wasn\u0026#39;t defined very precisely. Though reading onto the paragrah after that reveals that this ambiguous notation is actually defining a set of five different objective functions (perhaps we can the final objective as the minimization of their sum?) This is still a point of slight unclarity for me. The paper uses the L-BFGS nonlinear optimization method to find a local minimum of the cost function. Vector Representations In the framework being used for this paper, each entity has a vector representation $e \\in \\mathbb{R}^d$. It seems like this framework was being used in multiple papers in the early 2010s, including in \u0026#34;Learning Structured Embeddings of Knowledge Bases\u0026#34; by Bordes, Weston, Collobert, and Bengio, in which a way of assigning entities vector representations is discussed. The NTN paper (the one currently being summarized) states that the NTN model works well with randomly initialized entity vectors, which are then learned for each entity through the training process (since the actually relationships between entity vectors are part of the traning data, which then translates to the learned function $g$). The paper also proposes a new scheme for representing entities using the composition of word vectors, which are vectors in $\\mathbb{R}^d$. An entity is represnted by the average of the vectors of words that compose to it. For example, $v_{\\textit{homo sapiens}} = 0.5(v_{\\textit{homo}} + v_{\\textit{sapiens}})$. This can then embed some similarities between entities before even training. The example used in the training is homo erectus. If this entity hasn\u0026#39;t been seen before, a fact about homo sapiens can still be extended to it due to the fact that $v_{\\textit{homo}}$ is in the word compositions for both vector representations, which means that $v_{\\textit{homo erectus}}$ will start out relatively close to $v_{\\textit{homo sapiens}}$ even though $v_{\\textit{erectus}}$ is random.\nThe total number of entities is $N_E$ and the total number of unique words is $N_W$. If the training is done on words, the entity embedding is $E \\in \\mathbb{R}^{d \\times N_W}$ and if the training is performed with whole vectors, the entity embedding is $E \\in \\mathbb{R}^{d \\times N_E}$. Experimental Results The experiments performed in the paper were quite succesful, achieving accuracies of $86.2\\%$ on the WordNet dataset and $90\\%$ on the FreeBase dataset, though the improvement seemed marginal over an existing model called the Bilinear Model (not quite the same as the NTN, though it uses an idea that inspired the NTN).\nFinal Thoughts This was my first look at Knowledge Base completion. I thought it was quite an interesting area and I might look further into it later. What brought me to this paper was the paper called End-To-End Differentiable Proving by Rocktäschel and Riedel, which I wanted to study as a part of my dive into automated and neurosymbolic reasoning. I will attempt to summarize that paper next.\n","permalink":"https://aarohgokhale.github.io/technical/neural-tensor-kb-completion/","summary":"Here, I summarize and try to explain in detail what I read and understood in the paper entitled \u0026#34;Reasoning With Neural Tensor Networks for Knowledge Base Completion\u0026#34; by Socher, Chen, Manning, and Ng from Stanford.","title":"A Summary of \"Reasoning With Neural Tensor Networks for Knowledge Base Completion\""},{"content":" While looking at computer science research areas that I could find interesting, I stumbled upon formal methods, and more specifically, automated symbolic reasoning, theorem proving, and the integration of modern machine learning with formal reasoning. I decided to read some research papers to get a feel for this area, since it looked quite interesting to me. In this article, I am going to review and outline one interesting paper I read in this area. I will continue writing further articles about other papers I read.\nPUTNAMBENCH The first interesting paper I stumbled upon was the PUTNAMBENCH paper by Tsoukalas et al., where the capabilities of modern neural models in proving theorems in the framework of theorem provers such as Lean 4, Isabelle, and Coq are tested. These frameworks can automatically and rigorously verify the correctness of the proofs provided by the neural models. In this paper, the authors formalized hundreds of problems from the William Lowell Putnam Mathematical Competition. The improvement that PUTNAMBENCH makes on existing benchmarks is that it introduces college level problems into the mix, with some problems even requiring ideas from research level mathematics, according to the paper. A few additional reasons cited for the creation of this benchmark were:\nThe limited scope of existing benchmarks Existing benchmarks being designed for older frameworks Preventing the leakage of benchmark data into the training data for LLMs (in general, the paper claims that this necessitates periodically creating new benchmarks) One issue that PUTNAMBENCH had to address was that Putnam problems often aren\u0026#39;t stated as logical propositions. In fact, more often than not, they require the student to both come up with a closed form solution and then prove that the solution is indeed correct. PUTNAMBENCH addresses this issue by splitting up generation of closed form solutions from the proofs of correctness into two tasks of different difficulty levels, where success in one task likely has high correlation with success in the other. The second task only asks for a proof of correctness of a pre-provided closed form solution. The first task is a strict superset of the second task, since it requires not only the generation of a closed form solution, but also a proof of correctness.\nPUTNAMBENCH is claimed in the paper to be the first formalization of a large number of Putnam problems in Lean, Isabelle, or Coq, which is what is used to justify the idea that there isn\u0026#39;t much cross-contamination between the dataset produced by the paper and the data used by large language models for training. I found this to be an interesting claim. Large language models probably have seen Putnam problems and their solutions in their natural language forms, but the claim that they haven\u0026#39;t been exposed to formalizations of these problems and their proofs does seem plausible. It is then an interesting question whether or not seeing the natural language variants would give a language model an unfair advantage in the solving of the formalizations. The paper does acknowledge the possibility of such an indirect form of contamination.\nThe results of running various theorem proving models on these formalizations was quite astonishing to me, as a newcomer to this field. None of provers were able to get more than even a handful of the problems. I don\u0026#39;t know whether this is typical of benchmarks for formal theorem proving, but it was a surprise to me. It also indicated to me that there is much progress left to be made in this area.\nAfter reading this paper, I was curious to learn about the current state of the art in neurosymbolic reasoning. I wanted to learn how some of the models used (though unsuccesfully) in the PUTNAMBENCH paper worked. I therefore started reading some papers on this area. I also wanted to a bit about theorem proving frameworks, so I also began reading about those.\n","permalink":"https://aarohgokhale.github.io/technical/putnam-bench/","summary":"A summary of the \u003ca href=\"https://arxiv.org/abs/2407.11214\"\u003ePUTNAMBENCH\u003c/a\u003e paper by Tsoukalas et al. In this paper, the authors formalized hundreds of problems from the William Lowell Putnam Mathematical Competition in order to test the capabilities oe modern \u003cem\u003eneural models\u003c/em\u003e in proving theorems in the framework of theorem provers such as Lean 4, Isabelle, and Coq are tested. These frameworks can automatically and rigorously verify the correctness of the proofs provided by the neural models.","title":"A Summary of PUTNAMBENCH"},{"content":" This is a collection of some of my favorite Hindustani classical music performances on YouTube grouped by Raag. The raags aren\u0026#39;t guaranteed to appear in any particular order, though I have tried to list them cyclically according to the time of day during which they are performed. The section names, i.e. the names of the raags, are clickable links. They lead to pages on https://ragajunglism.org/, which is a great resource for learning more about Hindustani classical music.\nSome of the below groups are short because I haven\u0026#39;t listened to enough recordings of the given raag. Others are short because the listed performances are so incredible that I didn\u0026#39;t feel the need to add any more. Differentiating between the two types is left as an exercise for the reader.\nYaman Malini Rajurkar\nDuration: 53:26 Bhimsen Joshi\nDuration: 1:19:32 Amir Khan\nDuration: 54:43 Nikhil Banerjee\nDuration: 1:05:09 Sultan Khan\nDuration: 27:04 Jitendra Abhisheki\nDuration: 59:45 Uday Bhawalkar\nDuration: 1:14:05 Jayateerth Mevundi\nDuration: 43:59 Darbari Kanada Amir Khan\nDuration: 1:06:38 Bhimsen Joshi - 1st Duration: 1:03:39 Bhimsen Joshi - 2nd\nDuration: 1:00:41 Veena Sahasrabuddhe\nDuration: 30:04 Munir Khan\nDuration: 18:34 Nikhil Banerjee\nDuration: 59:32 Gundecha Brothers\nDuration: 1:16:32 Jasraj\nDuration: 1:09:31 Mallikarjun Mansur\nDuration: 14:37 Bhairav Nikhil Banerjee\nDuration: 1:19:49 Uday Bhawalkar\nDuration: 1:29:31 Niloy Ahsan\nDuration: 59:30 Todi Bhimsen Joshi\nDuration: 1:09:03 Kishori Amonkar\nDuration: 1:07:22 Venkatesh Kumar\nDuration: 43:29 Rashid Khan\nDuration: 42:06 Nikhil Banerjee\nDuration: 49:27 Amjad Ali Khan\nDuration: 30:34 Bhimpalasi Ulhas Kashalkhar\nDuration: 50:59 Nikhil Banerjee\nDuration: 35:38 Kumar Gandharva\nDuration: 32:40 Sultan Khan\nDuration: 14:44 Jasraj\nDuration: 1:12:04 Kaushiki Chakraborty\nDuration: 57:30 Ashiwini Bhide-Deshpande\nDuration: 29:49 Aniruddh Aithal\nDuration: 23:38 Multani Bhimsen Joshi\nDuration: 35:54 Kumar Gandharva\nDuration: 41:38 Mallikarjun Mansur\nDuration: 41:47 Uday Bhawalkar\nDuration: 1:18:02 Sabri Khan\nDuration: 21:11 Ali Akbar Khan\nDuration: 19:02 Marwa Amir Khan - 1st\nDuration: 18:35 Amir Khan - 2nd\nDuration: 1:02:51 Vasantrao Deshpande\nDuration: 43:55 Kishori Amonkar\nDuration: 39:40 Rashid Khan\nDuration: 52:14 Chandrakant Limaye\nDuration: 19:51 Jitendra Abhisheki\nDuration: 18:44 Sultan Khan\nDuration: 56:57 Shuddha Kalyan Bhimsen Joshi\nDuration: 58:32 Abdul Karim Khan\nDuration: 4:21 Venkatesh Kumar\nDuration: 49:55 Uday Bhawalkar\nDuration: 1:04:02 Kishori Amonkar\nDuration: 56:05 Amir Khan\nDuration: 39:25 Ajoy Chakraborty\nDuration: 13:04 Malkauns Bhimsen Joshi - 1st\nDuration: 43:06 Bhimsen Joshi - 2nd\nDuration: 54:18 Kumar Gandharva\nDuration: 57:51 Nikhil Banerjee\nDuration: 41:24 Amir Khan\nDuration: 21:34 Bade Ghulam Ali Khan\nDuration: 17:56 Vasantrao Deshpande\nDuration: 29:25 Uday Bhawalkar\nDuration: 1:28:58 Lalit Amir Khan\nDuration: 22:16 Kesarbai Kerkar\nDuration: 4:18 Bhimsen Joshi\nDuration: 12:13 Rashid Khan\nDuration: 49:16 Vijay Koparkar\nDuration: 56:28 Rajan and Sajan Mishra (Mislabelled by YouTube as Bhairavi)\nDuration: 8:17 Jayateerth Mevundi\nDuration: 25:40 (not the whole video) Jaunpuri Malini Rajurkar\nDuration: 22:06 Venkatesh Kumar\nDuration: 27:36 Bhimsen Joshi\nDuration: 8:51 Vrndavani Sarang Bhimsen Joshi\nDuration: 17:45 Jitendra Abhisheki\nDuration: 17:20 Rashid Khan\nDuration: 42:11 Venkatesh Kumar\nDuration: 19:18 Ulhas Kashalkhar\nDuration: 25:50 Puriya Dhanashree Bhimsen Joshi\nDuration: 15:32 Veena Sahasrabuddhe\nDuration: 47:41 Vasantrao Deshpande\nDuration: 17:14 Haunsadhwani Amir Khan\nDuration: 20:02 Kishori Amonkar\nDuration: 8:58 Jayateerth Mevundi\nDuration: 46:15 Kalavati Salamat and Nazakat Ali Khan\nDuration: 22:13 Sanjeev Abhyankar\nDuration: 25:33 Ajoy Chakraborty\nDuration: 19:32 Prabha Atre\nDuration: 10:35 Nikhil Banerjee\nDuration: 27:28 Bahar Dattatreya Vishnu Paluskar\nDuration: 4:49 Bhimsen Joshi\nDuration: 9:09 Kumar Gandharva\nDuration: 7:40 Jayateerth Mevundi\nDuration: 9:08 Asavari Todi Bhimsen Joshi\nDuration: 56:06 (not the whole video) Amir Khan\nDuration: 45:29 Jayateerth Mevundi\nDuration: 54:12 Venkatesh Kumar\nDuration: 43:42 ","permalink":"https://aarohgokhale.github.io/music/hindustani-classical-music-recommendations/","summary":"\u003cp\u003e\nThis is a collection of some of my favorite Hindustani classical music performances on YouTube grouped by \u003ca href=\"https://en.wikipedia.org/wiki/Raga\"\u003eRaag\u003c/a\u003e. The raags aren\u0026#39;t guaranteed to appear in any particular order, though I have tried to list them cyclically according to the time of day during which they are performed. The section names, i.e. the names of the raags, are clickable links. They lead to pages on \u003ca href=\"https://ragajunglism.org/,\"\u003ehttps://ragajunglism.org/,\u003c/a\u003e which is a great resource for learning more about Hindustani classical music.\u003c/p\u003e","title":"Hindustani Classical Music Recommendations"}]