Neuro-symbolic approach to reinforcement learning in robotics

SS 2023/2024 - Week 1: 18.2. - 25.2.

This week we:

studied the paper on one of the neuro-symbolic RL approaches, namely, NUDGE (Literature entry 1).

SS 2023/2024 - Week 2: 26.2. - 3.3.

This week we:

surveyed other recent neuro-symbolic RL approaches (Literature entries 2-7).

SS 2023/2024 - Week 3: 4.3. - 10.3.

This week we:

studied the papers on Logical Neural Networks (Literature entry 4) and their application to RL in the context of text-based games (Literature entry 3);
found two publications (Literature entries 8 and 9), where the authors combine neural and symbolic policies working on a different level of abstraction, which could help us utilize symbolic representation even in our continuous action and state spaces naturally arising when dealing with robot-control tasks;
set up the website for our thesis.

SS 2023/2024 - Week 4: 11.3. - 17.3.

This week we:

decided to try using a simple multilayer perceptron with an AND-OR architercture from Literature entry 3 on the meta-controller level of the hierarchical approach described in Literature entries 8 and 9;
implemented the core utility classes and functions for creating instances of the Block's World problem that can be used as an RL environment for experimenting with the above-described type of meta-controller:
- each state is represented by set of logical facts that are true in that state (of the form on(X, Y) and top(X));
- possible actions are of the form move(X, Y) - move a block X onto a block Y/table;
- the agent is rewarded for arriving to the goal state and slightly penalized for each step it takes on its way;
- an episode ends also when an invalid action is performed, for which the agent recieves a higher negative reward.

SS 2023/2024 - Week 5: 18.3. - 24.3.

This week we:

prepared a basic implementation of our meta-controller (agent) using the activation functions for AND and OR designed for dNL-ILP (Literature entry 10):
- the agent works on a propositional level, i.e., takes the truth values (0/1) for each fact from the language for describing the instance of the Block's World problem being solved;
- for each possible action, there is a separate AND-OR architecture with a single output neuron, which gives us the truth value or certainty with which the agent wants to perform that action in the given input state;
- a softmax is then applied over the certanties for all actions;
- learning is done via the basic REINFORCE algorithm.

SS 2023/2024 - Week 6: 25.3. - 31.3.

This week we:

improved/debugged our implementation of both the agent and the environment;
- a simple normalizatoin is done instead of softmax (for easier and more accurate computation);
- the agent is also rewarder for reaching a subgoal (building a substack of the desired stack);
experimented with our implementation in environments with up to four block:
- the task was to stack initially unstacked blocks onto one another in a predefined order to form a single stack;
- the agent successfully learned the optimal policy within 100-200 episodes;
identified several problem that we would like to address in the future:
- how to reasonably detect actions that are not useful at all;
- how to enforce that multiple ANDs connected with the same OR are optimized for different case scenarios;
- is it possible to train one agent that would generalize to any number of blocks.

SS 2023/2024 - Week 7: 1.4. - 7.4.

This week we:

started thinking of how to generalize our agent to predicate logic (i.e., one agent for any number of blocks):
- should the appropriateness of the execution of the action move(X, Y) depend just on the information about X and Y and no other block;
- how to integrate such other blocks to the decision process (existential/universal quantification);
noticed that the current agent struggles with learning optimal policies for five blocks and identified two possible causes:
- only the simple REINFORCE algorithm is used;
- the state space becomes too large.

SS 2023/2024 - Week 8: 8.4. - 14.4.

This week we:

conducted several experiments to test the scalability of the current agent and found out that adding one block enlarges state space so much that the time to convergence increases at least by an order of magnitude (from 100 to 1000 epochs);
tried learning an agent with only one MLP for all actions of type move(X, Y):
- the agent was given the truth value of top(X), top(Y), on(X, Y), and on(Y, X) both in the current and the goal state;
- the output was calculated as AND of AND over the mentioned facts in the current state (validity branch) and AND over theese facts in the current and the goal state (profitability branch);
- this agent was able to learn the best possible policy for three blocks under these circumstances, i.e. it was giving the same importance to move(B, C) and move(A, B) in the initial state, because with such a simple architecture, both seemed to be reasonable (there is no mechanism to determine the order in which to take these actions);
consulted our progress with doc. Homola, who promised us to provide literature on the topics of planning and neuro-symbolic AI.

SS 2023/2024 - Week 9: 15.4. - 21.4.

This week we:

tried implementing an agent that would learn only the validity of actions:
- the original Blocks World problem was transformed to a continuous problem, in which the task of the agent is to keep moving blocks as long as possible without attempting to perform an invalid action;
- the agent was the same generic agent with one MLP for move(X, Y) having only the validity branch;
- no experiments were carrued out yet, since a reimplementation of the REINFORCE algorithm is required as well;
consulted our progress with doc. Homola, who promised us to provide literature on the topics of planning, abstraction, and neuro-symbolic AI.

SS 2023/2024 - Week 10: 22.4. - 28.4.

This week we:

quickly went through the literature provided by doc. Homola and concluded that the most interesting sources could be Literature entries 11 and 12;
watched the seminar on planning shared by doc. Homola;
thought deeper about how to generalize in planning and keep interpretability/explainability:
- with our current agent, we would like embed all reasons why to perform a certain action in preconditions, but what if that is too complex (or even imposible)?
- even in standard planning, we probably cannot determine what action to take next without having a model of the environment and looking ahead to create at least a part of a plan;
- the plan itself is probably the only explanation of our actions in the end;
- there is an approach that uses symbolic planning as the symbolic component of a neuro-symbolic RL agent (Literature entry 8);

SS 2023/2024 - Week 11: 29.4. - 5.5.

This week we:

started studying Literature entry 8 more deeply;
inspired by Literature entries 8 and 9 and our previous (unfinished) approach to learn only the validity of actions, decided that we should focus on using symbolic method for planning and abstraction and subsymbolic methods for learning preconditions and effects from interaction with the environment to achieve generalization:
- subsymbolic elements will be used to perfrom actions in the environment and perceive the current state;
- a model of symbolic transitions will be learned based on interactions;
- symbolic abstraction will be used to obtain generic preconditions and effects;
- symbolic planning will be used to choose the next action;
- RL rewards/returns will be used to evaluate the quality of states so that it suffices to search for plans only up to certain length;

SS 2023/2024 - Week 12: 6.5. - 12.5.

This week we:

finished studying Literature entry 8;
created a LaTEX template for our thesis;
started preparing our presentation for the seminar;

SS 2023/2024 - Week 13: 13.5. - 19.5.

This week we:

finalized our presentation for the seminar;

Inter-Academic-Year Period

During this period we:

examined more deeply what explainability in RL means and found a few surveys on this matter (Literature entries 13, 14), with the following conclusions:
- having an NN-based policy enables us to study what the agent would do in a given, hypothetical state;
- having a learned transition model in addition gives us a means to find reasons behind the agent's decisions by examining what changes to the environment it expects throughout mutliple consecutive steps;
- therefore, we would like to base our final approach on the above-mentioned combination rather than on the combination of symbolic planning with a predefined / learned transition model, which could serve as a benchmark;
agreed with the supervisor that apart from explainability, we could also strive for better general learnability and data efficiency with our approach;
implemented REINFORCE algorithm with a baseline (a simple recency-weighted estimate of the value function) in our meta-controller, which lead to significant improvements in convergence, even allowing for randomized initial states.

WS 2024/2025 - Week 1: 23.9. - 29.9.

This week we:

studied Literature entry 9 and reconsidered its combination with the approach in Literature entry 8 as planned:
- we concluded that we would like to have a more modular, fine-grained way of learning a symbolic transition model (e.g., NN-based) than presented in Literature entry 9, where standard ILP is employed;
briefly looked at another paper on rule learning (Literature entry 15) for an alternative implementation of logical conjunction in NNs that allows each weight to also indicate that a negation of the corresponding input should be true or that the output does not depend on that input at all;
started researching the available RL simulation environments / frameworks, where we could train the entire hierarchical agent acting upon the real-world robot NICO (no neuro-symbolic RL gyms were found).

WS 2024/2025 - Week 2: 30.9. - 6.10.

This week we:

rewrote the implementation of our meta-controller to use pytorch for better readability and efficiency;
when comparing our implementations, noticed that a successful convergence is sensitive to the implementation details (hence, maybe unstable):
- our original implementation seems to converge with certainty, whereas the results with the pytorch one are rather disappointing (the model achieves effectively zero success rate), despite the same hyperparameter configuration;
installed and had a closer look at myGym - one of the RL environments we consider for training NICO (as it includes the model of NICO and also several example learning tasks);
had a small talk with one of the maintainers / developers of myGym on what would be the benefits of using myGym compared to maybe a more popular Gymnasium Robotics;
planned another meeting for a more in-depth demo of myGym.

WS 2024/2025 - Week 3: 7.10. - 13.10.

This week we:

fixed several issues in the pytorch implementation (even rectifying the learning proceudre previously used), which resolved the problems with convergence;
implemented an Actor-Critic algorithm in our meta-controller:
- this further improved the performance, especially the stability of learning;
- it also allowed us to use a sparse reward signal - invalid action: -2 and immediate termination, goal state: 2, otherwise: -0.1;
introduced a new AND activation function based on our current implementation and Literature entry 15 with weights in range [-1, 1] to allow the discussed features:
- improved performance when lerning from random initial state;
- allowed negations and full "not relevant" semantics;
- actually does not require the "top" state atoms;
- we tried two different versions: one where weights are simply clipped and one where tanh is used;
- we decided to opt for the tanh version to ensure stability
replaced normalization with a scaled softmax to define a probability distribution over the actions from raw NN outputs, ensuring better numerical stability;
attempted to regularize the learning to eliminate uncertainty in the learned weights:
- we tried L1, L2, and custom (polynomial / sinusoidal) regularization, but none brought the desired effects;
- we observed that it may even hinder convergence with only one AND branch as the remaining uncertainty probably corresponds to the fact that the agent may be still uncertain in other situations and hence leaves space for exploration, or that there are multiple possible paths to the goal (even among the shortest ones);
- OR may help to cope with multiple possibilities;
met with one of the developers of myGym for a more in-depth demo of myGym.

WS 2024/2025 - Week 4: 14.10. - 20.10.

This week we:

identified an inerent flaw in the employed product-based AND:
- if at least two terms evaluate to 0, gradients are 0 everywhere;
- cannot be resolved if we insist on the behavior that 0 annuls anything and if the gradient of the output w.r.t. each input depends on all other inputs;
replaced the product-based AND with a mixture of product-based and min-based (Gödel) ANDs:
- if the product-based AND outputs a value too close to 0, Gödel AND is used;
- this solved the issue and still provides the distribution of gradient among all inputs / weights when possible, which is the main limitation of the pure Gödel AND;
experimented with more AND branches under either OR with some similarity regularization or XOR to encourage learning different rules:
- the similarity regularization seemed to have either very little positive or too strong negative impact on learning;
- XOR is more computationally demanding and did not prove to be more successful;
postponed further experimentation with regularization and OR to focus on implementing the hierarchical approach in the first place:
- we may try adding, removing, freezeing, and unfreezing AND branches dynamically to tackle the above-described issues later;
concluded that most prebuilt gyms are too cumbersome (including myGym);
discovered that MuJoCo (a physics simulator used in some of the gyms) cannot load the model of NICO;
therefore decided to prepare our own RL training environment, taking inspiration from the existing ones and using pybullet (a different physics simulator, on which myGym is based and which can import the NICO model);

WS 2024/2025 - Week 5: 21.10. - 27.10.

This week we:

downloaded the official NICO model and changed it so that only the upper part with the right arm is left;
almost prepared the physical environment for blocks world, with the following features to be implemented:
- block fall detection;
- proper gripper operation;
started preparing the hierarchical environment for blocks world.

WS 2024/2025 - Week 6: 28.10. - 3.11.

This week we:

added gripping actions and state evaluation to the physical environment;
created a transition to a canonic invalid state after invalid action in the symbolic environment;
tried adaptive L1 regularization (w.r.t. the gradient), but without any significant success;
started refactoring the symbolic environment;
started writing the chapter on RL.

WS 2024/2025 - Week 7: 4.11. - 10.11.

This week we:

finished the refactoring of the symbolic environment;
refactored the physical environment;
improved collision detection in the physical environment;
added falling/moving block detection in the physical environment;
prepared the hierarchical environment;
implemented a simple Actor-Critic (AC) agent for the physical environment;
implemented a prototype of a hierarchical agent:
- versions with a common and separate feature extractor;
- the ELU activation function performed the best at first, but tanh seemed to be better for value function estimation, so we decided to always use tanh;
- reaching a closer block is learnable.

WS 2024/2025 - Week 8: 11.11. - 17.11.

This week we:

fixed minor bugs in loss computation, which lead to slightly quicker and more stable learning;
implemented batch n-step AC (no visible improvement);
implemented PPO (also with eligibility traces), but this still requires fine-tuning;
applied maximum joint velocities when moving joints, which further stabilized simulation and learning;
even AC (optionally with eligibility traces) might work, however, it needs more time;
using RMSProp instead of Adam improved convergence (moment estimation in Adam seems to hinder convergence);
RMSProp also improved the learned weights of the meta-controller in the symbolic environment:
- simpler rewards (goal: 1, invalid action: -1, other action: -0.1);
- learning episodes are shorter;
- weight regularization abs(m)*(1-abs(m)), where m is the scaled weight, works (with a single AND);
- using multiple ANDs before an OR still does not result in learning conditions for different situations;
mostly finished describing the standard RL setting in the written thesis.

WS 2024/2025 - Week 9: 18.11. - 24.11.

This week we:

added the states of joints as an input for the controller, improving convergence;
made minor changes to the description of the standard RL setting;
wrote almost the entire section on Hierarchical Reinforcement Learning.

WS 2024/2025 - Week 10: 25.11. - 1.12.

This week we:

created figures of the agent-environment interaction in the different settings;
changed input to the controller: positions of blocks are relative to the holding position of the hand (the agent learns a bit more easily);
wrote parts on value-based and policy-based solution methods;

WS 2024/2025 - Week 11: 2.11. - 8.12.

This week we:

implemented a symbolic-planning solver, whose outputs will serve as the optimal in comparisons;
experimented with longer learning periods and robot actions being delta movements instead of the target joints' positions:
- shorter episodes and minor improvements over time;
finished the chapter on RL;
prepared the figures and the table of the results from the symbolic environment.

WS 2024/2025 - Week 12: 9.11. - 15.12.

This week we:

discussed the results from the symbolic environment we have in the written thesis;
captured video recordings of the agent's behavior in the physical environment when trying to reach different blocks;
prepared the end-of-semester presentation of our progress.

Weekly Updates

SS 2023/2024 - Week 1: 18.2. - 25.2.

SS 2023/2024 - Week 2: 26.2. - 3.3.

SS 2023/2024 - Week 3: 4.3. - 10.3.

SS 2023/2024 - Week 4: 11.3. - 17.3.

SS 2023/2024 - Week 5: 18.3. - 24.3.

SS 2023/2024 - Week 6: 25.3. - 31.3.

SS 2023/2024 - Week 7: 1.4. - 7.4.

SS 2023/2024 - Week 8: 8.4. - 14.4.

SS 2023/2024 - Week 9: 15.4. - 21.4.

SS 2023/2024 - Week 10: 22.4. - 28.4.

SS 2023/2024 - Week 11: 29.4. - 5.5.

SS 2023/2024 - Week 12: 6.5. - 12.5.

SS 2023/2024 - Week 13: 13.5. - 19.5.

Inter-Academic-Year Period

WS 2024/2025 - Week 1: 23.9. - 29.9.

WS 2024/2025 - Week 2: 30.9. - 6.10.

WS 2024/2025 - Week 3: 7.10. - 13.10.

WS 2024/2025 - Week 4: 14.10. - 20.10.

WS 2024/2025 - Week 5: 21.10. - 27.10.

WS 2024/2025 - Week 6: 28.10. - 3.11.

WS 2024/2025 - Week 7: 4.11. - 10.11.

WS 2024/2025 - Week 8: 11.11. - 17.11.

WS 2024/2025 - Week 9: 18.11. - 24.11.

WS 2024/2025 - Week 10: 25.11. - 1.12.

WS 2024/2025 - Week 11: 2.11. - 8.12.

WS 2024/2025 - Week 12: 9.11. - 15.12.