Research ideas

Ideas for research. Free for a good home.
fun
Author

Yuxi Liu

Published

June 11, 2024

Modified

June 11, 2024

Statistics

Correlation gradient ascends to causation

The idea is that by gradient-ascending correlation, we would eventually arrive at a high-correlation pair of variables, which should be quite causally related. Naively, this is false. If the basic structure is a complex causal diagram, then a local maximum for \(\mathop{\mathrm{argmax}}_Y R_{XY}\) might be where \(Y\) is far downstream of \(X\), and connected by many weak paths.

However, there was a case-report that it works in biochemistry, where the following sequence was used to discover how to chemically induce meiosis (Meiosis is all you need, Metacelsus 2022):

  1. Take a diploid cell line (probably ESC or iPSC or PGCLC)
  2. Induce meiosis and form many haploid cell lines.
  3. Genotype the haploid lines and select the best ones.
  4. Fuse two haploid cells to re-generate a diploid cell line.
  5. Repeat as desired. At the end, either differentiate the cells into oocytes or perform nuclear transfer into a donor oocyte.

What I think might work out is if we find \(X, Y\) such that \(\nabla_X r_{X,Y} = 0\) and \(\nabla_Y r_{X,Y} = 0\), where the gradient \(\nabla\) does not literally mean \(d/dx\), but rather, what happens if we move from \(X\) to an adjacent variable. However, what is “adjacent”? We can’t say that “adjacent” means “directly connected on the causal graph” because if we know the causal graph, then our problem is solved!

AI

search-aware training

When one does “tree of thought” with an LLM, such as Llama 3, because the LLM was “unaware” that it would be used in tree searches in test-time, it would not behave as well as possible. This is a case of train-test mismatch. If during training it was also used in tree searches, it should do much better during test-time tree search.

Intuition for the mismatch: If an LLM is trained to just predict the next token on only the training corpus, then it would have difficulty planning over multiple rollouts, because it has only ever played one-rollout games of language-generation.

Sometimes it is very valuable to go through many rollouts being wrong just to gather learn exactly why they are wrong, so that one can avoid them. But an LLM trained to do one-rollout language-generation would be trained to not do that. They are “YOLO” (you only language once) in that sense. YOLO leads to conservation and exploitation, not exploration.

Note: It may still learn multi-rollout by a “side-channel attack”, like how they managed to learn to spell despite using tokenizers (Models In a Spelling Bee: Language Models Implicitly Learn the Character Composition of Tokens), by using some implicit cryptoanalysis to break the substitution cipher of tokenizers, but that’s obviously very inefficient.