Følg
Tomek Korbak
Tomek Korbak
Andre navnTomasz Korbak
UK AI Safety Institute
Verifisert e-postadresse på dsit.gov.uk - Startside
Tittel
Sitert av
Sitert av
År
Open problems and fundamental limitations of reinforcement learning from human feedback
S Casper, X Davies, C Shi, TK Gilbert, J Scheurer, J Rando, R Freedman, ...
arXiv preprint arXiv:2307.15217, 2023
2972023
The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"
L Berglund, M Tong, M Kaufmann, M Balesni, AC Stickland, T Korbak, ...
arXiv preprint arXiv:2309.12288, 2023
158*2023
Pretraining language models with human preferences
T Korbak, K Shi, A Chen, RV Bhalerao, C Buckley, J Phang, SR Bowman, ...
International Conference on Machine Learning, 17506-17533, 2023
1522023
Towards understanding sycophancy in language models
M Sharma, M Tong, T Korbak, D Duvenaud, A Askell, SR Bowman, ...
arXiv preprint arXiv:2310.13548, 2023
962023
Inverse scaling: When bigger isn't better
IR McKenzie, A Lyzhov, M Pieler, A Parrish, A Mueller, A Prabhu, ...
arXiv preprint arXiv:2306.09479, 2023
96*2023
Training language models with language feedback at scale
J Scheurer, JA Campos, T Korbak, JS Chan, A Chen, K Cho, E Perez
arXiv preprint arXiv:2303.16755, 2023
782023
Improving code generation by training with natural language feedback
A Chen, J Scheurer, T Korbak, JA Campos, JS Chan, SR Bowman, K Cho, ...
arXiv preprint arXiv:2303.16749, 2023
512023
Aligning language models with preferences through f-divergence minimization
D Go, T Korbak, G Kruszewski, J Rozen, N Ryu, M Dymetman
arXiv preprint arXiv:2302.08215, 2023
482023
Foundational challenges in assuring alignment and safety of large language models
U Anwar, A Saparov, J Rando, D Paleka, M Turpin, P Hase, ES Lubana, ...
arXiv preprint arXiv:2404.09932, 2024
472024
RL with KL penalties is better viewed as Bayesian inference
T Korbak, E Perez, CL Buckley
arXiv preprint arXiv:2205.11275, 2022
402022
On reinforcement learning and distribution matching for fine-tuning language models with no catastrophic forgetting
T Korbak, H Elsahar, G Kruszewski, M Dymetman
Advances in Neural Information Processing Systems 35, 16203-16220, 2022
392022
Taken out of context: On measuring situational awareness in LLMs
L Berglund, AC Stickland, M Balesni, M Kaufmann, M Tong, T Korbak, ...
arXiv preprint arXiv:2309.00667, 2023
34*2023
Computational enactivism under the free energy principle
T Korbak
Synthese 198 (3), 2743-2763, 2021
332021
Many-shot jailbreaking
C Anil, E Durmus, M Sharma, J Benton, S Kundu, J Batson, N Rimsky, ...
Anthropic, April, 2024
32*2024
Controlling conditional language models without catastrophic forgetting
T Korbak, H Elsahar, G Kruszewski, M Dymetman
International Conference on Machine Learning, 11499-11528, 2022
312022
Interaction history as a source of compositionality in emergent communication
T Korbak, J Zubek, Ł Kuciński, P Miłoś, J Rączaszek-Leonardi
Interaction Studies 22 (2), 212-243, 2021
19*2021
Catalytic role of noise and necessity of inductive biases in the emergence of compositional communication
Ł Kuciński, T Korbak, P Kołodziej, P Miłoś
Advances in neural information processing systems 34, 23075-23088, 2021
152021
Is model collapse inevitable? breaking the curse of recursion by accumulating real and synthetic data
M Gerstgrasser, R Schaeffer, A Dey, R Rafailov, H Sleight, J Hughes, ...
arXiv preprint arXiv:2404.01413, 2024
132024
Measuring non-trivial compositionality in emergent communication
T Korbak, J Zubek, J Rączaszek-Leonardi
arXiv preprint arXiv:2010.15058, 2020
102020
Scaffolded minds and the evolution of content in signaling pathways
T Korbak
Studies in Logic, Grammar and Rhetoric 41 (1), 89-103, 2015
102015
Systemet kan ikke utføre handlingen. Prøv på nytt senere.
Artikler 1–20