Deception Detection Research Lightning Talks

Deception Detection Research Lightning Talks

Apart - Safe AI

54 года назад

69 Просмотров

⚡ Lightning talks from placing submissions of the Deception Detection Apart Hackathon, hosted by Apart Research [https://apartresearch.com/] and Apollo Research [https://www.apolloresearch.ai] in June of 2024.

Learn more about the hackathon ⭢ https://apartresearch.com/event/deception

_Our moderator and organizer is Esben Kran and Apart Research._

*━━━━━ Placing Submissions ━━━━━*
1️⃣ Sandbag Detection through Model Degradation ⭢ https://www.apartresearch.com/project/sandbag-detection-through-model-degradation
2️⃣ Detecting and Controlling Deceptive Representation in LLMs with Representational Engineering ⭢ https://www.apartresearch.com/project/detecting-and-controlling-deceptive-representation-in-llms-with-representational-engineering
3️⃣ Detecting Deception in GPT-3.5-turbo: A Metadata-Based Approach ⭢ https://www.apartresearch.com/project/detecting-deception-in-gpt-3-5-turbo-a-metadata-based-approach
4️⃣ Modelling the oversight of automated interpretability against deceptive agents on sparse autoencoders ⭢ https://www.apartresearch.com/project/modelling-the-oversight-of-automated-interpretability-against-deceptive-agents-on-sparse-autoencoders

*━━━━━ Chapters ━━━━━*
00:00 - Intro
03:43 - Sparse autoencoder interpretability oversight
16:20 - Sparse autoencoder interpretability oversight | Questions
17:41 - Metadata based deception detection
30:19 - Metadata based deception detection | Questions
32:16 - Representation engineering for deception
40:34 - Representation engineering for deception | Questions
41:45 - Sandbag detection through model degradation
50:48 - Sandbag detection through model degradation | Questions
54:46 - Honorable Mentions
57:23 - Next steps

*━━━━━ Apart Links ━━━━━*
Learn more about Apart ⭢ https://www.apartresearch.com
Join future hackathons and sprints ⭢ https://apartresearch.com/sprints
Connect with us on Discord ⭢ https://discord.gg/dYUWDm7Ben
Check out potential AI safety projects ⭢ https://aisafetyideas.com
Stay up-to-date on Google Calendar ⭢ https://calendar.google.com/calendar/embed?src=f5bbc369a41ff892f9e919bc0ed2ae64f90c2ec533d5a16a1cd268f553ba10ec%40group.calendar.google.com
Be on the ball with iCal (.ics format) ⭢ https://calendar.google.com/calendar/ical/f5bbc369a41ff892f9e919bc0ed2ae64f90c2ec533d5a16a1cd268f553ba10ec%40group.calendar.google.com/public/basic.ics
Follow on Twitter ⭢ https://twitter.com/apartresearch
Explore code on GitHub ⭢ https://github.com/apartresearch
Get professional on LinkedIn ⭢ https://www.linkedin.com/company/apartresearch

Тэги:

#AI #Artificial_Intelligence #Research #Safety #Hackathon #Awards #Lightning_Talks #Deception
Ссылки и html тэги не поддерживаются


Комментарии: