TECH PLAY

2026/06/30(火)16:00 〜 17:00
Bookmark Icon

[AI Security and Privacy Team Seminar] Talk by Eric Wong

オンライン

イベント内容

JailbreakBenchや敵対的サンプルに対する証明可能防御などのAIセキュリティ分野で著名な業績を多数挙げられたUniversity of PennsylvaniaのProf. Eric Wongに東京科学大  大岡山キャンパスでご講演いただくことになりました。オンサイトでのご聴講もぜひご検討ください。

講演タイトル:: Understanding Safety & Alignment with Mechanistic Theory
講演者:Eric Wong (University of Pennsylvania)
日時: 6/30(火) 16:00-17:00 
会場: 東京科学大 大岡山キャンパス 西8号館E 10F 系会議室 (1004)およびオンライン(Zoom)
Zoomリンク:URLは登録者のみに表示されます。

概要: Why are LLM guardrails fundamentally so easily broken, and how can we enforce them? This talk formalizes a mechanistic theory for studying safety problems. We begin with one-layer transformers, identifying rule-breaking as an inherent architectural vulnerability in the model's attention mechanism. This mechanistic theory framework (LogicBreaks) taught us a critical lesson: if attention is the key to breaking rules, it may also be the key to enforcing them. Building upon this insight, we expand the mechanistic theory to analyze attention-based interventions, arriving at InstaBoost: an incredibly simple yet highly effective steering method that boosts the model's attention on user-provided instructions during generation. This technique, developed from analysis on one-layer transformers, provides state-of-the-art control over large-scale LLMs with just five lines of code. 

プロフィール: Eric Wong is an assistant professor at the Department of Computer and Information Science at the University of Pennsylvania. He leads Brachio Lab on debugging machine learning and making systems actually do what we want them to do. He is also a part of the ASSET Center on safe, explainable, and trustworthy AI systems. Previously, He completed PhD at CMU advised by Zico Kolter, and did a postdoc with Aleksander Madry.

注意事項

※ こちらのイベント情報は、外部サイトから取得した情報を掲載しています。

※ 掲載タイミングや更新頻度によっては、情報提供元ページの内容と差異が発生しますので予めご了承ください。

※ 最新情報の確認や参加申込手続き、イベントに関するお問い合わせ等は情報提供元ページにてお願いします。

Doorkeeper