2026/06/30(火)16:00 〜 17:00

[AI Security and Privacy Team Seminar] Talk by Eric Wong

Name: [AI Security and Privacy Team Seminar] Talk by Eric Wong
Start: 2026-06-30T07:00:00+00:00
End: 2026-06-30T08:00:00+00:00

オンライン

セキュリティ

イベント内容

JailbreakBenchや敵対的サンプルに対する証明可能防御などのAIセキュリティ分野で著名な業績を多数挙げられたUniversity of PennsylvaniaのProf. Eric Wongに東京科学大大岡山キャンパスでご講演いただくことになりました。オンサイトでのご聴講もぜひご検討ください。

講演タイトル:: Understanding Safety & Alignment with Mechanistic Theory
講演者：Eric Wong (University of Pennsylvania)
日時:　6/30(火) 16:00-17:00
会場:　東京科学大大岡山キャンパス西8号館E 10F 系会議室 (1004)およびオンライン（Zoom）
Zoomリンク:URLは登録者のみに表示されます。

概要: Why are LLM guardrails fundamentally so easily broken, and how can we enforce them? This talk formalizes a mechanistic theory for studying safety problems. We begin with one-layer transformers, identifying rule-breaking as an inherent architectural vulnerability in the model's attention mechanism. This mechanistic theory framework (LogicBreaks) taught us a critical lesson: if attention is the key to breaking rules, it may also be the key to enforcing them. Building upon this insight, we expand the mechanistic theory to analyze attention-based interventions, arriving at InstaBoost: an incredibly simple yet highly effective steering method that boosts the model's attention on user-provided instructions during generation. This technique, developed from analysis on one-layer transformers, provides state-of-the-art control over large-scale LLMs with just five lines of code.

プロフィール: Eric Wong is an assistant professor at the Department of Computer and Information Science at the University of Pennsylvania. He leads Brachio Lab on debugging machine learning and making systems actually do what we want them to do. He is also a part of the ASSET Center on safe, explainable, and trustworthy AI systems. Previously, He completed PhD at CMU advised by Zico Kolter, and did a postdoc with Aleksander Madry.