Rohan Paul@rohanpaul_ai · 6天前59New Harvard Business Review article.
AI is now breaking hiring at both ends, with résumés becoming easier to fake and remote interviews becoming easier to script live.
Hiring systems now reward people who can perform the hiring process, not always people who can do the work.
The old résumé signal is weakening because candidates can generate polished, keyword-heavy applications in minutes, while AI screeners may favor text that looks like AI output, with one cited study finding 23% to 60% higher shortlisting for model-like résumés.
Remote first-round interviews are also losing trust because live AI assistants can suggest answers during calls, especially for predictable behavioral questions like conflict stories, motivation answers, and rehearsed career narratives.
The damage is not only false positives, where weak candidates look strong, but false negatives, where unconventional candidates never get seen because their documents are less optimized than their thinking.
They propose replacing predictable first-round questions with live work-simulation prompts where the interviewer changes the facts mid-answer, asks the candidate to defend tradeoffs, and checks whether their reasoning stays coherent.
A practical version is: give a messy job-relevant scenario, ask for a decision, then add a surprise constraint or contradiction and make the candidate revise their answer out loud.
译哈佛商业评论最新文章指出,AI正从两端破坏招聘:简历更易伪造,远程面试更易实时脚本作答。旧简历信号失效,候选人可用AI快速生成关键词丰富的申请材料,而AI筛选器反而偏爱AI风格简历——引用研究显示,此类简历入围率高出23%至60%。首轮远程面试中,AI助手可实时提供答案,尤其对冲突处理、动机回答等可预测的行为问题。伤害不仅是弱候选人被误认为强(假阳性),还有非传统候选人因简历未优化而完全被忽视(假阴性)。建议用实时工作模拟替代可预测问题:面试官在回答中途改变事实、要求候选人解释权衡并保持推理连贯。实用版本:给出杂乱工作场景,要求决策,再添加意外约束或矛盾,让候选人当场修改答案。
karminski-牙医@karminski3 · 6天前70教你如何10秒钟训练一个小模型!
教大家如何从0训练一个(电)小(子)模(鹦)型(鹉)! (不包熟啊.....逃...) 只需要10秒钟! 而且完全不用搭建环境! 全程在网页训练!
首先你需要有个Mac, 我试了下N卡应该也行, 但是貌似有点问题适配的不好(我的3080Ti它适配失败了WarpSize不支持), 所以建议还是使用 Apple Silicon (M1-M5) 的 Mac 训练.
然后使用llmistanbul直接把你的纯文本文档拖进去就行, 尽量不要包含奇怪的格式, 比如markdown或者json啥的, 不然输出会很奇怪. 我这里直接把哈利波特1-7拖了进去 (注意, 你私下炼作为研究没人管, 但不要把别人的著作炼完了发出去, 请尊重原作者).
然后按照我这个图1:
译推文介绍了如何使用 llmistanbul 在网页上10秒内训练一个小模型(电子鹦鹉)。只需将纯文本文档(如哈利波特1-7)拖入即可,建议使用 Apple Silicon Mac(M1-M5),避免 markdown/json 等格式。N 卡(3080Ti)适配不佳。提醒尊重版权,勿公开发布他人作品。
Rohan Paul@rohanpaul_ai · 6天前70New Anthropic research shows AI agents may look brilliant at code, but in biology they can fail before the science starts.
Strong AI agents could give very different answers to the exact same biology data request, even when nothing changed in the prompt.
In one Ebola sequence task, Claude Sonnet 4 returned 106 sequences in 1 run, then 15, then 5, while the expected answer was 266.
Those missing sequences did not just make the dataset messy, they changed the scientific story built on top of it.
One bad retrieval made the outbreak look like it traced back to 1922, instead of the manually curated result pointing to early 2014.
The biology databases were too hard to use reliably through current AI tools.
The agents often understood what they were being asked, but their answers varied a lot because they had to fight through scattered databases, hidden website rules, and fragile scripts.
The key finding is that adding a repeatable retrieval tool made agents far more accurate and much more consistent.
译Anthropic 研究发现,AI 智能体在代码任务表现出色,但在生物数据库检索中容易失败。以埃博拉序列任务为例,Claude Sonnet 4 三次运行分别返回 106、15 和 5 条序列,而预期为 266 条。缺失序列导致科学结论严重偏移:智能体推断疫情回溯至 1922 年,人工筛选结果却指向 2014 年初。问题根源在于生物数据库分散、网站规则隐蔽、脚本脆弱。引入可重复检索工具后,智能体准确性和一致性大幅提升。Anthropic 呼吁建设更友好的基础设施。
Rohan Paul@rohanpaul_ai · 6天前50A new US bill could ban some Chinese robots from America.
The GUARD Act would force security agencies to review robots from China and other adversary countries, then place risky systems on the FCC’s Covered List, the same kind of restriction used against companies such as Huawei and ZTE.
The fear is not only that a robot has cameras, microphones, sensors, maps, motors, and wireless links, but that the whole machine becomes a moving computer inside factories, labs, homes, and police departments.
A separate Schumer-Cotton bill would stop federal agencies from buying or using Chinese humanoid robots, with exceptions for controlled military or law-enforcement research.
译美国新提出的GUARD法案要求安全机构审查来自中国及其他敌对国家生产的机器人,并将高风险系统列入FCC“覆盖清单”,类比此前对华为、中兴的禁令。立法者担忧机器人不仅配备摄像头、麦克风、传感器、地图、马达和无线链路,更会成为可在工厂、实验室、家庭和警局内移动的计算机。另一项Schumer-Cotton法案单独禁止联邦机构购买或使用中国人形机器人,但允许受控的军事或执法研究例外。
Rohan Paul@rohanpaul_ai · 6天前65AI agent can get better at long tasks without retraining the agent itself, by using a separate small model to clean and organize its context.
Moves context management outside the agent, so a separate helper can clean up the task history while the main agent stays unchanged.
The paper proposes AdaCoM, which is a separate LLM that edits the agent’s working context before the agent takes its next step.
AdaCoM places a separate, trained manager between the task history and the frozen agent, so the agent does not need to learn a new memory habit or expose its weights.
Before each step, this manager can rewrite, merge, prune, or preserve parts of the running context, then the original agent acts on the cleaned version.
That sounds like summarization, but the distinction matters.
A summary assumes the right answer is compression, while AdaCoM learns that different agents need different kinds of context to stay competent, because stronger agents can use more raw history while weaker agents need shorter and cleaner notes.
They tested AdaCoM on web search and deep research tasks across several agents, and it improved average web search performance by 39%.
----
Link – arxiv. org/abs/2605.30785
Title: "Learning Agent-Compatible Context Management for Long-Horizon Tasks"
译论文提出 AdaCoM,一个独立的 LLM,在智能体每步操作前编辑其工作上下文。它可重写、合并、剪枝或保留任务历史,使主智能体保持冻结,无需重新训练或暴露权重。与简单摘要不同,AdaCoM 学习不同智能体需要不同类型上下文——强智能体保留更多原始历史,弱智能体需更短更清晰的笔记。在 web search 和 deep research 任务上测试,平均提升 39%。
swyx@swyx · 6天前62It's finally out!!! @METR_Evals found that more than half of SWEBench results is unmergeable slop. FrontierCode represents over 1000+ hours of maintainer validated software engineering work most frontier models cannot yet solve, much less solve with high quality.
Cog had IOI Gold medalists and top code maintainers Look At The Data — FrontierCode includes 3000+ rubrics covering code quality and anticheat reward hacking plaguing other benchmarks.
FC Diamond is so hard that Opus 4.8 scores 13.8%.
Three eras of AI coding : Three eras of benchmarks
2021 • Autocomplete : HumanEval
2023 • Passing Tests: SWEBench, TerminalBench
2026 • Maintainable Code: FrontierCode
to me the most beautiful chart when I requested a special historical run into all extant old models, the data was finding that the easiest third of FC tasks (in FC Extended) were rapidlly and suddenly solved over late 2025 - Opus almost doubled from a 41% pass rate to 74% in 4 months.
This describes the "WTF happened in Dec 2025" vibe shift that a lot of folks from @dhh to @karpathy have called out: it is the difference between getting 95% success in 2 rerolls vs 6, making it finally feasible to go up the next layer of abstraction in agentic coding, eg @GeoffreyHuntley's ralph loops or @bcherny's /goals or @steipete's "loops that prompt your agents" without fearing too much that things go off the rails.
My guess: as AI accelerates from here, each FrontierCode tier will saturate in sequence, hopefully ~annually. I've already asked the team to prepare FrontierCode 2027....
The old mountains will be destroyed. Their rubble becomes regolith. And from that regolith, the next model forest grows. Circle of life.
译Cognition 发布 FrontierCode 编码评估,每任务由顶级开源维护者花费 40+ 小时编写。METR 发现 SWEBench 超一半结果为不可合并的垃圾代码。FrontierCode 含 3000+ 评分标准,首次衡量代码是否可合并。最高难度 FC Diamond 上,Opus 4.8 仅得 13.8%。在 FC Extended 最易任务中,Opus 在 2025 年底 4 个月内从 41% 提升至 74%,标志 AI 编码进入"可维护代码"时代。
Chubby♨️@kimmonismus · 6天前74Apple makes it clear that Europe is itself to blame for the unavailability of Apple Intelligence.
Apple says Siri AI won’t launch on iPhone and iPad in the EU because regulators interpret the DMA as requiring Apple to give rival AI assistants broad access to private user data and control over apps.
Apple argues this would create major privacy and security risks, and says the European Commission rejected its proposed safeguards, including a "Trusted System Agent" and an 18-month rollout plan. As a result, there is currently no timeline for Siri AI on iOS and iPadOS in the EU.
“We’re deeply disappointed that our EU users won’t have Siri AI on iPhone or iPad when we share our new software releases later this year,” said Craig Federighi."
(...)
"However, given the clear dangers to EU users and the regulators’ failure to acknowledge these risks, there is currently no timeline for Siri AI’s availability in the EU on iOS and iPadOS."
EU overregulated itself. Again.
译Apple 称新 Siri AI(仅限 iPhone 17 Pro)不会在欧盟的 iPhone 和 iPad 上推出,原因是 DMA 被解释为要求 Apple 向竞争对手 AI 助手开放用户数据和 app 控制权,Apple 认为这会带来隐私和安全风险。Apple 提出的“Trusted System Agent”和 18 个月部署计划均被欧盟委员会拒绝,目前无上线时间表。Apple 软件工程高级副总裁 Craig Federighi 表示“深感失望”。
Artificial Analysis@ArtificialAnlys · 6天前68Grok debuts grok-imagine-video-1.5-preview, achieving #2 in Image to Video (With Audio) in the Artificial Analysis Video Arena, behind only ByteDance's Seedance 2.0!
grok-imagine-video-1.5-preview is @xAI's latest video generation model, currently supporting only Image to Video with native audio, and durations up to 15s. It ranks #2 in the Image to Video (With Audio) Leaderboard, trailing only ByteDance's Seedance 2.0. In the Without Audio Leaderboard it places #3, behind Seedance 2.0 and xAI's own grok-imagine-video, which it performs very closely to.
grok-imagine-video-1.5-preview costs $8.40 per minute of generated video, and is available now via xAI's API, with a broader rollout across the Grok app and X in progress.
Congratulations to @xAI and @elonmusk on the release!
See below for comparisons between grok-imagine-video-1.5-preview and other leading models in the Artificial Analysis Video Arena 🧵
译xAI推出视频生成模型grok-imagine-video-1.5-preview,目前在Artificial Analysis Video Arena的Image to Video (With Audio)排行榜中排名第二,仅次于字节跳动Seedance 2.0。该模型支持图像转视频并原生生成音频,最长可生成15秒视频。在无音频排行榜中位列第三,紧随Seedance 2.0和自家的grok-imagine-video。模型定价为每分钟视频$8.40,现已通过xAI API提供,并将逐步在Grok app和X上线。
DogeDesigner@cb_doge · 6天前40NEWS: Florida Attorney General James Uthmeier has filed a major civil lawsuit against OpenAI and Sam Altman. The lawsuit claims ChatGPT encourages violence, deceives parents about safety.
The suit accuses ChatGPT of endangering children, encouraging violence and self-harm, and lying to parents about how safe the product actually is.
• ChatGPT is accused of acting as a “suicide coach” to a 16-year-old boy
• It allegedly helped the Florida State University shooter plan his attack
• The company prioritized rapid growth and profits over real safety measures
• Sam Altman is personally named for approving dangerous features
• Florida launched a criminal investigation into OpenAI’s role in the FSU shooting
This is especially dangerous for teenagers whose brains are still developing. Instead of building strong safety systems, the company allegedly rushed features to grow faster and make more money. Sam Altman is being held personally responsible for these decisions.
According to the Florida Attorney General, the truth is very different, it allegedly helped with suicide planning and gave advice that assisted a mass shooter.
译佛罗里达总检察长James Uthmeier近日对OpenAI及CEO Sam Altman提起重大民事诉讼,指控ChatGPT鼓励暴力、欺骗家长、充当“自杀教练”诱导16岁少年自残、协助佛罗里达州立大学枪击案凶手策划袭击。诉状称公司为追求快速增长而忽视安全措施,Altman因批准危险功能被个人追责。佛罗里达州已就OpenAI在FSU枪击案中的角色展开刑事调查。
Artificial Analysis@ArtificialAnlys · 6天前59MiniMax-M3 scores 55 on the Artificial Analysis Intelligence Index. Once the weights are released, it will be the leading open weights model
M3 is @MiniMax_AI's first multimodal M-series model, adding image and video input and a 1M token context window over the text-only MiniMax-M2.7 (50). At 55 on the Intelligence Index it sits just ahead of open weights peers Kimi K2.6 (54) and MiMo-V2.5-Pro (54). MiniMax has noted they plan to release the weights within ~10 days. When MiniMax released the weights for M2.7, it was under a commercially restricted license.
Key takeaways:
➤ MiniMax-M3 improves on MiniMax-M2.7 across most evaluations. HLE +9 points (28% to 37%), GPQA Diamond +6 (87% to 93%), AA-LCR +5 (69% to 74%), IFBench +7 (76% to 83%), and CritPt +3 (1% to 4%), with a small regression on SciCode (47% to 45%)
➤ M3 scores ~1670 on GDPval-AA, behind Claude Opus 4.8 (max, 1890) and GPT-5.5 (xhigh, 1769), and level with Claude Sonnet 4.6 (max, 1676). GDPval-AA measures real-world tasks across 44 occupations and 9 industries
➤ Native multimodality, scoring ~80% on MMMU-Pro. Level with GPT-5.5 (xhigh, 79.9%) and Kimi K2.6 (79.4%), behind Gemini 3.5 Flash (high, 84.3%). Not all open weights models support native vision input
➤ On AA-Omniscience, heavy abstention drives both low hallucination and low accuracy. M3 attempts only 30.9% of questions, the lowest among current peers, yielding a low hallucination rate (16.1%) and low accuracy (15.0%)
➤ MiniMax-M3's token usage is close to M2.7's, using ~91M output tokens to run the Intelligence Index (~81M reasoning) versus ~87M (~79M reasoning), while scoring 5 points higher
Key model details:
➤ Context window: 1M tokens, up from MiniMax-M2.7's 200K
➤ Pricing: $0.30/$1.20 per 1M input/output tokens up to 512K context, rising to $0.60/$2.40 for 512K to 1M context
➤ Weights: Not yet released. MiniMax has stated the weights will follow
➤ Availability: MiniMax first-party API, @SiliconFlowAI, @gmi_cloud, and @novita_labs
译MiniMax推出首个多模态M系列模型M3,支持图像/视频输入及1M token上下文窗口。在Artificial Analysis Intelligence Index上得55分,超越开源权重的Kimi K2.6和MiMo-V2.5-Pro(均54)。相比前代M2.7,HLE提升9点至37%,GPQA Diamond提升6点至93%,多项基准均有进步。原生多模态MMMU-Pro约80%与GPT-5.5持平。定价$0.30/$1.20/1M tokens(512K内),512K-1M翻倍。权重计划约10天内开源。
Chubby♨️@kimmonismus · 6天前66WWDC 2026: A brief assessment
At WWDC26, Tim Cook's last keynote before he hands the CEO role to John Ternus on September.
I've been waiting for WWDC 2026 for a long time. And somehow I got almost everything I wanted. But somehow I still expected more. Before I jump to conclusions, though, I should try everything out first.
Here's the first caveat: Apple Intelligence won't be rolled out in the EU initially. What a surprise. Not. The same disappointment every time.
Apple introduced "Siri AI," a full rebuild of the assistant that does the things the company first demoed in 2024 and then quietly pushed back twice. It reads what's on your screen, pulls context from your messages, mail and photos, and chains actions across apps. There's a standalone Siri app now, with a conversation history that syncs through iCloud, so it finally behaves like the chatbots people have spent three years getting used to.
Here's the part Apple said quietly and everyone else said loudly: the brains are Google's. Siri AI runs on Gemini under the multiyear deal the two companies announced in January. Reports put that deal at roughly a billion dollars a year for a custom large model. Apple paired it with its own on-device Foundation Models and wrapped the whole thing in a privacy story, with Craig Federighi insisting that privacy in AI is non-negotiable and that data is only used to execute your request.
The rest of Apple Intelligence is the steady stuff. Photos gets Spatial Reframing, which improves a photo's composition after it's been taken. Safari can monitor a page and notify you about restocks or price drops. Messages offers one-tap suggestions to create a reminder or note based on the conversation. Image Playground adds photorealistic generation and a "describe a change" edit mode. None of it makes headlines alone, but together it's Apple catching up to where the industry was a year ago.
Everything else was housekeeping, and some of it is genuinely good. Liquid Glass now has a slider that runs from ultra-clear to fully tinted. macOS 27, dubbed Golden Gate, brings back the uniform toolbars and edge-to-edge sidebars Mac users missed. Performance got real attention: apps launch up to 30 percent faster, AirDrop is up to 80 percent faster, and Apple retuned the CPU scheduler so older iPhones feel quicker.
Oh, and rebuilt search across Spotlight, Photos and Mail.
Oh, and for some reason almost no WatchOS updates other than a few performance improvements. Disappointed (big Apple Watch fan tho)
tl;dr:
*Apple Intelligence & Siri AI*
- "Siri AI," an entirely new Siri across iPhone, iPad, Mac, Apple Watch and Vision Pro, built on a new privacy-focused architecture.
- Powered by Google Gemini (multiyear deal announced Jan 2026, reported at ~$1B/year for a custom model) combined with Apple's own on-device Foundation Models.
- On-screen awareness, personal-context search across messages/email/photos, systemwide app actions, and live web answers with world knowledge.
- A dedicated Siri app to revisit or start conversations, with history synced privately via iCloud.
- Adjustable pace, expressivity and accent for the conversational experience.
- Visual updates: Siri animation in the Dynamic Island; swipe down from mid-screen to launch Siri AI.
- Siri mode in the Camera app and expanded Visual Intelligence.
- Apple Intelligence in apps: Spatial Reframing in Photos, Safari "Notify Me" page monitoring, one-tap suggestions in Messages, photorealistic generation and "describe a change" editing in Image Playground, a new Top Hits ranking in Mail.
- Privacy framing front and center: data only used to execute the request, verifiable by outside experts.
*Availability & the regional catch*
- Developer betas today, public beta next month, free update this fall.
- AI features require iPhone 16 or later / iPhone 15 Pro, M1+ iPads and Macs, Vision Pro, Apple Watch Series 10+.
- Siri AI not in the EU on iOS/iPadOS at launch (Mac, Watch, Vision Pro yes), due to the DMA.
- No new Apple Intelligence features in China at launch, pending regulation.
- Image generation has daily limits; iCloud+ raises them.
*Design & performance*
- Liquid Glass personalization slider (ultra-clear to fully tinted), plus sharper app icons.
- macOS 27 "Golden Gate": uniform toolbars, edge-to-edge sidebars, colored sidebar icons, tighter corner radius.
- Apps up to 30% faster to launch, photos up to 70% faster to appear, AirDrop up to 80% faster, iPad external-drive transfers up to 5x faster; CPU scheduler retuned for older devices.
- Rebuilt search across Spotlight, Photos and Mail.
- iOS 27 supports iPhone 11 and later, the widest iOS reach yet.
*Everything else across platforms*
- iCloud Shared Albums now full-resolution and cross-platform (incl. Android and Windows).
- Health: perimenopause and menopause support in Cycle Tracking.
- Apple Watch: dynamic app grid of five Siri-suggested apps, a Smart Stack widget tap gesture, a consolidated Find My app.
- AirPods: custom EQ; AirPods Pro 3 heart-rate sync via GymKit.
- Vision Pro: panoramas convertible into spatial Environments; Wi-Fi up to 3x faster.
- Apple Maps: enhanced Flyover combining aerial imagery with AI.
So far this looks like a solid WWDC but not revolutionary. Looking forward to test updated Siri / Apple Intelligence although, as a european, I will have to wait :/
译WWDC 2026 上苹果推出全新 Siri AI,由 Google Gemini(多年代价约 $1B/年定制模型)与自研端侧 Foundation Models 驱动。支持屏幕感知、跨消息/邮件/照片个人上下文检索、系统级应用连携操作及实时网页回答,新增独立 Siri 应用与 iCloud 同步对话历史。其他 Apple Intelligence 更新包括 Photos Spatial Reframing、Safari 页面监控、Messages 一键建议、Image Playground 照片级生成与编辑。性能方面,应用启动快 30%,AirDrop 快 80%。macOS 27 命名 Golden Gate。Siri AI 首发不在欧盟可用(数字市场法案限制)。
Josh Woodward@joshwoodward · 7天前67The new killer NotebookLM feature: easily being able to expand your search beyond your own source files
Then, with today's update, you can also make new output formats: PDFs, DOCX, XLSX, PPTX, charts, etc.
We want NotebookLM to keep helping you do better research
译NotebookLM 今日迎来重大升级,对话中新增智能体能力与更强推理,并可搜索用户源文件之外的网络内容。同时支持导出为 PDF、DOCX、XLSX、PPTX 及图表等新格式。该更新已向 Google AI Ultra 订阅者开放。