2026-05-29

Step 3.7 Flash

The new frontier is agent efficiency.

A high-efficiency Flash model for real-world agents.

Multimodal Understanding & Action|Web & Visual Search Enhancement|Reliable Tool Use & Orchestration|Agent Ecosystem Compatibility

Key Features

  • Native Multimodal Understanding & Acting

    Understands images across the full range — product UIs, documents, charts, and natural scenes — then writes code or calls tools to act on what it sees.

  • Web & Visual Search Enhancement

    Web search reaches further — more sources, deeper follow-up. Visual search recognizes what other systems don't — long-tail entities, freshly emerged concepts.

  • Reliable Tool Use & Orchestration

    Drives terminals, browsers, Office tools, search, and beyond — staying coherent however long the run gets. Less drift, fewer broken toolcalls, fewer failed runs.

  • Agent Ecosystem Compatibility

    Works with mainstream harnesses (Claude Code, KiloCode, Hermes Agent, OpenClaw) and Skills — lower integration cost, less workflow rewiring.

Agentic Coding

SWE-Bench Pro
65
35
5
56.3
Step 3.7 Flash
Score: 56.3
Params: 196B
51.3
Step 3.5 Flash
Score: 51.3
Params: 196B
55.6
DeepSeek V4 Flash
Score: 55.6
Params: 284B
55.1
Gemini 3.5 Flash
Score: 55.1
Params: Unknown
58.6
GPT 5.5
Score: 58.6
Params: Unknown
64.3
Claude Opus 4.7
Score: 64.3
Params: Unknown
Terminal-Bench 2.1
85
43
0
59.5
Step 3.7 Flash
Score: 59.5
Params: 196B
53.4
Step 3.5 Flash
Score: 53.4
Params: 196B
62.0
DeepSeek V4 Flash
Score: 62.0
Params: 284B
76.2
Gemini 3.5 Flash
Score: 76.2
Params: Unknown
82.7
GPT 5.5
Score: 82.7
Params: Unknown
69.4
Claude Opus 4.7
Score: 69.4
Params: Unknown

Multimodal

SimpleVQA (with Tool)
80
78
75
79.2
Step 3.7 Flash
Score: 79.2
Params: 196B
78.2
GLM 5V Turbo
Score: 78.2
Params: Unknown
78.2
Kimi K2.6
Score: 78.2
Params: Unknown
79.1
GPT 5.5
Score: 79.1
Params: Unknown
V* (with Python)
97
91
85
95.3
Step 3.7 Flash
Score: 95.3
Params: 196B
89.0
GLM 5V Turbo
Score: 89.0
Params: Unknown
96.9
Kimi K2.6
Score: 96.9
Params: Unknown
96.3
Gemini 3 Flash
Score: 96.3
Params: Unknown

General Agent

GDPval
66
33
0
45.8
Step 3.7 Flash
Score: 45.8
Params: 196B
28.0
Step 3.5 Flash
Score: 28.0
Params: 196B
44.0
DeepSeek V4 Flash
Score: 44.0
Params: 284B
57.8
Gemini 3.5 Flash
Score: 57.8
Params: Unknown
63.0
GPT 5.5
Score: 63.0
Params: Unknown
63.0
Claude Opus 4.7
Score: 63.0
Params: Unknown
Toolathlon
66
33
0
49.5
Step 3.7 Flash
Score: 49.5
Params: 196B
33.3
Step 3.5 Flash
Score: 33.3
Params: 196B
52.8
DeepSeek V4 Flash
Score: 52.8
Params: 284B
56.5
Gemini 3.5 Flash
Score: 56.5
Params: Unknown
60.2
GPT 5.5
Score: 60.2
Params: Unknown
65.4
Claude Opus 4.7
Score: 65.4
Params: Unknown
ClawEval-1.1 (2026-05-09)
75
38
0
67.1
Step 3.7 Flash
Score: 67.1
Params: 196B
43.6
Step 3.5 Flash
Score: 43.6
Params: 196B
57.8
DeepSeek V4 Flash
Score: 57.8
Params: 284B
57.8
Gemini 3.1 Pro
Score: 57.8
Params: Unknown
60.3
GPT 5.4
Score: 60.3
Params: Unknown
70.8
Claude Opus 4.6
Score: 70.8
Params: Unknown
HLE (with Tool)
56
33
10
47.2
Step 3.7 Flash
Score: 47.2
Params: 196B
35.7
Step 3.5 Flash
Score: 35.7
Params: 196B
45.1
DeepSeek V4 Flash
Score: 45.1
Params: 284B
40.2
Gemini 3.5 Flash
Score: 40.2
Params: Unknown
52.2
GPT 5.5
Score: 52.2
Params: Unknown
54.7
Claude Opus 4.7
Score: 54.7
Params: Unknown

Note: On non-multimodal tasks, we organize comparisons in two groups: the left panel compares Step 3.7 Flash with DeepSeek V4 Flash, an open-source model of comparable Flash-size scale, while the right panel places Step 3.7 Flash alongside frontier closed-source models. In particular, Step 3.7 Flash, Gemini 3.5 Flash, and DeepSeek V4 Flash are evaluated on Terminal-Bench 2.1, where DeepSeek V4 Flash is a self-tested score. GPT 5.5 and Claude Opus 4.7 use official self-reported Terminal-Bench 2.0 scores. On GDPval, Step 3.7 Flash score is obtained through internal pairwise evaluation, while comparison models are sourced from the official Artificial Analysis Leaderboard.

Agentic Coding

Foundation models are shifting from answering questions to taking action, and in the digital world that action takes the form of code. Coding is the substrate of digital agency, the purest form of the plan–execute–observe–iterate loop, and the leading indicator of where a model's broader agentic capability is heading. We invested heavily in this surface for Step 3.7 Flash. Compared to Step 3.5 Flash, it gains +5% on SWE-Bench Pro and 6.1% on Terminal-Bench 2.1.

Step-SWE-Bench
Step 3.5 Flash · Hermes Agent: 60.00% Step 3.5 Flash · OpenClaw: 47.00% Step 3.5 Flash · Claude Code: 73.00% Step 3.5 Flash · KiloCode: 59.00% Step 3.5 Flash · OpenCode: 57.00% Step 3.5 Flash · RooCode: 43.00% Step 3.7 Flash · Hermes Agent: 67.50% Step 3.7 Flash · OpenClaw: 67.00% Step 3.7 Flash · Claude Code: 71.50% Step 3.7 Flash · KiloCode: 67.50% Step 3.7 Flash · OpenCode: 64.50% Step 3.7 Flash · RooCode: 64.50% 75 60 45 30 Hermes Agent OpenClaw Claude Code KiloCode OpenCode RooCode
Step 3.7 Flash avg 67.08%
Step 3.5 Flash avg 56.50%
Step 3.7 FlashStep 3.5 Flash
Hermes Agent67.50%60.00%
OpenClaw67.00%47.00%
Claude Code71.50%73.00%
KiloCode67.50%59.00%
OpenCode64.50%57.00%
RooCode64.50%43.00%

In production, coding agents rarely run on a single scaffold. They live inside a heterogeneous stack of harnesses, each with its own prompting conventions, tool schemas, and orchestration patterns — and a model has to perform reliably across all of them to be genuinely useful. Step 3.7 Flash is markedly more balanced across this stack than Step 3.5 Flash, with the per-harness gap narrowing substantially on our in-house Step-SWE-Bench.

To push quality further without giving up Flash-tier efficiency, Step 3.7 Flash supports Advisor Mode. Step 3.7 Flash drives the trajectory end-to-end — calling tools, reading results, and iterating — and consults a larger advisor model only at the few inflection points where its own judgment falls short, such as planning or recovering from repeated failures. This is Step's implementation of the advisor strategy described by Anthropic, where a small executor stays in control and escalates to a frontier advisor only when needed, keeping most of the run at executor cost. With Advisor Mode enabled, Step 3.7 Flash reaches 97% of Claude Opus 4.6's coding performance at roughly one-ninth the per-task cost ($0.19 v.s. $1.76 per task).

Score vs. Cost on SWE-Bench Verified
(Claude Opus 4.6 internal reproduce)
73 74 75 76 77 78 79 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Score (%) Cost per agentic task ($) 73.7% $0.12 Step 3.7 Flash + Advisor 76.3% $0.19 Step 3.7 Flash + Advisor 78.7% $1.76 Claude Opus 4.6

Sharpened for Enterprise Tasks

Enterprise work inherently depends on two critical pillars: autonomous task execution in dynamic environments and deep, domain-specific vertical knowledge. Step 3.7 Flash is purpose-built and rigorously optimized across both frontiers to independently drive assignments and ship production-grade deliverables.

The model combines strong agentic execution with precise intent understanding and rich multimodal perception, allowing it to seamlessly bridge the gap between comprehension and action. Users can hand Step 3.7 Flash a complete piece of knowledge work and trust it to independently map out the plan, search across live sources, extract key information, and fluidly orchestrate tools to deliver a ready-to-ship result without intervention. It reads and directly acts on mixed inputs—such as screenshots, complex documents, and dense spreadsheets—parsing visual context and digital assets simultaneously. This long-horizon task execution is validated across diverse environments, where Step 3.7 Flash achieves 49.5% on Toolathlon for multi-tool coordination and 67.1% on ClawEval-1.1 for daily autonomous task execution in realistic environments.

The path from general intelligence to true professional expertise starts with real expert practices. By partnering deeply with domain specialists, we have embedded native industry know-how into the model, validating its capabilities through our own benchmarks in finance, accounting, and data analysis. This expertise extends well beyond specialized domains: Step 3.7 Flash reaches 45.8% on GDPval across 44 occupations, and passes at over 98% across different reasoning difficulty tiers on Tau2-bench Telecom.

Manufacturing · Production Scheduling

Search Wider and Deeper

For a model at the scale of Step 3.7 Flash, the goal is not to pack every piece of world knowledge into its weights, but to make the model better at calling upon that knowledge when needed. We therefore focus its capabilities on search planning, evidence filtering, and information synthesis, turning search from an external add-on into a native part of the reasoning process.

Step 3.7 Flash delivers strong results across search-heavy benchmarks. It scores 47.20% on HLE with Tools, up from 35.68% (text-only) for Step 3.5 Flash, and outperforms Flash models from DeepSeek V4 and Gemini 3.5. It reaches 75.82% on BrowseComp, approaching larger models such as Claude Opus 4.7 and GLM 5.1. On DeepSearchQA, it achieves 92.82% F1 score, comparable to Kimi K2.6, a 1T / 32B-active model. On ResearchRubrics, it scores 71.68%, ahead of GPT 5.5 at 61.50% and close to Claude Opus 4.7 at 73.92%. These results show that Step 3.7 Flash combines Flash-level efficiency with strong deep-retrieval and research capabilities.

The trajectories further highlight both the breadth and depth of its search behavior. In the Ontario lawyer conflict-of-interest case, it similarly expanded its search around domain-specific concepts, combined evidence from papers, course materials, official rules, and case analyses, and caught the key traps in the questions.

Agents That Can SEE

We establish Step 3.7 Flash as an agentic foundation model with vision input support, shifting perception and recognition from parametric capacity to test-time scaling with visual tools. As the first of these, we strengthen its ability to invoke the Visual Search tool, thereby compensating for the parametric knowledge deficiencies caused by Step 3.7 Flash's limited model size. As shown in the table below, on visual recognition tasks, Step 3.7 Flash with Visual Search achieves performance on par with models five times its size.

Visual Recognition with Visual Search

Flash Level Pro Level
Benchmarks Step 3.7 Flash Kimi K2.6 GLM 5V Turbo GPT 5.5
SimpleVQA 79.16% 78.24%* 78.20% 79.11%*
WorldVQA 58.10% 55.98%* 47.81%* 54.58%*
BC-VL 58.96% 57.12%* 51.90%* 65.68%*
  1. * denotes a self-tested score.

For a broader set of challenging vision tasks that demand fine-grained perception over high-resolution images or visual reasoning capabilities—such as V*, HR-Bench, and VisualProbe—we grant the model an enriched action space to interact with images, including cropping, zooming in and out, and drawing pixels or bounding boxes. These tools are implemented as a unified code interface, commonly referred to in the field as the Python tool. With Python, Step 3.7 Flash achieves exceptionally strong performance on these benchmarks.

Visual Perception with Python Tool

Flash Level Pro Level
Benchmarks Step 3.7 Flash Kimi K2.6 GLM 5V Turbo Gemini 3 Flash
V* 95.29% 96.90% 89.00% 96.30%
HR-Bench 4K 89.13% 91.25%* 84.62% 94.50%
HR-Bench 8K 86.34% 90.13%* 83.12% 94.80%
VisualProbe 65.05% 64.47%* 53.01% 69.90%
  1. * denotes a self-tested score.
  2. The GLM results were aligned with official GLM personnel, using crop + search and other tools.

One particularly interesting finding is the emergent ability of compositional generalization across visual and other tools. During testing, Step 3.7 Flash seamlessly combined visual tools with non-visual ones to accomplish complex tasks, despite never having been explicitly guided toward such compositional tool use during training.

Visual Reasoning with Python Tool

Compositional Usage across Visual and Non-visual Tools

Operating graphical user interfaces (GUI) is another foundational visual capability for an agentic model — many real-world tasks live beyond the chatbox and the CLI, and require the agent to see, click, and verify. We extend Step 3.7 Flash with GUI operation, in particular for the Phone-use stack, so that it can complete long-horizon tasks across multiple apps. On the Android Daily benchmark, Step 3.7 Flash achieves a substantial improvement over last year's Step-GUI in stability, robustness, and long-horizon completion, and ahead of other models of larger scale.

Score of Android Daily Benchmark

Gemini 3 Flash
63.21%*
Step 3.7 Flash
61.87%
Kimi K2.6
53.36%*
GLM 5V Turbo
51.68%*
  1. * denotes a self-tested score.
  2. Android Daily: https://arxiv.org/abs/2605.27761

The same compositional pattern we observed across visual tools also surfaces here: in the following case, after writing a piece of frontend code, the model autonomously turned to the GUI to test the page it had just produced — inspecting the rendered output, exercising interactive elements, and iterating on its own code based on what it saw. Again, this code-and-GUI compositional behavior was never explicitly demonstrated or rewarded during training, yet emerges robustly in test-time use.

GUI Operation

Benchmarks

In our benchmark table, we provide a detailed, side-by-side comparison of today's top-performing open-source models. Across a wide range of metrics, Step 3.7 Flash stands out with consistently strong results. Our evaluation focuses on three core dimensions—Reasoning, Coding and Agentic Capability.

Flash Level Pro Level
Benchmarks Step 3.7 Flash Step 3.5 Flash DeepSeek V4 Flash Gemini 3.5 Flash DeepSeek V4 Pro GPT 5.5 Claude Opus 4.7 Kimi K2.6 GLM 5.1
Total Params 196B + 1.8B (ViT) 196B 284B 1.6T 1T 754B
Active Params 11B 11B 13B 49B 32B 40B
Multi-modal
General Agent
HLE w. tool (acc) 47.20% (text-only 49.70%) 35.68% 45.10% 40.20% 48.20% 52.20% 54.70% 54.00% 52.30%
BrowseComp (acc) 75.82% 69.00% 73.20% 83.40% 90.10% 79.30% 83.20% 79.30%
deepsearchQA (F1) 92.82% 85.48%* 90.61%* 93.98%* 91.74%* 92.50% 91.16%*
deepsearchQA (acc) 81.69% 73.44% 79.76%* 85.31%* 82.31%* 83.00% 81.31%*
ResearchRubrics (score) 71.68% 65.30% 66.17%* 63.58%* 68.31%* 61.50%* 73.92%* 62.96%* 67.90%*
Toolathlon 49.51% 33.33% 52.78%* 56.50% 51.80% (56.61%*) 60.18%* 65.43%* 54.63%* 48.09%*
Claweval-v1.1 (pass^3) 67.07% 43.60% 57.80% 59.80% 62.30% 62.30%
GDPval-Stirrup (rubric-score) 1415.8 (ii 45.79%) 1055.0 (ii 27.75%) 1414.0 (ii 44.00%) 1656 (ii 57.80%) 1554 (ii 53.00%) 1769 (ii 63.00%) 1753 (ii 63.00%) 1481 (ii 49.00%) 1535 (ii 52.00%)
Coding
SWE-MTLG 72.42% 67.40% 73.30% 76.20% 80.50% 76.70%
SWE-Bench Pro 56.26% 51.30% 55.60%* 55.10% 55.40% 58.60% 64.30% 58.60% 58.40%
Terminal-Bench 2.1 59.55% 53.37% 62.00%* 76.20% 72.00% 78.2% ± 2.4 66.1% ± 2.7 69.00%
Long Context
AA-LCR (avg@16/acc) 63.94% 45.50% 63.70% (63.00%*) 66.30% (66.30%*) 69.10% (69.70%*) 64.90% (62.30%*)
  1. "—" indicates the score is not publicly available or not tested. * denotes a self-tested score.
  2. Android Daily: https://arxiv.org/abs/2605.27761

Meet StepFun

Availability, Deployment, and Ecosystem

Availability

Step 3.7 Flash is available through StepFun Open Platform at platform.stepfun.ai and platform.stepfun.com, as well as partner platforms including OpenRouter and NVIDIA NIM.

Deployment

Step 3.7 Flash supports flexible deployment across cloud, data center, and local environments. For large-scale production and enterprise use cases, Step 3.7 Flash can be deployed on modern data center infrastructure. For local and workstation scenarios, it can also run on high-memory devices such as NVIDIA DGX Station, AMD Ryzen AI Max+ 395-based systems, and Mac Studio / Mac Pro devices with at least 128GB unified memory.

Ecosystem

Step 3.7 Flash is supported across popular open-source infrastructure for both inference and model development. For inference and serving, developers can use vLLM, SGLang, Hugging Face Transformers, and llama.cpp. For model development workflows, StepFun model support has landed in the NVIDIA Megatron ecosystem, including Megatron Core and Megatron Bridge.