We put Claude 4 through 40 structured tasks across six weeks — professional emails, long-form articles, bilingual translation, contract redlining, code documentation, and customer-facing copy. The goal wasn't to cherry-pick impressive outputs; it was to find the failure modes.
The short version: Claude 4 is the tool we'd actually recommend for anyone whose job involves writing more than a few hundred words a day.
On English prose, Claude 4 is the most consistent large model we've tested. The ceiling is slightly lower than GPT-4o's peak; the floor is considerably higher.
When we asked it to rewrite the same 800-word investor update in five different registers, it nailed four of five on the first pass. The LinkedIn version needed a second prompt to strip out the corporate hedging.
For long-form work specifically, the advantage is real and measurable. We gave it a 40,000-word product spec and asked it to write an executive summary. It correctly identified the three most commercially significant points, none of which were explicitly flagged in the document.
Claude 4's Chinese output has closed the gap with its English quality significantly since Claude 3. It doesn't write like a native — there's still a slight foreignness in rhythm — but it's past the threshold where you'd need to rewrite the whole thing before sending it to a client.
Speed is a real issue on long outputs. Where GPT-4o will stream a 2,000-word article in roughly 45 seconds, Claude 4 takes closer to 90. The lack of native web search is a genuine gap if research is part of your workflow.