This post was provoked by a thread on Bluesky. An InfoSec professional confidently responded to a lawyer about the utility of generative AI for legal work. If only people would get better at building agents, prompting, and checking output.
The change to work discussed, when you read past mutual frustration, was experts pivoting to build agents, tune models, and check outputs.
Another position, based on that being a potential death knell for much work satisfaction and ability to train new cohorts of experts, was 'suck it up buttercup'. Going on to suggest that those who understand will thrive, using AI as a force multiplier. Those who cannot will get left behind.
I argue there is an interim step. If things get eroded to an overly simplistic and intermittently erroring mean, folk quietly stop using the tools. Not because of principles, ignorance, or fear. Because of the 'Just give it to me!' moment. When time and money set aside to procure tools, train people, tune systems, and hone prompts has run out. The expectation of 10x productivity is still there and work keeps coming back with hard to predict issues.
The InfoSec person went on to challenge various legal interjectors, pointing out how the world shifts. Pointing out that experts need to get better with LLM or SLM use, lest they find themselves out of a job.
A history of intermittent errors
What folk seem to be forgetting is a history of intermittent errors. Folk who know what the heck is going on, in depth, for their part of a complex system, are not folk you want to hastily shuffle off in favour of an almost-there language model. Condemning specialists to the role of fact checking and knowledge download. Gains in productivity credited to the magic new system. Absence of gains pinned on inadequate embrace of the tech and work to integrate.
Back in the day a bunch of the servers in a network I looked after were intermittently dropping connection to the internet. Everyone dreads hearing the words 'intermittent error'. Horrid thing to work on in a resource constrained system. After two days of diagnosis, not doing all the other things I should have been doing, ruling out other causes, it came down to (you guessed it), DNS.
I actually diagnosed that very quickly and put calls in to the people who look after big academic internet pipes. They, quite correctly, pointed me at Cable and Wireless as holders of the keys to our exit to the rest of the interwebs (intentionally using highly technical terms here). I asked them to check if any DNS config had changed for our subnets in the last week. Cable gave me the usual response 'Nah, it won't be us'. Nothing I had was budging them. I needed more to back me up. Guilty until proven innocent.
Armed with DNS config and logs from our side with traces of outages over 3 days, pointing to the discrete IP address subnet issues were limited to, I phoned Cable again. After 30 minutes of being shuffled round the houses I got to speak to a chap who said 'Oh yeah, we updated some of the routing tables'. What he didn't add was 'which buggered up the routing for that subnet'. No apology, just a quiet fix. 2 days lost.
Another time we had degraded connectivity and periodic connectivity drops in admissions. Never happened when I was there to monitor traffic. Everyone lived with that, enormously grumpily, but it was coming up to new intake application time, busiest time of the year. It had to get fixed.
There was no central traffic analysis for the edges at this point, for this network. This network was held together by very old duct tape, string, and wishes. I traced every inch of that cable, via different cabinets and roof spaces, until I found a very old router, with a newish mouse nest. The fluctuations were a mismatch of the speed of the router feeding that point and the one feeding the onwards connection. It was intermittently negotiating down to a snail's pace. The outages were likely mouse related. The short fibre run was very gnawed.
That was fixed by replacing that fibre, a relatively inexpensive router, and a mouse trap... a relatively humane one. The cause was folk making do and mending with spare boxes to replace ones that periodically died, not knowing enough to foresee the issues, and it being a fairly rural campus.
We were not alone. St Andrews University had (has?) a huge problem with squirrels eating core fibre. Similarly the massive undersea internet cables, with power thrumming through them, are catnip for sharks. Sometimes it is the nature of creatures, people, and their environment vs tech. Mostly it is mismanaged or misunderstood complexity rather than nation state hackers when the internet breaks. Ok, fair play, sometimes it is submarines, or other not unintentional activities.
What the heck has that got to do with LLMs?
At the very core of this is where we delegate accountability and responsibility to monitor for, diagnose, and resolve intermittent and increasingly niche errors.
Will that progress, via model retraining, expert tuning prior to release, tweaks to weightings, system prompts, post release human tuning, layered RAG (Retrieval Augmented Generation), (COT) Chain of Thought, Knowledge Graphs etc to the stage where we can safely depend upon the outputs and easily diagnose inevitable residual errors?
It all, and I can say this entirely confidently, depends on the context.
Blame the AI intern
If you will excuse the personification of AI for a moment, that is not desperately far removed from the process of training and then delegating work to a brand new intern. The quality and speed of issue diagnosis leans heavily on local system knowledge and domain expertise. A brand new intern has little of either and a fundamental part of the hiring decision is time and spare bodies to build their capability and check work until they are trustworthy and cost effective.
Recent experience involved a very keen and capable intern providing AI generated outputs and not acknowledging that. Always a hyper-quick response to asks. Always a sense the content was somehow 'off'. Needing to go through output with a fine tooth-comb, considering details and the content as a whole to work out where things had gone awry to course correct and train.
Over time it became apparent these were generalised responses, missing intent of the ask. The intern could not explain why they chose certain outputs and how the logic carried through to the carefully defined problem. That was incredibly frustrating for all concerned. It required multiple working sessions to realign. That could be blamed on me for not articulating the need clearly, but it could also be blamed on presenting output divorced from the full problem context, without the human chains of thought for each part that provide means to unpick issues. It was biased to general internet positions, not the space we were operating in.
The other element here is where you put your 'good enough' marker. We could have published that content from the intern and called it good. We would have had to swallow accusations of it being generic. We could have invested money and time in training them to prompt and in a model with bespoke training for purpose. Or we could have had those working sessions to create content together.
That 'good enough' marker for content can be low for tech firms in a hype-cycle. It has been basement low for content generated simply as a placeholder for advertising in the broader internet ecosystem. That is, in part, the content these models have been trained on. That programatic SEO equation actively discouraged careful curation. Engagement was king, not accuracy, authoritative sources, or quality.
If the appearance of force multiplication is sufficiently real, firms will capitalise. The downstream nuance of fitness for purpose and issues potentially stored up has little to do with company valuation for the next funding round.
Quoting myself (I know, insufferable thing to do):
When time and money set aside to buy tools, train people, tune systems, and hone prompts has run out. The expectation of 10x productivity is still there and work keeps coming back with hard to predict issues.
Condemning specialists to the role of fact checking and knowledge download. Gains in productivity pinned on the magic new system. Absence of gains pinned on inadequate embrace of the tech and work to integrate.
That, I argue, is one of the more harmful dynamics here. Expectations set by PR and reality set by... well... reality. Just like me trying to get to the point where Cable and Wireless would believe my diagnosis, people using these systems and identifying a potential issue may find it incredibly hard to combat the murk around system operation. If excessively invested in achieving those productivity gains, perhaps staking careers on it, there will be massive pressure to 'just make it work'.
Folk generally lack the skills, experience, language, and historical data to unpick many facets of system performance (generative AI at scale, as I'm so fond of saying, is still VERY young). That ignores constant updates to compensate for architectural, training, tuning, configuration, fine tuning, content moderation, and operational issues. Plus the rest of the less novel tech stack and internet plumbing everything is built upon. Almost every facet a moving target.
Can we simplify the core challenges here, pull everyone up to a common level of understanding, to start to address bigger questions, like those around copyright, power consumption, usable data availability, and aggregate impact on the information ecosystem from intended, unintended, and malicious generative AI usage?
Not, I argue, in a desperate hurry.
Attempting to distil and a call to action
I've been reading my content and presentation recommendations, so I should leave you with something tighter to consider and actionable insights. Here goes:
- Complex technical systems with intermittent errors require deep expertise to diagnose and fix
- Current LLM implementations in professional work are no different on the surface
- The jury is still out on whether the probabilistic nature of generative AI architecture can ever be layered with enough filters to justify dependence in areas where precision is paramount.
The key parallel in both cases:
- Surface-level symptoms can mask complex underlying issues
- Deep domain expertise is crucial for proper diagnosis and resolution
- "Making do" with partial solutions can lead to compounding problems
- The true cost of errors isn't immediately apparent
The current push to have experts become LLM "prompt engineers" and output checkers may be missing a crucial point (and encouraging brain drains) - the value of deep domain expertise in understanding when something is subtly wrong.
Suggesting that:
- The real issue isn't resistance to change, but recognition that intermittent errors in professional work (like law or InfoSec) can have serious consequences
- The cost of checking and fixing AI outputs may eventually outweigh the perceived productivity benefits
- The role of "fact checking and knowledge download" may be more complex and time-consuming than AI enthusiasts assume
Before reorganising entire professions around AI tools, we need to:
- Properly account for the hidden costs of refining new and pre-existing processes and tools to integrate, monitor, validate outputs, and error correct
- Value and preserve the deep expertise that lets us spot when things are subtly wrong
- Consider the longer-term implications of turning domain experts into AI-output validators in terms of staff retention, reputational risk with poor quality outputs, and ability to train new people with the same level of domain knowledge.
Are the tools useful - they clearly are. It's whether the current push to reshape professional work around them acknowledges the true complexity and total cost of ownership for something that is still unpredictably 'almost right'.
Potential for a two-tier system in many fields?
This raises the concern that we will see a two-tier system in a lot of different fields, for a lot of different purposes. Human-led work at a premium, the shifting sands of generative AI 'good enough' output for more affordable offerings.
The potential issues there should be apparent. Can 'good enough' ever be appropriate in medical, manufacturing, defence, education, transport, legal sectors? We have seen many examples of lawyers being caught out with fictional case law, or poor interpretation of precedent. The only compensation, against a shifting tech solution baseline, is best efforts validation from available experts in often limited time. Caveat Emptor for remaining inconsistencies or errors. Might that quickly ramp up post hoc liability for organisations? Is that baked in as a cost of doing business for vendors, insurers, and some system deployers?
This is not, as some like to claim, a totally novel disruptive challenge. A Chatham House session I attended had an NGO that deals with support in midst of humanitarian crises. They remembered needing tech support and not getting it unless they were willing to tolerate blockchain elements that were not fit for that purpose. The funding pots were driven by the next big thing (TM). A trade off of inconvenient mandated tech use, with data collection to feed back, in return for funding for basics. We see the same thing being played out at huge scale and at huge speed in the education sector. In that meeting those funding pots had just shifted to being conditional on using AI.
Tech solutionism is not new and is going nowhere. Fitness for purpose and upstream, downstream, lateral impact can only be understood in local contexts and can only be identified by those with local knowledge.
The quicker we get the people who will use these systems in anger up to speed the better, with allowance for the possibility systems may just be a poor fit. But that, we know, depends on risk tolerance. Is it desirable to see people nudged out through attrition due to being demoted to glorified proof-readers?
We are good at calculating the total cost of ownership for staff. Full time vs part time vs zero hours contracting. Finding candidates, screening, hiring, onboarding, training, incremental productivity uplift as they gain experience. Ways to retain them. Costs associated with them being fragile complex systems with often hard to diagnose intermittent problems, often dependent on the people, processes, and systems they interact with.
We need to get far better at doing that for language models with various wrappers. We need to avoid viewing incredibly novel systems (that lean hugely on our most experienced people) as a quick, cheap, and simple fix.
DISCLAIMER: Unlike the last few posts that used no generative AI output, and prior ones where using some LLM output was very explicitly flagged prior to relevant sections, here the summary and call to arms sections are a more subtle use of Claude 3.5 Sonnet (quickly becoming a favourite model for many). The final part about two-tier systems is all me again. Claude proposed a far more optimistic picture. I am not ready to be optimistic about that potential inequity.
All Claude elements were carefully checked. None of that is intended to drive legally or otherwise commercially sensitive decisions or actions. It is general information. You should not trust that section without question. Weigh it against your technical and domain knowledge before you integrate it into any decision making. Do you have time for that? I thought not. So here we go again.