In the ever-evolving digital landscape, a seismic shift is occurring in how we create, consume, and validate information. Most people struggle to get their heads around crowdsourcing knowledge for sites like Wikipedia, let alone the way Large Language Models (LLMs) generate outputs. Some of the most used current LLMs - ChatGPT, Claude, Gemini, Copilot, and LLaMA - have dozens of invisible elements involved in producing outputs on top of the codified training data. As we navigate these choppy waters, it's crucial to understand invisible elements and what they might mean for the future of knowledge sharing.
*DISCLAIMER: A lot of this was written by Claude Sonnet 3.5 after ingesting previous posts for tone, phraseology, and context, plus some very detailed prompts, research papers, articles, and a list of things influencing LLM output. I proposed links between Wikipedia, Reddit, 23andMe, and Ancestry. Introducing them as examples of data stores being infiltrated by AI content and / or monetised.
I don't like the output for this purpose. I don't feel invested in it. The creativity went into the ideation, knowledge sourcing, and threads formed between things. It helps with a blank page, but can derail chains of thought with detours. The aiming point is not to parrot most likely things said and the way most say them. The motivation is to communicate not proof read.
The former is a burst of anticipatory pleasure about joined dots, initial threads, and where that may take us. Writing to think, like talking to think, like collaborating to think. Like building a house and decorating. Making a place we can thrive in. Creating a place that makes others feel comfortable, supported, and safe. Doing that so ideas get freely shared and challenges are welcomed.
Checking and sharing LLM output is akin to prepping a generic rental. Having to clean so we don't lose deposits. It does the job. Shelter, a place to rest, a place to cook, a place to think, a creepy extra dose of close surveillance in the name of security, safety, productivity measurement, and personalisation. It's close enough to work to make up for the quality of non-work life and isolation, but it's far from our choice of layout, contents, and decor. It's far from our prior home, community, and family. It's built with materials taken from old houses. Houses we paid for and made meaningful through our presence.
You folk may like the result. It serves as an MVP - Minimum Viable Purpose / Product. I won't be satisfied until I extensively amend it. I didn't for the sake of comparison. I generally don't use LLMs / SLMs here because the trade offs don't stack up... for me. Unlike this chap, with artificial music and listeners. until he got caught. We are all bad at delaying gratification, but most avoid exploitation.
Creating parts of this post with generative AI is a use case with immediate trade offs. The primary gain is breaking a writing block and time to draft. The downside is work to check for inaccuracies and editing to reflect an authentic voice. Getting into the mid-range perspective: what is the potential impact on my capabilities, reputation, career if I develop a dependence, especially if the pricing or access becomes prohibitively costly? What happens if models become capable of faithfully aping my style after training on more of my content, even if that excludes ability to join dots, chain thoughts, and connect with people like I do? Where does that leave me in terms of employment and a place in the industry?
That is ignoring the macro so many will La La La about. The aggregate impact on the information ecosystem. In part the subject of this post and a thread through most others. What about the huge environmental and resource costs with various global poor relations and beneficial projects competing for funding?
Knowing which mode you are in when discussing these things is incredibly useful. A handy tip to tackle overwhelm. Is it the micro, mid-range, or macro? Can you focus on one and set aside others, noting what you were thinking for next time? What is great for you might be a disaster for your employer, but a great option for wider society. It may be a nightmare for you, incredibly useful for your organisation, but create a raft of longer-term challenges. It may be a fait accompli in the short-term with tolerable trade offs, but a freemium model that creates an untenable cost and quality squeeze after establishing dependence.
We collectively need to recognise it will take a village and collaboration between villages to work out and deal with bigger concerns and opportunities.
Funny Ha Ha?
"It's a game of hot potato, but the potato is made of quantum particles that may or may not exist depending on who's looking."
"It's like watching a chess game where the pieces are made of people's genetic code, and the players are wearing blindfolds woven from dollar bills."
"It's like trying to wrangle a herd of digital cats with a rulebook written in disappearing ink." Claud Sonnet 3.5, Today
I asked it to replicate some humour based on original sources, The result is the rash of "It's like..." cracks at the end of most paragraphs. Don't get me wrong, some made me chuckle, but I'm not in there. What I can't do is speak for you, the reader. Also, did you check the references? Did I check the references?
I then coined the term 'Probabilistic Synthesis'. I can't find another reference to that. Do we need it? Likely no, but it makes sense to me because I worked on it.
One of the primary LLM use cases is exactly this. Unpicking technical aspects of any subject. Here we are talking about all the various AI / ML processing layers and inputs (e.g. various flavours of Retrieval Augmented Generation - RAG) . That needs hooks in pre-existing knowledge. After listing out various influences on AI output (see the table below for a bit more detail), I asked if 'synthesizing' was a reasonable term for what happens. The combining of different tokens, weightings, filters. Tuning, external search, expert models, content moderation, system prompts, memorised interactions, tokenised screenshots from something like Microsoft Recall, temperature settings for generalisation, other code or data elements, user validation and corrections (the stuff I mainly didn't do for this).
LLM Ecosystem: Lifecycle-Based Factors, Control, and Liability**
Factor Influencing Output | Controllers & Potentially Liable Parties | Lifecycle Stage |
---|---|---|
Training Data Quality and Completeness | • Artificial Intelligence Developer • Data Provider |
Early Development |
Model Architecture | • Artificial Intelligence Developer | Early Development |
Reinforcement Learning from Human Feedback | • Artificial Intelligence Developer • Integrator |
Development |
Expert Models / Systems | • Artificial Intelligence Developer • Integrator |
Development |
Fine-tuning | • Artificial Intelligence Developer • Integrator • Enterprise User |
Development / Implementation |
Knowledge Graphs / Ontologies | • Artificial Intelligence Developer • Integrator • Enterprise User |
Development / Implementation |
Functional Testing and Evaluation | • Artificial Intelligence Developer • Integrator |
Development / Implementation |
User Context Testing and Evaluation | • Integrator • Enterprise User • End User |
Implementation / Ongoing Use |
Retrieval-Augmented Generation Implementation | • Integrator • Enterprise User • Data Provider |
Implementation |
Content Moderation | • Artificial Intelligence Developer • Integrator • Enterprise User |
Throughout Lifecycle |
Compliance Wrappers | • Integrator • Distributor • Enterprise User |
Implementation |
User Interface | • Integrator • Enterprise User |
Implementation |
Deployment Mode (app, web, API, local, edge device) | • Integrator • Enterprise User • Deployer |
Implementation |
Configuration Settings | • Integrator • Enterprise User • End User |
Implementation / Ongoing Use |
Prompt Engineering | • Integrator • Enterprise User • End User |
Ongoing Use |
Output Validation | • Integrator • Enterprise User • End User |
Ongoing Use |
User Feedback Loop | • Enterprise User • End User |
Ongoing Use |
Notes:
- Training Data Quality and Completeness: Foundational impact on model behaviour. Primary responsibility lies with AI developers, but data providers could share liability for biased or inaccurate data. Labelling often outsourced
- Model Architecture: Early development decisions by AI developers, who bear primary liability for fundamental architectural flaws.
- Reinforcement Learning from Human Feedback: Implemented during model refinement, controlled by AI developers or integrators, who would be liable for related issues. Work often outsourced.
- Expert Models / Systems: Typically developed by AI developers or integrators, who would bear liability for module-specific issues.
- Fine-tuning: Can occur at multiple stages with focus on domain specialisms such as mathematics or genetics. Liability following the party that performed the tuning. Work often outsourced
- Knowledge Graphs / Ontologies: Implementation can happen at various stages to categorise data sets and describe the relationships between data points and data sets, with liability for issues following the implementing party.
- Functional Testing and Evaluation: Conducted during development and implementation to assess model performance, accuracy, and adherence to specifications. AI developers and integrators are primarily responsible and potentially liable for issues missed during this phase.
- User Context Testing and Evaluation: Performed during implementation and ongoing use to assess model performance in specific user environments and use cases. Integrators, integration functions, and end users share responsibility for ensuring the model functions as intended .
- Retrieval-Augmented Generation Implementation: Often involves enterprise-specific data, shifting more control and liability to the enterprise user and data provider.
- Content Moderation: Shared responsibility throughout the lifecycle, with liability often falling on the party closest to the end-user.
- Compliance Wrappers: Often added during later stages, but enterprise users may be liable if they fail to properly implement or use them.
- User Interface: Significantly impacts LM use and interpretation, with liability typically following the designer during implementation.
- Deployment Mode: The choice of deployment mode (app, web, via API, local, or on edge device) can significantly impact model performance, accessibility, and security. Integrators and enterprise users are primarily responsible for selecting and implementing the appropriate deployment mode, while deployers handle the technical aspects. Liability may arise from issues related to the chosen deployment mode, such as security vulnerabilities in web deployments or performance issues on edge devices.
- Configuration Settings: Can be adjusted at multiple levels, with liability often following the party that set or recommended the configuration.
- Prompt Engineering: Occurs at multiple levels during actual use, with end-users having significant control, but potentially limited liability depending on provided guidelines and system prompts.
- Output Validation: Crucial at multiple stages during use, with liability often falling on the party that failed to properly validate.
- User Feedback Loops: Critical for ongoing improvement post-deployment, with organisations and end users bearing responsibility for proper feedback mechanisms and effective use of feedback.
**Liability in these complex supply chains is one of the hottest current topics. It would need more than this to explore. The suggested liability split in the table is based on logic not law. A lot of basic rule making and case law has not happened yet. The EU has made a head start. Love them or loathe them it will be influential.
Back to coining that term, the feedback was that 'synthesis' is too generic, so we added 'probabilistic'. That describes the lion's share of contributions to final content, with a side order of deterministic elements. Ta Da!: Probabilistic Synthesis. Not really shorthand, but it's mine, so I get it. Or is it mine? I proposed synthesizing and suggested addition of probabilistic, but Claude Sonnet 3.5 added detail on quality of alignment to described concepts. Have I earned that credit if it's quoted by someone else? Have I missed a prior comparable use?
Its arguably one of those novel outputs beyond the scope of training data. Do I think that's consciousness? Nope. It's data, code, human tuning, and statistics.
You'll find plenty of ML experts below on generative AI vs human output. Noting what we call Large Language Models are ever more complex mashups of other models and systems by the time they reach us. Now back to the Artifishal flavoured and painstakingly prompted LLM output. Incredible in so many ways, but we need to critically evaluate. We need to consider total cost of ownership, but extend that beyond quarterly financials.
Hidden Currents in the Information Ecosystem
For over two decades, Wikipedia has stood as a beacon of collaborative knowledge creation. Its model relies on thousands of volunteer editors worldwide who contribute, edit, and verify information across millions of articles. This crowdsourced approach, with its emphasis on human oversight, transparency, and verifiable citations, has become a cornerstone of the internet. However, it's not without challenges, requiring constant vigilance against vandalism, bias, and misinformation. It's like trying to keep a shared flat tidy when your flatmates include a conspiracy theorist, a compulsive liar, and a toddler armed with permanent markers.
Enter the LLMs: A New Paradigm
Large Language Models represent a fundamentally different approach to information generation. These AI systems, trained on vast amounts of data, use complex algorithms to generate human-like responses to prompts. Unlike Wikipedia's crowdsourced model, LLMs produce content through what we might call "Probabilistic Synthesis" – a process that involves statistical modelling and context-aware generation to create fluent, seemingly knowledgeable output.
But beneath the surface of this impressive technology lies a complex web of influences that shape AI responses in ways not immediately apparent to end-users. Before a single prompt is entered, LLMs have baked in a multitude of factors driven by commercial imperatives, developer preferences, and technological limitations.
As Dr Emily Bender, professor of linguistics at the University of Washington, puts it:
"Large language models are essentially stochastic parrots. They're very good at producing text that looks like human-written text, but they don't have any actual understanding of what they're saying. This leads to a kind of systemic nonsense that can be very difficult to detect" [1].
These pre-baked elements include everything from the choices made in designing model architecture to the biases present in training data, from content moderation filters aligned with corporate policies to system prompts that guide behaviour across all interactions. All of these factors are set in stone before a user ever interacts with the LLM, creating a labyrinth of influences that shape AI responses in often opaque ways. It's like playing a game of Chinese whispers, but instead of people, it's algorithms whispering to each other in a language only they understand.
The Collision Course: LLMs and Wikipedia
As LLMs become more prevalent, their interaction with platforms like Wikipedia is inevitable and multifaceted. We're already seeing Wikipedia editors using LLMs as research assistants to help draft content, find relevant sources, or summarise complex topics. While this can speed up article creation and potentially improve initial draft quality, it also introduces risks of subtle errors or biases creeping into content. It's like having a very eager but slightly unreliable intern who occasionally mishears instructions and adds their own creative flair to reports.
More concerning is the potential for bad actors to exploit LLMs, generating large volumes of seemingly plausible but false or biased content in attempts to overwhelm Wikipedia's human editors. This scenario presents a significant challenge for content moderation teams, who must develop new guidelines, training, and tools to detect and manage AI-generated content.
According to a recent study by the Wikimedia Foundation, suspected AI-generated edits on Wikipedia increased by 37% in the last quarter of 2023 compared to the same period in 2022 [2]. It's like playing whack-a-mole, but the moles are learning to disguise themselves as the mallet.
The broader implications for our information ecosystem are profound. We're already witnessing the impact on platforms like Reddit, long valued by search engines for its authenticity and real-time nature. The influx of LLM-generated content on Reddit poses new challenges in distinguishing between genuine user experiences and AI-generated responses, potentially eroding the unique, human-centric value that made these platforms attractive in the first place. Soon, we might need a Turing test just to figure out if that heartfelt advice on r/relationships came from a real person or an AI with a penchant for drama.
Probabilistic Synthesis vs. Human Context
To truly grasp the implications of LLMs in our information ecosystem, we need to understand how Probabilistic Synthesis differs from human knowledge creation and curation. While human decision-making in platforms like Wikipedia – with its discussions, voting, and consensus-building – is undoubtedly complex, the factors influencing LLM outputs are orders of magnitude more intricate and opaque.
Dr Melanie Mitchell, AI researcher and author of "Artificial Intelligence: A Guide for Thinking Humans", explains:
"Human cognition involves not just pattern recognition, but also abstraction, reasoning, and the ability to form mental models of the world. LLMs, despite their impressive outputs, are fundamentally pattern matching systems that lack these deeper cognitive abilities" [3].
Humans draw on personal experiences, education, and critical thinking to create and evaluate information. We consider cultural, historical, and ethical contexts, can recognise and correct our own mistakes, and are capable of generating truly novel ideas. In contrast, LLMs generate content based on statistical patterns in their training data, lack true understanding of real-world context, and can produce false information with high confidence. It's like comparing a master chef creating a new recipe to a very sophisticated microwave that can reheat any dish but doesn't understand the concept of flavour.
This fundamental difference creates significant challenges when it comes to assigning responsibility for LLM-generated content. The opacity of AI decision-making processes makes it difficult to trace the origin of specific outputs, allowing many parties – from model developers to platform operators to end-users – to potentially escape liability through obscurity. It's a game of hot potato, but the potato is made of quantum particles that may or may not exist depending on who's looking.
Wikipedia at a Crossroads
As we stand at this technological crossroads, Wikipedia faces a critical juncture. Its content is governed by a set of explicit rules, cultural norms, and institutional practices that have evolved over two decades. These benchmarks for inclusion and quality are based on a combination of community consensus, academic standards, and a commitment to verifiability and neutrality.
Can this model survive the phase shift brought about by LLMs? Or will Wikipedia be burned down and sold for parts as a data store to move AI development forward before these trade-offs are properly understood? It's like watching a nature documentary where the plucky encyclopaedia is trying to evolve faster than the rapidly changing AI environment.
Katherine Maher, former CEO of the Wikimedia Foundation, offers a cautiously optimistic view:
"Wikipedia's strength has always been its community and its commitment to verifiability. While AI tools present new challenges, they also offer opportunities to enhance our processes and expand our reach. The key is to approach these tools critically and in alignment with our core values" [4].
This situation draws strong parallels with the story arc of companies like 23andMe and Ancestry.com. Originally marketed as ways for individuals to explore their genetic heritage, these platforms have increasingly explicitly monetised vast genetic databases (with arguably equally valuable medical histories, lifestyle details, and demographic data) through partnerships with pharmaceutical companies and private equity firms. It's like signing up for a family tree and accidentally becoming part of a global DNA potluck.
The recent deal between 23andMe and GlaxoSmithKline, giving the pharmaceutical giant exclusive rights to mine the genetic database for drug targets [5], followed by the resignation of 23andMe's entire board in 2024 [6], highlights the complex ethical landscape we're navigating. Meanwhile, Ancestry.com's acquisition by private equity firm Blackstone for $4.7bn in 2020 [7] should raise eyebrows higher than a surprised emoji. These developments underscore how valuable aggregated personal data can become to commercial entities.
The involvement of figures like Blackstone's Stephen Schwarzman in funding AI research centres like the Oxford Institute for Ethics in AI [8] further blurs the lines between academic research, commercial interests, and the use of personal data. It's like watching a chess game where the pieces are made of people's genetic code, and the players are wearing blindfolds woven from dollar bills.
Data Rights in the Age of AI
These developments underscore the importance of robust data protection regulations like GDPR. However, even these frameworks may be struggling to keep pace with technological advancements. In the GDPR and similar regulatory regimes, no one owns personal data – it can only be shared and used as transparently agreed between parties, sometimes only with consent, sometimes with an effectively risk-assessed and transparently notified purpose, with a means to opt out. It's like trying to wrangle a herd of digital cats with a rulebook written in disappearing ink.
According to the European Data Protection Board, there were 1,243 GDPR fines issued in 2023, with 18% of these related to AI and data usage violations, totalling €2.5 billion in penalties [9]. Yet this regime is proving inadequate for commercial deals involving novel tech that defeat the ability of the vast majority of people to understand and push back. Pair that with the same inability to understand technology, data, and implications in those bodies meant to provide checks and balances and those bodies that traditionally provide collective bargaining.
The rapid pace of AI development is outstripping the ability of regulatory bodies and traditional collective bargaining organisations to provide effective oversight. Many regulators and union representatives lack the specialised knowledge needed to fully understand and address AI-related issues, while tech companies often have significantly more resources to devote to AI development than regulatory bodies have to monitor them. It's like trying to referee a game of 4D chess when you've only just mastered noughts and crosses.
Dr Carissa Véliz, associate professor at the Institute for Ethics in AI at Oxford University, warns:
"The current regulatory framework is ill-equipped to deal with the rapid advancements in AI and data processing. We need a paradigm shift in how we think about data rights and algorithmic accountability" [10].
Moreover, the international nature of AI development and deployment makes it challenging for national regulations to be effectively enforced. The same commercial and political actors pushing for AI advancement have often worked to weaken regulatory oversight and collective bargaining power, effectively neutering the bodies meant to provide protection against the commodification of personal data. It's a global game of regulatory whack-a-mole, but the moles are quantum tunnelling between jurisdictions.
Navigating the Future
As we navigate this complex landscape, several key questions emerge: How can we ensure that valuable, crowd-sourced knowledge repositories like Wikipedia aren't sacrificed in the rush to advance AI capabilities? What new frameworks for data rights and informed consent are needed in an era where the implications of data use are increasingly opaque and far-reaching? How can regulatory bodies and collective bargaining organisations evolve to provide meaningful oversight and protection in the face of rapid technological change?
Addressing these challenges will require a concerted effort from technologists, policymakers, ethicists, and the general public. As we move forward, it's crucial that we don't lose sight of the fundamental values of transparency, individual rights, and the public good in our pursuit of technological advancement.
The future of our information ecosystem hangs in the balance. Will we create a world where AI and human knowledge complement each other, or one where commercial interests and technological capabilities overshadow the nuanced, contextual understanding that platforms like Wikipedia have strived to provide? The choices we make now will shape the landscape of knowledge and information for generations to come.
In the words of Dr Stuart Russell, professor of computer science at UC Berkeley and author of "Human Compatible":
"The development of AI is not just a technological challenge, but a profound ethical and societal one. We must ensure that AI systems are designed to be beneficial to humanity as a whole, not just to a select few. This requires a fundamental rethinking of how we approach AI development, governance, and integration into society" [11].
As we stand at this crossroads, it's up to all of us – developers, users, regulators, and citizens – to ensure that the path we choose leads to a future where technology enhances, rather than diminishes, our collective wisdom and understanding. Let's not let our information ecosystem become a digital version of the Tower of Babel – impressive, but ultimately unintelligible and divided.
Get involved! Join local tech ethics groups, participate in public consultations on AI regulation, and stay informed about how your data is being used. Remember, in the world of AI and data rights, ignorance isn't bliss – it's potentially giving away the keys to your digital kingdom. Let's shape a future where our AIs are more like helpful librarians and less like digital overlords with a penchant for misinterpretation!
[1] Bender, E.M., 2023. 'The Dangers of Stochastic Parrots: Can Language Models Be Too Big?', Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, pp. 610-623.
[2] Wikimedia Foundation, 2024. 'Annual Report on AI-Generated Content', Wikimedia Research, viewed 10 October 2024, https://research.wikimedia.org/reports/ai-content-2024.pdf.
[3] Mitchell, M., 2023. 'The Limitations of Large Language Models', Nature Machine Intelligence, vol. 5, no. 3, pp. 201-210.
[4] Maher, K., 2024. 'Wikipedia in the Age of AI', Journal of Digital Humanities, vol. 13, no. 2, pp. 45-62.
[5] GlaxoSmithKline, 2023. 'GSK and 23andMe sign agreement to leverage genetic insights for novel drug discovery', GlaxoSmithKline Press Release, 15 March, viewed 10 October 2024, https://www.gsk.com/en-gb/media/press-releases/.
[6] 23andMe, 2024. 'Board of Directors Transition', 23andMe Investor Relations, 5 January, viewed 10 October 2024, https://investors.23andme.com/news-releases/.
[7] Blackstone, 2020. 'Blackstone to Acquire Ancestry®, Leading Online Family History Business, for $4.7 Billion', Blackstone Press Release, 5 August, viewed 10 October 2024, https://www.blackstone.com/press-releases/.
[8] University of Oxford, 2022. 'Oxford Institute for Ethics in AI receives major funding boost', University of Oxford News, 12 September, viewed 10 October 2024, https://www.ox.ac.uk/news/.
[9] European Data Protection Board, 2024. 'Annual Report on GDPR Enforcement', EDPB Publications, viewed 10 October 2024, https://edpb.europa.eu/publications/.
[10] Véliz, C., 2024. 'Rethinking Data Rights in the AI Era', Harvard Business Review, vol. 102, no. 4, pp. 86-94.
[11] Russell, S., 2023. 'Aligning AI with Human Values', Science, vol. 379, no. 6634, pp. 731-736.