Summary
Google (Alphabet), Meta, Microsoft and Apple are the top five firms in the world by reported R&D expenditure – collectively accounting for more than €100bn of R&D spending.*
It seems, of course, impossible to know the detail of how firms prioritize their R&D budgets – a gap, compared to the civilian science budgets of governments which, in contrast, tend to be relatively legible.
Ahmadov (2022) conducted an analysis of funding statements in a sample of 130k academic papers with a view to understanding the influence of Big Tech on the scientific enterprise.
The author argued that the focus of the firms was, unsurprisingly, on computer science, with notably less emphasis on “socio-political impacts of platformisation, the psychological effects of social media on children, or the environmental cost of large database exploitation”.†
Big Tech staff also write research papers (the extent of output varies by company). My goal is to gradually build a picture of what R&D means to them, such as we can know it in this labyrinth, based on reading these papers.
Research papers (currently notes subject to extensive revision)
Joon Sung Park, et al., 2024, Generative Agent Simulations of 1,000 People (arXiv)
This paper picks up a very old idea of simulating human behavior with computers.†† The goal of the present study was to apply the latest methods on generative agents.
1000 people, recruited by the market research firm Bovitz, were interviewed verbally by an AI system. The AI asked such questions concerning the participants’ life stories as well as their views on current societal issues while also generating follow-up questions.
The system produced thousands of words of input from each participant which was then fed back with the aim of creating a network that would “imitate the person based on their interview data”.
The imitations were then evaluated “on their ability to predict their source participants’ responses to…surveys and experiments commonly used across social science disciplines”.
The agents predicted participant responses with “an average normalized accuracy of 0.85 (std = 0.11)” in a general social survey (GSS) evaluation. The authors reported that their agents were also able to predict the “five personality dimensions: openness, conscientiousness, extroversion, agreeableness, and neuroticism”.
However, agents did less well predicting the outcome of economic games, namely, “Dictator Game, the first and second player Trust Games, the Public Goods Game, and the Prisoner’s Dilemma” with a normalized correlation of 0.66 (std = 2.83).
The authors argued that “when informing language models about human behavior, interviews are more effective and efficient than survey-based methods”.
Overall, we could say the authors reported some correlations between their imitations and a handful of standard surveys. It is difficult to say if this finding is significant at this juncture.
If the AI interview methodology were to be less arduous for participants than conventional survey methods – or required less human input from investigators – it could in theory be labor-saving (and potentially, cost saving).
However, the paper offered no data on these aspects of the problem.
The authors worked with a management consulting firm, Gemic, holding focus groups with 54 “knowledge workers” across sectors such as law, journalism and education, to uncover “expectations of generative AI’s impact”.
The prevalent belief of the participants, as reported in the paper, was that generative AI would lead to unemployment and deskilling.
The paper evidently does not examine the political economy of the issue. What are actually problems of political economy such as deskilling were instead positioned by the authors as “research challenges”.
However the work does, nevertheless, tell to my mind an often negative story.
The authors quote one participant saying “I just feel like under capitalism there can be no good AI…the people at the top are always concerned about making a profit and cutting out jobs and whatnot”.
Qian, et al., 2024, Understanding the Dataset Practitioners Behind Large Language Models (arXiv)
The authors interviewed 10 staff members in Google who work on projects related to understanding “unstructured, text-based datasets for LLM development” (termed “dataset practitioners”).
They found that staff reported prioritizing “data quality” while depending on “their own intuition” to assess it.
Staff used relatively old technology such as spreadsheets and files formatted in CSV (dating from the last century) as well as Google Colab (Python coding and annotation features in a web browser).
These processes were not necessarily automated and seemingly drew on few common standards.
The authors identified two main “data exploration patterns”, namely, “visually inspecting data in spreadsheets” and “crafting custom analyses in Python notebooks”.
Given the authors did not reveal the specifics of the data that was being analyzed, it is, however, difficult to draw conclusions as to the overall meaning of the work.
The authors used a deep learning system to make diagnostic predictions of body mass index, blood pressure, cholesterol and white blood cell count, among other metrics, based on smartphone photos of eyes.
The system was initially trained on 123 130 images from 38 398 patients in Los Angeles. It was tested on 798 participants in Taiwan and also trained on these participants. Interpreting the results is difficult.
The authors base their analysis on the area under the receiver operating characteristic curve (AUC).
“It is the probability that a model predicts a higher risk for a randomly selected patient with the outcome of interest than a randomly selected patient without the outcome of interest…If the model has good discrimination and gives estimated risks for all patients with the outcome that are higher than all patients without, then the AUC will be 1. If the model discrimination is no better than a coin toss, then the AUC will be 0.5.”¶
The AUC reported by Talreja, et al., was above 0.5 for all metrics but most notably, eFGR (kidney function) and hemoglobin (anemia). In theory, therefore, the system could be used to screen patients based on photos of the eyes.
Ocular diagnosis of disease is, however, an expert topic in its own right (including recent literature concerning diagnosis through a deep-learning model). It would therefore have been good to know the authors’ views on how their deep learning model analyzed images of the eyes such as the retina.
Overall, the paper plays into a wider context about how medical professionals and their patients can know that AI-based medical devices “work”.§
Chen, et al., 2024, Designing a Dashboard for Transparency and Control of Conversational AI (arXiv)
The authors identified “how an AI response might depend on its model of the user” as an important part of transparency of AI, noting as problems ” sycophancy, where the system tries to tell users what they are likely to want to hear, based on political and demographic attributes, or sandbagging, where it may give worse answers to users who give indications of being less educated”.
They decided to create a “probe” that would reveal the AI’s view of the user’s age, gender, education, and socioeconomic status. This was apparently very difficult to achieve as it required creating fake conversations between two AIs (I will skip the details).
Their system reported back the assumed attributes to 19 users recruited to test the system. Sometimes it was accurate, other times it failed to understand the attributes of the user based on their responses to it.
Whether this system could form some kind of forensic tool that could be wielded in the public interest is an open point that was obviously not raised in the paper.
Rational choice theory – very much the basis of the legal system – would tend to suggest citizens need a suite of forensic tools (see, additionally, below).
Lee, et al., 2024, Extended Abstract: Machine Unlearning Doesn’t Do What You Think
The authors assert that “training data can (often inadvertently) contain sensitive, private, toxic, unsafe, or otherwise risk-inducing data—data that model trainers may want to purge from the model”.
But how to remove this training data from the model after it has been created? This seems quite an interesting problem not only from the angle proposed by the authors.
Citizens, or courts, for that matter, could expunge sensitive and private data directly from models without recourse to intermediaries and also verify that data had been expunged.
Satyanarayanan, et al., 2022, Balancing Privacy and Serendipity in CyberSpace (ACM HotMobile)
This paper starts from the premise that the purpose of work is to facilitate “casual collisions between colleagues…recognized as catalysts for creativity and innovation”. The evidence cited is a quote from Steve Jobs.
Accordingly, the authors see their role as designing online methods that facilitate such “casual collisions” and propose a series of video-linked “freezones”.
A worker wishing to have a random encounter must enter a freezone which will then connect them with like-minded staff who entered other freezones.
The authors raise the problem of confidentiality and offer technical solutions.
Evidently the paper is based on a narrow premise. Most work is not about random encounters. The broader context of the “anthropology of work” is important.
The authors developed a mathematical model to evaluate the effect of content moderation policies on participation and diversity of opinion in online forums.
Unfortunately, the paper is faulty in the way it constructs the problem.
The authors claim, without evidence, that “Gab, Parler, and Truth Social, hope to attract users with permissive moderation policies”. At that, point, their arguments lost credibility.
In reality, those organizations originate dangerous right-wing propaganda. This is their salient feature and, indeed, their purpose.
We could say, therefore, that words such as “permissive” and “moderation” have very particular meanings in the Big Tech lexicon that are not much to do with their accepted meanings.
It is, as always, incumbent on progressive scholars to challenge this discourse.
An original sin was to understand platforms as different from publishers. This had practical impacts, such as the law, as well impacts on the language used around the topic.
Freedom of speech is one thing, freedom to get paid to say what you like, quite another. It is an error to pretend these qualities are the same, albeit a convenient error for those in positions of power.
Platforms became, as such, seen as both samizdat and propaganda for powerful interests, generating complicated and, at times, hard to follow (and understand) political rhetoric.
To gain a full picture of why this awful situation arose we would probably need an account of the intersections of technology and political propaganda in general.||
The authors describe a a “smartphone application based on novel computer vision algorithms” (MeasureNet in Amazon Halo) which they say offers “accurate and reliable” estimation of the waist-to-hip circumference ratio in adults – a high ratio presaging risk of heart attack and stroke.
The crucial quantitative comparison was achieved by a Bland-Altman plot of the differences between WHR from “flexible tape measurements made by skilled technicians” and WHR predicted by MeasureNet (with 1200 participants).
“The user inputs their height, weight, and sex into their smartphone. Voice commands from the application then guide the person to capture front-, side-, and back-viewpoint color images” (images captured with the arms held out from the sides and, therefore, impossible for the user themselves to take the images, unless, I guess, they used a tripod).
The mean difference was 0.03 in both women and men. Regrettably I did not see the confidence interval stated but perhaps I did not read the paper carefully enough. Furthermore, the authors did not state if the data were normally distributed nor, as far as I can tell, do we know if the mean difference changed systematically over the measurement range.
Bland-Altman is an old method to determine agreement between measurements taken with two different medical devices. It is typically calculated with a view to assessing a newer device in terms of performance relative to an older model and, therefore, is an important part of the process of obsolescence – but not the only part.
From the perspective of healthcare workers and their patients, presumably, “acceptable limits must be defined a priori, based on clinical necessity, biological considerations or other goals”.
Broader questions concern, for example, how AI tools might be assessed as would-be medical devices. The overall impression currently is sometimes of a device in search of a use, rather than the other way around. That is not necessarily a problem, but warrants thought.
A second point concerns how we might understand the ways in which software and information technology has changed medical practice and which specific items have been the most significant.
If, for example, we took a prevalent medical procedure such as caesarean section, where has software played an important role, what changed, and what specifically are we talking about such as a particular computer algorithm.
This is not an easy question to answer across the board without dipping into fields such as ethnography to understand in detail what happens in medical settings. However, it might, possibly, give clues as to which devices would have biggest impact in future.
The authors developed an AI chat-bot, PRO-PILOT, to counsel “low-wage” staff facing “disgruntled clients”.
The bot listened into artificial conversations with the clients (created by combing social media for customer complaints such as from airline passengers whose checked baggage got delayed) and then supplied “emphatic messages” back to staff with a view to “regulating” their emotions.
This is a strand – using a would-be Turing machine as a form of psychological support – which goes back to the first chat-bot, ELIZA, created in the 1960s.
However, in the present form, as a comment on contemporary capitalism, it could be satire.
The authors argue that using GenAI can lead to “productivity loss” in software development. They categorize the ways it might do so, such as “workflow restructuring” and “task interruption” and offer what they term design solutions drawing on parallels with older human factors studies such as in airline cockpits and control rooms of paper making plants.
The authors seem to put particular emphasis on the paradigm of cockpit automation.
Without doubt, it is a particularly long running and, indeed, well-documented example. It is also a field where human factors and political economy met unexpectedly in the writing of famous analysts (e.g., Charles E. Billings, NASA Ames Research Center).
Landmark studies by NASA published in the early 1990s found few significant differences between performance in automated and conventional (non-automated) cockpits, based on pilot evaluation of different models of the McDonnell Douglas DC-9 airliner.
“Clearly one may also conclude that we have not produced a case in favor of high technology cockpits – that the crews or the DC-9, a product of mid-1960 decade technology, performed just as well as those flying a very advanced, very expensive, modern technology aircraft” (p. 127).
“Technology-in-principle…did not work as technology-in-practice”. It did, however, cut the need for radio operators, navigators, and flight engineers, thereby depopulating the cockpit.#
What the authors of the current paper identify as an irony of automation is not really so. Irony implies an unexpected result. In this case, the result is not unexpected.
This is, of course, not to discount all automation as useless. But rather to say, we need to be very sensitive to the political economy of proposals to restructure workflows.
The authors seek to assert that their AI systems “substantially increase productivity” on some common tasks such as writing emails performed by “enterprise information workers” (several hundred freelancers offering services on UpWork and Amazon Mechanical Turk – as well as Microsoft employees).
In the studies reported, productivity was defined by speed (“output per unit time”), quality (mainly defined by accuracy), and effort (degree of exhaustion experienced or perceived energy expended by a participant). The latter was assessed by surveying participants but “future studies will seek to evaluate effort using other techniques, including functional neuroimaging”.
Digitally-enhanced Taylorism of gig workers (potentially with added neurological monitoring) sketches an apocalyptic vision.
One point – the ambiguous meanings of productivity and how they are being used and abused.
Productivity is obviously a political term of art and reading across what Microsoft believe into the wider discourse is going to be misleading. The research described, however detailed, gives us only an insight into the thoughts of Microsoft managers – no more, no less.
Yang, et al., 2021, Local Factor Models for Large-Scale Inductive Recommendation (RecSys ’21)
The authors propose a model for detecting like minded internet users and clustering them into small groups for content targeting.
Teevan, 2023, How the Web Will Shape the Hybrid Work Era: A Keynote at WWW 2022 (ACM SIGIR Forum)
Jones, et al., 2024, Teaching language models to hallucinate less with synthetic tasks (arXiv)
Singer, et al., 2024, Video Editing via Factorized Diffusion Distillation (arXiv)
Tao Tu, et al., 2024, Towards Generalist Biomedical AI, in: NEJM AI
Xin Luna Dong, 2024, Next-generation Intelligent Assistants for Wearable Devices (KDD ’24)
Ambrosio, et al., 2024, Achieving Human Level Competitive Robot Table Tennis (arXiv)
Gemini Team, Google, 2024, Gemini: A Family of Highly Capable Multimodal Models (arXiv)
Caron, et al., 2024, Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach (arXiv)
Wang, et al., 2024, A Case for Moving Beyond “Gold Data” in AI Safety Evaluation (CHI’24)
Haarnoja, et al., 2024, Learning agile soccer skills for a bipedal robot with deep reinforcement learning, in: Science Robotics
Zhang, et al., 2024, Human-aligned Chess with a Bit of Search (arXiv)
Li, et al., 2024, Unbounded: A Generative Infinite Game of Character Life Simulation (arXiv)
Lazaridou, et al., 2024, Augmenting machine learning language models using search engine results (US Patent US20240281659A1)
Kumar, et al., 2024, Training language models to self-correct via reinforcement learning (arXiv)
Notes:
*The 2023 EU Industrial R&D Investment Scoreboard – 2023 R&D Investment Scoreboard (World 2500). The private sector, via the banks, is empowered to create unlimited amounts of money “from thin air” by issuing credit. This explains the notable volume of spending. Lambert, 2018, Monopoly Capital and Innovation: An Exploratory Assessment of RD
Effectiveness (MPRA Paper No. 89503), is informative on why firms might invest in R&D.
†Ahmadov, 2022, Big Tech and Research Funding: A Bibliometric Approach (Universidade Nova de Lisboa), p. 31.
††Borrelli and Wellmann, 2019, Computer Simulations Then and Now: an Introduction and Historical Reassessment, in: NTM Zeitschrift für Geschichte der Wissenschaften, Technik und Medizin, pp. 408-411. A complicated intellectual history. Simulating human behavior goes back to the origins of AI. Werbos, for example, developed back-propagation as part of a research program to predict political mobilization (funded by the military R&D agency, DARPA).
§Angehrn, et al., 2020, Artificial Intelligence and Machine Learning Applied at the Point of Care, in: Frontiers in Pharmacology; Zrubka, et al., 2023, The Reporting Quality of Machine Learning Studies on Pediatric Diabetes Mellitus: Systematic Review, in: Journal of Medical Internet Research; Howe, et al., 2024, Embedding artificial intelligence in healthcare: An ethnographic exploration of an AI-based mHealth app through the lens of legitimacy, in: Digital Health; Sarim Dawar Khan, et al., 2024, Frameworks for procurement, integration, monitoring, and evaluation of artificial intelligence tools in clinical settings: A systematic review, in: PLOS Digital Health
||In the last century, Switzerland, the Netherlands, the UK, (Weimar) Germany and Sweden banned political uniforms in public places – a telling historical example of legal attempts to stop Fascism (black shirts) but with complicated effects. It offers sidelights on current debates by, as it were, switching out the technology in question (the uniforms were apparently considered the height of modernity in those times, being explicitly compared to “high-speed transport and electric light”). Pollen, 2019, The Public Order Act: defining political uniform in 1930s Britain, in: Uniform: Clothing and Discipline in the Modern World (eds. Tynan and Godson).
#Billings, 1997, Aviation Automation: The Search for A Human-centered Approach, pp. 57, 61. Billings repeatedly cites insights from the classic Forces of Production by Noble.
Please contact me if you would like to know more.
Email: william@resorg.news
Dr. William Burns PhD MSc