If several independent general AI were to, unprompted, develop their own language ab initio (or perhaps from other languages,) could this serve as a test for sentience, sapience, or general consciousness? And if so, how liberal or conservative a test would this be?
Given the recent success of LLMs, it might appear that we now have the ability to answer your question, or that there is agreement about it, but there is not. While there is generally presumed to be a connection between the use of language and grammar and intelligence, it is an open philosophical question on exactly what intelligence is, and how it comes about. There are no known technologies that are capable of generating complex context-sensitive languages ab initio used pragmatically. Not even the cognitive architecture of animals who signal each other are fully understood. Since there is debate over what constitutes intelligence, there is necessarily debate on how it could be detected or measured. My own thoughts on the matter is that such a test would have to include a theory-of-mind component. From WP:
Possessing a functional theory of mind is crucial for success in everyday human social interactions. People utilise a theory of mind when analyzing, judging, and inferring others' behaviors. The discovery and development of theory of mind primarily came from studies done with animals and infants.
That is, no machine that lacks a theory of mind is a human-level intelligence.
LLMs are fascinating machines which present themselves as using complex language, but do no such thing. Human-level language use relies heavily on linguistic structures like prosody, pitch, morphemes, lexicons, and phrase and sentence grammar in order to convey meaning from one agent to another. The transformer model in NLP does no such thing. It relies on statistical properties of a corpus to generate strings. (For instance, the temperature setting on a such a model means one can get wildly different responses to the same question). It's best to understand the LLM as a search engine of text that weights common constructions of tokens (tokens are frequent character sequences that do not align even with morphemes necessarily). Thus, LLMs create the illusion of language use, and only simulate language by generating strings and have no underlying comprehension of the words.
Right now, LLMs are plagued by hallucinations, and cannot be used for anything that resembles deterministic output. They might be seen philosophically as generating intuitions about language, and clearly they can replicate the grammars they are trained on somewhat reliably, however, they have very little awareness of semantic content and instead are merely reflecting the semantic understanding of the people who write language that contributes to the corpus an LLM is trained on. An LLM is like a parrot that takes a survey of billions of people and then does its best to repeat what it thinks is the best representation of the results of that survey. Generating responses to prompts for meaning outside of the corpus is technically impossible, because without the semantic content contained by a syntactic encoding for training, the system is blissfully unaware of that content.
Currently, one task that LLM engineers are working on is raising the success rate of mathematical content. According to this article on Marktechpost.com, it's a highly imperfect output. The article says:
Its extensive trials and in-depth analysis demonstrate MathGLM’s superior mathematical reasoning over GPT-4. MathGLM delivers an impressive absolute gain of 42.29% in answer accuracy compared to fine-tuning on the original dataset. MathGLM’s performance on a 5,000-case math word problems dataset is very close to GPT-4 after being fine-tuned from the GLM-10B. By breaking down arithmetic word problems into their constituent steps, MathGLM can fully comprehend the intricate calculation process, learn the underlying calculation rules, and produce more reliable results.
I suspect an honors student without a calculator in elementary school could outperform the 42.29% easily.
Thus, LLMs are not able to solve arithmetic reliably (let alone higher level mathematics or invent their own languages). There are fundamental differences the way the human brain builds and uses categories and languages that LLMs simply flounder over, and no amount of big data will fix that problem. There are fundamental differences in the mechanism. The why's are the subject of Larson's The Myth of Artificial Intelligence and revolve around the determinism of Turing architecture and the lambda calculus, the difference between deductive, inductive, and abductive reasoning, and a large amount of confusion and hype on how the von Neumann architecture works and what it is actually capable of.
In fact, Larson talks about how Turing Test "winners" actually just take advantage of people's inability to communicate well and have little to do with reasoning. Turing Tests as often implemented eschew the logic that Turing himself set out in his paper, and resemble ELIZA, a bag of tricks that fool no analytical intellect. The Winograd schema challenge is one of a number of improvements on the Turing Test that an LLM couldn't deal with.
So, LLMs are very powerful NLP tools, but they are narrow technology because they only implement our agenda, and have no capacity to implement their own. In this way, they are in the same class of algorithms as bubble sorts and autocomplete functions your web browser uses. Only a fundamental ignorance of the underlying mechanisms would lead someone to conclude that they are a threat or on the cusp of becoming self-aware. Like the Great Oz, one has to use their imagination and ignore the man behind the curtain to get to such conclusions.