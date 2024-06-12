Antonio Bordunovi

Over the past several weeks, I've dedicated significant time to exploring the realm of Artificial Intelligence ('AI'). My investigations have encompassed building proprietary AI tools, programming in Python, and examining businesses that provide AI-powered solutions or are capitalizing on the current boom. It is undeniable that AI has sparked a surge of innovations, leading to a colossal demand for computational power. This phenomenon is both understandable and undeniable.

The spectrum of innovation within the AI industry is truly awe-inspiring. Breakthroughs, novel models, and advanced technological capabilities are emerging on a weekly basis. However, for individuals lacking a scientific background, the intricacies of such innovations can be challenging to grasp. This lack of understanding creates a fertile ground for speculation and mania, as people often resort to projecting their own fantasies onto the subject.

Given the complexities of AI and the difficulty in fully comprehending its potential, it is perhaps understandable that many default to the assumption that the advancements are highly beneficial for Nvidia. Consequently, they opt to invest in NVDA stock.

One aspect of large language models (LLMs) that draws attention is the issue of knowledge acquisition and learning. A straightforward way to understand LLMs is to view them as compression algorithms. They ingest training data, compress it, and store it within their parameters. Consequently, LLMs are limited to the knowledge present in the data they were trained on. They cannot update their knowledge over time as the world evolves, nor do they genuinely learn and improve. In essence, a large language model possesses data and information that is frozen in time. The challenge then becomes how to enable models to react to the world, learn from it, and acquire new skills. Currently, there are three primary methodologies that address this issue.

The first approach is fine-tuning a model. As the term implies, this enables you to enhance a model's performance on a particular task. It usually entails "freezing" the majority of the model's layers and training it on new, domain-specific data, with only a few active layers updating their parameters. As a result, the model improves within a specific domain but typically declines in performance on more general tasks, and understanding this trade-off is challenging. Fine-tuning is a technique that can improve a model's capacity to respond to highly particular inquiries, but this strategy calls for considerable technical knowledge and potentially substantial computational resources.

The second method is RAG. RAG, short for Retrieval Augmented Generation, is a powerful approach in the field of natural language processing. It combines the strengths of both information retrieval and language models, with the aim of generating more informative and coherent responses. The core idea behind RAG is to first index a large dataset and store it in a vector database. The data is divided into chunks, and each chunk is then converted into a numerical representation known as a vector embedding. This process enables efficient retrieval of relevant information when a query is presented. Upon receiving a query, RAG begins by searching the vector database to identify the most relevant pieces of information. The retrieved data are then used to augment the initial query, which is then presented to a language model. The language model, enriched with the retrieved information, generates a comprehensive and informative response.

While RAG shows great promise, its implementation presents a number of challenges. Determining the optimal chunk size, handling overlaps between chunks, selecting the appropriate embedding model, indexing vectors, ranking retrieved information, and presenting it effectively to the language model all require careful consideration. These intricate details make RAG a more complex approach than it might initially appear. It is important to note that RAG is not a flawless solution, and its limitations should be acknowledged. While it can enhance the quality of generated text, it does not guarantee perfect results in all cases. Ongoing research and development are necessary to further refine and improve RAG's capabilities.

Lastly, one promising method of enhancing the capabilities of large language models (LLMs) is through the use of agents. This area of ongoing research has shown great potential. In essence, agents enable an LLM to leverage additional resources such as Wikipedia or a search engine. Agents can also be combined with a RAG vector database to further enhance their capabilities. The LLM can then act iteratively as an agent to complete a task. Alternatively, multiple "agents" can collaborate to solve a problem, each employing specific tools to accomplish its part. The agents can interact and work together to achieve a common goal. While this technique is currently error-prone, it holds great promise as a means of augmenting an LLM's capabilities.

Another one of the limitations of today's large language models (LLMs) is their reliance on "system one" thinking, as described by Daniel Kahneman in his book "Thinking Fast and Slow." "System one" thinking is characterized by quick, intuitive, and subconscious responses, which are prone to logical flaws and biases. In contrast, "system two" thinking is deliberate, slow, and measured, and involves long-term planning and problem-solving.

LLMs currently lack the ability to engage in "system two" thinking. They can quickly process inputs and generate answers but cannot be instructed to take more time to think about a problem, research it, or consider different hypotheses. This limitation prevents them from achieving breakthroughs or generating truly original ideas.

Human breakthroughs and great ideas typically require extended periods of "system two" thinking, involving multiple attempts, testing various hypotheses, and combining ideas from different domains. LLMs are not yet capable of this type of thinking, limiting their potential for creativity and innovation.

Over the past several months, I've spent a significant amount of time interacting with large language models, and I've come to realize that the way one asks a question significantly impacts the results. In many cases, altering the prompt can have a more profound effect than switching models, such as from Llama3 to ChatGPT or Mistral. The manner in which a question is posed, known as prompt engineering, has a significant influence on the caliber of the response. While some people have joked about the idea of a "Prompt Engineer" as a profession, it is indeed a genuine field, with some individuals earning salaries of $200,000 or more. My personal experiences over the past few weeks have convinced me that prompt engineering plays a crucial role in enhancing the effectiveness of AI models.

In recent years, as large language models are introduced, we have witnessed a significant increase in model size, measured by the number of parameters trained. This growth trend has demonstrated a clear correlation between model size and performance. Leading models like Llama3 possess 70 billion parameters, while Chat GPT4 is rumored to have over 1 trillion. This architecture is believed to consist of several expert models working together, each with approximately 220 billion parameters. The total model size exceeds 1 trillion parameters, but only portions are activated to respond to specific queries. Other models, such as Qwen from Alibaba (BABA), boast 110 billion parameters, and Facebook's (META) upcoming Llama model is anticipated to surpass 400 billion.

Although the path to increasing model sizes seems unobstructed, significant challenges remain. The costs of training these massive models, their inherent complexity, and the substantial electricity consumption hinder their widespread adoption. The training process necessitates large data systems and collaboration among thousands or even tens of thousands of GPUs, like Nvidia A100s. Coordinating these machines, processing the data, and optimizing the training process pose formidable technical hurdles. Off-the-shelf solutions are scarce, necessitating a highly skilled technical team to execute the training effectively. Estimation reveals that the capital cost of a data center designed to train a model like Llama3 exceeds 2 billion dollars, considering hardware equipment alone. The training duration itself spans several months.

Another critical aspect is the tremendous power consumption of GPUs. The latest chips from Nvidia demand approximately 1200 watts at full processing capacity. A single server rack filled with GPUs can draw up to 120 kilowatts of power. Consequently, data centers are being constructed with an unprecedented power draw exceeding 200 megawatts. Experts estimate that several new nuclear power plants would be required solely to power the data centers that process AI workloads, underscoring the immense energy demands of AI.

The training and inference costs of large language models are spiraling out of control, raising the question of whether they are the ultimate solution. Simplified versions of larger models might offer a better alternative, as they address some of the key problems with large models. Training larger models is becoming increasingly difficult, and performing inference on them to generate outputs is equally challenging. The memory and processing requirements for larger models increase exponentially, resulting in only marginal improvements in performance. Investing in larger models with more parameters, coupled with the escalating costs of training and inference, seems like a dead-end. Furthermore, models with more parameters necessitate more data for training. Rather than creating ever-larger static models, which face challenges such as power demand, training issues, capital costs, and inference speed, improving existing models can be achieved by providing them with better data for improved retrieval augmented generation techniques or by incorporating more agent tools.

As large language models appear to be approaching a limiting point or a dead-end, a growing demand for "small language models" has emerged. While larger models excel at performing various tasks, creating a model capable of writing a poem about a Falling Leaf in the style of Donald Trump requires considerable complexity. However, for more specific tasks, smaller and more narrowly focused models can now be trained to answer domain-specific questions or perform simple image recognition, such as detecting whether a person in a video is wearing a helmet. These models can be built and trained efficiently and can run inference faster on local devices like iPhones or small embedded chips, unlocking AI's potential for numerous applications. Therefore, reducing model size and computational requirements will likely be a significant area of development.

In order to reduce computational power and memory requirements, several techniques can be employed.

Quantization: In this technique, the parameters of the model are rounded down to a smaller level of precision, making the model itself smaller. While this results in a slight sacrifice in precision and power, it significantly reduces the model size, requiring less computing power to run. Model pruning: This technique involves identifying neurons in a network with activation values close to zero, setting them to zero, and recalibrating the remaining neurons. This allows for the removal of neurons with zero activations, making the model simpler and smaller, and reducing computational power requirements. Computational optimizations: Implementations like llama-cpp, a C++ inference implementation, show promise in improving model inference speed. For example, a standard M2 Pro MacBook can now process over 100 tokens per second using the llama3 7b model with llama-cpp. Customized hardware: Companies like Groq are building specialized hardware, known as language processing units (LPUs), specifically designed for running inferences on large language models. These hardware solutions can significantly accelerate inference speeds, with the 70 billion parameter Llama3 model inferencing at roughly 320 tokens per second.

It is interesting to note that Apple has not yet entered the market for building specialized hardware for large language models or AI acceleration. However, given Apple's strengths in custom silicon hardware innovation, it is possible that they may introduce custom chips specifically designed for AI workloads in the future.

One striking observation is Google's (GOOG,GOOGL) limited presence in artificial intelligence ('AI') despite its early leadership in the field. Its free model, Gemma, falls short compared to competitors like Llama3. This is surprising, considering Google's pioneering role in developing machine learning tools such as TensorFlow for Python. TensorFlow once set the standard for machine learning research, but its "market share" in terms of project adoption has dwindled. PyTorch, an open, consortium-led project, is swiftly replacing TensorFlow. This decline illustrates how a significant lead can be quickly eroded.

In contrast, Nvidia has maintained extremely high gross margins by combining highly efficient GPU chips and dominance in software with its CUDA processing software library. This raises the question of whether Nvidia, with its current advantage, could face a similar fate as Google.

PyTorch, a high-level software abstraction layer, offers the potential to bypass the reliance on CUDA. Tech giants like Intel (INTC), AMD, and others are investigating the use of PyTorch as an abstraction layer to conceal distinctions between diverse hardware providers. This solution would enable developers to code in PyTorch, ensuring seamless execution of models on various hardware, including CUDA (Nvidia), oneAPI-supported (Intel) hardware, and AMD equivalents.

As companies increasingly develop custom silicon, there will be a growing need for a high-level abstraction library. This library should not rely on CUDA and eliminate the vendor lock-in currently imposed by Nvidia. This approach is likely to challenge Nvidia's near-monopoly position in the GPU processing chip market over time.

Major hyperscale data providers are joining the fray. Microsoft (MSFT), for instance, has developed the MAIA 100 and Cobalt 100 chips. Meta has created its AI chips for inference and AI workloads, particularly for their advertising business. Google has its TPUs specifically designed for AI workloads.

Amazon's (AMZN) Graviton CPU, built on the ARM architecture, effectively replaces Intel CPUs and offers improved performance per watt. Its compatibility with the ARM architecture requires recompilation of the source code. For companies investing heavily in cloud infrastructure, moving their computations to a Graviton system provides significant cost savings, with AWS promising over 30% in savings. As more businesses adopt Graviton, the library of compatible tools and supporting applications will expand. Similar to Graviton's threat to Intel's data center CPU business, I anticipate the emergence of other providers of AI chips that will challenge Nvidia's dominance. While Nvidia will likely respond, its substantial margins leave room for competitors to make inroads.

To sum up my thoughts on AI:

AI is an incredible innovation, leading to an explosion of new ideas, services, and innovation. However, it seems that the rate of innovation may be slowing down in the short term. Current efforts are focused on making AI models more intelligent by providing them with more data and improving their learning capabilities. However, the usefulness of these models is still limited to relatively narrow applications. Large language models are becoming increasingly large and complex, making them challenging to train and use. I believe we will see new advancements and innovation aimed at making models smaller, faster, easier to train, and compatible with a wider range of hardware. This will likely lead to a broader adoption of AI models. Competing companies are developing hardware and software to challenge Nvidia's dominance in the AI hardware market. They are likely to use PyTorch or similar high-level abstraction software layers to bypass CUDA and reduce Nvidia's current near-monopoly position.

In the stock market, May has been a tumultuous month for many stocks. Companies have experienced significant reactions to their earnings reports, with some facing substantial declines and others showing dramatic gains. For instance, MongoDB (MDB), DELL, and Salesforce (CRM) all witnessed drops of over 20% after seemingly weak earnings. Conversely, companies like Fabrinet , Deckers, Burlington, and Nvidia experienced significant increases. These extreme market movements can be attributed to several factors: low liquidity, positioning games by multi-manager platforms and long-short equity players, and a prevalence of passive investors in the marketplace. The combination of these factors leads to relatively few investors actively analyzing companies and providing liquidity, resulting in large and volatile price swings. Notably, few companies, except for Nvidia and possibly Microsoft, are genuinely profiting from the AI hype.

The AI revolution, once heralded as transformative, is now showing signs of slowing down. Innovation has reached a plateau, practical applications are limited, and the costs associated with AI development remain high. Companies have grappled with challenges in turning AI into a profitable venture. Nvidia, a prominent player in the AI industry, is currently trading at a trailing price-to-earnings ratio of 67x and a forward price-to-earnings ratio of 32x. In an environment of relatively high-interest rates (5.25%), such valuations appear elevated.

The key question arises: What is the fair valuation for Nvidia? An examination of Nvidia's business history prior to 2022 reveals a cyclical pattern characterized by unstable returns on capital. Based on this past performance, the company does not appear to warrant a high valuation multiple. However, if the future holds significantly different outcomes, as the markets seem to anticipate, then the current valuation may be considered reasonable, although not inexpensive.

However, there are risks associated with investing in AI. If the AI boom encounters any setbacks or obstacles, Nvidia's stock price could easily experience a substantial decline, potentially losing up to 80% of its value. Sales and profit margins may fail to meet expectations, further exacerbating the situation.

While AI remains an intriguing concept, it is not a domain I am currently inclined to invest in. The challenges and uncertainties surrounding the practical applications and profitability of AI warrant a cautious approach.

