The Hidden Battle Behind “Open” AI: Open Source vs Open Weight

In the world of artificial intelligence, a war of words is raging. Behind the technical terms “open source” and “open weight” lie crucial stakes for the future of technology. A deep dive into a distinction that will determine who controls tomorrow’s AI.

Artificial intelligence is currently going through a crucial defining phase. As generative AI models transform our societies, a fundamental question divides industry players: what does “open AI” really mean? This question, far from being purely semantic, determines access to these technologies and their future development.

Two approaches are clashing today. On one side, “open weight” models, favored by many companies. On the other, the authentic “open source” approach, defended by free software organizations. To understand this distinction, we must first grasp how an artificial intelligence model works.

The Fundamentals: How an AI Model Works

An artificial intelligence model relies on three essential elements. First, the “weights”: millions or billions of numerical parameters that determine the model’s responses. These weights are obtained through “training,” a process that progressively adjusts these values. Second, the training data: texts, images, or other content used to teach the model. Third, the source code: the computer programs that orchestrate the training and operation of the model.

This architecture explains why not all “open” models are equal. Depending on which elements are shared or kept secret, the possibilities for use and improvement vary considerably.

Open Weight: Variable Geometry Openness

The “open weight” approach consists of publishing only the weights of the trained model. This strategy allows developers to use the model and adapt it to their specific needs. However, it keeps the crucial elements of its creation in the shadows.

Concretely, receiving an “open weight” model is equivalent to obtaining a fully assembled automobile without having the manufacturing plans, the list of components used, or the production tool specifications. The user can drive the vehicle and even modify it superficially, but remains unable to understand its internal mechanisms or reproduce its manufacturing.

This limitation is not trivial. Without access to training data, it’s impossible to evaluate the model’s potential biases or understand its strengths and weaknesses. Without the source code, reproducing the training process becomes impossible, preventing any independent verification of announced performance.

Added to this are so-called “free” exploitation licenses that are often more restrictive than existing standards (Apache, MIT, or others), custom-created by model publishers. Meta’s Llama model perfectly illustrates these restrictions. Despite its “open” labeling, this model remains inaccessible to European users due to legal constraints that the company refuses to lift. A situation that reveals the limits of conditional and geographically selective openness.

Authentic Open Source: Total Transparency Required

The Open Source Initiative, the reference organization in free software, has established strict criteria for artificial intelligence. A truly “open source” model must provide all of its components: complete weights under a free license, detailed documentation of training data, source code allowing training reproduction, and exhaustive technical documentation.

This approach is inspired by the four fundamental freedoms of free software, adapted to the AI context. Freedom of use authorizes the model’s use without application or sector restrictions. Freedom of study allows detailed understanding of the model’s operation and decision mechanisms. Freedom of modification authorizes adaptation of the model to specific needs. Finally, freedom of redistribution encourages sharing improvements with the entire community.

These principles create a virtuous circle of collaborative innovation. Each improvement can be shared, studied, and integrated by other developers, accelerating global technological progress.

The Contrasting Landscape of Current Initiatives

Faced with these definitions, sector players adopt diverse strategies, each with its advantages and risks.

Pioneers of Total Transparency

Organizations like Eleuther AI, Allen Institute for AI, or HuggingFace have chosen the path of maximum transparency. These projects share not only the weights of their models but also the training data and creation processes. Their approach allows complete reproduction of work and independent verification of results.

However, this transparency comes with significant legal risks. Eleuther AI had to remove several components of “The Pile,” its famous dataset, following copyright challenges. A Dutch development project on Llama was entirely deleted for license violation. These incidents reveal the legal gray areas threatening the open source ecosystem.

The Emergence of Legally Secure Solutions

Faced with these uncertainties, a new generation of initiatives prioritizes legal security. The Common Corpus project, for example, compiles exclusively data whose distribution is legally authorized. This approach eliminates copyright risks and allows redistribution without fear of prosecution.

Models by Daijobu AI, developed in France, follow a similar philosophy by guaranteeing compliance with European regulations, notably the AI Act and exceptions provided for textual data exploration. While these models are not necessarily “more open” technically, they offer crucial legal security for institutional and commercial adoption.

The Challenges of License Continuation

Some projects experiment with an even stricter approach: “license continuation.” According to this principle, a model trained on Wikipedia should inherit that encyclopedia’s license. This logic, intellectually coherent, proves practically unmanageable.

Combining sources with different licenses – Creative Commons, GNU Free Documentation License, French open license – becomes an unsolvable legal puzzle. This approach is only viable for projects exclusively based on the public domain, considerably limiting innovation possibilities.

The DeepSeek Shockwave

DeepSeek’s arrival on the market has disrupted established balances. By publishing its cutting-edge models under a totally free MIT license, this Chinese company demonstrated that a radically open approach remained not only possible but also competitive.

This demonstration exposed the limitations of partial openness strategies adopted by other players. When a high-performing model becomes available without restrictions, legal subtleties and artificial limitations lose their economic justification.

The impact goes beyond the technical domain. DeepSeek revealed an uncomfortable reality: many companies exploit the ambiguity between open source and open weight to maximize their benefits. They harvest contributions from the open source community without real reciprocity, while preserving their competitive advantages through proprietary elements they retain.

The European Regulatory Framework Takes Shape

The European Union is not passive in the face of these issues. The AI Act and the Code of Conduct for AI redefine the rules applicable to artificial intelligence models. These texts notably impose mandatory traceability of training data and increased transparency on sources used.

Respect for the “text and data mining” exception becomes a legal obligation, not just good practice. Developers must now precisely document their sources and respect opt-out rights expressed by content holders.

These regulations, perceived by some as constraints, could paradoxically clean up the market. By imposing clear standards, Europe forces players to choose between authentic transparency and marketing communication about their supposed “openness.” And paves the way to the emergence of an truly sovereign AI in Europe.

Nevertheless, many uncertainties remain. The use of copyrighted content for training remains a controversial subject, with variable legal interpretations across jurisdictions. This situation discourages innovation and favors organizations with substantial legal resources.

Practical Guide for Developers

In this complex landscape, developers must adopt a methodical approach to choose their tools.

For standardized commercial applications, an “open weight” model may suffice if needs don’t require understanding or modifying training processes. This option offers usage flexibility while maintaining relative legal simplicity.

Conversely, for research, auditing critical systems, or developing innovative solutions, the complete transparency of open source becomes indispensable. Only this approach allows deep understanding of mechanisms and continuous improvement.

In all cases, careful examination of licenses is essential. Restrictions can hide in contractual details, with major implications for final use. Anticipating regulatory evolution, by favoring models compliant with emerging standards today, also constitutes a wise precaution.

Obviously, Daijobu AI accompanies you in these technological choices that are central to your company’s development.

A Technological Governance Issue

The distinction between open source and open weight far exceeds technical considerations. It fundamentally determines who will be able to understand, improve, and democratize these technologies that are transforming our societies.

This battle defines the future balance between open innovation and proprietary control. It directly influences the ability of researchers, public institutions, and small companies to participate in artificial intelligence development.

The future is taking shape between two scenarios. The first would see the emergence of a truly open ecosystem, based on transparency and collaboration. The second would maintain the dominance of a few major players using terminological ambiguity to preserve their competitive advantages.