Alibaba ATH-Token Foundry has officially announced the open-source release of LOGOS, the first multi-domain scientific generative foundation model based on a unified scientific syntax, in collaboration with the Gaoqiang Institute of Artificial Intelligence at Renmin University of China. The model consistently matches or surpasses traditional domain-specific methods in six representative scientific tasks, using a pure sequence modeling approach.

image.png

Notably, the model demonstrates extremely high parameter efficiency. The LOGOS-1B model, with only 1B parameters, outperforms Microsoft's NatureLM language model, which has 8×7B parameters, on multiple core tasks.

First-in-class unified scientific syntax for heterogeneous objects

LOGOS has built a large pre-training corpus covering seven modalities, including biological macromolecules, chemical entities, and interface interactions, totaling 44.87B tokens. By designing a shared vocabulary, it encodes previously heterogeneous objects such as proteins and small molecules into a unified discrete token sequence.

This unique scientific syntax design allows different scientific objects to be understood autoregressively by large models within the same generation space. It even invented a "text description method," enabling complex spatial interaction rules to be constructed in the mind without inputting complex 3D coordinates, relying solely on sequence prediction.

image.png

Eliminating the gap between pre-training and application

In traditional research paradigms, switching models is often required when changing research stages, leading to extensive fine-tuning during model deployment. LOGOS achieves a high level of consistency in form and purpose, as the sequence format of its pre-training data is identical to the input and output formats of downstream tasks.

This high alignment effectively eliminates the gap between pre-training and downstream applications, directly activating generation capabilities without complex adaptation layers. Currently, Alibaba has fully open-sourced the model weights, inference code, and technical report of this large model.