Towards Data-Centric Automatic R&D (2024)

Haotian Chen1 , Xinjie Shen111footnotemark: 1 , Zeqi Ye1 , Wenjun Feng1, Haoxue Wang1
Xiao Yang2 , Xu Yang2, Weiqing Liu2, Jiang Bian2
2Microsoft Research Asia
{ht1ian.chen, frinkleko, liamyzq}@gmail.com
fwj20020813@outlook.com, whx924@gmail.com
{xiao.yang, xuyang1, weiqing.liu, jiang.bian}@microsoft.com
Equally ContributedCorresponding Author

Abstract

The progress of humanity is driven by those successful discoveries accompanied by countless failed experiments. Researchers often seek the potential research directions by reading and then verifying them through experiments. The process imposes a significant burden on researchers. In the past decade, the data-driven black-box deep learning method has demonstrated its effectiveness in a wide range of real-world scenarios, which exacerbates the experimental burden of researchers and thus renders the potential successful discoveries veiled. Therefore, automating such a research and development (R&D) process is an urgent need. In this paper, we serve as the first effort to formalize the goal by proposing a Real-world Data-centric automatic R&D Benchmark, namely RD2superscriptRD2\text{RD}^{2}RD start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTBench. RD2superscriptRD2\text{RD}^{2}RD start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTBench benchmarks all the operations in data-centric automatic R&D (D-CARD) as a whole to navigate future work toward our goal directly. We focus on evaluating the interaction and synergistic effects of various model capabilities and aiding in selecting well-performing trustworthy models.Although RD2superscriptRD2\text{RD}^{2}RD start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTBench is very challenging to the state-of-the-art (SOTA) large language model (LLM) named GPT-4, indicating ample research opportunities and more research efforts, LLMs possess promising potential to bring more significant development to D-CARD: They are able to implement some simple methods without adopting any additional techniques. We appeal to future work to take developing techniques for tackling automatic R&D into consideration, thus bringing the opportunities of the potential revolutionary upgrade to human productivity.

1 Introduction

“I have not failed. I’ve just found 10,000 ways that won’t work.”

Thomas Alva Edison

footnotetext: This is an open-source project starting in Oct. 2023. 1 Work done during an internship at Microsoft.

The advancement of human society and the enhancement of living standards are highly correlated with the development of technology[39, 33, 26, 4]. Numerous truths and principles remain undiscovered in the world, awaiting experimental exploration[36, 27]. Those few successful discoveries, accompanied by countless failed experiments, propel the frontiers of technology. Historically, scientific researchers, including Edison, have undertaken extensive experiments by conducting them manually.In the age of AI, the influence of data-driven solutions, such as machine learning (ML) systems, is rapidly expanding[19, 7, 21]. These systems are known for their robust fitting capabilities and their “black box” nature, which significantly increases the experimental load on researchers and hinders the process of identifying and validating effective methodologies. This paper concentrates on this critical scenario, which we refer to as Data-Centric Research and Development (R&D).To cope with the prohibitively expensive costs and the overwhelming volume of experiments required, we consider automating such an R&D process for higher research efficiency by leveraging the strong language understanding and programming ability of the state-of-the-art (SOTA) large language models (LLMs)[40]. The brief illustration of Data-Centric Automatic R&D (D-CARD) is shown in Figure1.

Towards Data-Centric Automatic R&D (1)

The first step towards automatic R&D is to formalize the task and provide a benchmark for identifying the potential effective methods and research directions. Intuitively, an outstanding methodology identified by the benchmark should possess (1) strong language understanding ability to identify the implementable methods or ideas (e.g., formulations and models) in the given raw information (e.g., papers, reports, websites, etc.) and (2) strong implementation ability to accurately implement the methods by programming and then obtain reliable experimental results. Previous work focuses on benchmarking the different aspects of the two abilities. Specifically, the language understanding ability of LLMs is partly evaluated through analyzing their performance on relation extraction[43], question answering[57], and other natural language processing (NLP) tasks[29]. Meanwhile, the implementation ability of LLMs is partly tested through benchmarks like SWE-Bench[11], ToolBench[30], ML-Bench[16] and MetaTool[1], which study their ability of solving GitHub issues, using tools to program, and determining whether to use tools in a given scenario.

In this paper, we serve as the first effort to investigate the capabilities of the SOTA LLMs in tackling automatic R&D and propose a Real-world Data-centric automatic R&D Benchmark (RD2superscriptRD2\text{RD}^{2}RD start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTBench). The scenario studied by RD2superscriptRD2\text{RD}^{2}RD start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTBench possesses two unique and distinct characteristics that fundamentally differentiate it from others. First, RD2superscriptRD2\text{RD}^{2}RD start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTBench focuses on studying the real-world scenario where all the operations in R&D are automatic and evaluated as a whole, thus navigating the related future research efforts toward the goal of developing human technology more effectively. The real-world scenario requires more comprehensive and advanced model capabilities and exhibits new challenges. Second, we study the real-world automatic R&D in data-centric settings to navigate future work toward the urgent experimental exploration need brought by black-box data-driven models.Compared with existing benchmarks, RD2superscriptRD2\text{RD}^{2}RD start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTBench possesses two significant advantages:

(1) RD2superscriptRD2\text{RD}^{2}RD start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTBench evaluates the interaction and synergistic effects of various model capabilities instead of focusing on a single aspect of ability, which not only captures the frontier of SOTA LLMs but also bridges the gap between studying “individual ability” and “real-world synergistic effects of abilities”. In automatic R&D, an ML system fails to complete the task even if it possesses both the strong information extraction ability and the strong programming or tool-using ability: While it succeeds in extracting methods and implementing them, it fails in selecting the appropriate data from the datasets or misunderstanding either the descriptions of data features or the requirements expressed by prompts. Additionally, exhaustively enumerating all the aspects for benchmarking is extremely challenging, which is overcome by RD2superscriptRD2\text{RD}^{2}RD start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTBench.

(2) RD2superscriptRD2\text{RD}^{2}RD start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTBench tends to select well-performing trustworthy models instead of those models that fail to learn rationales and causality yet possess outstanding performance. Specifically, ML systems easily achieve SOTA performance on previous benchmarks by shortcut learning or learning spurious correlations instead of learning rationales or causality[20, 9, 6, 46, 5]. This renders a benchmark ineffective and misleading as it fails to accurately identify the well-performing trustworthy methods. For example, an ML system achieves SOTA performance on dog classification by merely recognizing grass[55]. RD2superscriptRD2\text{RD}^{2}RD start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTBench, on the contrary, eliminates such models by its high difficulty and large scope. The decision rules of models have to simultaneously satisfy at least four major requirements: (1) accurately and comprehensively extracting the implementable methods; (2) precisely selecting the method-specific data for computation; (3) correctly writing the code according to the logic expressed by methods and prompts; (4) successfully storing the correct results in a predefined format. Therefore, the decision rules of models selected by this benchmark are stable (work well in various situations), and thus getting closer to rationales and causality[6].

We evaluate the existing SOTA LLMs on RD2superscriptRD2\text{RD}^{2}RD start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTBench to expose their bottleneck and characterize the future research direction. RD2superscriptRD2\text{RD}^{2}RD start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTBench reveals new insights: (1) Among the popular LLMs, GPT-4 exhibits promising potency in dealing with the D-CARD task; (2) Detailed information of data descriptions significantly improves the performance of GPT-4; (3) The ability to query domain-specific knowledge is a basic requirement of D-CARD methods; (4) The more complex the method is, the more unstable the model’s performance is.

2 Related Work

2.1 LLM as Autonomous Agent

In the past few years, LLM has made great achievements in both academia and industry[22, 42], and has achieved results that surpass the previous level in a number of classic tasks[56]. Research has shown that with the growth of data volume and model size[58], LLM has emerged with stronger reasoning and other capabilities[23]. These capabilities enable LLM to exhibit certain agent-like behaviors in some tasks such as using or creating tools[31, 28], planning[54, 3], and memory. Therefore, more and more researchers have expressed their expectations for its human-like and overall capabilities, and have made preliminary explorations of it as an independent agent[44, 38]. Multi-agent collaboration[49, 14] is also introduced to LLM for better accuracy and generalizability. Moreover, for reducing human efforts and automatically exploring, previous work focuses on autonomous LLM agents for general purpose are purposed[51, 37]. Positive views further believe that the realization of AGI may come from the evolution of autonomous LLM and some inspiring examples have been released[25].

However, most research still focuses on limited scenarios that are given with clear and fixed questions and backgrounds. A recent work[53] has attempted to introduce LLM to the R&D field and formalize the R&D process as a sequence of tasks. However, there is no easy-to-use benchmark for the community and current R&D tasks may be too general and can’t reveal significant signals. In this work, we propose a benchmark for LLM in data-centric R&D tasks and provide a comprehensive evaluation.

2.2 Semi-Automatic R&D with Agents

Scientific research and development (R&D) is a time-consuming and important process. In the past, R&D has been mainly conducted by human researchers with countless failed experimental explorations and creative observation conclusions. Agents have been introduced to R&D to reduce human efforts and automatically explore. Recently, there have been attempts to partly automate R&D, including the automatic chemical synthesis planning [2], automatic molecular design [13, 35, 2], automatic theorem proving [45, 52]. However, these attempts mainly focus on automatic searching for possible solutions and optimizations with symbolic representation [18] and heuristic techniques [48], but less addressing long-horizon planning, implementation, and reasoning for the next step idea exploration. Moreover, the data-centric R&D tasks currently have not been explored in the community, and no benchmark has been proposed for the community. Previous works have applied LLM to real-world R&D tasks such as debugging issues [41, 12] or only focus on data-centric but not real-world R&D tasks [17]. In this work, we propose a benchmark for LLMs in data-centric R&D tasks and evaluate the performance of LLMs.

3 RD2superscriptRD2\text{RD}^{2}RD start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTBench

Overall, our benchmark focuses on evaluating the finally implemented results according to the given raw information (e.g., papers, reports, websites, etc.). Moreover, we also provide human-annotated ground-truth information corresponding to the intermediate steps for debugging and more comprehensive evaluation. RD2superscriptRD2\text{RD}^{2}RD start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTBench selects well-performing models that follow human operations and accurately calculate the final results. We introduce the details of our proposed RD2superscriptRD2\text{RD}^{2}RD start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTBench in the following sections. In section3.1 and section3.2, we introduce how we collect data and perform human annotation to form RD2superscriptRD2\text{RD}^{2}RD start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTBench. Then, we elaborate on the two necessary steps, namely method extraction and method implementation, to perform R&D in section3.3 and section3.4. Finally, we detail our adopted evaluation metrics in section3.5.

3.1 Data Collection

We consider the raw information that contains formulas and models, which represent a wide range of methods proposed in the AI domain.

Data Collection with Formulas.We prepare raw information that contains formulas as the input of R&D. Raw information is presented as publicly available financial reports and stock trading data.Formulas are usually mathematical formulas that take complex numeric input data about stock, company, and market as input and output a series of values with the time series.We collect financial reports with 27 implementable formulas distributed in three difficulty levels: easy, medium, and hard. Domain experts manually label the difficulty levels according to the complexity of implementation. To obtain their implementation results, an agent is expected to accurately select the features from three types of trading data scattered across 2010 to 2022, namely fundamental, price-volume, and high-frequency data. We denote the three types of data as Data I, II, and III, respectively.

Data Collection with Models.We collect papers with six open-sourced models[10, 34, 32, 15, 50, 47]. The implementation of models adopts Pytorch[24] and torch_gemometric framework[8] to perform deep learning. All the papers and models are publicly available. We manually label the difficulty level (easy, medium, hard) of the task based on the complexity of implementation (computational graphs and tensor operations). We refer the readers to the appendix for more details about the dataset and the task.

3.2 Human Annotation.

To provide a more comprehensive evaluation for debugging and analyzing, we conduct human annotation to provide the ground-truth results of our collected data, namely method extraction results and method implementation results.

Challenges.We confront five main challenges in the human annotation process.First, we need to identify the difficulty levels of methods to ensure the diversity of our benchmark and expose the bottleneck of current models.Second, we have to identify and discard the raw information if its presented methods demand unavailable data: The computation of some formulas can require confidential information that is not publicly available.Third, since the definitions or descriptions of some methods can be vague, leading them to be unimplementable, we have to filter out these methods.Fourth, some domain-specific methods containing factual errors should be filtered out since they are not implementable.Fifth, we should distinguish the domains and types of the methods according to their descriptions.To sum up, all the challenges imply the fact that human annotation of RD2superscriptRD2\text{RD}^{2}RD start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTBench requires high time cost and expertise of annotators. Therefore, we commit more effort to designing the annotation guidelines and mechanisms to ensure the dataset quality.

Annotation Guidelines and Mechanisms.We provide principled guidelines and mechanisms to the annotators. Specifically, we first clarify our requirements in the guidelines: The extracted ground-truth methods are implementable without the requirement of any additional information, and the ground-truth implementation results are stored in an expected space with the predefined form and shape. Second, we provide training to the annotators. After the training step, we delicately design a test case and an interview to examine if the annotators possess expertise in the target domain (e.g., financial domain) and understand the guidelines. Third, we make each annotation result checked at least twice by annotators. A senior researcher and an applied scientist with domain expertise and rich engineering experience will further check each annotation result together before it finally gets included in the benchmark.

3.3 Method Extraction Step

We evaluate the ability of models to recognize and extract information from raw information (e.g., R&D context). A qualified model is expected to discern feasible methods (formulas and models) from extensive research materials and extract all necessary information for implementing these formulas. The ability serves as the foundational premise for subsequent code implementation.

We expect the model to accurately and comprehensively extract the methods mentioned in the research materials it reviews, including all essential conditions required to implement the method. For methods with incomplete information, further implementation is not required; for methods with complete conditions, a model is expected to correctly comprehend the semantic meaning of these conditions stated in natural language and generate corresponding code. Specifically, we have predefined the extraction format (key-value pair) for the model. We employ the F1 score to measure the comprehensiveness and accuracy of method identification and extraction.

Note that some methods in the original materials might only imply their function, effect, or origin through their names without explicitly presenting their formulas, definitions, or other details. In such cases, the model may choose not to extract them or opt to autonomously complete them based on the semantics of the original materials. We expect the latter approach in future work, as it showcases the creativity of models by proposing new formulas and generating brand-new, informative, and reliable information. In the current version of the benchmark, only methods mentioned by name are evaluated in this manner; future iterations will explore and assess the model’s ability to generate new names and formulas when none are explicitly mentioned.

3.4 Method Implementation Step

In this section, we evaluate the performance of LLM in the implementation of methods. Given all the necessary conditions provided to the model after the previous step, the model needs to select the necessary data and write code from scratch to implement the method with an informative and well-organized prompt. Details of the prompt are included in the dataset, which is also shown in the appendix. We encourage models to use Python and perform data analysis. They are also permitted to use common machine-learning libraries.One example of the method implementation step is shown in Figure2.

Towards Data-Centric Automatic R&D (2)

3.5 Evaluation Metrics

We adopt multiple metrics to evaluate model performance in each step. For formula implementation, we adopt the average and maxima “running success rate”, “format success rate”, “Pearson correlation” and “value accuracy” across multiple independent attempts. We use “avg.”, “exe.”, “form.”, “corr.”, and “acc.” to denote the average value, number of successful execution times, number of matched result formats, the correlation, and the accuracy of corresponding values, respectively. We refer the readers to more details about the metrics calculation details in AppA.

For model implementation, we believe a successful implementation of a model should be consistent with the ground truth implementation as the model can be viewed as a numeric function and combination of tensor transformations. Therefore, we propose these two metrics for the model architecture implementation task: tensor shape consistency rate (tsc.), tensor value consistency rate (tvc.). Specifically, for each model layer, we calculate the consistency rate of the tensor shape and tensor value between the ground truth implementation and the implementation generated by the LLM. All the ground truth tensor values are determined by ground truth implementation codes with random Gaussian noise.Therefore, the formula for the two metrics is as follows, where Sshapeisuperscriptsubscript𝑆shape𝑖S_{\text{shape}}^{i}italic_S start_POSTSUBSCRIPT shape end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and Svalueisuperscriptsubscript𝑆value𝑖S_{\text{value}}^{i}italic_S start_POSTSUBSCRIPT value end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT are the consistency rate of tensor shape and tensor value in layer i𝑖iitalic_i, respectively, and disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the maximum length of the two tensors as the two tensors are 𝐙isubscript𝐙𝑖\mathbf{Z}_{i}bold_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝐙isuperscriptsubscript𝐙𝑖\mathbf{Z}_{i}^{*}bold_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, the ground truth and the generated tensor, respectively:

Sshapei(𝐙i,𝐙i)=(1+exp(j=1d|dim(𝐙i)jdim(𝐙i)j|d))1,Svaluei(𝐙i,𝐙i)=(1+exp(j=1d|𝐙i(j)𝐙i(j)|d))1,d=max(len(dim(𝐙i)),len(dim(𝐙i))),formulae-sequencesuperscriptsubscript𝑆shape𝑖subscript𝐙𝑖superscriptsubscript𝐙𝑖superscript1superscriptsubscript𝑗1𝑑dimsubscriptsubscript𝐙𝑖𝑗dimsubscriptsuperscriptsubscript𝐙𝑖𝑗𝑑1formulae-sequencesuperscriptsubscript𝑆value𝑖subscript𝐙𝑖superscriptsubscript𝐙𝑖superscript1superscriptsubscript𝑗1𝑑superscriptsubscript𝐙𝑖𝑗superscriptsubscript𝐙𝑖absent𝑗𝑑1𝑑lendimsubscript𝐙𝑖lendimsuperscriptsubscript𝐙𝑖\begin{gathered}S_{\text{shape}}^{i}(\mathbf{Z}_{i},\mathbf{Z}_{i}^{*})=\left(%1+\exp\left(\frac{\sum_{j=1}^{d}|\text{dim}(\mathbf{Z}_{i})_{j}-\text{dim}(%\mathbf{Z}_{i}^{*})_{j}|}{d}\right)\right)^{-1},\\S_{\text{value}}^{i}(\mathbf{Z}_{i},\mathbf{Z}_{i}^{*})=\left(1+\exp\left(%\frac{\sum_{j=1}^{d}|\mathbf{Z}_{i}^{(j)}-\mathbf{Z}_{i}^{*(j)}|}{d}\right)%\right)^{-1},d=\max(\text{len}(\text{dim}(\mathbf{Z}_{i})),\text{len}(\text{%dim}(\mathbf{Z}_{i}^{*}))),\end{gathered}start_ROW start_CELL italic_S start_POSTSUBSCRIPT shape end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( bold_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = ( 1 + roman_exp ( divide start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT | dim ( bold_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - dim ( bold_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | end_ARG start_ARG italic_d end_ARG ) ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , end_CELL end_ROW start_ROW start_CELL italic_S start_POSTSUBSCRIPT value end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( bold_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = ( 1 + roman_exp ( divide start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT | bold_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT - bold_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ ( italic_j ) end_POSTSUPERSCRIPT | end_ARG start_ARG italic_d end_ARG ) ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , italic_d = roman_max ( len ( dim ( bold_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) , len ( dim ( bold_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) ) , end_CELL end_ROW(1)

while the shorter tensor is padded with zeros to match the length of the longer tensor.As the final score of the two metrics, we use the weighted sum of the consistency rate of all layers, weight increases with the depth of the layer and is summed as one: Sfinal=i=1nSiγii=1nγi,subscript𝑆finalsuperscriptsubscript𝑖1𝑛superscript𝑆𝑖superscript𝛾𝑖superscriptsubscript𝑖1𝑛superscript𝛾𝑖S_{\text{final}}=\frac{\sum_{i=1}^{n}S^{i}\cdot\gamma^{i}}{\sum_{i=1}^{n}%\gamma^{i}},italic_S start_POSTSUBSCRIPT final end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_S start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ⋅ italic_γ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG ,where n𝑛nitalic_n is the number of layers in the model, γ𝛾\gammaitalic_γ is a tunable hyperparameter to control the weight increase, and we set γ=1.1𝛾1.1\gamma=1.1italic_γ = 1.1 in our experiments.

Towards Data-Centric Automatic R&D (3)

An example of the calculation is shown in Figure3, using model LinkX [15] as an example. Meanwhile, we also include the “average running success rate” as the basic metric for the model architecture implementation task, which is the same as the formula implementation task.

4 Experiments

4.1 Experimental Settings

As we have numeric input and output in R&D tasks, we set numeric equability with 1e-6 as tolerance for the evaluation of the implementation of methods. We set the base models as GPT-4-turbo[22], GPT-4-32k[22], GPT-35-turbo-16k[22] and Llama2[42] for the experiments. All the methods mentioned above, and their corresponding results are executed with Azure OpenAI API. There is no external data, resources, human feedback, or internet access involved in the experiments. We perform 20 independent attempts for each model and calculate the average and maximum value of each metric. As most of our input data is encoded in the form of document files, we first use parsing tools to extract text content from files. Azure document intelligence API (4.0) is used for parsing reports and academic papers in PDF format.

4.2 Results of Method Extraction

We evaluate the information extraction ability of models. As shown in Table1, we observe that GPT-4-turbo, GPT-4-32k, LLaMa3-70b, and LLaMa2-70b achieve competitive performance, which makes them possible to perform information extraction automatically. The performance of the four foundation models is stronger than of Phi3-4k, indicating more future endeavors to improve the extraction ability of Phi3-4k.

MetricsPrecionRecallF1
GPT-4-turbo0.8180.8000.809
GPT-4-32k0.8180.8180.818
LLaMa3-70b0.9090.8330.869
LLaMa2-70b0.8180.9000.857
Phi3-4k0.6360.7500.688

4.3 Results of Formula Implementation

In this section, we compare the performance of different models in the model architecture implementation task. We use the proposed metrics to evaluate the performance of the models. The results are shown in Table2 and Table3. We observe that the GPT-4-turbo achieves better performance than GPT-35-turbo and Phi3-128k in the model architecture implementation task. Overall experimental results indicate ample room for further research on the difficulty of the task and the challenges in automating R&D tasks. Specifically, we obtain the following four major findings revealed by the experimental results.

DataDifficultyFormulaAvg. Exec.Avg. FormatAvg. Corr.Max. Corr.
Data IEasyPB_ROE0.6500.0500.8520.852
PB_ROE_20.6000.2000.8751.000
PB_ROE_30.6000.3000.7261.000
MediumROE_movement0.9500.7500.9341.000
ROE_movement_100.9000.8000.8031.000
ROE_movement_200.9500.7500.7031.000
HardPB_ROE_movement0.6000.4500.5160.897
PB_ROE_movement_100.6500.3000.3270.896
PB_ROE_movement_200.5500.5000.2440.896
Data IIEasymid_price0.8000.1001.0001.000
mid_price_20.8500.000NaNNaN
mid_price_30.8500.000NaNNaN
Mediumliquidity_imbalance0.5000.0501.0001.000
liquidity_imbalance_20.9000.1500.6941.000
liquidity_imbalance_30.4500.1001.0001.000
Hardmicro_price0.8500.000NaNNaN
micro_price_20.6000.000NaNNaN
micro_price_30.6000.1001.0001.000
Data IIIEasyalpha0530.9500.7000.9331.000
alpha053_150.9500.6500.8721.000
alpha053_51.0000.6500.6761.000
Mediumalpha_pv_diff1.0000.6000.5131.000
alpha_pv_diff_150.9500.7500.2581.000
alpha_pv_diff_201.0000.7500.4411.000
Hardalpha_pv_diff_pct0.9500.7000.3751.000
alpha_pv_diff_pct_150.9000.4500.2361.000
alpha_pv_diff_pct_201.0000.3500.3581.000
OverallN/AAvg. Data I0.7170.4560.6650.949
Avg. Data II0.7110.0560.5220.556
Avg. Data III0.9670.6220.5181.000
Mean Value0.7980.3780.5680.835

LLM agents hold promising potential to tackle D-CARD.We can observe from Table2 and Table3 that GPT-4 possesses the ability to tackle some simple D-CARD cases without adopting any additional techniques. Specifically, GPT-4 achieves a high maximum correlation coefficient with the ground-truth results in implementing both easy and medium formulations: GPT-4-turbo achieves the maximum correlation value in implementing easy and medium formulas. However, GPT-4 fails to precisely match the exact ground-truth values due to some minor mistakes, such as missing the domain common knowledge (e.g., using percent change rather than difference when calculating growth), mismatching the output format, and unnecessarily introducing additional computational operations.

Precisely understanding and selecting data requires more detailed data information in D-CARD.As shown in Table2, we observe a special situation where GPT-4 significantly fails to implement a simple formulation while succeeding in implementing the harder ones. After analyzing its generated code, we find that GPT-4 confuses the different semantic meanings of data features due to their close natural language descriptions, which renders the subsequent calculation ineffective. For example, GPT-4 confuses the two terms named “volume” and “volatility” and always opts to use “volume” data when “volatility” is required. If we manually improve our initial prompt by adding a more detailed description, GPT-4 succeeds in understanding the semantic difference and obtains over 99% performance in the accuracy of values.

The ability to query domain-specific knowledge is a basic requirement of D-CARD methods.As we mentioned in the first finding, missing domain common knowledge impedes GPT-4 from calculating precisely matched final results. Additionally, we find that the implementation of some operations in a formulation also requires domain-specific knowledge. For example, in the financial domain, it’s clear enough for financial practitioners to implement the operation named “IndNeutralize(x,g)” by merely giving the description “x cross-sectionally neutralized against groups g”. However, in the code generated by GPT-4, it defines a function named “IndNeutralize(series, industry)” and leaves its content blank by merely adding a notation “Please replace this with your actual function definition”.

The more complex the method is, the more unstable the model performance is.As shown in the columns of Table2 named “avg. exec.”, “avg. form.”, and “avg. corr.”, respectively, we can observe that the performance variance of GPT-4 is significantly higher as the complexity of formulations increases. In 20 times of execution, GPT-4 generates the successfully executed code 18 times when implementing the medium mid_price while only three times in implementing hard alpha_pv.

avg. exec.avg. formatavg. corr.max. corr.
GPT-4—turboData I0.7170.4560.6650.949
Data II0.7110.0560.5220.556
Data III0.9670.6220.5181.000
Mean Value0.7980.3780.5680.835
GPT-35-turboData I0.5560.1000.3230.453
Data II0.5670.0000.0000.000
Data III0.7670.3890.4310.696
Mean Value0.6300.1630.2510.383
Phi3-128kData I0.1170.1110.1860.222
Data II0.1720.0000.0000.000
Data III0.0560.0220.0630.084
Mean Value0.1150.0440.0830.102

As shown in Table 3, the performance of GPT-35-turbo and Phi3-128k is poor, even failing in execution codes. However, GPT-4 models shown in Table2 have a much better performance. This indicates that the performance of the model in the data-centric R&D task is highly related to the model’s pre-training and capacity. Therefore, we posit that continually training and improving the foundation model is a promising direction for future research in the field of data-centric R&D tasks.

4.4 Results of Model Architecture Implementation

In this section, we compare the performance of different LLMs in the model architecture implementation task and summarize the results in Table 4 and 5. As shown in the table, we can see the GPT-4-turbo, GPT-35-turbo-16k, and GPT-4-32K have similar running success rates but differ variously in tvc. and tsc.. The LLaMa-2-70b has the lowest running success rate and other metrics. Notice that even though a significant gap still exists between GPT-35, LLaMa-2, and GPT-4, it is much smaller than the gap in the formula implementation task. The overall running success rates are also higher than the formula implementation task. We can conclude that we can have similar observations in the model architecture implementation task as in the formula implementation task.

Model NameDifficultyGPT-4-turboGPT-4-32Kavg. exe.avg. tsc.avg. tvc.max. tsc.max. tvc.avg. exe.avg. tsc.avg. tvc.max. tsc.max. tvc.PMLPEasy100.00%1.001.001.001.00100.00%1.001.001.001.00LinkXEasy100.00%1.000.851.001.00100.00%0.900.901.001.00VisNetHard45.00%0.290.090.370.4945.00%0.210.090.370.49AntiSymmetricMedium80.00%0.710.590.730.8870.00%0.560.660.660.88GPSConvMedium75.00%0.560.620.651.0075.00%0.530.620.650.72DirGNNConvMedium100.00%0.800.680.860.9490.00%0.650.620.820.91

Model NameDifficultyGPT-35-turbo-16kLLaMa-2-70bLLaMa-3-70bavg. exe.avg. tsc.avg. tvc.max. tsc.max. tvc.avg. exe.avg. tsc.avg. tvc.max. tsc.max. tvc.avg. exe.avg. tsc.avg. tvc.max. tsc.max. tvc.PMLPEasy100.00%0.750.751.001.0060.00%0.450.551.001.00100.00%0.850.751.001.00LinkXEasy100.00%0.600.341.001.0030.00%0.200.151.001.00100.00%0.800.651.001.00VisNetHard5.00%0.030.000.160.400.00%0.000.000.000.0040%0.270.240.330.42AntiSymmetricMedium45.00%0.160.210.610.220.00%0.000.000.000.0080%0.620.700.710.88GPSConvMedium45.00%0.240.190.450.420.00%0.000.000.000.0075%0.510.590.651.00DirGNNConvMedium65.00%0.560.290.710.420.00%0.000.000.000.0090%0.720.680.840.91

5 Limitation

The RD2superscriptRD2\text{RD}^{2}RD start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTBench framework, while innovative, only evaluates the most representative base LLM, such as GPT4, LLama3, LLama2, GPT35, without further evaluate more open source models. Meanwhile, this paper only include most representative R&D domains and problems and focus on data driven scenarios, which can be extended more in the future to show its generalizability.For more details and a more comprehensive evaluation, we will include them in not-too-distant future works. We believe that the benchmark will be a valuable tool for the community to evaluate the performance of the models in the data-centric R&D tasks and to develop new models and techniques to address the challenges and opportunities in the domain.

6 Conclusion

In this paper, we serve as the first effort to tackle the real-world data-centric automatic R&D scenario in the hope of significantly improving the research efficiency of scientists and thus contributing to the revolution of human productivity. Specifically, we first propose RD2superscriptRD2\text{RD}^{2}RD start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTBench that benchmarks all the operations in D-CARD as a whole to navigate future work toward the ultimate goal of automating data-centric R&D directly. RD2superscriptRD2\text{RD}^{2}RD start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTBench focuses on evaluating the interaction and synergistic effects of various model capabilities and aiding in selecting the well-performing trustworthy models. Based on RD2superscriptRD2\text{RD}^{2}RD start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTBench, we find that although the most SOTA GPT-4 shows its promising potency in tackling D-CARD, there remains ample room for future work.

References

  • [1]Anonymous.MetaTool benchmark: Deciding whether to use tools and which to use.In The Twelfth International Conference on Learning Representations, 2024.
  • [2]DaniilA. Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes.Autonomous chemical research with large language models.Nature, 624(7992):570–578, December 2023.
  • [3]Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, JaredD Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei.Language models are few-shot learners.In H.Larochelle, M.Ranzato, R.Hadsell, M.F. Balcan, and H.Lin, editors, Advances in Neural Information Processing Systems, volume33, pages 1877–1901. Curran Associates, Inc., 2020.
  • [4]Erik Brynjolfsson and Andrew McAfee.The Second Machine Age: Work, Progress, and Prosperity in a Time of Brilliant Technologies.WW Norton & Company, 2014.
  • [5]Haotian Chen, Bingsheng Chen, and Xiangdong Zhou.Did the Models Understand Documents? Benchmarking Models for Language Understanding in Document-Level Relation Extraction.In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6418–6435, Toronto, Canada, 2023. Association for Computational Linguistics.
  • [6]Peng Cui and Susan Athey.Stable Learning Establishes some Common Ground between Causal Inference and Machine Learning.Nature Machine Intelligence, 4(2):110–115, February 2022.
  • [7]Jacob Devlin, Ming-Wei Chang, Kenton Lee, and KristinaN. Toutanova.BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, 2018.
  • [8]Matthias Fey and JanE. Lenssen.Fast graph representation learning with PyTorch Geometric.In ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019.
  • [9]Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and FelixA. Wichmann.Shortcut Learning in Deep Neural Networks.Nature Machine Intelligence, 2(11):665–673, November 2020.
  • [10]Alessio Gravina, Davide Bacciu, and Claudio Gallicchio.Anti-symmetric DGN: a stable architecture for deep graph networks.In The Eleventh International Conference on Learning Representations, 2023.
  • [11]CarlosE Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan.SWE-bench: Can language models resolve real-world GitHub issues?arXiv preprint arXiv:2310.06770, 2023.
  • [12]CarlosE. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan.Swe-bench: Can language models resolve real-world github issues?, 2023.
  • [13]RajendraP. Joshi and Neeraj Kumar.Artificial intelligence for autonomous molecular design: A perspective.Molecules, 26(22):6761, November 2021.
  • [14]Huao Li, YuQuan Chong, Simon Stepputtis, Joseph Campbell, Dana Hughes, Michael Lewis, and KatiaP. Sycara.Theory of mind for multi-agent collaboration via large language models.In Conference on Empirical Methods in Natural Language Processing, 2023.
  • [15]Derek Lim, Felix Hohne, Xiuyu Li, SijiaLinda Huang, Vaishnavi Gupta, Omkar Bhalerao, and SerNam Lim.Large scale learning on non-homophilous graphs: New benchmarks and strong simple methods.Advances in Neural Information Processing Systems, 34:20887–20902, 2021.
  • [16]Yuliang Liu, Xiangru Tang, Zefan Cai, Junjie Lu, Yichi Zhang, Yanjun Shao, Zexuan Deng, Helan Hu, Zengxian Yang, Kaikai An, Ruijun Huang, Shuzheng Si, Sheng Chen, Haozhe Zhao, Zhengliang Li, Liang Chen, Yiming Zong, Yan Wang, Tianyu Liu, Zhiwei Jiang, Baobao Chang, Yujia Qin, Wangchunshu Zhou, Yilun Zhao, Arman Cohan, and Mark Gerstein.ML-Bench: Large Language Models Leverage Open-source Libraries for Machine Learning Tasks, November 2023.
  • [17]Yuliang Liu, Xiangru Tang, Zefan Cai, Junjie Lu, Yichi Zhang, Yanjun Shao, Zexuan Deng, Helan Hu, Zengxian Yang, Kaikai An, Ruijun Huang, Shuzheng Si, Sheng Chen, Haozhe Zhao, Zhengliang Li, Liang Chen, Yiming Zong, Yan Wang, Tianyu Liu, Zhiwei Jiang, Baobao Chang, Yujia Qin, Wangchunshu Zhou, Yilun Zhao, Arman Cohan, and Mark Gerstein.Ml-bench: Large language models leverage open-source libraries for machine learning tasks, 2023.
  • [18]Pan Lu, Liang Qiu, Wenhao Yu, Sean Welleck, and Kai-Wei Chang.A survey of deep learning for mathematical reasoning.In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2023.
  • [19]Tomas Mikolov, Ilya Sutskever, Kai Chen, GregS Corrado, and Jeff Dean.Distributed representations of words and phrases and their compositionality.In C.J. Burges, L.Bottou, M.Welling, Z.Ghahramani, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume26. Curran Associates, Inc., 2013.
  • [20]PramodKaushik Mudrakarta, Ankur Taly, Mukund Sundararajan, and Kedar Dhamdhere.Did the Model Understand the Question?In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1896–1906, Melbourne, Australia, 2018. Association for Computational Linguistics.
  • [21]OpenAI.GPT-4 Technical Report, March 2023.
  • [22]OpenAI.Gpt-4 technical report, 2023.
  • [23]Long Ouyang, Jeff Wu, XuJiang, Diogo Almeida, CarrollL. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe.Training language models to follow instructions with human feedback, March 2022.
  • [24]Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, LuFang, Junjie Bai, and Soumith Chintala.Pytorch: An imperative style, high-performance deep learning library.In H.Wallach, H.Larochelle, A.Beygelzimer, F.d'Alché-Buc, E.Fox, and R.Garnett, editors, Advances in Neural Information Processing Systems, volume32. Curran Associates, Inc., 2019.
  • [25]Franci Penov, Yohei Nakajima, MalikM Alnakhaleh, Alexander Dibrov, Shukri, Frank Chen, Anton Troynikov, David Byttow, John Cao, Felipe Schieber, Josh XT, FRM MinsuYeom, CFA, Zain Hasan, zeel sheladiya, jmtatsch, Aidan Rauscher, Thiago Alves, jakvb, Jason Banich, Muhamed AlGhzawi, Peter Banda, TungusSs, Lorenzo Fontoura, Joe Heitzeberg, Jay Scambler, IkkoEltociear Ashimine, Cs4K1Sr4C, Mike Crawford, Michele Bellitti, and swyx.io.yoheinakajima/babyagi.1 2024.
  • [26]Carlota Perez.Technological Revolutions and Financial Capital.Edward Elgar Publishing, 2003.
  • [27]Karl Popper.The Logic of Scientific Discovery.Routledge, 2005.
  • [28]Cheng Qian, Chi Han, YiFung, Yujia Qin, Zhiyuan Liu, and Heng Ji.Creator: Tool creation for disentangling abstract and concrete reasoning of large language models.In Findings of the Association for Computational Linguistics: EMNLP 2023. Association for Computational Linguistics, 2023.
  • [29]Chengwei Qin, Aston Zhang, Zhuosheng Zhang, Jiaao Chen, Michihiro Yasunaga, and Diyi Yang.Is ChatGPT a General-Purpose Natural Language Processing Task Solver?, February 2023.
  • [30]Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, etal.Toolllm: Facilitating large language models to master 16000+ real-world apis.arXiv preprint arXiv:2307.16789, 2023.
  • [31]Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun.Toolllm: Facilitating large language models to master 16000+ real-world apis, 2023.
  • [32]Ladislav Rampášek, Mikhail Galkin, VijayPrakash Dwivedi, AnhTuan Luu, Guy Wolf, and Dominique Beaini.Recipe for a General, Powerful, Scalable Graph Transformer.Advances in Neural Information Processing Systems, 35, 2022.
  • [33]Gustav Ranis and John C.H. Fei.A theory of economic development.The American Economic Review, 51(4):533–565, 1961.
  • [34]Emanuele Rossi, Bertrand Charpentier, Francesco DiGiovanni, Fabrizio Frasca, Stephan Günnemann, and Michael Bronstein.Edge directionality improves learning on heterophilic graphs, 2023.
  • [35]Gisbert Schneider.Automating drug discovery.Nature Reviews Drug Discovery, 17(2):97–113, December 2017.
  • [36]Dudley Shapere.The structure of scientific revolutions.The Philosophical Review, 73(3):383–394, 1964.
  • [37]Yongliang Shen, Kaitao Song, XuTan, Dongsheng Li, Weiming Lu, and Yueting Zhuang.Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face, 2023.
  • [38]Noah Shinn, Federico Cassano, Ashwin Gopinath, KarthikR Narasimhan, and Shunyu Yao.Reflexion: language agents with verbal reinforcement learning.In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  • [39]Adam Smith.The Wealth of Nations [1776], volume 11937.na, 1937.
  • [40]Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu AwalMd Shoeb, Abubakar Abid, Adam Fisch, AdamR. Brown, Adam Santoro, and Aditya Gupta.Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.Transactions on Machine Learning Research, 2023.
  • [41]Runchu Tian, Yining Ye, Yujia Qin, Xin Cong, Yankai Lin, Yinxu Pan, Yesai Wu, Zhiyuan Liu, and Maosong Sun.Debugbench: Evaluating debugging capability of large language models, 2024.
  • [42]Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, and Bhosale.Llama 2: Open foundation and fine-tuned chat models, 2023.
  • [43]Somin Wadhwa, Silvio Amir, and Byron Wallace.Revisiting Relation Extraction in the era of Large Language Models.In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15566–15589, Toronto, Canada, 2023. Association for Computational Linguistics.
  • [44]Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar.Voyager: An open-ended embodied agent with large language models, 2023.
  • [45]Haiming Wang, YeYuan, Zhengying Liu, Jianhao Shen, Yichun Yin, Jing Xiong, Enze Xie, Han Shi, Yujun Li, Lin Li, Jian Yin, Zhenguo Li, and Xiaodan Liang.Dt-solver: Automated theorem proving with dynamic-tree sampling guided by proof-level value function.In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2023.
  • [46]Tianlu Wang, Rohit Sridhar, Diyi Yang, and Xuezhi Wang.Identifying and mitigating spurious correlations for improving robustness in NLP models.In Findings of the Association for Computational Linguistics: NAACL 2022, pages 1719–1729, Seattle, United States, July 2022. Association for Computational Linguistics.
  • [47]Yusong Wang, Tong Wang, Shaoning Li, Xinheng He, Mingyu Li, Zun Wang, Nanning Zheng, Bin Shao, and Tie-Yan Liu.Enhancing geometric representations for molecules with equivariant vector-scalar interactive message passing.Nature Communications, 15(1), January 2024.
  • [48]Daniel Whalen.Holophrasm: a neural automated theorem prover for higher-order logic, 2016.
  • [49]Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, LiJiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, AhmedHassan Awadallah, RyenW White, Doug Burger, and Chi Wang.Autogen: Enabling next-gen llm applications via multi-agent conversation, 2023.
  • [50]Chenxiao Yang, Qitian Wu, Jiahua Wang, and Junchi Yan.Graph neural networks are inherently good generalizers: Insights by bridging gnns and mlps.In International Conference on Learning Representations (ICLR), 2023.
  • [51]Hui Yang, Sifu Yue, and Yunzhong He.Auto-gpt for online decision making: Benchmarks and additional opinions, 2023.
  • [52]Kaiyu Yang, AidanM Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, Ryan Prenger, and Anima Anandkumar.Leandojo: Theorem proving with retrieval-augmented language models.In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
  • [53]XuYang, Xiao Yang, Weiqing Liu, Jinhui Li, Peng Yu, Zeqi Ye, and Jiang Bian.Leveraging large language model for automatic evolving of industrial data-centric r&d cycle, 2023.
  • [54]Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, ThomasL. Griffiths, Yuan Cao, and Karthik Narasimhan.Tree of thoughts: Deliberate problem solving with large language models, 2023.
  • [55]Xingxuan Zhang, Peng Cui, Renzhe Xu, Linjun Zhou, Yue He, and Zheyan Shen.Deep Stable Learning for Out-Of-Distribution Generalization.In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5368–5378, Nashville, TN, USA, June 2021. IEEE.
  • [56]WayneXin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen.A survey of large language models, 2023.
  • [57]Yuchen Zhuang, Yue Yu, Kuan Wang, Haotian Sun, and Chao Zhang.ToolQA: A Dataset for LLM Question Answering with External Tools, June 2023.
  • [58]Barret Zoph, Colin Raffel, Dale Schuurmans, Dani Yogatama, Denny Zhou, Don Metzler, EdH. Chi, Jason Wei, Jeff Dean, LiamB. Fedus, MaartenPaul Bosma, Oriol Vinyals, Percy Liang, Sebastian Borgeaud, TatsunoriB. Hashimoto, and YiTay.Emergent abilities of large language models.TMLR, 2022.

Appendix A Formula implementation task metrics calculation details

As mentioned above, we have multiple metrics (the average and maxima score across multiple independent attempts, including ”running success rate”, ”format success rate”, ”pearson correlation” and ”value accuracy”). Assume the ground truth factor value is 𝐘𝐘\mathbf{Y}bold_Y with length n𝑛nitalic_n (the length of the time series), and the generated factor value is 𝐘superscript𝐘\mathbf{Y}^{*}bold_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, the calculation of the metrics is as follows:

Running success is defined as successful execution. Any error that occurs in the Python interpreter during the execution that stops the execution is considered a failure. We calculate the ratio of the number of successful execution times to the total number of attempts, denoted as avg. exe.

Pearson correlation is the correlation between the ground truth factor value and the generated factor value.

corr.=i=1n(𝐘i𝐘¯)(𝐘i𝐘¯)i=1n(𝐘i𝐘¯)2i=1n(𝐘i𝐘¯)2,corr.superscriptsubscript𝑖1𝑛subscriptsuperscript𝐘𝑖¯superscript𝐘subscript𝐘𝑖¯𝐘superscriptsubscript𝑖1𝑛superscriptsubscriptsuperscript𝐘𝑖¯superscript𝐘2superscriptsubscript𝑖1𝑛superscriptsubscript𝐘𝑖¯𝐘2\text{corr.}=\frac{\sum_{i=1}^{n}(\mathbf{Y}^{*}_{i}-\bar{\mathbf{Y}^{*}})(%\mathbf{Y}_{i}-\bar{\mathbf{Y}})}{\sqrt{\sum_{i=1}^{n}(\mathbf{Y}^{*}_{i}-\bar%{\mathbf{Y}^{*}})^{2}}\sqrt{\sum_{i=1}^{n}(\mathbf{Y}_{i}-\bar{\mathbf{Y}})^{2%}}},corr. = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( bold_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG bold_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG ) ( bold_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG bold_Y end_ARG ) end_ARG start_ARG square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( bold_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG bold_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( bold_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG bold_Y end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG ,

Format success is defined as successful format matching, which means the final output dataframe format is (datetime, factor_name). We calculate the ratio of the number of matched result formats to the total number of attempts, denoted as avg. form.

Value accuracy is the accuracy of the generated factor value, which can be formulated as:

acc.=1ni=1n𝕀(|𝐘i𝐘i|<t),acc.1𝑛superscriptsubscript𝑖1𝑛𝕀subscriptsuperscript𝐘𝑖subscript𝐘𝑖𝑡\text{acc.}=\frac{1}{n}\sum_{i=1}^{n}\mathbb{I}(|\mathbf{Y}^{*}_{i}-\mathbf{Y}%_{i}|<t),acc. = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_I ( | bold_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | < italic_t ) ,

Please note that we set the tolerance t𝑡titalic_t for the value accuracy as 1e-6 in this paper, which means two values are considered as equal if the absolute difference is less than 1e-6.

Appendix B Data collection details

As mentioned in the previous section, we collected papers including [10, 34, 32, 15, 50, 47] and corresponding codes using pyg[8], which are listed in the following table.

PaperTypeDifficultyGT Code
PMLPModelEasyLink
LinkXModelEasyLink
AntiSymmetricLayerMediumLink
GPSConvLayerMediumLink
DirGNNCOnvLayerMediumLink
VisNetModelHardLink

Appendix C Broader Impact

The proposed RD2superscriptRD2\text{RD}^{2}RD start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTBench has the potential to significantly impact the scientific community and industries reliant on R&D. By automating the tedious aspects of R&D, researchers can focus on more creative and innovative aspects of their work, potentially accelerating the pace of discoveries. Smaller institutions or individual researchers with limited resources might benefit from automated tools that reduce the need for extensive human labor, making high-level R&D more accessible. Automation of R&D can reduce costs and time-to-market for new technologies, fostering faster economic growth and competitiveness

avg. exec.avg. formatavg. corr.max. corr.FundamentalEasyPB_ROE0.4000.000NaNNaNPB_ROE_20.6000.000NaNNaNPB_ROE_30.6000.2000.5210.999MediumROE_movement0.8000.3000.3391.000ROE_movement_100.6000.1001.0001.000ROE_movement_200.9000.2000.9671.000HardPB_ROE_movement0.2000.1000.0780.078PB_ROE_movement_100.5000.000NaNNaNPB_ROE_movement_200.4000.000NaNNaNHigh FrequencyEasymid_price0.6000.000NaNNaNmid_price_20.5000.000NaNNaNmid_price_30.6000.000NaNNaNMediumliquidity_imbalance0.2000.000NaNNaNliquidity_imbalance_20.8000.000NaNNaNliquidity_imbalance_30.5000.000NaNNaNHardmicro_price0.4000.000NaNNaNmicro_price_20.7000.000NaNNaNmicro_price_30.8000.000NaNNaNPrice VolumeEasyalpha0530.8000.5000.8091.000alpha053_150.7000.5000.8061.000alpha053_50.7000.5000.4401.000Mediumalpha_pv_diff0.8000.7000.3041.000alpha_pv_diff_150.7000.4000.2591.000alpha_pv_diff_200.6000.4001.0001.000Hardalpha_pv_diff_pct0.8000.200-0.011-0.011alpha_pv_diff_pct_150.9000.2000.0960.096alpha_pv_diff_pct_200.9000.1000.1760.176gpt3.5N/AFundamental Avg0.5560.1000.3230.453High Frequency Avg0.5670.0000.0000.000Price Volume Avg0.7670.3890.4310.696mean value (0 for NaN)0.6300.1630.2510.383

avg. exec.avg. formatavg. corr.max. corr.FundamentalEasyPB_ROE0.0000.000NaNNaNPB_ROE_20.0500.000NaNNaNPB_ROE_30.0000.000NaNNaNMediumROE_movement0.3500.3501.0001.000ROE_movement_100.3500.3500.6751.000ROE_movement_200.3000.300NaNNaNHard ooPB_ROE_movement0.0000.000NaNNaNPB_ROE_movement_100.0000.000NaNNaNPB_ROE_movement_200.0000.000NaNNaNHigh FrequencyEasymid_price0.2500.000NaNNaNmid_price_20.2500.000NaNNaNmid_price_30.4000.000NaNNaNMediumliquidity_imbalance0.0500.000NaNNaNliquidity_imbalance_20.1500.000NaNNaNliquidity_imbalance_30.4500.000NaNNaNHardmicro_price0.0000.000NaNNaNmicro_price_20.0000.000NaNNaNmicro_price_30.0000.000NaNNaNPrice VolumeEasyalpha0530.0500.000NaNNaNalpha053_150.0000.000NaNNaNalpha053_50.0500.000NaNNaNMediumalpha_pv_diff0.2500.1500.4130.602alpha_pv_diff_150.0500.000NaNNaNalpha_pv_diff_200.0000.000NaNNaNHardalpha_pv_diff_pct0.0500.0500.1530.153alpha_pv_diff_pct_150.0000.000NaNNaNalpha_pv_diff_pct_200.0500.000NaNNaNphi3_128kN/AFundamental Avg0.1170.1110.1860.222High Frequency Avg0.1720.0000.0000.000Price Volume Avg0.0560.0220.0630.084mean value (0 for NaN)0.1150.0440.0830.102

Appendix D Prompts

The prompt for the model architecture implementation task is as follows:

The user is trying to implement some factors in quant investment, and you are the one to help write the Python code.

The user will provide the source data in HDF5(H5) format which you can load using pandas.read_hdf. The file is located near your Python code file which you can read from "./source_data.h5". After that, you will get a pandas dataframe with the following format:

open,close,high,low,volume,vwap,cap,IndClass.industry IndClass.sector,returns,date,instruments

2020-01-02,SH600000,158.538132,158.538132,160.699432,158.283859,4060945.0,

159.431900,647446144.0,1.0,NaN

The explanation of the example column names:

1: returns: daily close-to-close returns

2: open, close, high, low, volume: standard definitions for daily price and volume data

3: vwap: daily volume-weighted average price

4: cap: market capitalization is the total value of a companys outstanding shares of stock

5: IndClass.industry and IndClass.sector: a generic placeholder for a binary industry classification such as GICS, BICS, NAICS, SIC, etc., in indneutralize(x, IndClass.level), where level: sector, industry, etc. Multiple IndClass in the same alpha need not correspond to the same industry classification.

The user will provide you with a formulation of the factor, which contains some function calls and operators. You need to implement the function calls and operators in Python. Your code is expected to align the formulation in any form which means the user needs to get the exact factor values with your code as expected.

Your code should contain the following parts: the import part, the function part, and the main part. You should write a main function named "calculate_{function_name}" and call this function in the "if __name__ == __main__" part. Dont write any try-except block in your code. The user will catch the exception message and provide feedback to you.

User will write your code into a python file and execute the file directly with "python {your_file_name}.py". You should calculate the factor values and save the result into an HDF5(H5) file named "result.h5" in the same directory as your python file. The result file is an HDF5(H5) file containing a pandas dataframe. The index of the dataframe is the date and instrument, and the single column name is the factor name,and the value is the factor value. The result file should be saved in the same directory as your python file.

To help you write the correct code, the user might provide multiple pieces of information that help you write the correct code:

1. The user might provide you the correct code to similar factors. You should learn from these code to write the correct code.

2. The user might provide you the failed former code and the corresponding feedback to the code. The feedback contains to the execution, the code and the factor value. You should analyze the feedback and try to correct the latest code.

3. The user might provide you with suggestions for the latest failed code and some similar failed-to-correct pairs. Each pair contains the failed code with a similar error and the corresponding corrected version of the code. You should learn from these suggestions to write the correct code.

Please respond to the code in the following JSON format. Here is an example structure for the JSON output:

{

"code": "The Python code as a string."

}

The prompt for the model architecture implementation task is as follows:

The user is trying to implement some models or layers in deep learning, specifically in the graph learning area, and you are the one to help write the Python code.

Use PyTorch and PyG (torch_geometric) framework to implement it. You can assume the input will contain node feature X [num_nodes, feature_dim], edge_index [2, num_edges], edge_feature [num_edges, num_edge_features], y [num_nodes, *] when it is node-level targets or graph-level targets of shape [1, *], pos (node position matrix) [num_nodes, position_dim].

The user will provide you with a formulation of the model/layer. You need to implement it in Python.

Your code should contain the following parts: the import part, the function part, and the main part. You should write a main function named "calculate_function_name" and call this function in the "if __name__ == __main__’" part. Dont write any try-except blocks in your code. The user will catch the exception message and provide the feedback to you.

User will write your code into a python file and execute the file directly with "python {your_file_name}.py".

Please respond with the code in the following JSON format. Here is an example structure for the JSON output:

{

"code": "The Python code as a string."

}

Towards Data-Centric Automatic R&D (2024)
Top Articles
Latest Posts
Recommended Articles
Article information

Author: Sen. Emmett Berge

Last Updated:

Views: 5345

Rating: 5 / 5 (60 voted)

Reviews: 91% of readers found this page helpful

Author information

Name: Sen. Emmett Berge

Birthday: 1993-06-17

Address: 787 Elvis Divide, Port Brice, OH 24507-6802

Phone: +9779049645255

Job: Senior Healthcare Specialist

Hobby: Cycling, Model building, Kitesurfing, Origami, Lapidary, Dance, Basketball

Introduction: My name is Sen. Emmett Berge, I am a funny, vast, charming, courageous, enthusiastic, jolly, famous person who loves writing and wants to share my knowledge and understanding with you.