Haotian Chen1 , Xinjie Shen111footnotemark: 1 , Zeqi Ye1 , Wenjun Feng1, Haoxue Wang1
Xiao Yang2 , Xu Yang2, Weiqing Liu2, Jiang Bian2
2Microsoft Research Asia
{ht1ian.chen, frinkleko, liamyzq}@gmail.com
fwj20020813@outlook.com, whx924@gmail.com
{xiao.yang, xuyang1, weiqing.liu, jiang.bian}@microsoft.comEqually ContributedCorresponding Author
Abstract
The progress of humanity is driven by those successful discoveries accompanied by countless failed experiments. Researchers often seek the potential research directions by reading and then verifying them through experiments. The process imposes a significant burden on researchers. In the past decade, the data-driven black-box deep learning method has demonstrated its effectiveness in a wide range of real-world scenarios, which exacerbates the experimental burden of researchers and thus renders the potential successful discoveries veiled. Therefore, automating such a research and development (R&D) process is an urgent need. In this paper, we serve as the first effort to formalize the goal by proposing a Real-world Data-centric automatic R&D Benchmark, namely Bench. Bench benchmarks all the operations in data-centric automatic R&D (D-CARD) as a whole to navigate future work toward our goal directly. We focus on evaluating the interaction and synergistic effects of various model capabilities and aiding in selecting well-performing trustworthy models.Although Bench is very challenging to the state-of-the-art (SOTA) large language model (LLM) named GPT-4, indicating ample research opportunities and more research efforts, LLMs possess promising potential to bring more significant development to D-CARD: They are able to implement some simple methods without adopting any additional techniques. We appeal to future work to take developing techniques for tackling automatic R&D into consideration, thus bringing the opportunities of the potential revolutionary upgrade to human productivity.
1 Introduction
“I have not failed. I’ve just found 10,000 ways that won’t work.”
— Thomas Alva Edison
The advancement of human society and the enhancement of living standards are highly correlated with the development of technology[39, 33, 26, 4]. Numerous truths and principles remain undiscovered in the world, awaiting experimental exploration[36, 27]. Those few successful discoveries, accompanied by countless failed experiments, propel the frontiers of technology. Historically, scientific researchers, including Edison, have undertaken extensive experiments by conducting them manually.In the age of AI, the influence of data-driven solutions, such as machine learning (ML) systems, is rapidly expanding[19, 7, 21]. These systems are known for their robust fitting capabilities and their “black box” nature, which significantly increases the experimental load on researchers and hinders the process of identifying and validating effective methodologies. This paper concentrates on this critical scenario, which we refer to as Data-Centric Research and Development (R&D).To cope with the prohibitively expensive costs and the overwhelming volume of experiments required, we consider automating such an R&D process for higher research efficiency by leveraging the strong language understanding and programming ability of the state-of-the-art (SOTA) large language models (LLMs)[40]. The brief illustration of Data-Centric Automatic R&D (D-CARD) is shown in Figure1.
The first step towards automatic R&D is to formalize the task and provide a benchmark for identifying the potential effective methods and research directions. Intuitively, an outstanding methodology identified by the benchmark should possess (1) strong language understanding ability to identify the implementable methods or ideas (e.g., formulations and models) in the given raw information (e.g., papers, reports, websites, etc.) and (2) strong implementation ability to accurately implement the methods by programming and then obtain reliable experimental results. Previous work focuses on benchmarking the different aspects of the two abilities. Specifically, the language understanding ability of LLMs is partly evaluated through analyzing their performance on relation extraction[43], question answering[57], and other natural language processing (NLP) tasks[29]. Meanwhile, the implementation ability of LLMs is partly tested through benchmarks like SWE-Bench[11], ToolBench[30], ML-Bench[16] and MetaTool[1], which study their ability of solving GitHub issues, using tools to program, and determining whether to use tools in a given scenario.
In this paper, we serve as the first effort to investigate the capabilities of the SOTA LLMs in tackling automatic R&D and propose a Real-world Data-centric automatic R&D Benchmark (Bench). The scenario studied by Bench possesses two unique and distinct characteristics that fundamentally differentiate it from others. First, Bench focuses on studying the real-world scenario where all the operations in R&D are automatic and evaluated as a whole, thus navigating the related future research efforts toward the goal of developing human technology more effectively. The real-world scenario requires more comprehensive and advanced model capabilities and exhibits new challenges. Second, we study the real-world automatic R&D in data-centric settings to navigate future work toward the urgent experimental exploration need brought by black-box data-driven models.Compared with existing benchmarks, Bench possesses two significant advantages:
(1) Bench evaluates the interaction and synergistic effects of various model capabilities instead of focusing on a single aspect of ability, which not only captures the frontier of SOTA LLMs but also bridges the gap between studying “individual ability” and “real-world synergistic effects of abilities”. In automatic R&D, an ML system fails to complete the task even if it possesses both the strong information extraction ability and the strong programming or tool-using ability: While it succeeds in extracting methods and implementing them, it fails in selecting the appropriate data from the datasets or misunderstanding either the descriptions of data features or the requirements expressed by prompts. Additionally, exhaustively enumerating all the aspects for benchmarking is extremely challenging, which is overcome by Bench.
(2) Bench tends to select well-performing trustworthy models instead of those models that fail to learn rationales and causality yet possess outstanding performance. Specifically, ML systems easily achieve SOTA performance on previous benchmarks by shortcut learning or learning spurious correlations instead of learning rationales or causality[20, 9, 6, 46, 5]. This renders a benchmark ineffective and misleading as it fails to accurately identify the well-performing trustworthy methods. For example, an ML system achieves SOTA performance on dog classification by merely recognizing grass[55]. Bench, on the contrary, eliminates such models by its high difficulty and large scope. The decision rules of models have to simultaneously satisfy at least four major requirements: (1) accurately and comprehensively extracting the implementable methods; (2) precisely selecting the method-specific data for computation; (3) correctly writing the code according to the logic expressed by methods and prompts; (4) successfully storing the correct results in a predefined format. Therefore, the decision rules of models selected by this benchmark are stable (work well in various situations), and thus getting closer to rationales and causality[6].
We evaluate the existing SOTA LLMs on Bench to expose their bottleneck and characterize the future research direction. Bench reveals new insights: (1) Among the popular LLMs, GPT-4 exhibits promising potency in dealing with the D-CARD task; (2) Detailed information of data descriptions significantly improves the performance of GPT-4; (3) The ability to query domain-specific knowledge is a basic requirement of D-CARD methods; (4) The more complex the method is, the more unstable the model’s performance is.
2 Related Work
2.1 LLM as Autonomous Agent
In the past few years, LLM has made great achievements in both academia and industry[22, 42], and has achieved results that surpass the previous level in a number of classic tasks[56]. Research has shown that with the growth of data volume and model size[58], LLM has emerged with stronger reasoning and other capabilities[23]. These capabilities enable LLM to exhibit certain agent-like behaviors in some tasks such as using or creating tools[31, 28], planning[54, 3], and memory. Therefore, more and more researchers have expressed their expectations for its human-like and overall capabilities, and have made preliminary explorations of it as an independent agent[44, 38]. Multi-agent collaboration[49, 14] is also introduced to LLM for better accuracy and generalizability. Moreover, for reducing human efforts and automatically exploring, previous work focuses on autonomous LLM agents for general purpose are purposed[51, 37]. Positive views further believe that the realization of AGI may come from the evolution of autonomous LLM and some inspiring examples have been released[25].
However, most research still focuses on limited scenarios that are given with clear and fixed questions and backgrounds. A recent work[53] has attempted to introduce LLM to the R&D field and formalize the R&D process as a sequence of tasks. However, there is no easy-to-use benchmark for the community and current R&D tasks may be too general and can’t reveal significant signals. In this work, we propose a benchmark for LLM in data-centric R&D tasks and provide a comprehensive evaluation.
2.2 Semi-Automatic R&D with Agents
Scientific research and development (R&D) is a time-consuming and important process. In the past, R&D has been mainly conducted by human researchers with countless failed experimental explorations and creative observation conclusions. Agents have been introduced to R&D to reduce human efforts and automatically explore. Recently, there have been attempts to partly automate R&D, including the automatic chemical synthesis planning [2], automatic molecular design [13, 35, 2], automatic theorem proving [45, 52]. However, these attempts mainly focus on automatic searching for possible solutions and optimizations with symbolic representation [18] and heuristic techniques [48], but less addressing long-horizon planning, implementation, and reasoning for the next step idea exploration. Moreover, the data-centric R&D tasks currently have not been explored in the community, and no benchmark has been proposed for the community. Previous works have applied LLM to real-world R&D tasks such as debugging issues [41, 12] or only focus on data-centric but not real-world R&D tasks [17]. In this work, we propose a benchmark for LLMs in data-centric R&D tasks and evaluate the performance of LLMs.
3 Bench
Overall, our benchmark focuses on evaluating the finally implemented results according to the given raw information (e.g., papers, reports, websites, etc.). Moreover, we also provide human-annotated ground-truth information corresponding to the intermediate steps for debugging and more comprehensive evaluation. Bench selects well-performing models that follow human operations and accurately calculate the final results. We introduce the details of our proposed Bench in the following sections. In section3.1 and section3.2, we introduce how we collect data and perform human annotation to form Bench. Then, we elaborate on the two necessary steps, namely method extraction and method implementation, to perform R&D in section3.3 and section3.4. Finally, we detail our adopted evaluation metrics in section3.5.
3.1 Data Collection
We consider the raw information that contains formulas and models, which represent a wide range of methods proposed in the AI domain.
Data Collection with Formulas.We prepare raw information that contains formulas as the input of R&D. Raw information is presented as publicly available financial reports and stock trading data.Formulas are usually mathematical formulas that take complex numeric input data about stock, company, and market as input and output a series of values with the time series.We collect financial reports with 27 implementable formulas distributed in three difficulty levels: easy, medium, and hard. Domain experts manually label the difficulty levels according to the complexity of implementation. To obtain their implementation results, an agent is expected to accurately select the features from three types of trading data scattered across 2010 to 2022, namely fundamental, price-volume, and high-frequency data. We denote the three types of data as Data I, II, and III, respectively.
Data Collection with Models.We collect papers with six open-sourced models[10, 34, 32, 15, 50, 47]. The implementation of models adopts Pytorch[24] and torch_gemometric framework[8] to perform deep learning. All the papers and models are publicly available. We manually label the difficulty level (easy, medium, hard) of the task based on the complexity of implementation (computational graphs and tensor operations). We refer the readers to the appendix for more details about the dataset and the task.
3.2 Human Annotation.
To provide a more comprehensive evaluation for debugging and analyzing, we conduct human annotation to provide the ground-truth results of our collected data, namely method extraction results and method implementation results.
Challenges.We confront five main challenges in the human annotation process.First, we need to identify the difficulty levels of methods to ensure the diversity of our benchmark and expose the bottleneck of current models.Second, we have to identify and discard the raw information if its presented methods demand unavailable data: The computation of some formulas can require confidential information that is not publicly available.Third, since the definitions or descriptions of some methods can be vague, leading them to be unimplementable, we have to filter out these methods.Fourth, some domain-specific methods containing factual errors should be filtered out since they are not implementable.Fifth, we should distinguish the domains and types of the methods according to their descriptions.To sum up, all the challenges imply the fact that human annotation of Bench requires high time cost and expertise of annotators. Therefore, we commit more effort to designing the annotation guidelines and mechanisms to ensure the dataset quality.
Annotation Guidelines and Mechanisms.We provide principled guidelines and mechanisms to the annotators. Specifically, we first clarify our requirements in the guidelines: The extracted ground-truth methods are implementable without the requirement of any additional information, and the ground-truth implementation results are stored in an expected space with the predefined form and shape. Second, we provide training to the annotators. After the training step, we delicately design a test case and an interview to examine if the annotators possess expertise in the target domain (e.g., financial domain) and understand the guidelines. Third, we make each annotation result checked at least twice by annotators. A senior researcher and an applied scientist with domain expertise and rich engineering experience will further check each annotation result together before it finally gets included in the benchmark.
3.3 Method Extraction Step
We evaluate the ability of models to recognize and extract information from raw information (e.g., R&D context). A qualified model is expected to discern feasible methods (formulas and models) from extensive research materials and extract all necessary information for implementing these formulas. The ability serves as the foundational premise for subsequent code implementation.
We expect the model to accurately and comprehensively extract the methods mentioned in the research materials it reviews, including all essential conditions required to implement the method. For methods with incomplete information, further implementation is not required; for methods with complete conditions, a model is expected to correctly comprehend the semantic meaning of these conditions stated in natural language and generate corresponding code. Specifically, we have predefined the extraction format (key-value pair) for the model. We employ the F1 score to measure the comprehensiveness and accuracy of method identification and extraction.
Note that some methods in the original materials might only imply their function, effect, or origin through their names without explicitly presenting their formulas, definitions, or other details. In such cases, the model may choose not to extract them or opt to autonomously complete them based on the semantics of the original materials. We expect the latter approach in future work, as it showcases the creativity of models by proposing new formulas and generating brand-new, informative, and reliable information. In the current version of the benchmark, only methods mentioned by name are evaluated in this manner; future iterations will explore and assess the model’s ability to generate new names and formulas when none are explicitly mentioned.
3.4 Method Implementation Step
In this section, we evaluate the performance of LLM in the implementation of methods. Given all the necessary conditions provided to the model after the previous step, the model needs to select the necessary data and write code from scratch to implement the method with an informative and well-organized prompt. Details of the prompt are included in the dataset, which is also shown in the appendix. We encourage models to use Python and perform data analysis. They are also permitted to use common machine-learning libraries.One example of the method implementation step is shown in Figure2.
3.5 Evaluation Metrics
We adopt multiple metrics to evaluate model performance in each step. For formula implementation, we adopt the average and maxima “running success rate”, “format success rate”, “Pearson correlation” and “value accuracy” across multiple independent attempts. We use “avg.”, “exe.”, “form.”, “corr.”, and “acc.” to denote the average value, number of successful execution times, number of matched result formats, the correlation, and the accuracy of corresponding values, respectively. We refer the readers to more details about the metrics calculation details in AppA.
For model implementation, we believe a successful implementation of a model should be consistent with the ground truth implementation as the model can be viewed as a numeric function and combination of tensor transformations. Therefore, we propose these two metrics for the model architecture implementation task: tensor shape consistency rate (tsc.), tensor value consistency rate (tvc.). Specifically, for each model layer, we calculate the consistency rate of the tensor shape and tensor value between the ground truth implementation and the implementation generated by the LLM. All the ground truth tensor values are determined by ground truth implementation codes with random Gaussian noise.Therefore, the formula for the two metrics is as follows, where and are the consistency rate of tensor shape and tensor value in layer , respectively, and is the maximum length of the two tensors as the two tensors are and , the ground truth and the generated tensor, respectively:
(1) |
while the shorter tensor is padded with zeros to match the length of the longer tensor.As the final score of the two metrics, we use the weighted sum of the consistency rate of all layers, weight increases with the depth of the layer and is summed as one: where is the number of layers in the model, is a tunable hyperparameter to control the weight increase, and we set in our experiments.
An example of the calculation is shown in Figure3, using model LinkX [15] as an example. Meanwhile, we also include the “average running success rate” as the basic metric for the model architecture implementation task, which is the same as the formula implementation task.
4 Experiments
4.1 Experimental Settings
As we have numeric input and output in R&D tasks, we set numeric equability with 1e-6 as tolerance for the evaluation of the implementation of methods. We set the base models as GPT-4-turbo[22], GPT-4-32k[22], GPT-35-turbo-16k[22] and Llama2[42] for the experiments. All the methods mentioned above, and their corresponding results are executed with Azure OpenAI API. There is no external data, resources, human feedback, or internet access involved in the experiments. We perform 20 independent attempts for each model and calculate the average and maximum value of each metric. As most of our input data is encoded in the form of document files, we first use parsing tools to extract text content from files. Azure document intelligence API (4.0) is used for parsing reports and academic papers in PDF format.
4.2 Results of Method Extraction
We evaluate the information extraction ability of models. As shown in Table1, we observe that GPT-4-turbo, GPT-4-32k, LLaMa3-70b, and LLaMa2-70b achieve competitive performance, which makes them possible to perform information extraction automatically. The performance of the four foundation models is stronger than of Phi3-4k, indicating more future endeavors to improve the extraction ability of Phi3-4k.
Metrics Precion Recall F1 GPT-4-turbo 0.818 0.800 0.809 GPT-4-32k 0.818 0.818 0.818 LLaMa3-70b 0.909 0.833 0.869 LLaMa2-70b 0.818 0.900 0.857 Phi3-4k 0.636 0.750 0.688
4.3 Results of Formula Implementation
In this section, we compare the performance of different models in the model architecture implementation task. We use the proposed metrics to evaluate the performance of the models. The results are shown in Table2 and Table3. We observe that the GPT-4-turbo achieves better performance than GPT-35-turbo and Phi3-128k in the model architecture implementation task. Overall experimental results indicate ample room for further research on the difficulty of the task and the challenges in automating R&D tasks. Specifically, we obtain the following four major findings revealed by the experimental results.
Data Difficulty Formula Avg. Exec. Avg. Format Avg. Corr. Max. Corr. Data I Easy PB_ROE 0.650 0.050 0.852 0.852 PB_ROE_2 0.600 0.200 0.875 1.000 PB_ROE_3 0.600 0.300 0.726 1.000 Medium ROE_movement 0.950 0.750 0.934 1.000 ROE_movement_10 0.900 0.800 0.803 1.000 ROE_movement_20 0.950 0.750 0.703 1.000 Hard PB_ROE_movement 0.600 0.450 0.516 0.897 PB_ROE_movement_10 0.650 0.300 0.327 0.896 PB_ROE_movement_20 0.550 0.500 0.244 0.896 Data II Easy mid_price 0.800 0.100 1.000 1.000 mid_price_2 0.850 0.000 NaN NaN mid_price_3 0.850 0.000 NaN NaN Medium liquidity_imbalance 0.500 0.050 1.000 1.000 liquidity_imbalance_2 0.900 0.150 0.694 1.000 liquidity_imbalance_3 0.450 0.100 1.000 1.000 Hard micro_price 0.850 0.000 NaN NaN micro_price_2 0.600 0.000 NaN NaN micro_price_3 0.600 0.100 1.000 1.000 Data III Easy alpha053 0.950 0.700 0.933 1.000 alpha053_15 0.950 0.650 0.872 1.000 alpha053_5 1.000 0.650 0.676 1.000 Medium alpha_pv_diff 1.000 0.600 0.513 1.000 alpha_pv_diff_15 0.950 0.750 0.258 1.000 alpha_pv_diff_20 1.000 0.750 0.441 1.000 Hard alpha_pv_diff_pct 0.950 0.700 0.375 1.000 alpha_pv_diff_pct_15 0.900 0.450 0.236 1.000 alpha_pv_diff_pct_20 1.000 0.350 0.358 1.000 Overall N/A Avg. Data I 0.717 0.456 0.665 0.949 Avg. Data II 0.711 0.056 0.522 0.556 Avg. Data III 0.967 0.622 0.518 1.000 Mean Value 0.798 0.378 0.568 0.835
LLM agents hold promising potential to tackle D-CARD.We can observe from Table2 and Table3 that GPT-4 possesses the ability to tackle some simple D-CARD cases without adopting any additional techniques. Specifically, GPT-4 achieves a high maximum correlation coefficient with the ground-truth results in implementing both easy and medium formulations: GPT-4-turbo achieves the maximum correlation value in implementing easy and medium formulas. However, GPT-4 fails to precisely match the exact ground-truth values due to some minor mistakes, such as missing the domain common knowledge (e.g., using percent change rather than difference when calculating growth), mismatching the output format, and unnecessarily introducing additional computational operations.
Precisely understanding and selecting data requires more detailed data information in D-CARD.As shown in Table2, we observe a special situation where GPT-4 significantly fails to implement a simple formulation while succeeding in implementing the harder ones. After analyzing its generated code, we find that GPT-4 confuses the different semantic meanings of data features due to their close natural language descriptions, which renders the subsequent calculation ineffective. For example, GPT-4 confuses the two terms named “volume” and “volatility” and always opts to use “volume” data when “volatility” is required. If we manually improve our initial prompt by adding a more detailed description, GPT-4 succeeds in understanding the semantic difference and obtains over 99% performance in the accuracy of values.
The ability to query domain-specific knowledge is a basic requirement of D-CARD methods.As we mentioned in the first finding, missing domain common knowledge impedes GPT-4 from calculating precisely matched final results. Additionally, we find that the implementation of some operations in a formulation also requires domain-specific knowledge. For example, in the financial domain, it’s clear enough for financial practitioners to implement the operation named “IndNeutralize(x,g)” by merely giving the description “x cross-sectionally neutralized against groups g”. However, in the code generated by GPT-4, it defines a function named “IndNeutralize(series, industry)” and leaves its content blank by merely adding a notation “Please replace this with your actual function definition”.
The more complex the method is, the more unstable the model performance is.As shown in the columns of Table2 named “avg. exec.”, “avg. form.”, and “avg. corr.”, respectively, we can observe that the performance variance of GPT-4 is significantly higher as the complexity of formulations increases. In 20 times of execution, GPT-4 generates the successfully executed code 18 times when implementing the medium mid_price while only three times in implementing hard alpha_pv.
avg. exec. avg. format avg. corr. max. corr. GPT-4—turbo Data I 0.717 0.456 0.665 0.949 Data II 0.711 0.056 0.522 0.556 Data III 0.967 0.622 0.518 1.000 Mean Value 0.798 0.378 0.568 0.835 GPT-35-turbo Data I 0.556 0.100 0.323 0.453 Data II 0.567 0.000 0.000 0.000 Data III 0.767 0.389 0.431 0.696 Mean Value 0.630 0.163 0.251 0.383 Phi3-128k Data I 0.117 0.111 0.186 0.222 Data II 0.172 0.000 0.000 0.000 Data III 0.056 0.022 0.063 0.084 Mean Value 0.115 0.044 0.083 0.102
As shown in Table 3, the performance of GPT-35-turbo and Phi3-128k is poor, even failing in execution codes. However, GPT-4 models shown in Table2 have a much better performance. This indicates that the performance of the model in the data-centric R&D task is highly related to the model’s pre-training and capacity. Therefore, we posit that continually training and improving the foundation model is a promising direction for future research in the field of data-centric R&D tasks.
4.4 Results of Model Architecture Implementation
In this section, we compare the performance of different LLMs in the model architecture implementation task and summarize the results in Table 4 and 5. As shown in the table, we can see the GPT-4-turbo, GPT-35-turbo-16k, and GPT-4-32K have similar running success rates but differ variously in tvc. and tsc.. The LLaMa-2-70b has the lowest running success rate and other metrics. Notice that even though a significant gap still exists between GPT-35, LLaMa-2, and GPT-4, it is much smaller than the gap in the formula implementation task. The overall running success rates are also higher than the formula implementation task. We can conclude that we can have similar observations in the model architecture implementation task as in the formula implementation task.
5 Limitation
The Bench framework, while innovative, only evaluates the most representative base LLM, such as GPT4, LLama3, LLama2, GPT35, without further evaluate more open source models. Meanwhile, this paper only include most representative R&D domains and problems and focus on data driven scenarios, which can be extended more in the future to show its generalizability.For more details and a more comprehensive evaluation, we will include them in not-too-distant future works. We believe that the benchmark will be a valuable tool for the community to evaluate the performance of the models in the data-centric R&D tasks and to develop new models and techniques to address the challenges and opportunities in the domain.
6 Conclusion
In this paper, we serve as the first effort to tackle the real-world data-centric automatic R&D scenario in the hope of significantly improving the research efficiency of scientists and thus contributing to the revolution of human productivity. Specifically, we first propose Bench that benchmarks all the operations in D-CARD as a whole to navigate future work toward the ultimate goal of automating data-centric R&D directly. Bench focuses on evaluating the interaction and synergistic effects of various model capabilities and aiding in selecting the well-performing trustworthy models. Based on Bench, we find that although the most SOTA GPT-4 shows its promising potency in tackling D-CARD, there remains ample room for future work.
References
- [1]Anonymous.MetaTool benchmark: Deciding whether to use tools and which to use.In The Twelfth International Conference on Learning Representations, 2024.
- [2]DaniilA. Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes.Autonomous chemical research with large language models.Nature, 624(7992):570–578, December 2023.
- [3]Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, JaredD Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei.Language models are few-shot learners.In H.Larochelle, M.Ranzato, R.Hadsell, M.F. Balcan, and H.Lin, editors, Advances in Neural Information Processing Systems, volume33, pages 1877–1901. Curran Associates, Inc., 2020.
- [4]Erik Brynjolfsson and Andrew McAfee.The Second Machine Age: Work, Progress, and Prosperity in a Time of Brilliant Technologies.WW Norton & Company, 2014.
- [5]Haotian Chen, Bingsheng Chen, and Xiangdong Zhou.Did the Models Understand Documents? Benchmarking Models for Language Understanding in Document-Level Relation Extraction.In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6418–6435, Toronto, Canada, 2023. Association for Computational Linguistics.
- [6]Peng Cui and Susan Athey.Stable Learning Establishes some Common Ground between Causal Inference and Machine Learning.Nature Machine Intelligence, 4(2):110–115, February 2022.
- [7]Jacob Devlin, Ming-Wei Chang, Kenton Lee, and KristinaN. Toutanova.BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, 2018.
- [8]Matthias Fey and JanE. Lenssen.Fast graph representation learning with PyTorch Geometric.In ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019.
- [9]Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and FelixA. Wichmann.Shortcut Learning in Deep Neural Networks.Nature Machine Intelligence, 2(11):665–673, November 2020.
- [10]Alessio Gravina, Davide Bacciu, and Claudio Gallicchio.Anti-symmetric DGN: a stable architecture for deep graph networks.In The Eleventh International Conference on Learning Representations, 2023.
- [11]CarlosE Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan.SWE-bench: Can language models resolve real-world GitHub issues?arXiv preprint arXiv:2310.06770, 2023.
- [12]CarlosE. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan.Swe-bench: Can language models resolve real-world github issues?, 2023.
- [13]RajendraP. Joshi and Neeraj Kumar.Artificial intelligence for autonomous molecular design: A perspective.Molecules, 26(22):6761, November 2021.
- [14]Huao Li, YuQuan Chong, Simon Stepputtis, Joseph Campbell, Dana Hughes, Michael Lewis, and KatiaP. Sycara.Theory of mind for multi-agent collaboration via large language models.In Conference on Empirical Methods in Natural Language Processing, 2023.
- [15]Derek Lim, Felix Hohne, Xiuyu Li, SijiaLinda Huang, Vaishnavi Gupta, Omkar Bhalerao, and SerNam Lim.Large scale learning on non-homophilous graphs: New benchmarks and strong simple methods.Advances in Neural Information Processing Systems, 34:20887–20902, 2021.
- [16]Yuliang Liu, Xiangru Tang, Zefan Cai, Junjie Lu, Yichi Zhang, Yanjun Shao, Zexuan Deng, Helan Hu, Zengxian Yang, Kaikai An, Ruijun Huang, Shuzheng Si, Sheng Chen, Haozhe Zhao, Zhengliang Li, Liang Chen, Yiming Zong, Yan Wang, Tianyu Liu, Zhiwei Jiang, Baobao Chang, Yujia Qin, Wangchunshu Zhou, Yilun Zhao, Arman Cohan, and Mark Gerstein.ML-Bench: Large Language Models Leverage Open-source Libraries for Machine Learning Tasks, November 2023.
- [17]Yuliang Liu, Xiangru Tang, Zefan Cai, Junjie Lu, Yichi Zhang, Yanjun Shao, Zexuan Deng, Helan Hu, Zengxian Yang, Kaikai An, Ruijun Huang, Shuzheng Si, Sheng Chen, Haozhe Zhao, Zhengliang Li, Liang Chen, Yiming Zong, Yan Wang, Tianyu Liu, Zhiwei Jiang, Baobao Chang, Yujia Qin, Wangchunshu Zhou, Yilun Zhao, Arman Cohan, and Mark Gerstein.Ml-bench: Large language models leverage open-source libraries for machine learning tasks, 2023.
- [18]Pan Lu, Liang Qiu, Wenhao Yu, Sean Welleck, and Kai-Wei Chang.A survey of deep learning for mathematical reasoning.In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2023.
- [19]Tomas Mikolov, Ilya Sutskever, Kai Chen, GregS Corrado, and Jeff Dean.Distributed representations of words and phrases and their compositionality.In C.J. Burges, L.Bottou, M.Welling, Z.Ghahramani, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume26. Curran Associates, Inc., 2013.
- [20]PramodKaushik Mudrakarta, Ankur Taly, Mukund Sundararajan, and Kedar Dhamdhere.Did the Model Understand the Question?In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1896–1906, Melbourne, Australia, 2018. Association for Computational Linguistics.
- [21]OpenAI.GPT-4 Technical Report, March 2023.
- [22]OpenAI.Gpt-4 technical report, 2023.
- [23]Long Ouyang, Jeff Wu, XuJiang, Diogo Almeida, CarrollL. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe.Training language models to follow instructions with human feedback, March 2022.
- [24]Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, LuFang, Junjie Bai, and Soumith Chintala.Pytorch: An imperative style, high-performance deep learning library.In H.Wallach, H.Larochelle, A.Beygelzimer, F.d'Alché-Buc, E.Fox, and R.Garnett, editors, Advances in Neural Information Processing Systems, volume32. Curran Associates, Inc., 2019.
- [25]Franci Penov, Yohei Nakajima, MalikM Alnakhaleh, Alexander Dibrov, Shukri, Frank Chen, Anton Troynikov, David Byttow, John Cao, Felipe Schieber, Josh XT, FRM MinsuYeom, CFA, Zain Hasan, zeel sheladiya, jmtatsch, Aidan Rauscher, Thiago Alves, jakvb, Jason Banich, Muhamed AlGhzawi, Peter Banda, TungusSs, Lorenzo Fontoura, Joe Heitzeberg, Jay Scambler, IkkoEltociear Ashimine, Cs4K1Sr4C, Mike Crawford, Michele Bellitti, and swyx.io.yoheinakajima/babyagi.1 2024.
- [26]Carlota Perez.Technological Revolutions and Financial Capital.Edward Elgar Publishing, 2003.
- [27]Karl Popper.The Logic of Scientific Discovery.Routledge, 2005.
- [28]Cheng Qian, Chi Han, YiFung, Yujia Qin, Zhiyuan Liu, and Heng Ji.Creator: Tool creation for disentangling abstract and concrete reasoning of large language models.In Findings of the Association for Computational Linguistics: EMNLP 2023. Association for Computational Linguistics, 2023.
- [29]Chengwei Qin, Aston Zhang, Zhuosheng Zhang, Jiaao Chen, Michihiro Yasunaga, and Diyi Yang.Is ChatGPT a General-Purpose Natural Language Processing Task Solver?, February 2023.
- [30]Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, etal.Toolllm: Facilitating large language models to master 16000+ real-world apis.arXiv preprint arXiv:2307.16789, 2023.
- [31]Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun.Toolllm: Facilitating large language models to master 16000+ real-world apis, 2023.
- [32]Ladislav Rampášek, Mikhail Galkin, VijayPrakash Dwivedi, AnhTuan Luu, Guy Wolf, and Dominique Beaini.Recipe for a General, Powerful, Scalable Graph Transformer.Advances in Neural Information Processing Systems, 35, 2022.
- [33]Gustav Ranis and John C.H. Fei.A theory of economic development.The American Economic Review, 51(4):533–565, 1961.
- [34]Emanuele Rossi, Bertrand Charpentier, Francesco DiGiovanni, Fabrizio Frasca, Stephan Günnemann, and Michael Bronstein.Edge directionality improves learning on heterophilic graphs, 2023.
- [35]Gisbert Schneider.Automating drug discovery.Nature Reviews Drug Discovery, 17(2):97–113, December 2017.
- [36]Dudley Shapere.The structure of scientific revolutions.The Philosophical Review, 73(3):383–394, 1964.
- [37]Yongliang Shen, Kaitao Song, XuTan, Dongsheng Li, Weiming Lu, and Yueting Zhuang.Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face, 2023.
- [38]Noah Shinn, Federico Cassano, Ashwin Gopinath, KarthikR Narasimhan, and Shunyu Yao.Reflexion: language agents with verbal reinforcement learning.In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- [39]Adam Smith.The Wealth of Nations [1776], volume 11937.na, 1937.
- [40]Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu AwalMd Shoeb, Abubakar Abid, Adam Fisch, AdamR. Brown, Adam Santoro, and Aditya Gupta.Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.Transactions on Machine Learning Research, 2023.
- [41]Runchu Tian, Yining Ye, Yujia Qin, Xin Cong, Yankai Lin, Yinxu Pan, Yesai Wu, Zhiyuan Liu, and Maosong Sun.Debugbench: Evaluating debugging capability of large language models, 2024.
- [42]Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, and Bhosale.Llama 2: Open foundation and fine-tuned chat models, 2023.
- [43]Somin Wadhwa, Silvio Amir, and Byron Wallace.Revisiting Relation Extraction in the era of Large Language Models.In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15566–15589, Toronto, Canada, 2023. Association for Computational Linguistics.
- [44]Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar.Voyager: An open-ended embodied agent with large language models, 2023.
- [45]Haiming Wang, YeYuan, Zhengying Liu, Jianhao Shen, Yichun Yin, Jing Xiong, Enze Xie, Han Shi, Yujun Li, Lin Li, Jian Yin, Zhenguo Li, and Xiaodan Liang.Dt-solver: Automated theorem proving with dynamic-tree sampling guided by proof-level value function.In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2023.
- [46]Tianlu Wang, Rohit Sridhar, Diyi Yang, and Xuezhi Wang.Identifying and mitigating spurious correlations for improving robustness in NLP models.In Findings of the Association for Computational Linguistics: NAACL 2022, pages 1719–1729, Seattle, United States, July 2022. Association for Computational Linguistics.
- [47]Yusong Wang, Tong Wang, Shaoning Li, Xinheng He, Mingyu Li, Zun Wang, Nanning Zheng, Bin Shao, and Tie-Yan Liu.Enhancing geometric representations for molecules with equivariant vector-scalar interactive message passing.Nature Communications, 15(1), January 2024.
- [48]Daniel Whalen.Holophrasm: a neural automated theorem prover for higher-order logic, 2016.
- [49]Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, LiJiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, AhmedHassan Awadallah, RyenW White, Doug Burger, and Chi Wang.Autogen: Enabling next-gen llm applications via multi-agent conversation, 2023.
- [50]Chenxiao Yang, Qitian Wu, Jiahua Wang, and Junchi Yan.Graph neural networks are inherently good generalizers: Insights by bridging gnns and mlps.In International Conference on Learning Representations (ICLR), 2023.
- [51]Hui Yang, Sifu Yue, and Yunzhong He.Auto-gpt for online decision making: Benchmarks and additional opinions, 2023.
- [52]Kaiyu Yang, AidanM Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, Ryan Prenger, and Anima Anandkumar.Leandojo: Theorem proving with retrieval-augmented language models.In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
- [53]XuYang, Xiao Yang, Weiqing Liu, Jinhui Li, Peng Yu, Zeqi Ye, and Jiang Bian.Leveraging large language model for automatic evolving of industrial data-centric r&d cycle, 2023.
- [54]Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, ThomasL. Griffiths, Yuan Cao, and Karthik Narasimhan.Tree of thoughts: Deliberate problem solving with large language models, 2023.
- [55]Xingxuan Zhang, Peng Cui, Renzhe Xu, Linjun Zhou, Yue He, and Zheyan Shen.Deep Stable Learning for Out-Of-Distribution Generalization.In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5368–5378, Nashville, TN, USA, June 2021. IEEE.
- [56]WayneXin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen.A survey of large language models, 2023.
- [57]Yuchen Zhuang, Yue Yu, Kuan Wang, Haotian Sun, and Chao Zhang.ToolQA: A Dataset for LLM Question Answering with External Tools, June 2023.
- [58]Barret Zoph, Colin Raffel, Dale Schuurmans, Dani Yogatama, Denny Zhou, Don Metzler, EdH. Chi, Jason Wei, Jeff Dean, LiamB. Fedus, MaartenPaul Bosma, Oriol Vinyals, Percy Liang, Sebastian Borgeaud, TatsunoriB. Hashimoto, and YiTay.Emergent abilities of large language models.TMLR, 2022.
Appendix A Formula implementation task metrics calculation details
As mentioned above, we have multiple metrics (the average and maxima score across multiple independent attempts, including ”running success rate”, ”format success rate”, ”pearson correlation” and ”value accuracy”). Assume the ground truth factor value is with length (the length of the time series), and the generated factor value is , the calculation of the metrics is as follows:
Running success is defined as successful execution. Any error that occurs in the Python interpreter during the execution that stops the execution is considered a failure. We calculate the ratio of the number of successful execution times to the total number of attempts, denoted as avg. exe.
Pearson correlation is the correlation between the ground truth factor value and the generated factor value.
Format success is defined as successful format matching, which means the final output dataframe format is (datetime, factor_name). We calculate the ratio of the number of matched result formats to the total number of attempts, denoted as avg. form.
Value accuracy is the accuracy of the generated factor value, which can be formulated as:
Please note that we set the tolerance for the value accuracy as 1e-6 in this paper, which means two values are considered as equal if the absolute difference is less than 1e-6.
Appendix B Data collection details
As mentioned in the previous section, we collected papers including [10, 34, 32, 15, 50, 47] and corresponding codes using pyg[8], which are listed in the following table.
Appendix C Broader Impact
The proposed Bench has the potential to significantly impact the scientific community and industries reliant on R&D. By automating the tedious aspects of R&D, researchers can focus on more creative and innovative aspects of their work, potentially accelerating the pace of discoveries. Smaller institutions or individual researchers with limited resources might benefit from automated tools that reduce the need for extensive human labor, making high-level R&D more accessible. Automation of R&D can reduce costs and time-to-market for new technologies, fostering faster economic growth and competitiveness
Appendix D Prompts
The prompt for the model architecture implementation task is as follows:
⬇
The user is trying to implement some factors in quant investment, and you are the one to help write the Python code.
The user will provide the source data in HDF5(H5) format which you can load using pandas.read_hdf. The file is located near your Python code file which you can read from "./source_data.h5". After that, you will get a pandas dataframe with the following format:
open,close,high,low,volume,vwap,cap,IndClass.industry IndClass.sector,returns,date,instruments
2020-01-02,SH600000,158.538132,158.538132,160.699432,158.283859,4060945.0,
159.431900,647446144.0,1.0,NaN
The explanation of the example column names:
1: returns: daily close-to-close returns
2: open, close, high, low, volume: standard definitions for daily price and volume data
3: vwap: daily volume-weighted average price
4: cap: market capitalization is the total value of a company’s outstanding shares of stock
5: IndClass.industry and IndClass.sector: a generic placeholder for a binary industry classification such as GICS, BICS, NAICS, SIC, etc., in indneutralize(x, IndClass.level), where level: sector, industry, etc. Multiple IndClass in the same alpha need not correspond to the same industry classification.
The user will provide you with a formulation of the factor, which contains some function calls and operators. You need to implement the function calls and operators in Python. Your code is expected to align the formulation in any form which means the user needs to get the exact factor values with your code as expected.
Your code should contain the following parts: the import part, the function part, and the main part. You should write a main function named "calculate_{function_name}" and call this function in the "if __name__ == __main__" part. Don’t write any try-except block in your code. The user will catch the exception message and provide feedback to you.
User will write your code into a python file and execute the file directly with "python {your_file_name}.py". You should calculate the factor values and save the result into an HDF5(H5) file named "result.h5" in the same directory as your python file. The result file is an HDF5(H5) file containing a pandas dataframe. The index of the dataframe is the date and instrument, and the single column name is the factor name,and the value is the factor value. The result file should be saved in the same directory as your python file.
To help you write the correct code, the user might provide multiple pieces of information that help you write the correct code:
1. The user might provide you the correct code to similar factors. You should learn from these code to write the correct code.
2. The user might provide you the failed former code and the corresponding feedback to the code. The feedback contains to the execution, the code and the factor value. You should analyze the feedback and try to correct the latest code.
3. The user might provide you with suggestions for the latest failed code and some similar failed-to-correct pairs. Each pair contains the failed code with a similar error and the corresponding corrected version of the code. You should learn from these suggestions to write the correct code.
Please respond to the code in the following JSON format. Here is an example structure for the JSON output:
{
"code": "The Python code as a string."
}
The prompt for the model architecture implementation task is as follows:
⬇
The user is trying to implement some models or layers in deep learning, specifically in the graph learning area, and you are the one to help write the Python code.
Use PyTorch and PyG (torch_geometric) framework to implement it. You can assume the input will contain node feature X [num_nodes, feature_dim], edge_index [2, num_edges], edge_feature [num_edges, num_edge_features], y [num_nodes, *] when it is node-level targets or graph-level targets of shape [1, *], pos (node position matrix) [num_nodes, position_dim].
The user will provide you with a formulation of the model/layer. You need to implement it in Python.
Your code should contain the following parts: the import part, the function part, and the main part. You should write a main function named "calculate_function_name" and call this function in the "if __name__ == ’__main__’" part. Don’t write any try-except blocks in your code. The user will catch the exception message and provide the feedback to you.
User will write your code into a python file and execute the file directly with "python {your_file_name}.py".
Please respond with the code in the following JSON format. Here is an example structure for the JSON output:
{
"code": "The Python code as a string."
}