Los administradores de TransicionEstructural no se responsabilizan de las opiniones vertidas por los usuarios del foro. Cada usuario asume la responsabilidad de los comentarios publicados.
0 Usuarios y 2 Visitantes están viendo este tema.
Hugging Face Clones OpenAI's Deep Research In 24 HoursPosted by BeauHD on Thursday February 06, 2025 @04:08PM from the that-was-quick dept.An anonymous reader quotes a report from Ars Technica:CitarOn Tuesday, Hugging Face researchers released an open source AI research agent called "Open Deep Research," created by an in-house team as a challenge 24 hours after the launch of OpenAI's Deep Research feature, which can autonomously browse the web and create research reports. The project seeks to match Deep Research's performance while making the technology freely available to developers. "While powerful LLMs are now freely available in open-source, OpenAI didn't disclose much about the agentic framework underlying Deep Research," writes Hugging Face on its announcement page. "So we decided to embark on a 24-hour mission to reproduce their results and open-source the needed framework along the way!"Similar to both OpenAI's Deep Research and Google's implementation of its own "Deep Research" using Gemini (first introduced in December -- before OpenAI), Hugging Face's solution adds an "agent" framework to an existing AI model to allow it to perform multi-step tasks, such as collecting information and building the report as it goes along that it presents to the user at the end. The open source clone is already racking up comparable benchmark results. After only a day's work, Hugging Face's Open Deep Research has reached 55.15 percent accuracy on the General AI Assistants (GAIA) benchmark, which tests an AI model's ability to gather and synthesize information from multiple sources. OpenAI's Deep Research scored 67.36 percent accuracy on the same benchmark with a single-pass response (OpenAI's score went up to 72.57 percent when 64 responses were combined using a consensus mechanism).As Hugging Face points out in its post, GAIA includes complex multi-step questions such as this one: "Which of the fruits shown in the 2008 painting 'Embroidery from Uzbekistan' were served as part of the October 1949 breakfast menu for the ocean liner that was later used as a floating prop for the film 'The Last Voyage'? Give the items as a comma-separated list, ordering them in clockwise order based on their arrangement in the painting starting from the 12 o'clock position. Use the plural form of each fruit." To correctly answer that type of question, the AI agent must seek out multiple disparate sources and assemble them into a coherent answer. Many of the questions in GAIA represent no easy task, even for a human, so they test agentic AI's mettle quite well.Open Deep Research "builds on OpenAI's large language models (such as GPT-4o) or simulated reasoning models (such as o1 and o3-mini) through an API," notes Ars. "But it can also be adapted to open-weights AI models. The novel part here is the agentic structure that holds it all together and allows an AI language model to autonomously complete a research task."The code has been made public on GitHub.
On Tuesday, Hugging Face researchers released an open source AI research agent called "Open Deep Research," created by an in-house team as a challenge 24 hours after the launch of OpenAI's Deep Research feature, which can autonomously browse the web and create research reports. The project seeks to match Deep Research's performance while making the technology freely available to developers. "While powerful LLMs are now freely available in open-source, OpenAI didn't disclose much about the agentic framework underlying Deep Research," writes Hugging Face on its announcement page. "So we decided to embark on a 24-hour mission to reproduce their results and open-source the needed framework along the way!"Similar to both OpenAI's Deep Research and Google's implementation of its own "Deep Research" using Gemini (first introduced in December -- before OpenAI), Hugging Face's solution adds an "agent" framework to an existing AI model to allow it to perform multi-step tasks, such as collecting information and building the report as it goes along that it presents to the user at the end. The open source clone is already racking up comparable benchmark results. After only a day's work, Hugging Face's Open Deep Research has reached 55.15 percent accuracy on the General AI Assistants (GAIA) benchmark, which tests an AI model's ability to gather and synthesize information from multiple sources. OpenAI's Deep Research scored 67.36 percent accuracy on the same benchmark with a single-pass response (OpenAI's score went up to 72.57 percent when 64 responses were combined using a consensus mechanism).As Hugging Face points out in its post, GAIA includes complex multi-step questions such as this one: "Which of the fruits shown in the 2008 painting 'Embroidery from Uzbekistan' were served as part of the October 1949 breakfast menu for the ocean liner that was later used as a floating prop for the film 'The Last Voyage'? Give the items as a comma-separated list, ordering them in clockwise order based on their arrangement in the painting starting from the 12 o'clock position. Use the plural form of each fruit." To correctly answer that type of question, the AI agent must seek out multiple disparate sources and assemble them into a coherent answer. Many of the questions in GAIA represent no easy task, even for a human, so they test agentic AI's mettle quite well.
OpenAI Researchers Find That Even the Best AI Is "Unable To Solve the Majority" of Coding ProblemsOpenAI researchers have admitted that even the most advanced AI models still are no match for human coders — even though CEO Sam Altman insists they will be able to beat "low-level" software engineers by the end of this year.In a new paper, the company's researchers found that even frontier models, or the most advanced and boundary-pushing AI systems, "are still unable to solve the majority" of coding tasks.The researchers used a newly-developed benchmark called SWE-Lancer, built on more than 1,400 software engineering tasks from the freelancer site Upwork. Using the benchmark, OpenAI put three large language models (LLMs) — its own o1 reasoning model and flagship GPT-4o, as well as Anthropic's Claude 3.5 Sonnet — to the test.Specifically, the new benchmark evaluated how well the LLMs performed with two types of tasks from Upwork: individual tasks, which involved resolving bugs and implementing fixes to them, or management tasks that saw the models trying to zoom out and make higher-level decisions. (The models weren't allowed to access the internet, meaning they couldn't just crib similar answers that'd been posted online.)The models took on tasks cumulatively worth hundreds of thousands of dollars on Upwork, but they were only able to fix surface-level software issues, while remaining unable to actually find bugs in larger projects or find their root causes. These shoddy and half-baked "solutions" are likely familiar to anyone who's worked with AI — which is great at spitting out confident-sounding information that often falls apart on closer inspection.Though all three LLMs were often able to operate "far faster than a human would," the paper notes, they also failed to grasp how widespread bugs were or to understand their context, "leading to solutions that are incorrect or insufficiently comprehensive."As the researchers explained, Claude 3.5 Sonnet performed better than the two OpenAI models pitted against it and made more money than o1 and GPT-4o. Still, the majority of its answers were wrong, and according to the researchers, any model would need "higher reliability" to be trusted with real-life coding tasks.Put more plainly, the paper seems to demonstrate that although these frontier models can work quickly and solve zoomed-in tasks, they're are nowhere near as skilled at handling them as human engineers.Though these LLMs have advanced rapidly over the past few years and will likely continue to do so, they're not skilled enough at software engineering to replace real-life people quite yet — not that that's stopping CEOs from firing their human coders in favor of immature AI models.
OpenAI Plots Charging $20,000 a Month For PhD-Level AgentsPosted by msmash on Wednesday March 05, 2025 @12:00PM from the up-next dept.OpenAI is preparing to launch a tiered pricing structure for its AI agent products, with high-end research assistants potentially costing $20,000 per month, [alternative source] according to The Information. The AI startup, which already generates approximately $4 billion in annualized revenue from ChatGPT, plans three service levels: $2,000 monthly agents for "high-income knowledge workers," $10,000 monthly agents for software development, and $20,000 monthly PhD-level research agents. OpenAI has told some investors that agent products could eventually constitute 20-25% of company revenue, the report added.