Los administradores de TransicionEstructural no se responsabilizan de las opiniones vertidas por los usuarios del foro. Cada usuario asume la responsabilidad de los comentarios publicados.
0 Usuarios y 1 Visitante están viendo este tema.
How Elon Musk’s Supercomputer Freaked Out AI RivalsElon Musk has stunned rivals by building a supercomputer for xAI bigger and faster than ever before, fueling a race to supersize data centers by OpenAI and others.Anissa Gardizy · 2024.11.13On a sunny day last month, a propeller plane made multiple passes in the air over a large industrial building surrounded by grassy fields near downtown Memphis, Tenn., as its passengers snapped photographs and videos of the facility.It was a secret reconnaissance mission. Elon Musk had recently converted the building from a former manufacturing plant for home appliances into a data center that housed one of the largest clusters of servers in the world for training artificial intelligence models. The speed with which Musk had built that AI supercomputer for his latest startup, xAI, had sparked anxiety and confusion among leaders at rivals such as OpenAI.The passengers on the plane—employees of a data center competitor—were attempting to gain whatever insights they could about the operations of the closely guarded facility, according to someone with direct knowledge of the flight and photos viewed by The Information. They took note of the bevy of gas-powered turbines Musk had trucked in to provide the facility with power, and they looked for clues about how xAI was controlling the heat put out by the servers inside the building, the person said.The TakeawayElon Musk has managed to create a giant supercomputer for XAI in record time by disregarding some of the customary methods and precautions companies use to build data centers.The spy plane in Memphis is one sign of the high stakes involved in one of the most expensive races in the history of technology. Microsoft, Meta Platforms, Google and Amazon are each pouring tens of billions of dollars into new data centers to power the advanced new forms of AI that undergird ChatGPT and other applications.It's a risky bet based on a simple belief that the bigger a cluster of servers is, the better the AI it can produce. The scramble to supersize those clusters began in late 2022 with OpenAI's launch of ChatGPT, the chatbot whose popularity sent shock waves across the tech industry.Musk-who co-founded and initially funded OpenAI but later split from it-joined the data center race after it had already started. But through a combination of ambition, indefatigability and disregard for some of the conventional methods of building data centers, he has still managed to make a major splash.Two things about Musk's supercomputer have jolted competitors: its size and the speed with which xAI built it. The supercomputer, fittingly known as Colossus, consists of 100,000 graphics processing units, the chips best suited to training and running AI software. That is several times bigger than similar supercomputers built in the past by Meta and other tech giants.Stringing together so many GPUs into a single supercomputer isn't as simple as it sounds because of how much power the servers consume and bottlenecks in the networking equipment used to connect the chips to each other. And completing the project as quickly as XAI did is unheard of.Musk and Nvidia, the AI chip powerhouse that supplied the GPUs for Colossus, said the data center and supercomputer were built in just 122 days. On a recent podcast, Nvidia CEO Jensen Huang said a GPU cluster of that size would normally take three years to plan and design and an additional year to get working."No question that nobody slept," Huang said of the Colossus project on a recent podcast."As far as I know, there's only one person in the world who could do that," Huang added. "Elon is singular in this understanding of engineering and construction and large systems and marshaling resources."Musk appears to have built the Memphis data center so quickly in part by cutting a few key corners-for example, by moving forward without having secured enough power from the electric grid to run Colossus. But defying such norms is part of the playbook Musk has used again and again at his other companies.At Tesla, for example, he once sidestepped the need for permits to expand a car factory in California by setting up an assembly line in a parking lot for its Model 3 vehicle. And at SpaceX, he constantly pushes engineers to get rid of parts on its rockets that he views as unnecessary or to use cheaper components that were not designed for space applications.Even though xAI's AI tools are still far behind those of OpenAI, the speed with which he built his supercomputer raised alarms with Sam Altman, OpenAI's CEO. After Musk posted on X about it, Altman got into an argument with infrastructure executives at Microsoft, telling them he was concerned XAI was moving faster than Microsoft, according to a person who heard his remarks.He worried that xAI could soon have a more powerful supercomputer than OpenAI did. That concern has prompted OpenAI to seek alternatives to Microsoft for the first time.Right now, one of those alternatives is under construction along a dusty stretch of flat land in Abilene, Texas, roughly three hours west of Dallas, where a group of companies is preparing the site for a data center that will eventually house a 100,000-chip cluster for OpenAI next year.A data center under construction in Abilene, Texas, that will house a 100,000-chip OpenAI supercomputer next year. Photo by DPR ConstructionConstruction at the site is moving quickly. On a recent tour, a guide who worked for one of the project's contractors pointed out how most of the buildings don't yet have walls on all of their sides. The contractors are building most of the facility's components off-site so that they can install parts quickly when they arrive.It may not be long before even the supercomputers in Abilene and Memphis look relatively small. Some big tech companies, including Microsoft, have discussed data center projects that would contain millions of GPUs and cost more than $100 billion each.The one-upmanship is likely to continue, as nearly everyone in the data center business is keeping close tabs on what their competitors are up to."The data center market is very small, and everybody is paying attention to what is going on," said John Arcello, who leads the advanced data center team at DPR Construction, which builds data centers for large companies including Meta and is working on the Abilene project.'Gigafactory of Compute'Early this year, Musk began putting together the computing horsepower he needed to build XAI, a company he had founded in 2023 as a serious contender in AI. At the time, he was already renting GPUs from Oracle to train the initial version of Grok, xAI's large language model.To improve Grok's quality, he needed access to much more computing capacity. In May, he held a video call with prospective xAI investors as part of an effort to raise billions of dollars for the startup. He laid out a vision to them of building the world's largest supercomputer, which he called the "gigafactory of compute"-a reference to Tesla's gargantuan factories around the world-according to an investor who attended the meeting.Huddled around a table with fewer than a dozen other xAI employees, Musk revealed his plan to connect 100,000 of Nvidia's H100s-the most advanced GPUs from the company then on the market into a single cluster. A chart on the screen said xAI would build its supercomputer in about one-fifth the time it would take most companies to do so.One of the xAI slides said that the company was operating at "ludicrous speed" and promised that "Elon is personally responsible for delivering the data center on time."Musk told investors he hadn't yet decided whether XAI would partner with a cloud provider on the project or proceed on its own.A few weeks later, a handful of Oracle executives logged onto a video meeting with Musk to discuss the first option. Musk proposed that Oracle-whose founder, Larry Ellison, is a close friend of Musk-build xAI's supercomputer, which would have made the AI startup one of Oracle's largest customers, The Information previously reported.The one-upmanship is likely to continue, as nearly everyone in the data center business is keeping close tabs on what their competitors are up to.The former Electrolux factory in Memphis that is now home to XAI's supercomputer. Photo by Bloomberg via GettyMusk wanted the xAI data center to be located in a former Electrolux appliance manufacturing facility in Memphis, and he wanted it done by fall 2024. But on the call, Oracle executives told Musk they didn't think they could build it as quickly as he was demanding, according to someone who attended the meeting.Oracle staff noted that the building Musk wanted to use didn't have access to enough power to support the number of chips he wanted Oracle to put in it, the person said. Musk quickly grew frustrated with the pushback from the Oracle executives.Ultimately Musk decided XAI would work on the Memphis data center without Oracle."Oracle is a great company...but, when our fate depends on being the fastest by far, we must have our own hands on the steering wheel, rather than be a backseat driver," he said on X. following publication of a story in The Information about how the talks broke down.Power PlansTo meet his aggressive timeline, Musk pushed local officials in Memphis to approve the data center in record time. Luckily for xAI, Memphis was eager to give him what he wanted so the city could attract his business."We worked even longer hours, took every text and phone call at any hour of the day so that we were reflecting this sense of drive that matches this company and its expectations," Ted Townsend, chair of the Greater Memphis Chamber, told the Daily Memphian.In early June, Townsend announced publicly that Musk had picked Memphis as the location for xAI's supercomputer.Over the next few weeks, Musk and his team at xAI gutted the Memphis manufacturing facility to make room for rows and rows of racks that would house the Nvidia GPUs. They set up electrical, mechanical and plumbing equipment and installed a water cooling system for the servers.One major hitch for Musk's breakneck construction schedule could have been electrical power. Initially, there wasn't enough of it at the Memphis site for all of xAI's power-hungry GPUs. Normally, that's the kind of problem that could derail or delay a data center project.But Musk came up with a stopgap solution: He brought in mobile, natural gas-powered turbines to provide supplemental power while he waited for local authorities to approve a request for an additional 100 megawatts of power at the site. The Tennessee Valley Authority agreed to that request last week.Musk's move prompted immediate pushback from a coalition of local environmental groups, who wrote to the local health department that xAI was polluting the air by operating gas combustion turbines without permits. An executive who works on data centers at Microsoft said there's no way the company would have been able to do something similar, given its climate goals and initiatives."To have basically an un-permitted power plant come in and set up shop is alarming and really disrespectful to the community," said Amanda Garcia, a senior attorney at the Southern Environmental Law Center, which opposed the Tennesse Valley Authority's decision. "Air pollution is a huge challenge in southwest Memphis."Other factors likely helped Musk finish the project quickly. For example, executives in the data center business said Colossus almost certainly didn't have to undergo and pass any compliance tests before xAI could start using the cluster. That's mainly because XAI was planning to use the supercomputer for its own needs rather than renting it out to other clients.In contrast, Microsoft has to undergo several data security tests before handing over servers to OpenAI or other Azure cloud customers, who expect a certain level of uptime or privacy standards, according to someone with direct knowledge of the process."We have all these different industry certifications that we have to pass," said Raul Martynek, CEO of data center operator DataBank. "I'll guarantee you the [xAI] data center can't pass those types of things."Musk's effort to stand up Colossus has encountered a fair amount of skepticism. Several data center executives said it is extremely difficult to retrofit buildings, such as manufacturing plants, for GPU servers and liquid cooling systems. Over the past few months, the data center has experienced outages, according to two people who have spoken with xAI employees.The problems haven't appeared to slow xAI down. Musk and Nvidia have said they started the first training run for the next Grok model only 19 days after they brought the first server rack into the data center.In a recent interview, Antonio Gracias a close friend of Musk's and a longtime investor in his companies, including xAI-said xAI is rethinking the entire process of building data centers "from first principles" and trying to make it cheaper, better, faster."I've seen this movie at Tesla, SpaceX, other companies, where it's Elon, but it is also dozens of engineers being led in a mission to create the very best, most effective system possible," he said.Peer PressureAs word of Musk's supercomputer progress spread this summer, senior data center executives at Amazon, Microsoft and Google began calling staffers at Nvidia with versions of the same question: How was Musk able to move so quickly on his supercomputer?Officials from some of these and other companies, including Meta, also called a small competing cloud provider to see whether the company could provide them with data center capacity faster than what they could build themselves, according to someone who spoke to the companies.Their eagerness to unpack the mysteries of the project has only increased as more information about the Memphis data center has trickled out. Data center and cloud executives have closely studied images from the facility to see what they can glean about its design.Musk has published several photos from inside the data center on X. And last month, a blogger published a video on YouTube after taking a tour of the Colossus data center (the video, an unusual peek behind the curtain of the facility, was sponsored by Super Micro Computer, which supplied some of the data center servers to Musk).Inside XAI's Memphis data center. Photo via YouTube/Serve TheHomeMeanwhile, Oracle-xAI's would-be partner on the Memphis project-signed its deal to provide computing capacity to OpenAI not long after the talks with XAI broke down over the summer. The Information was first to report that the new OpenAI data center would be located in Abilene, where Oracle had signed an agreement to work with startups Crusoe and Lancium to develop the site.Last month, Crusoe raised more than $3 billion to develop the initial phase of the data center, which will contain 100,000 of Nvidia's forthcoming GPUs, known as GB200s.As Musk did in Memphis, Crusoe is pushing to get the project done quickly. Arcello from DPR, whose firm is contracting with Crusoe on the project, said it's one of the fastest builds he's ever worked on. The firms began discussing the data center design in March and broke ground on the project in June.A few weeks ago, construction crews at the Abilene site were busy cutting down trees to make room for a new electrical substation and pouring as much concrete as they could truck in on a daily basis. OpenAl has asked its partners on the project to consider using gas turbines in case there are any issues getting power to the site on time, according to three people with direct knowledge of the request.On a recent tour at the site, a guide was asked why there was so much fuss about building the data center so quickly."Whoever can get their [supercomputer] faster...could pretty much rule the world," the guide said.Anissa Gardizy is a reporter at The Information covering cloud computing. She was previously a tech reporter at The Boston Globe. Anissa is based in San Francisco and can be reached at anissa@theinformation.com or on Twitter at @anissagardizy8