OpenAI's Mira Murati Questions Sora's Training Data Source

4.5

★

119 User Rating

Mira Murati, serving as the chief technology officer of the company, offered vague insights into the data sources utilized for Sora, OpenAI's forthcoming video-generating artificial intelligence (AI) model, during an interview with The Wall Street Journal on March 13. When questioned about the origins of the data fueling the Sora model, Murati responded cryptically, mentioning the utilization of both public and licensed data by the $80 billion company for training purposes.

Further probing from The Wall Street Journal's Joanna Stern regarding the specific data sources, such as social media platforms like YouTube, Instagram, or Facebook, prompted Murati to express uncertainty. Murati's response suggested a lack of confidence in pinpointing the exact origins of the data, although she acknowledged the potential utilization of publicly available data if accessible.

Before shifting the conversation, Stern referenced OpenAI's partnership with stock photo company Shutterstock and inquired if its data had contributed to Sora's training. Murati refrained from divulging specific details about the data sources but reiterated the utilization of publicly available or licensed data for training the model.

Subsequently, it was confirmed to The Wall Street Journal that Sora indeed leveraged data from Shutterstock. AI models like Sora rely on extensive datasets to facilitate learning, enabling them to recognize patterns, predict outcomes, and comprehend language. Murati, who joined OpenAI in 2018, has spearheaded several prominent projects for the company, including the image generator Dall-E 3, the speech recognition tool Whisper, and the latest iteration of the chatbot, ChatGPT-4.

Murati briefly assumed the role of interim CEO at OpenAI in November 2023 following Sam Altman's removal by the OpenAI board. Despite the company's technological advancements, it has encountered legal challenges related to the training data of its AI models. In July 2023, authors Sarah Silverman, Richard Kadrey, and Christopher Golden filed a lawsuit against OpenAI, alleging copyright infringement by ChatGPT.

Additionally, The New York Times initiated legal action against Microsoft and OpenAI in December 2023, accusing the companies of using its content without authorization to train AI chatbots. Another lawsuit in California alleged that OpenAI unlawfully scraped private user data from the internet to train ChatGPT, underscoring the complex ethical and legal considerations surrounding AI development.