The Synthetic Data Vault

The Synthetic Data Vault

Software Development

Boston, MA 647 followers

Solving the data availability problem - one step at time.

About us

The Synthetic Data Vault is an set of open source software systems built to help enterprises generate synthetic data that mimics the real data. The project was launched at MIT. Visit our Github at: https://github.com/sdv-dev/SDV

Website
https://sdv.dev/
Industry
Software Development
Company size
2-10 employees
Headquarters
Boston, MA
Type
Privately Held

Locations

Updates

  • The Synthetic Data Vault reposted this

    View organization page for DataCebo, graphic

    852 followers

    2024 has been the biggest year for synthetic data so far! Perhaps the greatest lesson of this year is that synthetic data has gone beyond being just a “stand in” for real data (to overcome privacy concerns) and has come into its own abilities. Use cases like testing software, developing more accurate predictive models (such as ones that detect fraud or hate speech) and training AI agents have all come front and center. Here is our take on the current 𝘀𝘁𝗮𝘁𝗲 𝗼𝗳 𝘀𝘆𝗻𝘁𝗵𝗲𝘁𝗶𝗰 𝗱𝗮𝘁𝗮, 𝗮𝗻𝗱 𝘀𝗼𝗺𝗲 𝗽𝗿𝗲𝗱𝗶𝗰𝘁𝗶𝗼𝗻𝘀 for the future. (Link below) Language models to create synthetic data to overcome data shortage (or shortage of annotated data) in training LLMs themselves has become a theme. Google, Microsoft, NVIDIA, OpenAI, Meta have all demonstrated how they are using #syntheticdata to train their models. The 𝗰𝗶𝗿𝗰𝘂𝗹𝗮𝗿 𝗿𝗲𝗹𝗮𝘁𝗶𝗼𝗻𝘀𝗵𝗶𝗽 𝗼𝗳 𝘂𝘀𝗶𝗻𝗴 𝗹𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗺𝗼𝗱𝗲𝗹𝘀 𝘁𝗼 𝗰𝗿𝗲𝗮𝘁𝗲 𝗱𝗮𝘁𝗮 𝘁𝗼 𝘁𝗿𝗮𝗶𝗻 𝗹𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗺𝗼𝗱𝗲𝗹𝘀 has sparked a widespread debate over whether this will cause mode collapse, perpetuate biases, and about how long this can work before we run out of data. Snowflake and Google entered the 𝘀𝘆𝗻𝘁𝗵𝗲𝘁𝗶𝗰 𝘁𝗮𝗯𝘂𝗹𝗮𝗿 𝗱𝗮𝘁𝗮 𝗱𝗼𝗺𝗮𝗶𝗻 𝘄𝗶𝘁𝗵 𝘁𝗵𝗲𝗶𝗿 𝗹𝗮𝘂𝗻𝗰𝗵𝗲𝘀 𝗶𝗻 𝗝𝘂𝗻𝗲 𝗮𝗻𝗱 𝗗𝗲𝗰𝗲𝗺𝗯𝗲r respectively. We at DataCebo 𝘄𝗲𝗹𝗰𝗼𝗺𝗲 𝘁𝗵𝗶𝘀 𝗱𝗲𝘃𝗲𝗹𝗼𝗽𝗺𝗲𝗻𝘁 𝗮𝗻𝗱 𝘄𝗲 𝘀𝗮𝘆, 𝗹𝗲𝘁'𝘀 𝗿𝘂𝗺𝗯𝗹𝗲! Synthetic data has found a new use - 𝘁𝗿𝗮𝗶𝗻𝗶𝗻𝗴 𝗔𝗜 𝗮𝗴𝗲𝗻𝘁𝘀! A team from Massachusetts Institute of Technology created "bootstrapped exemplars" to train an #AI #agent that creates narratives for machine learning model explanations. We highlight these developments and many more in our article.. and share 𝗼𝘂𝗿 𝗽𝗿𝗲𝗱𝗶𝗰𝘁𝗶𝗼𝗻𝘀 𝗳𝗼𝗿 𝟮𝟬𝟮𝟱! We welcome feedback! Please leave comments here or on the blog article itself. If we missed something significant, we’ll be sure to update. Link to the article: https://lnkd.in/e2ZRdjv4 #syntheticdata, #generativeai #openai #sdv #machinelearning #2025

    • No alternative text description for this image
  • The Synthetic Data Vault reposted this

    Today, we are excited to introduce a very powerful new framework to The Synthetic Data Vault : 𝗰𝗼𝗻𝘀𝘁𝗿𝗮𝗶𝗻𝘁 𝗮𝘂𝗴𝗺𝗲𝗻𝘁𝗲𝗱 𝗴𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝗼𝗻 (#CAG for short). CAG addresses the shortcomings of generative models in capturing the context buried in enterprise data stores - with human input. (Link to the announcement: https://lnkd.in/eq4FgtbS) ❎ 𝗚𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝘃𝗲 𝗔𝗜 𝗺𝗼𝗱𝗲𝗹𝘀 𝗳𝗮𝗶𝗹 𝘁𝗼 𝗰𝗮𝗽𝘁𝘂𝗿𝗲 𝗱𝗲𝘁𝗲𝗿𝗺𝗶𝗻𝗶𝘀𝘁𝗶𝗰 𝗿𝗲𝗹𝗮𝘁𝗶𝗼𝗻𝘀𝗵𝗶𝗽𝘀 𝗯𝗲𝘁𝘄𝗲𝗲𝗻 𝗰𝗼𝗹𝘂𝗺𝗻𝘀, 𝗿𝗼𝘄𝘀, 𝗮𝗻𝗱 𝘁𝗮𝗯𝗹𝗲𝘀. We call such relationships database context. Database context describes hard and fast rules under which data is created and stored. What is even harder is that usually, this context is not explicitly stored within the database schema itself – but data teams know that it exists. Downstream applications process this data based on the context using logic within the application software. When the generative models are used to create #syntheticdata the expectation is that the #syntheticdata will also follow the database context. ✅ When we launched The Synthetic Data Vault — a system to enable enterprises to build generative models for their own #multitable data — we provided the ability to include context via what we called #𝗰𝗼𝗻𝘀𝘁𝗿𝗮𝗶𝗻𝘁𝘀. 🔥 Over the years, 𝗰𝗼𝗻𝘀𝘁𝗿𝗮𝗶𝗻𝘁𝘀 𝗵𝗮𝘀 𝗯𝗲𝗰𝗼𝗺𝗲 𝗼𝗻𝗲 𝗼𝗳 𝘁𝗵𝗲 𝗺𝗼𝘀𝘁 𝗽𝗼𝗽𝘂𝗹𝗮𝗿 𝗳𝗲𝗮𝘁𝘂𝗿𝗲𝘀 𝗼𝗳 𝗼𝘂𝗿 𝗦𝗗𝗩 𝗘𝗻𝘁𝗲𝗿𝗽𝗿𝗶𝘀𝗲 𝗽𝗿𝗼𝗱𝘂𝗰𝘁. 💪 𝗪𝗶𝘁𝗵 𝗖𝗔𝗚 𝘄𝗲 𝗮𝗿𝗲 𝗱𝗼𝘂𝗯𝗹𝗶𝗻𝗴 𝗱𝗼𝘄𝗻 𝗼𝗻 𝘁𝗵𝗶𝘀 𝗳𝗼𝗰𝘂𝘀. To use this new and powerful framework, users can just pick the pre-defined pattern that corresponds to their database context and tell SDV where to apply it. It will then augment your synthesizer directly with this information. And 100% valid #syntheticdata 𝗥𝗲𝗮𝗱 𝗺𝗼𝗿𝗲 𝗮𝗯𝗼𝘂𝘁 𝗖𝗔𝗚, 𝘄𝗵𝗮𝘁 𝗶𝘁 𝗺𝗲𝗮𝗻𝘀 𝗳𝗼𝗿 𝘆𝗼𝘂, 𝗮𝗻𝗱 𝗵𝗼𝘄 𝘁𝗼 𝗮𝗰𝗰𝗲𝘀𝘀 𝗶𝘁 𝗶𝗻 𝗼𝘂𝗿 𝗹𝗮𝘁𝗲𝘀𝘁 𝗽𝗿𝗼𝗱𝘂𝗰𝘁 𝗮𝗻𝗻𝗼𝘂𝗻𝗰𝗲𝗺𝗲𝗻𝘁 𝗵𝗲𝗿𝗲: https://lnkd.in/eq4FgtbS 𝗛𝗮𝗽𝗽𝘆 𝗵𝗼𝗹𝗶𝗱𝗮𝘆𝘀 𝗮𝗻𝗱 𝗲𝗻𝗷𝗼𝘆 𝘀𝘆𝗻𝘁𝗵𝗲𝘀𝗶𝘇𝗶𝗻𝗴!  - from all of DataCebo Team #syntheticdata #generativeai #data #machinelearning #ml #ai

    • No alternative text description for this image
  • 🚀🔥 #CTGAN has been downloaded over 2.5 million times. 🔥🚀 Released #thisweek in 2019: version 0.1.0 of #CTGAN, a Deep Learning-based #syntheticdata generator for single-table data that can learn from real data and generate synthetic data with high fidelity. During this time:  🙌 It continues to be the go to model for many #fortune500 companies who want to create #syntheticdata for training #AI models 👍 It has been used for a wide variety of use cases in the domains ranging from #energy, #healthcare, #education, #insurance and many others 🔥It has been used to create #syntheticdata for competitions, to improve predictive accuracy of healthcare models, and to accurately predict fraud, to name a few. 🤝 Data created using #CTGAN has been used by more than 30,000 data science teams. ❤️ Thank you to all our users who used it and gave a ton of feedback which has helped us build it further and further. With its demand surpassing any other generative AI model for tabular data, we will be releasing more features for CTGAN in the near future. Check it out here: https://lnkd.in/ey-SJZVq #syntheticdata #datascience #dataanalytics #DS #sdv Happy synthesizing! - The DataCebo Team

    • No alternative text description for this image
  • In 1956, to store 5MB it required a hard disk that weighed a ton. In 2024 a generative model can capture the salient properties of terabytes of data in an entire database within a single file and recreate it on demand - what we now call #syntheticdata. #otd in 1956 IBM launched the first commercial hard-disk drive, the Model 350 RAMAC, which weighed a ton and stored the equivalent of roughly 5 MB. In comparison, today's largest commercial hard drive - Seagate's Exos X Mozaic - has 6 million times more space, at 30TB. And … in 2024 with generative AI: Now a generative model of a file size of a few GBs can capture the salient properties of the data and recreate 30TBs of #syntheticdata with the same statistical properties and that looks like the real data on-the-fly Read more about the original article about IBMs first hard disk here: https://lnkd.in/e6mD9svy

    • No alternative text description for this image
  • "This synthetic data must meet two requirements: 1️⃣ First, it must somewhat resemble the original data statistically, to ensure realism and keep problems engaging for data scientists. 2️⃣ Second, it must also formally and structurally resemble the original data, so that any software written on top of it can be reused. In order to meet these requirements, the data must be statistically modeled in its original form, so that we can sample from and recreate it. In our case and in most cases, that form is the database itself. Thus, modeling must occur before any transformations and aggregations are applied." From the paper "The Synthetic data vault" from 2016 whose camera ready version was submitted #otd in 2016 from Massachusetts Institute of Technology Today, #sdv counts millions of downloads, thousands of users and so many additional modules have been added to evaluate #syntheticdata, #benchmark models and so much more.. You can find the original paper here: https://lnkd.in/evSmnZz8 #syntheticdata, #generativeai, #tabulardata , #ai, #machinelearning, #datascience ---- Neha Patki Roy Wedge and Kalyan Veeramachaneni, MIT Computer Science and Artificial Intelligence Laboratory (CSAIL), MIT Laboratory for Information and Decision Systems (LIDS) MIT Schwarzman College of Computing MIT Data-to-AI Lab

    • No alternative text description for this image
  • Released #otd in 2018: Our RDT library (v 0.1) that makes generative AI for #tabulardata possible! RDT has been downloaded more than 3.5 million times. Reversible Data Transforms is a #python library developed for the Synthetic Data Vault. It: ➡️ helps transform real world data into data that's ready for generative AI models (forward) and ⬅️ then helps reverse the samples generated from the model to produce realistic looking data both in format and structure (reverse). 🔥 It's 3.5 million downloads (>10K/day as of today) is a testament to its significance for #generativeai modeling and #syntheticdata generation Check it out here: https://docs.sdv.dev/rdt #syntheticdata #preprocessing #datascience #dataanalytics #DS #sdv

    • No alternative text description for this image
  • The Synthetic Data Vault reposted this

    We're intrigued by a recent paper on a promising new use case for #syntheticdata in clinical trials. While still very preliminary, an Italian team led by Matteo G Della Porta showed the potential for creating “synthetic patients” as a control group for clinical trials - making such trials much more cost-effective and quicker to run. ✅ Researchers used real patient data to train a model to generate a “mirror cohort” of synthetic records of patients with the blood disease myelodysplastic syndrome (MDS). ❇️ They've developed an associated tool that allows those working on MDS to generate cohorts of up to 10,000 synthetic patients, complete with clinical, genomic, and follow-up data. ⭐️ Della Porta says that that’s a potential game-changer for research on disorders for which “very few data are publicly available.” Obviously we have to proceed carefully here, but this is a very encouraging sign for ways that synthetic data can be used that can concretely help with further research and development of new medical treatments. PNAS overview: https://lnkd.in/ew5pM-Ca Paper: https://lnkd.in/eNZaJRPQ #clinicaldata #researchdata #syntheticdata #healthcaredata

    • No alternative text description for this image
  • The Synthetic Data Vault reposted this

    Most enterprises store data across multiple tables, with each table storing data for an entity and its attributes. Each single table captures just one part of a complex pattern. For example, as Neha Patki and Frances Hartwell explain, a customer's age may correlate with the amount of money they spent on a purchase. Even though the "age" and "amount per purchase" columns are present in different tables, the connection between the tables means that intertable trends exist. To fully understand these trends, and emulate how an entity and its data connect and evolve, a generative model must work across tables and learn these patterns. We at the The Synthetic Data Vault developed our first multi-table approach — popularly known as a Hierarchical Modeling Algorithm (HMA) — in 2016. Over the years, we have improved it and created many different generative modeling techniques for multi-table data. Each of our approaches strikes a delicate balance between modeling speed vs. how exhaustively the model learns the patterns, affecting the usability of the tool and the quality of the #syntheticdata generated, and letting users find an approach that matches their needs. Read our primer on multi-table synthesizers and the SDV on our blog: https://lnkd.in/etjqnJ9f. As always, we learn and evolve based on your feedback. Join our Slack: https://lnkd.in/ePnJgHhc Try out SDV Community: https://docs.sdv.dev/sdv Send us a request for SDV Enterprise: https://lnkd.in/e97Ce89i

    • No alternative text description for this image

Similar pages