Data Sources in Entrepreneurship Research
Whether you are testing a theory or creating one, at some point you might need to collect some data. This is challenging when doing research in entrepreneurship, because data about early-stage firms are typically sparse, noisy, and hardly accessible. I put together a list of common data sources that may be helpful for research in this space. I start by discussing data useful to build a sample. Then I discuss data useful to describe some of the key features: team, intellectual property, financing, and exit. I conclude by discussing alternative directions. Obviously, this is NOT a comprehensive list, it is just my view on this topic. This blog was published on April 16th 2023, and is going through numerous iterations. If you have any feedback, I’d love to hear from you at andreacontigiani2019@gmail.com.
1 — Sample
Whether you are testing a theory or you are creating a theory, at some point you need to get to the data. To do that, you need to identify a population of units that fit the theory you have in mind. In entrepreneurship research, while sometimes you can go more micro (i.e., teams, people, etc) or more macro (i.e., regions, countries, etc), those units are typically early-stage firms.
Once you have a population in mind, you typically need to extract a sample from it, for feasibility reasons. So, the starting point is to create a sample. There are several ways to do that.
Ideally, you want to start from something broad (i.e., the “universe”) and select firms satisfying the criteria relevant to your question (i.e., geography, industry, age, etc). The caveat is that the database has to be focused on early-stage firms, which isn’t always easy to find. Here are some common databases:
While entrepreneurship is about young firms, sometimes it’s helpful to look at more established firms as well. Here are some common databases:
Another approach is to start from products. Doing that often requires focusing on a specific sector. Here are some common databases:
Another approach is to start from survey data. The challenge is that these datasets are not easily connectable to other datasets. However, they may be helpful for specific studies, especially at the micro level. Here are some common datasets:
2 — Team
A new venture starts with a founder or, more commonly, a founding team. So, it is often helpful to get information about the people that mark the beginning of the firm.
Here are some common data sources:
- Crunchbase. This is a good starting point to get some sense of the members of founding teams. Though, the information isn’t complete and perhaps not super accurate.
- LinkedIn. This is the obvious place to get information about education and experience, once you have the names of the people.
- Glassdoor. This is sometimes helpful to get information about compensation and culture inside firms.
- Lightcast (formerly Burning Glass). This is helpful to get information about salaries and positions.
Here are some examples of research using some of these datasets:
3 — Intellectual Property
A lot of entrepreneurship research examines the role of intellectual property (aka IP). Much of this work focuses on patents. While patent data has challenges, its observability makes it an incredibly valuable resource. Plus, recently, research has increasingly looked at non-patent IP as well.
Here are some common data sources for research on patents:
- USPTO PatentsView. This is perhaps the most common patent dataset. Managed directly by the USPTO.
- NBER PDP. A historical source of patent data. No longer updated, but still a great resource.
- Harvard PND. Another historical source of patent data. Again, no longer updated, but still excellent.
- PATSTAT. This dataset, managed by the EPO, has broad coverage. So it is useful for research on patents in settings outside the US.
- USPTO PatEx. This is a database of patent applications. More info here.
- KPSS. This dataset is helpful to connect patent data to other common datasets, such as CRPS and Compustat.
- IPRroduct. This dataset is useful to connect patent data to product data.
- Silverman’s IPC-SIC Concordance. This dataset is helpful to connect patent data to industry sectors. No longer updated, but still a great resource.
Here are some common data sources for research on non-patent IP:
- USPTO Trademark Dataset. This dataset provides data about trademarks.
- Png’s Trade Secrecy Index. This dataset provides an index measuring the strength of trade secrecy protection across US states.
Here are some examples of research using some of these datasets:
4 —Financing
A key variable in entrepreneurship research is financing. While only a small fraction of ventures get it, venture capital (aka VC) is a central topic in this space, so a lot of research has looked at the drivers and consequences of VC financing.
Here are some common data sources:
- Crunchbase. This dataset is comprehensive and financially accessible, but not extraordinarily precise.
- Pitchbook. This is what many view as the top quality dataset for VC data. Hard to get.
- CB Insights.
- Preqin.
- Refinitiv (formerly ThomsonReuters VentureXpert). This product includes what used to be VentureXpert, a historical source for VC data.
Here are some examples of research using some of these datasets:
5 — Exit
A common outcome variable in entrepreneurship research is whether a venture “exits”. This means, primarily, that the venture either gets acquired or goes public.
Here are some common data sources:
- Refinitiv (formerly ThomsonReuters SDC Platinum). This product includes what used to be SDC Platinum, a historical source for data on M&A, JVs, and IPO filings.
- Orbis M&A. This is another comprehensive dataset for M&A, JVs, and IPO filings.
Here are some examples of research using some of these datasets:
6 — Alternative Directions
This is just a partial list of (secondary) data sources useful for research in entrepreneurship. It is not comprehensive, for several reasons. 1) I obviously don’t know all of the sources. 2) I am based in the US, so this is unavoidably US-centric. 3) A comprehensive list just can’t exist, because data sources change over time.
In fact, I would argue that the most interesting research is often based on novel data sources. So, in some sense, this list is useful because it explains the data sources you should NOT use. Or, at the minimum, that you should use as the core of your dataset.
Here are some interesting papers using different approaches to do primary research:
- Survey
- Bennett & Chatterji 2023 - Lab experiments
- Kagan & al 2017 - Field experiments
- Bernstein & al 2017 - Text
- Carlson 2023 - Image processing
- Banerjee & al 2023 - Neural activity
- Frydman & al 2014
Thank you for reading all the way through here. I hope this information was helpful. If you have feedback, please reach out at andreacontigiani2019@gmail.com. Avanti tutta!