Description

Harnessing high-quality, domain-specific data is pivotal for training robust ML models. We hope to construct a dynamic, ML-empowered pipeline to streamline the collection of specific datasets and facilitate the seamless training of Large Language Models (LLMs).

The project can be viewed as a triptych: first, we will use web crawling and simulation models to avoid bot blockers. We will use a combination of decision trees, clustering algorithms, and CNNs to process data and determine relevancy. Next we plan on integrating this freshly curated dataset into our LLM training framework. We will utilize feedback mechanisms to refine the data collection process.

We are looking for developers who are proficient in Python and have experience with core ML libraries. The following skills would be a benefit: familiarity with web-scraping techniques, experience with LLMs, and training models. If you have any questions or would like to reach out, please contact Sarah Walker (sarahl.walker@mail.utoronto.ca, Discord @quartzified) or Arihant Bapna (a.bapna@mail.utoronto.ca, Discord @ari_b). Thank you for your interest!.

Proposal