Full text
Most verticals arenโt clean, well-oiled SaaS databases; the reality is ugly documents, proprietary schemas, implicit workflows, and longโrunning tasks that most general-purpose models struggle with.
This prompted construction project management company Trunk Tools to build a specialized, three-layer architecture โ perception, semantics, agents โ based on highly-detailed data to support high-accuracy, highly-relevant industry automation.
Their purpose-built stack has shrunk review cycles from months to days, prevented costly field errors, and given autonomous agents the ability to reason over millions of pages of documentation, Trunk says.
โWe really set out to take the data from dispersed systems, pre-process it, structure it, go through our ontology into a knowledge graph, and then train AI models,โ said Sarah Buchner, Trunkโs founder and CEO and a former carpenter.
For builders in other verticals, Trunkโs approach could serve as a blueprint for transforming data chaos into agentโready, industry-specific workflows.
Where general-purpose LLMs break down on industry data
Foundation LLMs, while powerful, are optimized for breadth, not always depth.
โGeneral-purpose LLMs are trained to be okay at everything, so they're weak at anything niche,โ said Kriti Faujdar, a senior product manager working in AI infrastructure, agentic AI, security, and LLM platforms. For instance: Rare terms, domain-specific reasoning, the unspoken context that any practitioner โjust knows.โ
Web, app, and software developer Sรฉbastien De Bollivier agreed that the biggest bottleneck is reliability on data that is โjargon-dense, abbreviation-heavy, and format-specific.โ
โA GPT-4-class model can understand a French legal contract, but will fumble the specific article references practitioners need to cite,โ he said.
Besides, the most valuable enterprise data never made it into pretraining anyway, Faujdar pointed out. It's sitting in internal systems and proprietary formats. โRAG helps a little,โ she said. โBut it's just giving better facts to a model that still can't reason properly in the domain.โ
Pre-training on domain data is critical; enterprises should then fine-tune on good task examples and build their own evals. โA few thousand examples from real practitioners beats millions of scraped, noisy ones," Faujdar said.
Mixture-of-experts (MoE) can provide specialization without inference costs blowing up. Pairing RAG with fine-tuning also works well; RAG handles the factual long trail while fine-tuning fixes vocabulary and reasoning.
De Bollivier pointed to the advantage of hybrid stacks: A general-purpose model for reasoning and orchestration, a smaller fine-tuned model (or dense retrieval over a curated corpus) for domain-specific extraction. He advised: โDon't fine-tune to make the model 'smarter' about a domain, fine-tune to make it more reliable on the specific output format your workflow requires.โ
The trades and construction are certainly industries seeing traction with these techniques, as are legal and healthcare, De Bollivier said. These verticals have โhigh stakes for errors plus standardized document formats, equaling clear domain-training ROI.โ
One honest caveat worth mentioning, Faujdar said: Specialized models can often fall apart outside their domain, so theyโre often not useful outside their expertise (unless theyโre re-trained).
Perception, semantics, agents: inside Trunk's three-layer stack
In highly-specialized domains like construction, โdata dumpsโ into large language models (LLMs) donโt cut it, said Trunkโs CTO Amrish Kapoor. This is because most transformers are probabilistic models: When given an image, they report back that it is โprobablyโ a tree, or โprobablyโ a child playing next to a tree.
This makes them insufficient for highโprecision symbolic interpretation. For instance, in construction documents, a 2-millimeter-wide symbol has a vastly different meaning depending on where itโs placed.
Further, constrained by context limits, probabilistic models struggle with longโterm project memory. โI don't mean a context window of a few tokens,โ Kapoor said. โI'm talking about long term memory that stretches across months and years, because this is how long some of these projects are.โ
Instead, Trunkโs three-layer system breaks workflows into:
Perception (reading and extracting data from messy docs like PDFs, drawings, or scans)
A semantic/graph layer (making sense of that data and understanding their relationships).
LLMs and agents on top.
Construction drawings are typically symbolic, Buchner said. A door isn't always labeled โdoor.โ Sometimes it's simply an arc on a wall that a trained eye learns to read based on years of practice.
โThe perception layer is what teaches AI to read that language,โ she said. The semantic layer then gives that information meaning; for instance, connecting the door to the drawing that details it, the spec that governs it, and the trade that installs it. This helps answer project engineersโ critical questions: Not "is there a door here?" but "does this door create a problem down the line?"
Particularly in construction, that shift matters because the cost of a problem compounds with time. โA conflict caught in design is relatively low cost to address,โ Buchner said, โwhereas the same problem caught in the field might cost tens of thousands of dollars.โ
At a high level, the system identifies the document type and begins extracting information based on content (drawing, schedules, paragraph text). This data is then โtransformed and augmentedโ in the platform, which triggers agentic workflows like knowledge graph relationships and end-user workflows.
For instance, an agent might review an architecture bulletin and produce a visual overlay comparing an older version and a newer version (flagging additions and removals), then generate written narratives that describe what those changes are in simple terms. This helps users understand whatโs changed and coordinate with trade partners on updated pricing and change orders.
The scale of constructionโs data problem
Construction workflows are โripe with implicit assumptions and connections between data in its myriad of sources,โ Buchner said. And the amount of unstructured data is โhumanly impossibleโ to process or make sense of.
Buchner estimated the average high-rise building generates about 3.6 million pages of corresponding documentation. โIf you print it into a stack of papers it would be as high as the building itself.โ
All three layers of Trunkโs stack โ perception, semantic, LLM โ are trained on โvery specific datasetsโ from customers with โexplicit permissionsโ and autoโlabeling/IP, Kapoor explained. Customers who donโt want Trunk training on their data can opt out.
Data is deidentified and aggregated, and Trunk also collects โtons moreโ labeled data through other pipelines like 3D building information modeling (BIM).
Trunk says it only ships agents that achieve around 95% accuracy. The team maintains continuous evaluation pipelines based on ground truth data from customers and experts. They also employ an LLMs-as-a-judge model.
โThis notion of an LLM as a judge is to score how well you're doing, both subjectively as well as objectively,โ Kapoor said. Objectivity can be an easy โrightโ or โnot right,โ but subjectivity requires more nuance.
For instance, when creating an email or narrative or explanation, an LLM as a judge framework can create a composite score, or a numerical value that aggregates different metrics and tests a model's performance or risk.
There can be challenges, though, particularly with latency, Buchner noted; any time the reasoning capacity of underlying models increases, the risk of latency goes up, too. Trunk maintains a set of evaluation criteria to objectively measure latency whenever changes are made to underlying infrastructure, agents, and API calls.
Then, โbefore we release to customers, we ensure marginal changes to the end-user experience are well worth the performance enhancements,โ Buchner said.
From 60 days to 10: the measurable payoff
Trunkโs platform powers seven AI agents purpose-built for construction, such as analyzing request for information (RFI) responses, overviewing bids, or reviewing drawings and submittals.
The submittal agent, for instance, flags missing, conflicting, or noncompliant information in product specs and RFIs. While itโs an essential step in the construction process, โit's a super annoying workflow,โ Buchner said, because human reviewers have to compare documents โwith a bunch of other parts of documents.โ
But the agent is able to do this in seconds, and Trunk says it has reduced submittal cycles from 50 to 60 days to 10, โwhich has massive schedule and financial implications.โ
Trunk is now at a place where these agents are communicating directly with each other, which is โquite exciting,โ Buchner said. So, for example, one agent will review an architectural drawing for accuracy, then autonomously hand it over to agents handling RFIs and asking follow-up questions.
โIf the drawings have problems, the RFI agent is taking over and is actively reaching out for clarification,โ Buchner explained.
Trunk says its customers report savings of 20 to 40 minutes per field question. Buchner said that users in the field know better than anyone how much of a โtime suckโ it is to go back and forth from office trailers, dig through project documents in scattered systems or printed PDFs, reconcile discrepancies, and return to coordinate with trade partners.
Trunk says its customers report these additional outcomes:
Average 8 minute time savings for single-document retrieval (status checks, location lookups, quantity queries).
Average 20 minute time savings for standard referencing (cross-referencing 2 to 3 spec sections to form an answer.
Average 40 minute time savings for multi-document research (listing and filtering queries, mapping relationships, analyzing RFIs and submittals across 4 to 6 documents).
Average 75 minute time savings for complex tasks (creating RFIs and other communication materials, deep cross-referencing across documents, change tracking).
In one instance, Trunkโs drawing review agent flagged that a structural beam had been moved up 8.5 inches. However, this was not documented by the architect. If the change hadnโt been caught, the project manager would likely have had to strip out and reinstall the right size beam, Buchner said. This rework would have added $10,000 or more to the budget, and โcertainly there would have been implications on the schedule.โ
Buchner also pointed to other examples: an agent flagged $60,000 in exaggerated pricing with no justification from landscaping subcontractors; identified a fireplace that needed to be sealed prior to drywall installation, saving around $100,000 in labor, materials, and delays; and called out that an electric door required a panel that wasnโt included in electrical drawings.
Learnings for other industries
Trunkโs approach to building agents is applicable to any vertical working with high volumes of unstructured, industry-specific data.
Builders working in specific verticals must understand the industryโs specific data challenges their end users face and build technical infrastructure that can transform unstructured data into something an โLLM can traverse and understand,โ Buchner said.
โOnly then can you build the connections between data points that ultimately feed agentic workflows.โ
A lot of money is being invested in foundational models, so enterprises should build modular systems that can leverage the strengths of various models as they continue to improve, Buchner advised.
Then, โbuild your technical advantage where the generic models are not investing and not performing well,โ she said.
Comments
No comments yet โ be the first to weigh in ๐
No comments yet. Be the first!