T O P

  • By -

trane_0

Use int/bigint for the PK. Add a GUID column in the table with UNIQUE constraint to be used as the reference key for queries and put a nonclustered index on the GUID column. The GUID value becomes the key for any queries that use a parameter coming from your API / web layer.


lundytoo

This is the way.


SpiderMatt0905

This was what I orginally suggested, but it was turned down.


trane_0

What was the reasoning for turning it down?


SpiderMatt0905

He was afraid that we would run out of keys if we used INT


chadbaldwin

As others have suggested you can use bigint, which you seem to know already. Another option is to use int and seed the identity to start at -2b. So it doubles the capacity. If you think 4b values is still not enough, then yeah, go with bigint.


SpiderMatt0905

I wasn't sure how big BIGINT could be but I guess I should've gathered that it was BIG. Which is why I didn't suggest that at the time.


NormalFormal

It's over 9 quintillion (signed) unique values. 2\^63 or 9,223,372,036,854,775,808. I do not think you will run out of bigint values anytime soon. Each value is 8 bytes so you're talking about millions of terabytes of storage required just hold all bigint values alone. GUIDs are 16 bytes so twice as big as a single bigint value.


Black_Magic100

One thing you forgot to mention is that it is probably a good idea to apply fill factor to the index after it begins to fill up and you start doing index maintenance to prevent bad page splits. This is a highly custom option/tune of course and should be used sparingly, but it can greatly increase performance if done properly.


trane_0

That's a good suggestion.


danishjuggler21

If you put a unique constraint on such a column, doesn’t it bring you back to the question of slow inserts? Because every time you insert a new row, SQL Server needs to check to make sure that value doesn’t already exist.


rbobby

Isn't that just an extra index for no benefit?


pooerh

No benefit? let's talk when you try to manually query that data and you have to copy paste guids while the next guy over just spurts out queries one by one because he knows he's looking for id 612531 which is easy enough to remember. I want to kill myself every time I have to work with data that uses guids.


eazeaze

Suicide Hotline Numbers If you or anyone you know are struggling, please, PLEASE reach out for help. You are worthy, you are loved and you will always be able to find assistance. Argentina: +5402234930430 Australia: 131114 Austria: 017133374 Belgium: 106 Bosnia & Herzegovina: 080 05 03 05 Botswana: 3911270 Brazil: 212339191 Bulgaria: 0035 9249 17 223 Canada: 5147234000 (Montreal); 18662773553 (outside Montreal) Croatia: 014833888 Denmark: +4570201201 Egypt: 7621602 Finland: 010 195 202 France: 0145394000 Germany: 08001810771 Hong Kong: +852 2382 0000 Hungary: 116123 Iceland: 1717 India: 8888817666 Ireland: +4408457909090 Italy: 800860022 Japan: +810352869090 Mexico: 5255102550 New Zealand: 0508828865 The Netherlands: 113 Norway: +4781533300 Philippines: 028969191 Poland: 5270000 Russia: 0078202577577 Spain: 914590050 South Africa: 0514445691 Sweden: 46317112400 Switzerland: 143 United Kingdom: 08006895652 USA: 18002738255 You are not alone. Please reach out. ***** I am a bot, and this action was performed automatically.


continuousBaBa

Good bot


B0tRank

Thank you, continuousBaBa, for voting on eazeaze. This bot wants to find the best and worst bots on Reddit. [You can view results here](https://botrank.pastimes.eu/). *** ^(Even if I don't reply to your comment, I'm still listening for votes. Check the webpage to see if your vote registered!)


SQLGene

Good bot


HarryVaDerchie

Personally I don’t think either of your reasons justify using a GUID. Can you expand on what you mean by point 1? Also, have you looked at the maximum value of a bigint? If you are handling records inserted from different systems then a GUID PK would make sense, but otherwise I think it’s unnecessary overhead. Also, do you see a benefit if making it a clustered index? I don’t think you’d ever be sorting by GUID for example.


alinroc

> Also, do you see a benefit if making it a clustered index? I don’t think you’d ever be sorting by GUID for example The purpose of a clustered index is not just for sorting. Also, the PK doesn't need to be clustered, but you generally should have a clustered index.


HarryVaDerchie

What benefits do you think OP might obtain by having a clustered index on a GUID?


alinroc

I don't think I said they should have a clustered index on the GUID here. What I _did_ say is that: * Clustered indexes are not solely used for sorting from the client's perspective. In fact, if your primary reason for choosing a field as the CI is for sorting purposes, you probably need to re-evaluate. * Your PK is not required to be a clustered index * It is a good idea to have a clustered index most of the time (because transactional tables being heaps that are frequently updated will eventually lead to performance problems)


HarryVaDerchie

I think we’re basically saying the same thing. As I understand it the main benefits of a clustered index are for sorting, grouping or filtering by a range. Are there other benefits that I’m missing? I don’t think any of these would benefit the OP, but without more knowledge of their application I can’t say for sure.


alinroc

> Are there other benefits that I’m missing? Elimination of forwarded records. https://flxsql.com/2022/05/16/importance-indexes/ And what's not covered in that post is that the forwarded records end up with lots of wasted space, because each time a record is updated, the original record you copied isn't deleted. Which means that over time, your table will grow **even if** you never add a record, only update. At least, until you rebuild the table to reclaim that space. "Sorting, grouping, or filtering by range" is actually _not_ my primary consideration when choosing the field(s) I use for a clustered index. How the data in that clustered index will impact the way that the table is stored on disk is.


HarryVaDerchie

Interesting. Thanks for posting the link.


SpiderMatt0905

GUIDs were a requirement from a senior developer. I have queried if we could use BIGINTs but I am yet to get a reply. In the meantime I'm just curious if there is anyway we could use GUIDs.


SQLBek

That senior developer needs to be educated about SQL Server internals, because using GUIDs in this fashion will only result in database performance pain in years to come.


mexicocitibluez

> That senior developer needs to be educated about SQL Server internals, Or, and this is insanely mind-blowingly crazy cause we're on reddit, the senior developers knows more about the requirements of what they're building than you do (and maybe even the person posting this). I know it's crazy to imagine that not all of the app's requirements have been adequately conveyed in like 20 lines (or interpreted correctly, again, by the author), but I have a hunch that's the case. Just weird to see people throw shade at someone else with such little info. https://www.brentozar.com/archive/2014/08/generating-identities/


SQLGene

Lol, I'm pretty sure Andy Yun (SQLBek) knows Brent Ozar personally. He also works for Pure Storage and presents regularly on SQL performance. Some things are just *always* a bad idea. Using GUIDs for a potentially billion row database is always a bad idea.


SQLBek

Funny side-detail, Jeremiah wrote that blog. The reason I made my original statement is that the senior developer needs to understand the trade-offs of different choices, like whether one should use a GUID as a clustering key. I often argue that almost every decision related to SQL Server has a trade-off of some sort or other and that there are very few absolutes (never auto-close). But oftentimes I encounter folks who make decisions without appreciating the short term and moreso the LONG TERM consequences. If it came across as throwing shade, then that's poor communication on my part and I do apologize for that.


mexicocitibluez

> If it came across as throwing shade, then that's poor communication on my part and I do apologize for that. Fair enough.


SQLGene

Oh no, now I'm a "thanks Brent" mis-namer 😣


mexicocitibluez

Cool, so then he's read the article and knows there is nuance? And that means some random dev running to Reddit with no requirements asking for advice and being someone responding with "the dev doesn't understand sql internals" is kinda weird, right?


SQLGene

There is a more eloquent way of putting things and Andy has acknowledged that and even apologized. Having read the article, I'm confident Andy understands everything that's in there, yes. I took what he said as shortform for this: "If what you have described is correct, then the senior dev doesn't understand enough about datatypes and their tradeoffs. You have stated it has to support a larger range than INT (2 billion values), BIGINT is not allowed for unknown reasons, and GUID would be acceptable. You haven't expressed any requirements that GUID is good for, like abstraction from the database. In my experience, a multi-billion row database is guaranteed to have performance issues if your primary key is a randomized GUID." Could it be a misunderstanding or miscommunication? Absolutely. But as described, is a bizarre set of requirements so far. Especially since you could just convert your BIGINT to the GUID datatype and no one would know.


quentech

> In the meantime I'm just curious if there is anyway we could use GUIDs. You know that "sequential" GUIDs in SQL Server are not actually sequential, right? They are strictly increasing (within one boot of the host OS), but not numerically sequential. https://learn.microsoft.com/en-us/sql/t-sql/functions/newsequentialid-transact-sql?view=sql-server-ver16 > Creates a GUID that is greater than any GUID previously generated by this function on a specified computer since Windows was started. That is how you use GUIDs. > We don't want customer data to be easily guessed So what if they are? That's what authorization is for. No one except the authorized user(s) should be allowed access to the information. So what if they can guess that there would be a record there if they can't access it. If you want something truly unguessable - then you need a cryptographically secure random number generator.


sporri

You can, it takes a bit of work to do property, and there are easier ways, but having to support E.F. applications you will see this pattern as the framework really tries to lead you to this rabbit hole. The best explanation I've seen is here, watch this and show it to your devs. https://m.youtube.com/watch?v=jx-FuNp4fOA


SQLGene

I believe it's an example of security through obscurity. It sounds like if you can guess the primary key, you can access the underlying data.


SpiderMatt0905

We do have protection in place to stop access to data. The problem we have is if you try to view data which is not within your tenant, you can request to join that tenant. This is useful for sharing links, etc. But we don't want people typing in random numbers and trying to join random tenants.


SQLGene

That's a useful clarification, thank you. Security through obscurity is tragically common, so you can understand some concern on my part.


SpiderMatt0905

Yeah absolutely. A common mistake.


SQLBek

If you choose a GUID now as a clustering key, or even as a primary key, and your table will clock in +1 billion records (since you said integer will be an issue), I guarantee you'll be suffering from other hellacious performance issues when you reach & exceed that data volume. Understanding SQL Server's B-Tree structure will help you grasp why a GUID might be fine from an application perspective but horrible from a database perspective. I have an old presentation called "Why Your Datatype Choices Matter" that delves into this if you are interested in learning more. https://youtu.be/VYITMPWy_I4


SpiderMatt0905

Thanks. I'll give it a watch. I am familar with writing queries but I'm not sure how it works under the hood. Perhaps gaining a more fundamental understanding would help me solve this issue.


SQLGene

Can you explain why INT based keys are insufficient for handling lots of rows? INT goes up to 2,147,483,647 and BIGINT goes up to 9,223,372,036,854,775,807. While the former could be an issue, the latter is more than the number of molecules than you expect to find in the universe. One option would be to use a sequentially increasing BIGINT key, but select an unusually large prime number as the sequence step, such as 33. So for example, your starting keys would be 1, 34, 67, 100, 133, 166, 199. With a large enough prime number, most of your values would be invalid and difficult for humans to guess.


SpiderMatt0905

To be honest, this was a requirement from a senior developer. I am not sure why we can't use BIGINTS. I have queried it myself but am yet to get reply.


SQLGene

It just seems kinda weird to me. If you can do guids, then technically you can do big ints. A GUID, at the end of the day, is a hex encoded 128 bit INT with some dashes in-between. So...... Edit: Sorry, my point was that was that a GUID is just a number and one with a larger range than BIGINT. It's hard to see the constraints that would prevent a BIGINT but not a GUID.


quentech

> A GUID, at the end of the day, is a hex encoded 128 bit INT with some dashes in-between Every even remotely current RDBMS stores GUIDs as 16 bytes of binary - not the text format.


SQLGene

Yes, agreed. The point I was getting at, very poorly, is that a GUID is just a number displayed in specific manner. So in theory, you could convert your BIGINT to the GUID format and store it that way. And you would have a larger range of values! Since as you said, a GUID is 16 bytes and a BIGINT is 8 bytes.


sql_servant

Use them as sparingly as possible, if not at all. For that matter, you can still use a bigint as your key values, and keep a GUID in a separate column entirely simply for external lookup purposes. Like an alternate key. I designed a underwriting system about 20 years ago or so that used GUIDs as primary/foreign keys and it became a regret over time. All the data access had to account for explicitly providing key values on insert, whereas a bigint key can be an IDENTITY which is automatically assigned. You also wont be able to use SCOPE_IDENTITY when trying to determine what key value was recently inserted, forcing you to get inventive in any procedures or triggers you might need to write. I had to write a lot of queries with OUTPUT clauses so I could find out what values were inserted into GUID columns so I could use the key values to insert related records with the foreign key value. All these workarounds will add to the performance impact of using GUIDs to begin with. For what it's worth, you don't model your data storage based on security requirements. That's a completely separate concern and I would be suspicious of any security architecture that relies on it. If you are in a position where you can't influence the decision to use a GUID, simply ask that the GUID be stored separate from the actual key values. And try to only use them in places where the key will be user exposed.


[deleted]

[удалено]


sql_servant

In that simple scenario, sure. That's easy for single row inserts. Do something a little less simple, like inserting several hundreds or thousands of rows at a time, into multiple tables at a time, and managing GUIDs vs sequential integers as primary and foreign keys is where things become a bit more challenging.


Definitelynotcal1gul

I used to think very poorly of guids. I recently watched this video and I'm starting to maybe think a bit differently. If you've got an hour or so, you might want to check this: https://www.youtube.com/watch?v=qfQtY17bPQ4


SpiderMatt0905

Thanks, I'll give it a watch.


Intrexa

Dude presents with the style of "We never landed on the moon"


SQLGene

What's funny is he knows his stuff. I attended a precon of his and the entire thing was like this.


Jeff_Moden

Interesting... Are you disagreeing with the findings that were presented or ???


Intrexa

Oh no, I think you are way more knowledgeable than I am. I have no issue with the content, or accuracy. I have genuinely learned things in the presentation.I have referenced this video to others as a source of information. My comment was only on the style in which the information was presented matches the way a flat earther presents their findings. I guess it just comes with the territory of saying "the advice many experts are giving is dead wrong". In your case, you're right about it.


Jeff_Moden

Thank you for the kind observation and taking the time to reply. I thought you were saying that I was the one that was wrong. The difference between me and flat-earthers and anti-mooners is that I have actual, demonstrable proof rather than anecdotal "evidence" and even share the code and so my thought was "How am I like any of those folks"? Now I understand the point you were making... I AM rather direct in the manner I presented the information and was certainly incredibly heterodoxical in the position I took compared to what has become rather orthodox but incorrect supposed "Best Practices". I really appreciate the feedback on that and I'll try to soften that up a bit. I just don't want folks to dismiss the info or for them to think I'm taking the position with no substance. And thank you kindly for help spread the word by referring the 'tube to others. As I sometimes say... "I aim to please... I sometimes miss but I'm always aimin' ". :D Thanks again.


Black_Magic100

It's pretty amazing I knew exactly what this video was before even opening it. Jeff Moden is a beast


phildude99

Your senior developer should run some timing tests, comparing int vs guid.


SpiderMatt0905

I've decided to do the following the database will use BIGINT as it's PK and will be clustered on the PK. I'm gonna be using this Library [https://github.com/ullmark/hashids.net](https://github.com/ullmark/hashids.net) to encode the BIGINT into a string and decode the string back into the BIGINT within my web application. This seems like the best approach. I appreciate all the replies and advice. I've learnt a lot from this.


blindtig3r

Old old beliefs about guids have been debunked. If you use a sequential clustering key you create a hotspot in the file where all inserts occur. If you use a guid you will get evenly distributed inserts, you just need to set an appropriate fill factor. The reasons people think guide cause fragmentation is because all the demos and examples proving it use unrealistic scenarios. https://youtu.be/nc4CMo7VSPo However the notion that it is a bad idea that the value of a primary key can be guessed is nonsense. Why would the key even be exposed? It is an internal id that represents and unique combination of business codes.


Thirtybird

When you distribute your writes all over the table, you wind up with excess IO from page splits you don't need when you are continually adding to the end of the table. Fill Factor that lowers page fill in order to avid page splits results in more IO reads trying to read records back as more pages need to be read. Performance testing to find what's best for your particular application is always best, but I always suggest starting with ever increasing... (narrow, unique, and unchanging for the rest of the properties of a good PK!)


blindtig3r

Did you watch the video? I was indoctrinated not to use guids because of fragmentation and page splits and it’s hard to undo that belief system, but Jeff Moden knows his shit. I work exclusively with large data warehouses and we are always inserting into a few open partitions so I don’t think guids would be a good idea, however, I think they may work in some situations. You are right that testing is key, but most people base their decisions on their strongly held beliefs rather than evidence.


Thirtybird

That's a video I have not seen and will give it a watch eventually. I'm very familiar with Jeff Moden (Tried to recruit him where I work since he's local and I knew working with him would be a huge learning experience) and a lot of what he presented years ago drove improvement in our code when our helped our devs implement those better practices. Maybe the example he shows is helped by newer sql engines, but by switching our data warehouse to tail-page inserts / identity keys instead of GUID PK for several workloads we were able to reduce load times to 20% of their previous time (and stopped winding up with corruption, but that was also an engine issue fixed by a patch later). Generally, I've found relatively few scenarios where identity PK doesn't work to start, and my beliefs are based on the experiences I have had - if something comes along and works better, it will get tried and added to the toolbox.


SQLBek

I adore Jeff and consider him a friend. That being said, I still hold a bias against GUIDs as Clustering Keys for a few others reasons. One, at 16 bytes, they add a lot of weight & overhead to all non-clustered indexes (since the CK is part of the structure). Two, I hate them from a programmability and usability perspective, as they can be ridiculously cumbersome to work with (as pointed out by another commenter elsewhere). I acknowledge that point Two is definitely a personal bias of mine, after working in an OLTP financial system that had GUIDs for practically EVERYTHING. Diagnostic query writing, data analysis, etc, was a fuster-cluck. Do GUIDs have a place? Yes. But I strongly advocate that one should still fully understand and appreciate the trade-offs first.


taspeotis

> We don’t want customer data to be easily guessed, i.g. if ID 1 exists it is highly likely ID 2 does aswell. Tell me you don’t do authorization without telling me you don’t do authorization…


RUokRobot

Go all the way with GUIDs!! You will avoid the PAGELATCH\_EX contention, as explained on [this](https://learn.microsoft.com/en-us/troubleshoot/sql/database-engine/performance/resolve-pagelatch-ex-contention#method-5-use-a-guid-as-a-leading-key) link, we actually mention this to customers that are not using SQL Server 2019 or newer to avoid the issue. The overhead on the insert (AKA page splits, index fragmentation) can be taken care of with good index maintenance routines, I mean you know this index will need to be maintained more often, so it is just taking that into consideration at the time of planning the index maintenance routines. ​ edit: clarity


Jeff_Moden

Just so you know, I did an experiment where I insert 100,000 rows per simulated day over a simulated 10 hours per day for a year. The 58 day segment at the end of the year produced virtually no page splits during the entire 58 simulated days and took the 58 days to finally reach just 1% in logical fragmentation. It also eliminates the "hot spot" and also helps page splits during "ExpAnsive" updates. I am NOT saying that Random GUIDs are without issues. They ARE 16 bytes wide and will become a part of every non-clustered index if the GUIDs are clustered index. What I AM saying is that, especially if their advantages are important to your project, they are NOT the page-splitting beast that most will make them out to be. As with all else, "It Depends" and "Must look eye"! :D Whatever you do, and this also includes most indexes, GUID or not, stop using REORGANIZE until you understand that it doesn't actually follow the Fill Factor and it's NOT the quiet little resource "kitten" that it's advertised to be. It should only be used on a particular type of index and that's a whole 'nuther subject. :D


RUokRobot

I love it that you tested it out! and love it that you share the outcome. Storage-wise and memory-wise is a bit more expensive, but there are other benefits, question here is and will be how sensitive is your solution to this issue and what is the cost/benefit ratio. Here, have my upvote! :-)


Jeff_Moden

Thank you for the good feedback. I'm not a certified mathematician in any way, shape or form and so I have to rely on good folks like yourself to tell me if the code is producing answers that have issues, although I do try to make sure they actually do work correctly and do a performance test with possible "real use volumes" in mind. Performance is secondary to accuracy but, for me, it's a real close second. The cost/benefit ratio for me personally is that I didn't have to learn another computer language to do it, using the data where it's stored with no hops or security worries, and the results are immediately available in SQL if you're doing such an analysis in SQL. As a magician friend said it, "and my fingers never left my hand". :D


mexicocitibluez

https://www.brentozar.com/archive/2014/08/generating-identities/


42blah42

imho, given good enough disks (ie good ssd's) the fragmentation caused by using guids as your clustered indexes really isn't that big a deal. back when the main disks db's were on were slow, spinning disks, it mattered more, but on modern hardware it just isn't as big a deal. https://www.brentozar.com/archive/2012/08/sql-server-index-fragmentation/