I. Preface
Recently, in the development of a security service boy lazy special tools, there is a function module need to use fingerprints, think directly to open source fingerprint library, but the amount of data is not a lot, think of their own integration of fingerprint libraries, and then looked around and found that it seems that there is not much detailed explanation of how to get a site's fingerprints, and more is to recommend those ready-made tools, the site of those who are thinking of their own to engage in fingerprint capture, organize their own fingerprint library, by the way, organize the fingerprint capture ideas, and then new vulnerabilities out of the iron can go to fingerprint submission. Organize a fingerprint library of their own, by the way to organize the idea of fingerprint capture, interested iron can learn, and then new vulnerabilities out of the time you can get fingerprints to submit, some of the station on the collection of fingerprints, the routine operation is to slowly find out the fingerprints of their own manual.
Advanced version, you can combine the script to automate the capture or training algorithm to automate the capture, the difference between the two is that the script can only be based on your dead rules to capture, the error is larger, and training algorithms can be in the script on the basis of improving the success rate of a cut, which has its own benefits, it is a matter of opinion.
II. Fingerprinting
Let's start with the idea of fingerprint extraction. The so-called fingerprint is a unique feature that a system has, based on that feature, we can identify the generic framework or components of this application, etc. Most of the fingerprint extraction methods are as follows:
Site home page source code
Packet header
Site icon hash
2.1 Home page source code fingerprint extraction
For example: Discuz this one CMS, under normal circumstances he will display different styles of interface, it may be like this:
It could also look like this, it's coded here, so don't worry about that for the big boys, just be able to tell it's different.
On the surface of these two sites do not have much similarity, it can be said that almost not the same, looking for fingerprints is to find out the same points in these two different sites, this process is also known as fingerprint extraction.
Fingerprinting usually requires us to check the source code of the web page, icons, packets, static file content and hash or unique url paths, etc. For example, for the two websites mentioned above, check the source code.
There is a block of code in both sites where the word Discuz exists, in the line. Where the first site is 3.5 and the second site is 3.4 . But the word Discuz also exists in
Then how to judge a fingerprint is most of the site-specific and easy to extract, will not be interfered by other data will become a fingerprint extraction and its important thing, this will be said later on how to scientifically obtain most of the sites have fingerprints, when we pick<meta name="generator" content="Discuz! fingerprint, a fingerprint to identify the Discuz framework is produced, that is, the source code contains: <meta name="generator" content="Discuz!, the next only need to look for assets in the search engine in the presence of this data in the source code, you can be the same CMS site, need to pay attention to a few points:
1, this is not 100% accurate, through this way in the search engine to find out is not necessarily accurate, there may be honey pots or BC and so on, this kind of site will be written in the source code on more than one fingerprint characteristics, to increase the likelihood of being found!
2, if you want to increase the success rate of matching can be increased by adding different lines of code, such as the above example inside the Discuz, if a new vulnerability in the 3.4 version, while the 3.5 version has no vulnerability, then in addition to the most importantthis code, and then Compare other code with gaps as secondary fingerprints for comparison or fingerprint matching on other static files.
3, generally speaking, we directly access the past will show the home page, but in the network loading time may not be so, if the page exists in the automation of the capture of this piece of the jump is a little bit difficult, need to go to determine what is the jump is 301, 302 or JS jump, match, directly to the page that appears in the jump or HTTP Response in the HTML for the main body can be.
Here are two small demo, one is simply to get the page code, one is in addition to get the code can also get to a specific element, according to the need to use the
#1, just get the source code of the page Import request url = '' # Replace the URL with the address of the page you want to get the source code from response = requests.get(url) # Get the source code of the page html_code = response. Text # Print the web page source code Print (html_code) #2, gets the source code of the web page and parses it using BeautifulSoup, which makes it easier to extract, manipulate, and search for specific elements in the web page. Importing a request Importing BeautifulSoup from bs4 headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, e.g. Gecko) Chrome/91.0.4472.101 Safari/537.36'} url = '' r = requests.get(url, headers=headers) html = r.content.decode('utf-8', 'ignore') my_page = BeautifulSoup(html, 'lxml') Print (my page)
2.2 Packet header fingerprinting
The header of the home page packet can match the header content of the home page response, usually there are some special fields in the header fields, by matching and comparing the fields in the return packet to get the fingerprints, and through the packet can determine whether to use the proxy or cache, you can do a brief judgment through the Header in the Via field, part of the development will be in the development of the Cookie section to define Some developers will define their own cookie fields in the cookie section, such as JeecmsSession="xxxxx" or cookies="phpcms_xxxx".
For example, the cookie field shown above, rememberMe, then we can determine the probability that this site is using shiro, use the tool to verify the next:
Attached here is a small demo for obtaining the headers of the returned data packet. After obtaining the headers, we can compare the differences in the headers to determine whether there is the fingerprint we want at the headers.
import requests
url = '' # Replace the URL with the address of the web page from which you want to get the packet header.
response = requests.get(url)
# Get the packet headers
headers = response.headers
# Print the headers
print(headers)
2.3 Site icon hash value
General site site title will have a fixed icon as well as their use of frameworks and components may also use the same icon, through the site icon hash value can be obtained by those who did not change the icon site, but there is a feature is that the version will not be on, so you need to further determine whether the site is the site we want to have no loopholes.
Getting the site icon hash can be done with the following code
import requests
url = '' # Replace the URL with the web address where you want to get the headers of the packet
response = requests.get(url)
# Get the header information of the packet
headers = response.headers
# Print the header information
print(headers)
III. Evaluation criteria
When we get a certain amount of fingerprints, how to judge whether our fingerprints are really good and how accurate they are, here involves a point of knowledge of the evaluation criteria, the current fingerprints and POC preparation evaluation system mainly relies on two indicators: precision and recall, precision is for our prediction results, which represents the number of positive samples. The precision rate is for our prediction result, which indicates how many of the samples predicted to be positive are really positive samples. There are two possibilities for a positive prediction, one is to predict a positive class as a positive class (TP) and the other is to predict a negative class as a positive class (FP), i.e., P = TP/(TP + FP)
Actual value predicted value | just | burden |
---|---|---|
Positive | City | FP |
Negative | FN | TN |
Recall indicates how many positive examples in the sample were predicted correctly. Then there are also two possibilities, one is to predict the original positive class as a positive class (TP), and the other is to predict the original positive class as a negative class (FN), i.e., R = TP/(TP+FN), in simple terms, the precision rate is how many fingerprints match the results of the fingerprint match accurately and how many of them are misreported, the more misreported, the lower the precision rate, and the recall rate is used to evaluate how many of the results that were supposed to be matched out The recall rate is used to evaluate how many results should have been matched, but were missed, the higher the recall rate, the lower the miss rate.
When matching fingerprints, of course, we hope that the miss rate and false alarm rate will be as low as possible, so that the precision rate and recall rate will be close to 100%, so we need to have a metric to measure the precision rate and recall rate, and here I choose F-score.
F-score is an index that integrates precision and recall, precision only indicates the accuracy of prediction, while recall reflects the proportion of decision. In the same data set of precision and recall of the two mutual constraints, in general, to improve the precision rate, it means that the recall of the decline, there is a negative correlation between the two F-score formula.
Where Precision denotes the precision rate, Recall denotes the recall rate, Predict_true denotes the data of each class that is predicted correctly, class_Testtext denotes the number of each class in the test dataset, and predict_class_Testtext denotes the number of each class in the prediction result.
When the importance of precision and recall are not consistent, the generalized F-score can be used as a metric, and the generalized F-score formula is shown below:
Where β indicates the weight of importance for precision rate and recall rate, in general, select β=1, that is, precision rate and recall rate are equally important, when β>1, it means that the importance of recall rate is stronger than precision rate, when β<1, it means that the importance of precision rate is stronger than recall rate. Here the F-score is used to measure the model effect, i.e., the generalized F-score when β=1 is selected, where the precision rate and recall rate are equally important, in addition to using the F-score for the metrics, the judgement is made on the basis of the null precision rate and the precision rate of the test dataset.
When the recall and precision are not high, we can do a concatenated matching operation by selecting a special unique js or css file name or a special unique color value to be added to our fingerprints, which can increase the recall and precision of the fingerprints to a certain extent.
IV. Summary
To summarize, our most commonly used fingerprint capture methods are these: site home page source code, packet header, site icon hash value. In fact, there are other ways, such as static file content and static file hash and other ways, as long as the site-specific files or code, hash can be used as a fingerprint, but like this file we need to launch an additional request to get the file, easy to be caught, so do not recommend this approach. In fact, fingerprint capture is not difficult, the focus is on how to ensure the accuracy and recall of the fingerprints we obtain, and what kind of metrics as the evaluation criteria.
This article mainly talks about some of the basics of fingerprint capture and how to capture through code, and after the capture is completed how to judge the fingerprints we have captured can accurately identify the asset, and then will write the research process of automated fingerprint capture, both how to automate fingerprint capture through the code and valid samples.
0 Comments